## 1. Handling Missing Data in Titanic Dataset
### <b>Task:</b> Identify and handle missing values in the Titanic dataset. Experiment with different strategies such as mean/median imputation, mode imputation, and dropping rows/columns.

In [11]:
# Importing Libraries
import pandas as pd

In [12]:
# Loading the Titanic Dataset
titanic_dataset = pd.read_csv('Datasets\\Titanic.csv')
print(titanic_dataset.shape, '\n')
titanic_dataset.head()

(891, 12) 



Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S


In [13]:
# Checking for missing values
titanic_dataset.isnull().sum()

PassengerId      0
Survived         0
Pclass           0
Name             0
Sex              0
Age            177
SibSp            0
Parch            0
Ticket           0
Fare             0
Cabin          687
Embarked         2
dtype: int64

<p> -> So, there are missing values in age, cabin and embarked columns. So, we need to fill these values by mean/median imputation, mode imputation and dropping rows/columns.</p>

In [14]:
# Mean Imputation: Filling the missing values with the mean (average) of the column
titanic_dataset.fillna({'Age' : titanic_dataset['Age'].mean()}, inplace=True)

# Median Imputation: Filling the missing valus with the median (middle value) of the column
titanic_dataset.fillna({'Age': titanic_dataset['Age'].median()}, inplace=True)

# We can use any of the method from above. No need to use both simultaneously. We can also use Mode imputation here.
# So, it is based on your own choice and which one fits the best.

print('Missing values in the Age column:', titanic_dataset['Age'].isnull().sum())

Missing values in the Age column: 0


In [15]:
# Mode Imputation: Filling the missing value with the mode (most occurring value) of the column
titanic_dataset.fillna({'Embarked': titanic_dataset['Embarked'].mode()[0]}, inplace=True)

# Here, mode imputation is the best one for Embarked column, since we can't calculate the mean and median for this column.

print('Missing values in the Embarked column:', titanic_dataset['Embarked'].isnull().sum())

Missing values in the Embarked column: 0


In [16]:
# Dropping rows/columns: In this case, since Cabin column has mostly null values, we can drop this whole column.
titanic_dataset.drop(columns=['Cabin'], inplace=True)
titanic_dataset.shape  # To check if the cabin column has been dropped.

(891, 11)

In [17]:
# Checking the results
titanic_dataset.isnull().sum()

PassengerId    0
Survived       0
Pclass         0
Name           0
Sex            0
Age            0
SibSp          0
Parch          0
Ticket         0
Fare           0
Embarked       0
dtype: int64

In [18]:
# Printing information
titanic_dataset.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 11 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   PassengerId  891 non-null    int64  
 1   Survived     891 non-null    int64  
 2   Pclass       891 non-null    int64  
 3   Name         891 non-null    object 
 4   Sex          891 non-null    object 
 5   Age          891 non-null    float64
 6   SibSp        891 non-null    int64  
 7   Parch        891 non-null    int64  
 8   Ticket       891 non-null    object 
 9   Fare         891 non-null    float64
 10  Embarked     891 non-null    object 
dtypes: float64(2), int64(5), object(4)
memory usage: 76.7+ KB


<p>-> So now, there are no missing values in the dataset and all the missing values have been filled. Now this dataset is ready for analysis and modeling tasks.</p>

<hr>