### Handling Missing Values

In [69]:
import seaborn as sns
import pandas as pd
import numpy as np

df = sns.load_dataset('titanic')
df.head()

Unnamed: 0,survived,pclass,sex,age,sibsp,parch,fare,embarked,class,who,adult_male,deck,embark_town,alive,alone
0,0,3,male,22.0,1,0,7.25,S,Third,man,True,,Southampton,no,False
1,1,1,female,38.0,1,0,71.2833,C,First,woman,False,C,Cherbourg,yes,False
2,1,3,female,26.0,0,0,7.925,S,Third,woman,False,,Southampton,yes,True
3,1,1,female,35.0,1,0,53.1,S,First,woman,False,C,Southampton,yes,False
4,0,3,male,35.0,0,0,8.05,S,Third,man,True,,Southampton,no,True


In [70]:
df.info() #missing values can be figured out from .info() method

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 15 columns):
 #   Column       Non-Null Count  Dtype   
---  ------       --------------  -----   
 0   survived     891 non-null    int64   
 1   pclass       891 non-null    int64   
 2   sex          891 non-null    object  
 3   age          714 non-null    float64 
 4   sibsp        891 non-null    int64   
 5   parch        891 non-null    int64   
 6   fare         891 non-null    float64 
 7   embarked     889 non-null    object  
 8   class        891 non-null    category
 9   who          891 non-null    object  
 10  adult_male   891 non-null    bool    
 11  deck         203 non-null    category
 12  embark_town  889 non-null    object  
 13  alive        891 non-null    object  
 14  alone        891 non-null    bool    
dtypes: bool(2), category(2), float64(2), int64(4), object(5)
memory usage: 80.7+ KB


In [71]:
def missing(df=df):
    missing = []
    for i in df.columns:
        x = df[i].isnull().sum()
        missing.append(x)
    return pd.DataFrame(missing, index=df.columns, columns=['Missing Values'])

missing()

Unnamed: 0,Missing Values
survived,0
pclass,0
sex,0
age,177
sibsp,0
parch,0
fare,0
embarked,2
class,0
who,0


Missing data can be handled in differnet ways depending on the needs. If a column has too many missing values then that column can be dropped. Small number of missing values can be filled by imputation i.e. using mean , median or mode. A row can be dropped if it has too many missing values.

Since 'deck' has majority of values missing we will just deop the column. We will impute the age column with mean. We can drop the rows with missing values too, but it should be avoided as it can cause information loss. 

In [72]:
df.drop(['deck'], inplace=True, axis=1)
df.head()

Unnamed: 0,survived,pclass,sex,age,sibsp,parch,fare,embarked,class,who,adult_male,embark_town,alive,alone
0,0,3,male,22.0,1,0,7.25,S,Third,man,True,Southampton,no,False
1,1,1,female,38.0,1,0,71.2833,C,First,woman,False,Cherbourg,yes,False
2,1,3,female,26.0,0,0,7.925,S,Third,woman,False,Southampton,yes,True
3,1,1,female,35.0,1,0,53.1,S,First,woman,False,Southampton,yes,False
4,0,3,male,35.0,0,0,8.05,S,Third,man,True,Southampton,no,True


We can impute the 'age' column with the average age of all passengers, but a better approximation will be using the other data to our advantage. Here we will impute the 'age' column with average age for each pclass.

In [73]:
mean_age = []
for i in range(1,4):
    x = df[df['pclass'] == i]['age'].mean()
    mean_age.append(x)
    print(f"mean age of class {i} is {x} years")

mean age of class 1 is 38.233440860215055 years
mean age of class 2 is 29.87763005780347 years
mean age of class 3 is 25.14061971830986 years


We can see that passengers of first class are older on average than passengers of second and third class.

In [74]:
mapping = {i+1 : mean_age[i] for i in range(3)}
mapping

{1: np.float64(38.233440860215055),
 2: np.float64(29.87763005780347),
 3: np.float64(25.14061971830986)}

In [75]:
df['age'] = df['age'].fillna(df['pclass'].map(mapping))

In [76]:
missing()

Unnamed: 0,Missing Values
survived,0
pclass,0
sex,0
age,0
sibsp,0
parch,0
fare,0
embarked,2
class,0
who,0


We will check the rows with missing values left

In [77]:
df[df.isnull().any(axis=1)]

Unnamed: 0,survived,pclass,sex,age,sibsp,parch,fare,embarked,class,who,adult_male,embark_town,alive,alone
61,1,1,female,38.0,0,0,80.0,,First,woman,False,,yes,True
829,1,1,female,62.0,0,0,80.0,,First,woman,False,,yes,True


Here the two rows are missing both the values of 'embarked' and 'embarked town'. Since this is a string you can impute it with the mode. Here we will just remove the rows for demonstration.

In [78]:
df.dropna(inplace=True)

In [79]:
missing()

Unnamed: 0,Missing Values
survived,0
pclass,0
sex,0
age,0
sibsp,0
parch,0
fare,0
embarked,0
class,0
who,0


In [86]:
len(df)

889

Now we have two rows less.