# Dealing With Missing Values

In [1]:
import pandas as pd

In [2]:
data = pd.read_csv('../../titanic.csv')
data.head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S


In [3]:
data.isnull().sum()

PassengerId      0
Survived         0
Pclass           0
Name             0
Sex              0
Age            177
SibSp            0
Parch            0
Ticket           0
Fare             0
Cabin          687
Embarked         2
dtype: int64

- Age and Cabin columns have lots of missing values indeed
- Embarked column has very low missing values that can be ignored

## Deleting Missing Values

We can get rid of missing values either deleting rows or deleting associated columns

In [4]:
# dropping all rows with missing values

data_row_drop = data.dropna(axis=0)
data_row_drop.isnull().sum()

PassengerId    0
Survived       0
Pclass         0
Name           0
Sex            0
Age            0
SibSp          0
Parch          0
Ticket         0
Fare           0
Cabin          0
Embarked       0
dtype: int64

In [5]:
data.shape, data_row_drop.shape

((891, 12), (183, 12))

However, we lose significant information

In [67]:
# dropping all columns with missing values

data_col_drop = data.dropna(thresh=500, axis=1)
data_col_drop.isnull().sum()

PassengerId      0
Survived         0
Pclass           0
Name             0
Sex              0
Age            177
SibSp            0
Parch            0
Ticket           0
Fare             0
Embarked         2
dtype: int64

## Replacing With New Value

In [68]:
data.Cabin.fillna(value='Unknown')

0      Unknown
1          C85
2      Unknown
3         C123
4      Unknown
        ...   
886    Unknown
887        B42
888    Unknown
889       C148
890    Unknown
Name: Cabin, Length: 891, dtype: object

In [69]:
data.Age.fillna(value=99)

0      22.0
1      38.0
2      26.0
3      35.0
4      35.0
       ... 
886    27.0
887    19.0
888    99.0
889    26.0
890    32.0
Name: Age, Length: 891, dtype: float64

In [70]:
data.isnull().sum()

PassengerId      0
Survived         0
Pclass           0
Name             0
Sex              0
Age            177
SibSp            0
Parch            0
Ticket           0
Fare             0
Cabin          687
Embarked         2
dtype: int64

Since we didn't assign inplace parameter to the True, data is protected.

In [74]:
data_copy = data.copy()

data_copy['Age'] = data_copy['Age'].fillna(value=99)
data_copy.isnull().sum()

PassengerId      0
Survived         0
Pclass           0
Name             0
Sex              0
Age              0
SibSp            0
Parch            0
Ticket           0
Fare             0
Cabin          687
Embarked         2
dtype: int64

In [76]:
data.Cabin.isnull().astype('int')

0      1
1      0
2      1
3      0
4      1
      ..
886    1
887    0
888    1
889    0
890    1
Name: Cabin, Length: 891, dtype: int32

In [77]:
data_copy['Cabin_NA'] = data.Cabin.isnull().astype('int')
data_copy.head(7)

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked,Cabin_NA
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S,1
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C,0
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S,1
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S,0
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S,1
5,6,0,3,"Moran, Mr. James",male,99.0,0,0,330877,8.4583,,Q,1
6,7,0,1,"McCarthy, Mr. Timothy J",male,54.0,0,0,17463,51.8625,E46,S,0


## Imputing Missing Values Using Central Tendancy

We can impute numerical columns using mean of this column

In [78]:
avg_age = data.Age.mean()
avg_age

29.69911764705882

In [82]:
# making a copy
data_imputed = data.copy()

# imputing missing values
data_imputed.Age = data.Age.fillna(value=avg_age)
data_imputed.Age.isnull().sum()

0

We can impute non-numerical columns using mode

In [83]:
data.Embarked.value_counts()

S    644
C    168
Q     77
Name: Embarked, dtype: int64

In [84]:
embarked_mode = data.Embarked.mode()[0]
embarked_mode

'S'

In [86]:
data_imputed['Embarked'] = data['Embarked'].fillna(value=embarked_mode)
data_imputed.Embarked.isnull().sum()

0