1. missing completely at random --> 
missing data has no relation with other values
2. missing data not at random --> 
missing data has relation with other values (eg cabin and age in this notebook)
3. missing at random --> 
eg-men hide salaries, women hide age

## Techniques to handle missing values
1. mean median mode replacement
2. random sample imputation
3. capturing nan values with new feature
4. end of distribution imputation
5. arbitrary imputation
6. frequent categories imputation

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

In [7]:
df=pd.read_csv(r'C:\Users\HP\DS_LAB_10\titanic.csv')

In [8]:
df.head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S


In [9]:
df.isnull().sum()

PassengerId      0
Survived         0
Pclass           0
Name             0
Sex              0
Age            177
SibSp            0
Parch            0
Ticket           0
Fare             0
Cabin          687
Embarked         2
dtype: int64

In [10]:
df.dtypes

PassengerId      int64
Survived         int64
Pclass           int64
Name            object
Sex             object
Age            float64
SibSp            int64
Parch            int64
Ticket          object
Fare           float64
Cabin           object
Embarked        object
dtype: object

In [11]:
df[df['Embarked'].isnull()]

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
61,62,1,1,"Icard, Miss. Amelie",female,38.0,0,0,113572,80.0,B28,
829,830,1,1,"Stone, Mrs. George Nelson (Martha Evelyn)",female,62.0,0,0,113572,80.0,B28,


In [12]:
df['cabin_null']=np.where(df['Cabin'].isnull(),1,0)

In [13]:
df.head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked,cabin_null
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S,1
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C,0
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S,1
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S,0
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S,1


In [16]:
df['cabin_null'].mean()
#percentage of null values

0.7710437710437711

In [17]:
df.columns

Index(['PassengerId', 'Survived', 'Pclass', 'Name', 'Sex', 'Age', 'SibSp',
       'Parch', 'Ticket', 'Fare', 'Cabin', 'Embarked', 'cabin_null'],
      dtype='object')

In [18]:
df.groupby(['Survived'])['cabin_null'].mean()

Survived
0    0.876138
1    0.602339
Name: cabin_null, dtype: float64

### 1. Mean median mode replacement
it has asssumption that data is missing completely at random (MCAR)

###### advantages
1. easy to implement
2. robust to outliers
3. fast

###### disadvatages
1. distortion in original variance
2. impacts correlation


In [20]:
df=pd.read_csv(r'C:\Users\HP\DS_LAB_10\titanic.csv', usecols=['Age','Fare','Survived'])

In [21]:
df.isnull().mean()

Survived    0.000000
Age         0.198653
Fare        0.000000
dtype: float64

In [22]:
def impute_nan(df,variable,median):
    df[variable + "median"] = df[variable].fillna(median)

In [23]:
median=df['Age'].median()

In [24]:
median

28.0

In [25]:
impute_nan(df,'Age',median)

In [26]:
df.head()

Unnamed: 0,Survived,Age,Fare,Agemedian
0,0,22.0,7.25,22.0
1,1,38.0,71.2833,38.0
2,1,26.0,7.925,26.0
3,1,35.0,53.1,35.0
4,0,35.0,8.05,35.0


### 2. Random Sample Imputation
take random obs from dataset to replace nan values
it assumes data is missing completely at random (MCAR)
###### advantages
1. easy to implement
2. less distortion

###### disadvatages
1. every situation randomness doesnt work

In [27]:
df=pd.read_csv(r'C:\Users\HP\DS_LAB_10\titanic.csv', usecols=['Age','Fare','Survived'])

In [29]:
df.isnull().mean()

Survived    0.000000
Age         0.198653
Fare        0.000000
dtype: float64

In [30]:
df

Unnamed: 0,Survived,Age,Fare
0,0,22.0,7.2500
1,1,38.0,71.2833
2,1,26.0,7.9250
3,1,35.0,53.1000
4,0,35.0,8.0500
...,...,...,...
886,0,27.0,13.0000
887,1,19.0,30.0000
888,0,,23.4500
889,1,26.0,30.0000


In [34]:
# random sample 
# gives 177 (no of nan values) random samples
df['Age'].dropna().sample(df['Age'].isnull().sum(),random_state=0)

423    28.00
177    50.00
305     0.92
292    36.00
889    26.00
       ...  
539    22.00
267    25.00
352    15.00
99     34.00
689    15.00
Name: Age, Length: 177, dtype: float64

In [35]:
df.isnull().sum()

Survived      0
Age         177
Fare          0
dtype: int64

In [38]:
def impute_nan_2(df,variable,median):
    df[variable+"median"]=df[variable].fillna(median)
    df[variable+"random"]=df[variable]
    
    random_sample=df[variable].dropna().sample(df[variable].isnull().sum(), random_state=0)
    
    random_sample.index=df[df[variable].isnull()].index
    df.loc[df[variable].isnull(),variable+"random"]=random_sample