**Why are some values missing?**
1. They hesitate to put down the information
2. Survey informations are not that valid
3. Men--salary (there are men who dont want to show their salary)
4. Women---age (there are women who dont want to show their age)
5. People may have died----NAN

**What are the different types of Missing Data?**

**1. Missing Completely at Random, MCAR:**<br>
A variable is missing completely at random (MCAR) if the probability of being missing is the same for all the observations. When data is MCAR, there is absolutely no relationship between the data missing and any other values (like target feature), observed or missing, within the dataset. In other words, those missing data points are a random subset of the data. There is nothing systematic going on that makes some data more likely to be missing than other. <br>

**2. Missing Data Not At Random(MNAR): Systematic missing Values.** <br>
There is absolutely some relationship between the data missing and any other values, observed or missing, within the dataset. <br>

**3. Missing At Random(MAR)**

## All the techniques of handling missing values for numerical features

1. Mean/ Median/Mode replacement
2. Iterative Imputer (predicting missing values)
3. KNN Imputer
2. Random Sample Imputation
3. Capturing NAN values with a new feature
4. End of Distribution imputation
5. Arbitrary imputation
6. Frequent categories imputation
7. Dropping rows with missing values
8. Dropping columns with missing values

In [50]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

### 1. Mean/ Median/Mode replacement

**How it works ?** <br>
We replace the missing value of a particular column with the mean/mode/median of that column.

**When should we apply?** <br>
Mean/median imputation has the assumption that the data are missing completely at random(MCAR). We solve this by replacing the NAN with the most frequent occurance of the variables

**Advantages**<br>
1. Easy to implement(Robust to outliers)
2. Faster way to obtain the complete dataset
<br>
**Disadvantages**<br>
1. Change or Distortion in the original variance
2. Impacts Correlation

**Note:** <br>
If feature has many outliers we replace with median since it is not affected by outliers, otherwise with mean.    

In [61]:
df=pd.read_csv('titanic.csv', usecols=['Age','Fare','Survived'])
df.isna().sum()

Survived      0
Age         177
Fare          0
dtype: int64

In [62]:
# 1.Way
# DYNAMIC
# Instead of mean we can use mode or median

def impute_nan(df, features):
    for feature in features:
        df[feature] = df[feature].fillna(df[feature].mean()) # or we can use inplace=True
        
    return df

In [63]:
df = impute_nan(df, ['Age'])
df.isna().sum()

Survived    0
Age         0
Fare        0
dtype: int64

**Note:**<br>
strategy can be mean/mode/most_frequent(median) or constant if we want to fill nan values with a particular number using fill_value. <br>
missing_values means how are missing values denoted. They can denoted as np.nan or ' ' or ? etc<br>
fill_value=x if we want to fill missing values with number x <br>
add_indicator=True if we want to add new features that captures the nan values (1 if value is missing, 0 otherwise)<br>

**If we set add_indicator=True we must avoid convertin array to dataframe because it will generate an error.**

In [88]:
# 2.Way
from sklearn.model_selection import train_test_split
df=pd.read_csv('titanic.csv', usecols=['Age','Fare','Survived'])
X = df.drop('Survived', axis=1)
y = df['Survived']

X_train, X_test, y_train, y_test = train_test_split(X, y)

In [135]:
from sklearn.impute import SimpleImputer


def impute_nan_SingleImputer(X_train, X_test, missing_values=np.nan, strategy='mean', fill_value=None, add_indicator=False):
    
    imputer = SimpleImputer(missing_values=missing_values, strategy=strategy, fill_value=fill_value,
                            add_indicator=add_indicator)
    arr_train = imputer.fit_transform(X_train)
    new_X_train =pd.DataFrame(arr_train, columns=X_train.columns)

    arr_test = imputer.transform(X_test)
    new_X_test = pd.DataFrame(arr_test, columns=X_test.columns)
    
    return new_X_train, new_X_test

In [90]:
new_X_train, new_X_test = impute_nan_SingleImputer(X_train, X_test, missing_values=np.nan, strategy='mean')

In [91]:
new_X_train.head()

Unnamed: 0,Age,Fare
0,37.0,7.925
1,8.0,29.125
2,33.0,7.775
3,16.0,7.75
4,29.639409,14.4583


In [92]:
new_X_test.head()

Unnamed: 0,Age,Fare
0,23.0,263.0
1,29.639409,8.05
2,29.639409,14.4583
3,17.0,14.4583
4,51.0,8.05


### 2. Iterative Imputer

**How it works?**<br>
This approach predicts the missing values using an ML Estimator based on other features. ML model treats each feature with missing values as a function of other features (target feature), and uses that estimate for imputation. (training data will be whereever the values of that feature are not missing, test data will be the data where the values are missing). It does so in an iterated round-robin fashion: at each step, a feature column (with missing values) is designated as output y and the other feature columns are treated as inputs X. A regressor is fit on (X, y) for known y. Then, the regressor is used to predict the missing values of y. This is done for each feature in an iterative fashion, and then is repeated for max_iter imputation rounds. The results of the final imputation round are returned.

**Note:**<br>
Since it will fit a ML Estimator to the data, data must be beforehand all in numerical format. <br>
Using transform for test data means we are using for imputin nan values of  the test data the same estimator that we fitted to the traiing data. <br>
We can use different ML Estimators with Iterative Imputer to predict the missing values like DecisionTreeRegressor, ExtraTreeREgressor, KNearestRegressor, BayesianRidge etc.

In [104]:
df=pd.read_csv('titanic.csv')
df.isna().sum()

PassengerId      0
Survived         0
Pclass           0
Name             0
Sex              0
Age            177
SibSp            0
Parch            0
Ticket           0
Fare             0
Cabin          687
Embarked         2
dtype: int64

In [108]:
from sklearn.model_selection import train_test_split
df=pd.read_csv('titanic.csv', usecols=['Age','Fare','Survived'])
X = df.drop('Survived', axis=1)
y = df['Survived']

X_train, X_test, y_train, y_test = train_test_split(X, y)

In [112]:
from sklearn.experimental import enable_iterative_imputer
from sklearn.impute import IterativeImputer

def impute_nan(X_train, X_test, max_iter=10, random_state=1):
    
    imputer = imputer = IterativeImputer(max_iter=max_iter, random_state=random_state)
    arr_train = imputer.fit_transform(X_train)
    new_X_train =pd.DataFrame(arr_train, columns=X_train.columns)

    arr_test = imputer.transform(X_test)
    new_X_test = pd.DataFrame(arr_test, columns=X_test.columns)
    
    return new_X_train, new_X_test

In [113]:
new_X_train, new_X_test = impute_nan(X_train, X_test, max_iter=10)

In [114]:
new_X_train.head()

Unnamed: 0,Age,Fare
0,29.0,211.3375
1,21.0,73.5
2,33.0,20.525
3,29.490884,39.6
4,17.0,110.8833


In [115]:
new_X_test.head()

Unnamed: 0,Age,Fare
0,41.0,39.6875
1,40.0,134.5
2,18.0,7.8542
3,32.0,8.3625
4,43.0,8.05


### 3. KNN Imputer

**How it works?**<br>
The KNNImputer uses using the k-Nearest Neighbors algorithm. 
A) For each record/row that contains missing value it will measure the euclidian distance of that record with all other records. <br>
B) It will select the k nearest neighbors (rows/records) which means the k records with lowest euclidian distance. <br>
C) It will take the values of columns that had missing value of the k nearest records, average them and assign to the missing value. If we assign weights to the records the average value will be calculated based on the weights.

**Note:**<br>
The dataset passed to KNN Imputer must be all in numerical format. <br>
Using transform to test data means we will use the same neighbors that we used in training data to impute the missing values of test data.

In [128]:
from sklearn.model_selection import train_test_split
df=pd.read_csv('titanic.csv', usecols=['Age','Fare','Survived'])
X = df.drop('Survived', axis=1)
y = df['Survived']

X_train, X_test, y_train, y_test = train_test_split(X, y)

In [129]:
from sklearn.impute import KNNImputer

def impute_nan(X_train, X_test, missing_values=np.nan, n_neighbors=5, weights='uniform', metric='nan_euclidean',
               add_indicator=False):
    
    imputer = KNNImputer(missing_values=missing_values, n_neighbors=n_neighbors, 
                         weights=weights, metric=metric, add_indicator=add_indicator)
    arr_train = imputer.fit_transform(X_train)
    new_X_train =pd.DataFrame(arr_train, columns=X_train.columns)

    arr_test = imputer.transform(X_test)
    new_X_test = pd.DataFrame(arr_test, columns=X_test.columns)
    
    return new_X_train, new_X_test

In [130]:
new_X_train, new_X_test = impute_nan(X_train, X_test)

In [131]:
new_X_train.head()

Unnamed: 0,Age,Fare
0,45.0,13.5
1,34.2,0.0
2,29.0,7.75
3,33.0,15.85
4,48.0,7.8542


In [132]:
new_X_test.head()

Unnamed: 0,Age,Fare
0,33.4,7.775
1,34.0,7.75
2,45.0,7.75
3,54.0,59.4
4,0.75,19.2583


### 4. Random Sample Imputation

**How it works ?** <br>
Random sample imputation consists of taking random observation from the dataset and we use this observation to replace the nan values

**When should it be used?** <br>
It assumes that the data are missing completely at random(MCAR)

**Advantages**
1. Easy To implement
2. There is less distortion in variance

**Disadvantages**
1. Every situation randomness wont work

In [14]:
df=pd.read_csv('titanic.csv', usecols=['Age','Fare','Survived'])
df.isna().sum()

Survived      0
Age         177
Fare          0
dtype: int64

In [16]:
def impute_nan(df, features): 
    for variable in features:
        random_sample=df[variable].dropna().sample(df[variable].isnull().sum(), random_state=1)
        random_sample.index=df[df[variable].isnull()].index
        df.loc[df[variable].isnull(), variable] = random_sample
        
    return df

In [17]:
df = impute_nan(df, ['Age'])
df.isna().sum()

Survived    0
Age         0
Fare        0
dtype: int64

### 5. Capturing NAN values with a new feature


**How it works?**<br>
We create new column and put 1 if the value of the feature is missing, otherwise 0. The missing values of the column we replace with mean/median/mode.

**When to apply?**<br>
It works well if the data are not missing completely at random.

**Advantages**
1. Easy to implement
2. Captures the importance of missing values

**Disadvantages**
1. Creating Additional Features(Curse of Dimensionality)

In [22]:
df=pd.read_csv('titanic.csv', usecols=['Age','Fare','Survived'])
df.isna().sum()

Survived      0
Age         177
Fare          0
dtype: int64

In [23]:
# 1.Way
# instead of mean we can use median/mode.
def impute_nan(df, features):
    for feature in features:
        df[feature+'_nan'] = np.where(df[feature].isnull(),1,0)
        df[feature] = df[feature].fillna(df[feature].mean())
        
    return df

In [24]:
df = impute_nan(df, ['Age'])
df.head()

Unnamed: 0,Survived,Age,Fare,Age_nan
0,0,22.0,7.25,0
1,1,38.0,71.2833,0
2,1,26.0,7.925,0
3,1,35.0,53.1,0
4,0,35.0,8.05,0


In [152]:
# 2.Way

from sklearn.model_selection import train_test_split
df=pd.read_csv('titanic.csv', usecols=['Age','Fare','Survived'])
X = df.drop('Survived', axis=1)
y = df['Survived']

X_train, X_test, y_train, y_test = train_test_split(X, y)

In [153]:
from sklearn.impute import SimpleImputer, MissingIndicator
from sklearn.model_selection import train_test_split
from sklearn.pipeline import FeatureUnion, make_pipeline


def impute_nan(X_train, X_test, strategy='mean', missing_values=np.nan):
    
    transformer = FeatureUnion( transformer_list=[
                                                  ('features', SimpleImputer(missing_values=missing_values, strategy=strategy)),
                                                  ('indicators', MissingIndicator(missing_values=missing_values))])
    
    transform_X_train = transformer.fit_transform(X_train, y_train)    
    transform_X_test = transformer.transform(X_test)
    
    return transformer, transform_X_train, transform_X_test

In [154]:
transformer, new_X_train, new_X_test = impute_nan(X_train, X_test, strategy='mean', missing_values=np.nan)

In [155]:
new_X_train

array([[17.        , 57.        ,  0.        ],
       [21.        ,  8.05      ,  0.        ],
       [14.5       , 14.4542    ,  0.        ],
       ...,
       [19.        , 26.2833    ,  0.        ],
       [34.        , 13.        ,  0.        ],
       [29.48216015, 52.        ,  1.        ]])

In [156]:
new_X_test

array([[ 26.        ,  16.1       ,   0.        ],
       [ 34.        ,  26.        ,   0.        ],
       [  9.        ,  31.3875    ,   0.        ],
       [ 16.        ,  18.        ,   0.        ],
       [ 16.        ,   9.2167    ,   0.        ],
       [ 45.5       ,   7.225     ,   0.        ],
       [ 37.        ,   9.5875    ,   0.        ],
       [ 29.48216015,   7.75      ,   1.        ],
       [  4.        ,  29.125     ,   0.        ],
       [ 29.48216015,   7.8792    ,   1.        ],
       [ 29.48216015,  22.3583    ,   1.        ],
       [ 50.        , 133.65      ,   0.        ],
       [ 29.48216015,   8.05      ,   1.        ],
       [ 29.48216015,   8.05      ,   1.        ],
       [ 54.        ,  59.4       ,   0.        ],
       [ 32.        ,  15.5       ,   0.        ],
       [ 29.48216015,   8.05      ,   1.        ],
       [ 17.        ,   8.6625    ,   0.        ],
       [ 29.        ,   9.5       ,   0.        ],
       [ 49.        ,   0.     

In [157]:
transformer

FeatureUnion(transformer_list=[('features', SimpleImputer()),
                               ('indicators', MissingIndicator())])

**Of course, we cannot use the transformer to make any predictions. We should wrap this in a Pipeline with a classifier (e.g., a DecisionTreeClassifier) to be able to make predictions.**

In [158]:
from sklearn.pipeline import make_pipeline
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score

clf = make_pipeline(transformer, DecisionTreeClassifier())
clf = clf.fit(X_train, y_train)
accuracy_score(y_test, clf.predict(X_test))

0.6547085201793722

### 6. End of Distribution imputation

In [25]:
df=pd.read_csv('titanic.csv', usecols=['Age','Fare','Survived'])
df.isna().sum()

Survived      0
Age         177
Fare          0
dtype: int64

In [27]:
def impute_nan(df, features):
    for variable in features:
        extreme = df[variable].mean() +3*df[variable].std() 
        df[variable] = df[variable].fillna(extreme)
        
    return df

In [28]:
df = impute_nan(df, ['Age'])
df.isna().sum()

Survived    0
Age         0
Fare        0
dtype: int64

### 6. Arbitrary Value Imputation

**How it works?**<br>
It consists of replacing NAN by an arbitrary value that we choose.

**Advantages**<br>
1. Easy to implement
2. Captures the importance of missingess if there is one

**Disadvantages** <br>
1. Distorts the original distribution of the variable
2. If missingess is not important, it may mask the predictive power of the original variable by distorting its distribution
3. Hard to decide which value to use

In [33]:
df=pd.read_csv("titanic.csv", usecols=["Age","Fare","Survived"])
df.isna().sum()

Survived      0
Age         177
Fare          0
dtype: int64

In [36]:
# in this case i choose 0 as arbitrary value. WE can choose 100, 200 etc.
def impute_nan(df, features, arb_val):
    for variable in features:
        df[variable] = df[variable].fillna(arb_val)
    
    return df

In [37]:
df = impute_nan(df, ['Age'], arb_val)
df.isna().sum()

Survived    0
Age         0
Fare        0
dtype: int64

## Techniques of handling missing values for categorical features

1. Mode Imputation
2. Add a new variable to capture nan values
3. Replacing nan values with new category 'Missing'
4. Dropping rows with missing values
5. Dropping columns with missing values

### 1. Mode Imputation

**Advantages**
1. Easy To implement
2. Fater way to implement
**Disadvantages**
1. Since we are using the more frequent labels, it may use them in an over respresented way, if there are many nan's
2. It distorts the relation of the most frequent label

In [41]:
df=pd.read_csv('titanic.csv')
df.isna().sum()

PassengerId      0
Survived         0
Pclass           0
Name             0
Sex              0
Age            177
SibSp            0
Parch            0
Ticket           0
Fare             0
Cabin          687
Embarked         2
dtype: int64

In [42]:
def impute_nan(df, features):
    for variable in features:
        most_frequent_category = df[variable].mode()[0]
        df[variable].fillna(most_frequent_category,inplace=True)
        
    return df

In [43]:
df = impute_nan(df, ['Cabin'])
df.isna().sum()

PassengerId      0
Survived         0
Pclass           0
Name             0
Sex              0
Age            177
SibSp            0
Parch            0
Ticket           0
Fare             0
Cabin            0
Embarked         2
dtype: int64

### 2. Adding a variable to capture NAN

In [44]:
df=pd.read_csv('titanic.csv')
df.isna().sum()

PassengerId      0
Survived         0
Pclass           0
Name             0
Sex              0
Age            177
SibSp            0
Parch            0
Ticket           0
Fare             0
Cabin          687
Embarked         2
dtype: int64

In [45]:
def impute_nan(df, features):
    for variable in features:
        df[variable] = np.where(df[variable].isnull(),1,0)
        mode = df[variable].mode()[0]
        df[variable] = df[variable].fillna(mode)

    return df

In [46]:
df = impute_nan(df, ['Cabin'])
df.isna().sum()

PassengerId      0
Survived         0
Pclass           0
Name             0
Sex              0
Age            177
SibSp            0
Parch            0
Ticket           0
Fare             0
Cabin            0
Embarked         2
dtype: int64

### 3. Replacing Nan values with new category 'Missing'

In [47]:
df=pd.read_csv('titanic.csv')
df.isna().sum()

PassengerId      0
Survived         0
Pclass           0
Name             0
Sex              0
Age            177
SibSp            0
Parch            0
Ticket           0
Fare             0
Cabin          687
Embarked         2
dtype: int64

In [48]:
def impute_nan(df, features):
    for variable in features:
        df[variable] = np.where(df[variable].isnull(), "Missing", df[variable])
        
    return df

In [49]:
df = impute_nan(df, ['Cabin'])
df.isna().sum()

PassengerId      0
Survived         0
Pclass           0
Name             0
Sex              0
Age            177
SibSp            0
Parch            0
Ticket           0
Fare             0
Cabin            0
Embarked         2
dtype: int64