#### There are three different types of missing data

1) Missing completely at random (MCAR)
2) Missing at random (MAR)
3) Not missing at random (NMAR)

#### popular ways for data imputation for cross-sectional datasets
Source:
    
https://towardsdatascience.com/6-different-ways-to-compensate-for-missing-values-data-imputation-with-examples-6022d9ca0779
    
    
###  1. Do Nothing:

We can just let the algorithm handle the missing data. Some algorithms can factor in the missing values and learn the 
best imputation values for the missing data based on the training loss reduction (ie. XGBoost). 
Some others have the option to just ignore them (ie. LightGBM — use_missing=false). However, other algorithms will panic and 
throw an error complaining about the missing values (ie. Scikit learn — LinearRegression). 
In that case, you will need to handle the missing data and clean it before feeding it to the algorithm.

### 2. Imputation Using (Mean/Median) Values:

This works by calculating the mean/median of the non-missing values in a column and then replacing the missing values within 
each column separately and independently from the others. It can only be used with numeric data.

#### Pros:
Easy and fast.
Works well with small numerical datasets.

#### Cons:
Doesn’t factor the correlations between features. It only works on the column level.
Will give poor results on encoded categorical features (do NOT use it on categorical features).
Not very accurate.
Doesn’t account for the uncertainty in the imputations.   

In [1]:
from sklearn.datasets import fetch_california_housing
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import StratifiedKFold
from sklearn.metrics import mean_squared_error
from math import sqrt
import random
import numpy as np
random.seed(0)

#Fetching the dataset
import pandas as pd
dataset = fetch_california_housing()
train, target = pd.DataFrame(dataset.data), pd.DataFrame(dataset.target)
train.columns = ['0','1','2','3','4','5','6','7']
train.insert(loc=len(train.columns), column='target', value=target)

#Randomly replace 40% of the first column with NaN values
column = train['0']
print(column.size)
missing_pct = int(column.size * 0.4)
i = [random.choice(range(column.shape[0])) for _ in range(missing_pct)]
column[i] = np.NaN
print(column.shape[0])

#Impute the values using scikit-learn SimpleImpute Class
from sklearn.impute import SimpleImputer
imp_mean = SimpleImputer( strategy='mean') #for median imputation replace 'mean' with 'median'
imp_mean.fit(train)
imputed_train_df = imp_mean.transform(train)

20640
20640


### 3. Imputation Using (Most Frequent) or (Zero/Constant) Values:
Most Frequent is another statistical strategy to impute missing values and YES!! It works with categorical features 
(strings or numerical representations) by replacing missing data with the most frequent values within each column.
    
#### Pros:
Works well with categorical features.

#### Cons:
It also doesn’t factor the correlations between features.
It can introduce bias in the data.

Zero or Constant imputation — as the name suggests — it replaces the missing values with either zero or any constant 
value you specify


In [2]:
#Impute the values using scikit-learn SimpleImpute Class

from sklearn.impute import SimpleImputer
imp_mean = SimpleImputer( strategy='most_frequent')
imp_mean.fit(train)
imputed_train_df = imp_mean.transform(train)

#### 4. Imputation Using k-NN:
    
The algorithm uses ‘feature similarity’ to predict the values of any new data points. This means that the new point is assigned 
a value based on how closely it resembles the points in the training set. This can be very useful in making predictions about 
the missing values by finding the k’s closest neighbours to the observation with missing data and then imputing them based 
on the non-missing values in the neighbourhood. 

It creates a basic mean impute then uses the resulting complete list to construct a KDTree. Then, it uses the resulting KDTree 
to compute nearest neighbours (NN). After it finds the k-NNs, it takes the weighted average of them

#### Pros:
Can be much more accurate than the mean, median or most frequent imputation methods (It depends on the dataset).

#### Cons:
Computationally expensive. KNN works by storing the whole training dataset in memory.
K-NN is quite sensitive to outliers in the data (unlike SVM)

Let's discuss KNN Imputer with an example:

In [3]:
!pip3 install -U scikit-learn



#### How does KNN Imputer work?
According scikit-learn docs: Each sample’s missing values are imputed using the mean value from n_neighbors nearest neighbors found in the training set. Two samples are close if the features that neither is missing are close. By default, a euclidean distance metric that supports missing values, nan_euclidean_distances, is used to find the nearest neighbors.

In [11]:
# Creating Dataframe with Missing Values
import numpy as np
import pandas as pd

df= {'first': [112, 90, np.nan, 89],
    'second': [30,45,56, np.nan],
    'Third':[np.nan, 40, 80, 98]}

df= pd.DataFrame(df)
df

Unnamed: 0,first,second,Third
0,112.0,30.0,
1,90.0,45.0,40.0
2,,56.0,80.0
3,89.0,,98.0


In [15]:
# 2 Initialize KNNImputer
# You can define your own n_neighbors value (as its typical of KNN algorithm)
from sklearn.impute import KNNImputer
imputer = KNNImputer(n_neighbors=2)

df_filled = imputer.fit_transform(df)
df_filled

array([[112. ,  30. ,  69. ],
       [ 90. ,  45. ,  40. ],
       [100.5,  56. ,  80. ],
       [ 89. ,  43. ,  98. ]])