In [1]:
# use kaggle Titanic data set
# https://analyticsindiamag.com/5-ways-handle-missing-values-machine-learning-datasets/

## 1 Delete

In [None]:
data.isnull()

In [None]:
data.dropna(inplace=True)
data.isnull().sum

Pros:
* Complete removal of data with missing values results in robust and highly accurate model
* Deleting a particular row or a column with no specific information is better, since it does not have a high weightage

Cons:
* Loss of information and data
* Works poorly if the percentage of missing values is high (say 30%), compared to the whole dataset

## 2. Replacing With Mean/Median/Mode

In [None]:
data['Age'].isnull().sum()

In [None]:
data['Age'].mean()

In [None]:
data['Age'].replace(np.NaN, data['Age'].mean()).head(15)

In [None]:
# or with median
data['Age'].median()

In [None]:
# or with mode()
data['Age'].mode()

Pros:
* This is a better approach when the data size is small
* It can prevent data loss which results in removal of the rows and columns

Cons:
* Imputing the approximations add variance and bias
* Works poorly compared to other multiple-imputations method

## 3. Assigning An Unique Category

In [None]:
data['Cabin'].head(10)

In [None]:
data['Cabin'].fillna('U').head(10)

Pros:
* Less possibilities with one extra category, resulting in low variance after one hot encoding — since it is categorical
* Negates the loss of data by adding an unique category

Cons:
* Adds less variance
* Adds another feature to the model while encoding, which may result in poor performance

## 4. Predicting The Missing Values

In [None]:
# * use linear regression

Pros:
* Imputing the missing variable is an improvement as long as the bias from the same is smaller than the omitted variable bias
* Yields unbiased estimates of the model parameters

Cons:
* Bias also arises when an incomplete conditioning set is used for a categorical variable
* Considered only as a proxy for the true valu

## 5. Using Algorithms Which Support Missing Values

In [None]:
#  KNN - (just not scikit learn)
# Random Forests

Pros:
* Does not require creation of a predictive model for each attribute with missing data in the dataset
* Correlation of the data is neglected

Cons:
* Is a very time consuming process and it can be critical in data mining where large databases are being extracted
* Choice of distance functions can be Euclidean, Manhattan etc. which is do not yield a robust result