Hi everyone. 

I was literally overwhelmed by the responses I got on my previous [notebook](https://www.kaggle.com/twinkle0705/an-interactive-eda-of-electricity-consumption) and the [dataset](https://www.kaggle.com/twinkle0705/state-wise-power-consumption-in-india) I uploaded last week. All your support and suggestions keeps me going on so keep them coming.

As it is said - data science is not only about model fitting and great accuracy scores through tuning, it's more about what goes on behind the curtains. Data preprocessing is the backstage manager which makes your model work best when you've tuned it to the best parameters. 

So, here is my first take on data preprocessing where I deal with various methods of filling missing values in your data. Usually, this is the first step you should take towards data preprocessing and hence it's my first step too to explain the various underlying methods you can take to fill missing values. I have tried to cover all the steps as per my knowledge and I hope to improvise it with advanced techniques in future versions. 

Comment below for suggestions on any other methods you know of to help me and others alike. 
Do upvote if you find my effort worth it.

# MISSING VALUES

The difference between data found in many tutorials and data in the real world is that real-world data is rarely clean and homogeneous. In particular, many interesting datasets will have some amount of data missing. To make matters even more complicated, different data sources may indicate missing data in different ways. 

**Why do we need to treat missing data?**

Missing data in the training data set can reduce the power / fit of a model or can lead to a biased model because we have not analysed the behavior and relationship with other variables correctly. It can lead to wrong prediction or classification.


**Why does the data have missing values?**

* Missing Completely At Random (MCAR) - The reason for missingness is totally independent of the predictors and response i.e., the probability of missingness is the same for each unit in your sample.

* Missing At Random (MAR) - Missingness depends only on other available information (e.g., other predictors). For example, we are collecting data for age and female has higher missing value compare to male.

* Missing Not At Random (MNAR): depends on unobserved predictors -  This is a case when the missing values are not random and are related to the unobserved input variable. For example: In a medical study, if a particular diagnostic causes discomfort, then there is higher chance of drop out from the study. Here "discomfort" is not an input variable.

* Missingn Not At Random II (MNAR): depends on the missing value itself - Missingness depends on the (potentially missing) variable itself. For example,  people with higher earnings are less likely to reveal them.

Let us go ahead and look at the various ways how we can detect and handle missing data.

# 1.Loading the libraries and data # 

In [None]:
import numpy as np
import pandas as pd 
import seaborn as sns

Reading the two datasets that are going to be used to demonstrate various methods of handling missing values.

In [None]:
data = pd.read_csv('../input/loan-prediction/train.csv')
cat_data = pd.read_csv('../input/house-prices-advanced-regression-techniques/train.csv')

Taking a first look at our data gives us a rough idea about the variables and the kind of data it holds. 

In [None]:
data.head()

The .info() functions gives us an idea about the total number of non-null datapoints in each column and their datatype.

In [None]:
data.info()

Sometimes it may happen that the datatype of a column seems inconsistent with the kind of data it holds. For example, the 'dependents' column refers to the number of dependents of the applicant. It makes more sense for it to have an integer datatype rather an object. Such problems can be solved by typecasting the datatype.

# 2.Checking for Missing values visually # 

We can import the missingno library that can be used for graphical analysis of missing values and it is compatible with pandas.

Using the matrix, we can quickly find the pattern of missingness in the dataset. The sparkline at right summarizes the general shape of the data completeness and points out the rows with the maximum and minimum nullity in the dataset.

In [None]:
import missingno as msno
msno.matrix(data)

This bar chart gives you an idea about how many missing values are there in each column. We can also change the colour and size of the figure as per our wish.

In [None]:
msno.bar(data, color = 'y', figsize = (10,8))

Heatmap shows the correlation of missingness between every 2 column.

A value near -1 means if one variable appears then the other variable is very likely to be missing.
A value near 0 means there is no dependence between the occurrence of missing values of two variables.
A value near 1 means if one variable appears then the other variable is very likely to be present.

In [None]:
msno.heatmap(data)

A dendogram plot is a tree diagram of missingness that reveals trends deeper than the pairwise ones visible in the correlation heatmap.

For detailed explanation you can refer to the link at the end of the notebook.

In [None]:
ax = msno.dendrogram(data)

# 3.Checking for Missing values numerically

In [None]:
data.describe()

Missing values are frequently indicated by out-of-range entries; perhaps a negative number in a numeric field that is normally only positive, or a 0 in a numeric field that can never normally be 0. Our current dataset does not show an error of that sort, but this function comes in handy if we want to check the inconsistency in the numeric values.

In [None]:
data.isnull()

The isnull function returns a dataframe that shows True for a missing value. This type of analysis is not good for huge datasets. But, with little tweaks in the code we can create a table that shows us the total number of missing values and also the percentage of missing values.

In [None]:
total = data.isnull().sum().sort_values(ascending=False)
percent = ((data.isnull().sum()/data.isnull().count())*100).sort_values(ascending=False)
missing_data = pd.concat([total, percent], axis=1, keys=['Total', 'Percent'])
missing_data

Just as isnull() returns the number of null values, notnull() helps in finding the number of non-null values.

In [None]:
data.notnull().sum().sort_values(ascending=False)

# 4.Handling Missing values

**4.1 NUMERICAL VALUES :-**

**4.1.1 Deletion**

One way to handle missing values is to drop them. We can drop all rows with 'any' NAs in a particular column (should be done only when proportion of missing values is very small or else we can lose information). Below is an example of how to do it.

In [None]:
data.dropna(subset = ['Loan_Amount_Term'], axis = 0, how = 'any', inplace = True)

If we find that most of the values of a column is missing, we might want to go ahead and drop the column altogether. We need to keep in mind that dropping data is not a good habit as we might lose information so we should drop the columns only if that column is insignificant and most of the values are missing. Below is an example.

In [None]:
# data.drop(['column_name'], axis = 1, inplace = True)

**4.1.2 Imputation**

The other way to handle numeric data is to fill the columns. We can use a test static like mean/median/mode to fill in, depending on the kind of value that the column holds. 

By checking the skewness and presence of outliers in the data we can decide whether to to fill with mean or median. 

Whenever a graph falls on a normal distribution, using the mean is a good choice. But if our data has extreme values, we will need to look at median, because it gives a better representative number for our sample.

Also, presence of outliers has a major effect on the mean, so in that case using median is a better choice.

In [None]:
columns = ['LoanAmount','ApplicantIncome','CoapplicantIncome']
sns.pairplot(data[columns])

In [None]:
data['LoanAmount'].fillna(data['LoanAmount'].median(), inplace = True)

Using mode returns a series, so to avoid errors we should use .mode()[0] to get the value.

In [None]:
data['Dependents'].fillna((data['Dependents'].mode()[0]),inplace=True)

**4.2 CATEGORICAL VALUES**

Categorical values need to be treated differently. One way is to check the most frequent values in a particular column and fill the column with that value.

In [None]:
data['Gender'].value_counts()

In [None]:
data['Gender'].fillna('Male', inplace = True)

The two lines can be merged into a single line of code as:-

In [None]:
data['Gender'].fillna(data['Gender'].value_counts().index[0], inplace = True)

Another way to fill categorical values is to use ffill or bfill. Forward-fill propagates the last observed non-null value forward until another non-null value is encountered.  Backward-fill propagates the first observed non-null value backward until another non-null value is met.

When we have a large dataset, I prefer filling the values using ffill/bfill as it keeps the data distributed. 

Below is an example.

In [None]:
data['Self_Employed'].fillna(method='ffill',inplace=True)

**4.3 Predict the Missing values**

By using the columns or features that doesn’t have missing values, we can predict the null values in other columns using Machine Learning Algorithms. In this case we divide our data into 2 sets. One that doesn't have any missing value and the other with the missing value. The former becomes our training data whereas the latter becomes the testing data where the variable with missing data is our target variable. 

One drawback of this method is that if there is no relationship between attributes in the data set and the attribute with missing values, then our model will not be precise for estimating missing values.

# 5. A word of caution

We are going to use the House prices: Advanced regression dataset to point out a mistake that poeple might make while handling missing values. Let us have a quick look at the data and the number of missing values that it holds.

In [None]:
cat_data.head()

In [None]:
total = cat_data.isnull().sum().sort_values(ascending=False)
percent = ((cat_data.isnull().sum()/cat_data.isnull().count())*100).sort_values(ascending=False)
missing_data = pd.concat([total, percent], axis=1, keys=['Total', 'Percent'])
missing_data.head(20)

When we look at the table, we notice that the column PoolQC has 99% of the values missing and we might tend to delete that column altogether. But if we have a look at the description of the data, PoolQC corresponds to the quality of the pool and any null value indicates absence of a pool which is a factor in determining the price of a house. Hence, before dropping any such data, we should always keep a check.

Filling the missing values with "none" means that presence of "none" will indicate absence of a pool. 

In [None]:
cat_data['PoolQC'].fillna("none", inplace = True)

# 6. Conclusion

In this notebook, I have tried to demonstrate numerous ways of handling missing data. We need to keep experimenting and draw insights from our data to know which method will give best results. 

References: https://github.com/ResidentMario/missingno