Handling Missing Values
=====================
To learn Data Science/Machine Learning one should have an idea about how to handle data with missing values.
It is one of the most important step in Data Science domain. 
So, I am gonna tell you how to handle missing values. There are Two most popular ways: 
- By Removing the Data 
- By Imputing the Data
Before starting, There are prerequisite  which are as follows : Python Language, pandas and numpy libraries.

Let start with an example: 
In the following code I will import libraries which i am gonna use it and I am gonna create a dataframe.

```
import numpy as np
import pandas as pd

df = pd.DataFrame({'col1': [1, 2, np.nan, np.nan, 5, 6], 
                             'col2': [7, 8, np.nan, 10, 11, 12],
                             'col3': [np.nan, 14, np.nan, 16, 17, 18]})
df
```

So now we have a dataframe "df" which we use as our dataset.
Let's First check our data.
 
```
df.shape[0] #"0"=Row and "1"=Column this shows total number of columns/rows in our dataset.  
```
Note : One should keep it in mind that "0" is used for rows and "1"  for column.
 
Once we had the number of columns and row count. we can check the missing values in the dataset.
```
df.isnull().any() # This tells us that if we have any missing data.
```
Now, we know we have some missing values in our dataset, so let's check how many values for each columns.
```
df[df.isnull()== True].sum().count() #This will give you the count of missing values.

no_nulls = set(df.columns[df.isnull()any()]) #Provide a set of columns with Null values.
print(no_nulls)

```
We have idea about how our dataset is, and how many missing values does it have. Now we need to go to the next step i.e. 
removing the missing values/imputing missing values. 
Let's first go with Removing the missing values. 

#1. Removing Missing Values :
------------------------------------------------

```
new_df = df.dropna(subset=['col2'], axis=0)# dataframe with rows for nan col2 removed
new_df
```
To remove columns we need to change "axis=1" this will remove whole column.

we can remove rows and columns based on our dataset.

Removing data may also affect the result. So, one should  always be careful about what to remove and what not too.
If there are certain rows or columns where data is missing 90% or 100% that one can consider it to remove that row or column 
(But one should always keep this point in a mind that the data they are removing should not affect their model)

**Note: Be careful while removing data. it should not affect your model.**

#2. Imputing Method : 
---------------------------------
Imputation means that you input a value for values that were originally missing.
Below are some common methods to impute the data :
 - Taking **Mean** of values and insert it.
 - Mode method. : If you are working with categorical data or a variable with outliers, then use the **mode** of the column. 
 - Impute 0, a very small number, or a very large number to differentiate missing values from other values.
 - User can create a **ML model** with existing values and predict some values and insert it.
 

This all methods depends on the data you have and you might seen the some difference of scores in your model.
So i would advice to analyse the data well and then take decision.

Following code will show how to impute the mean data.
```
fill_mean = lambda col: col.fillna(col.mean()) # Mean function

fill_df = new_df.apply(fill_mean, axis=0) #Fill all missing values with the mean of the column.
fill_df
 
```
To try other methods go through the offical documentation of [sklearn](https://scikit-learn.org/stable/), [pandas](https://pandas.pydata.org/pandas-docs/stable/getting_started/index.html), [numpy](https://numpy.org/)
 

In [30]:
import numpy as np
import pandas as pd

df = pd.DataFrame({'col1': [1, 2, np.nan, np.nan, 5, 6], 
                             'col2': [7, 8, np.nan, 10, 11, 12],
                             'col3': [np.nan, 14, np.nan, 16, 17, 18]})
df

Unnamed: 0,col1,col2,col3
0,1.0,7.0,
1,2.0,8.0,14.0
2,,,
3,,10.0,16.0
4,5.0,11.0,17.0
5,6.0,12.0,18.0


In [17]:
df.shape[0]

6

In [16]:
df.isnull().any()

col1    True
col2    True
col3    True
dtype: bool

In [15]:
df[df.isnull()== True].sum().count() 

3

In [22]:
no_nulls = set(df.columns[df.isnull().any()])
print(no_nulls)

{'col2', 'col3', 'col1'}


In [26]:
new_df = df.dropna(subset=['col2'], axis=0)
new_df

Unnamed: 0,col1,col2,col3
0,1.0,7.0,
1,2.0,8.0,14.0
3,,10.0,16.0
4,5.0,11.0,17.0
5,6.0,12.0,18.0


In [29]:
fill_mean = lambda col: col.fillna(col.mean()) # Mean function

fill_df = new_df.apply(fill_mean, axis=0) #Fill all missing values with the mean of the column.
fill_df

Unnamed: 0,col1,col2,col3
0,1.0,7.0,16.25
1,2.0,8.0,14.0
3,3.5,10.0,16.0
4,5.0,11.0,17.0
5,6.0,12.0,18.0
