# Dealing With Missing Values

In [1]:
import numpy as np
import pandas as pd

## 1. How does our data Look like?

1. First thing to do when you get a new dataset is to take a look at it, Atleast some of it not all of it. This will give you a brief idea of the dataset, information like the total number of columns(features).

In [2]:
df = pd.read_csv("datasets/housing.csv")

In [3]:
df.head()

Unnamed: 0,longitude,latitude,housing_median_age,total_rooms,total_bedrooms,population,households,median_income,median_house_value,ocean_proximity
0,-122.23,37.88,41.0,880.0,129.0,322.0,126.0,8.3252,452600.0,NEAR BAY
1,-122.22,37.86,21.0,7099.0,1106.0,2401.0,1138.0,8.3014,358500.0,NEAR BAY
2,-122.24,37.85,52.0,1467.0,190.0,496.0,177.0,7.2574,352100.0,NEAR BAY
3,-122.25,37.85,52.0,1274.0,235.0,558.0,219.0,5.6431,341300.0,NEAR BAY
4,-122.25,37.85,52.0,1627.0,280.0,565.0,259.0,3.8462,342200.0,NEAR BAY


In [4]:
# set seed for reproducibility
np.random.seed(0) 

## 2. How Many Missing Values Do We Have?

2. Get the total missing rows from the dataframe. Missing values are represented by `NaN` or `None`

In [5]:
total_missing_values = df.isnull().sum()
total_missing_values

longitude               0
latitude                0
housing_median_age      0
total_rooms             0
total_bedrooms        207
population              0
households              0
median_income           0
median_house_value      0
ocean_proximity         0
dtype: int64

In [6]:
df.isna().sum()

longitude               0
latitude                0
housing_median_age      0
total_rooms             0
total_bedrooms        207
population              0
households              0
median_income           0
median_house_value      0
ocean_proximity         0
dtype: int64

## Percentage Of The Dataset That Is Empty

It is always helpful to know the percentage of the dataset that is empty.

In [7]:
df.shape

(20640, 10)

In [8]:
total_cells = np.product(df.shape)
total_missing_values = total_missing_values.sum()
percentage_missing = (total_missing_values/total_cells)*100

percentage_missing

0.1002906976744186

In this case we have queit a clean dataset.

## Figuring Out Why We Have Missing Values

Now that we know we have missing values, it time we clean them up. This is the stage where we really dive into data science and take a look at our data and see why it is the way it is aka 'data intuition'.

One of the questions to ask yourself is: 

**Is this value missing because it wasn't recorded or because it doesn't exist?**

Example the `total_number_of_cars` of someone who does not have a car. For this cases, the value better remain, `NaN` or `None`. In other case like `height` the value is probably not recorded, for this cases you need to figure out a way to replace the `NaN` some how. This process is called **imputation**

According to the English dictionary, imputation means **the assignment of a value to something by inference from the value of the products or processes to which it contribute**.

Before you perform imputation, it is alway best practice to read the dataset documentaion to figure things out. If you dont have a documentation then, an online or domain research will take you  a long way.

Taking a look, the `total_bedrooms` columns seems to be the one having alot of missing values. Using the documentation and intuition, i dont think a house will have no bedroom considering the fact that this are houses where people reside. The best thing to do is to find a way to figure out these missing values. We can do this by using other data from the same row.

## Dropping Missing Values

One way to solve the issue of missing data is not drop the missing value rows, this is not a good practice but can save you if you are not working on an important project or the missing values are just very few.

The best approach is not get to  know your dataset and explore it.

In [9]:
df_cp = df.copy()

df_cp.dropna(inplace = True)
df_cp.isna().sum()

longitude             0
latitude              0
housing_median_age    0
total_rooms           0
total_bedrooms        0
population            0
households            0
median_income         0
median_house_value    0
ocean_proximity       0
dtype: int64

In [10]:
df.shape

(20640, 10)

In [11]:
df_cp.shape

(20433, 10)

For this particular dataset, dropping Na's is not that much of a loss but, in some datasets it can mean dropping all rows, not a good idea at all. It's always best to clean your dataset and visually inspect the data. This is common when every row has atleast one missing value.

## Filling Missing Values

One way around this is to fill in the missing values with some data. In this case a house can not have NaN bedrooms, we can replace Na's with zeros.

In [12]:
df_cp2= df.copy()

df_cp2.isna().sum()

longitude               0
latitude                0
housing_median_age      0
total_rooms             0
total_bedrooms        207
population              0
households              0
median_income           0
median_house_value      0
ocean_proximity         0
dtype: int64

In [13]:
df_cp2.fillna(0, inplace = True)

In [14]:
df_cp2.isna().sum()

longitude             0
latitude              0
housing_median_age    0
total_rooms           0
total_bedrooms        0
population            0
households            0
median_income         0
median_house_value    0
ocean_proximity       0
dtype: int64

We could also replace this columns with the value that comes before it in the column. Lets go ahead and fill all Na's with the value that comes directly after it or below it in the column. The last one wont have a value below it so will use another fillna() method for that.

In [15]:
df_cp3 = df.copy()

In [16]:
df_cp3 = df_cp3.fillna(method = 'bfill', axis = 0).fillna(0)
df_cp3.isna().sum()

longitude             0
latitude              0
housing_median_age    0
total_rooms           0
total_bedrooms        0
population            0
households            0
median_income         0
median_house_value    0
ocean_proximity       0
dtype: int64

## Using Imputation From sklearn

https://www.kaggle.com/alexisbcook/missing-values

https://www.kaggle.com/alexisbcook/scaling-and-normalization

## Imputing Missing Values

When having alot of missing values, its too costly to simply drop them. The best strategy is to infer them from other parts of the dataset(Known parts), this strategy is what we call **Imputation**.

### Types Of Imputation

#### 1. Univariate Imputation

Type of imputation algorithm that uses missing values from a single $ n^{th} $ column to impute values in that dimension only. Most basic imputation algorithm.

#### 2. Multivariate Imputation

Type of imputation that uses the entire set of available features to estimate values of missing columns

### Univariate Imputation Implementation Using SimpleImputer

This imputation method uses simple statistical measures such as mean, mode and median or contant values to replace missing values

In [18]:
from sklearn.impute import SimpleImputer

In [23]:
simple_imputer = SimpleImputer(missing_values = np.nan, strategy = "mean")

In [34]:
df_cp4 = df.copy()

In [35]:
df_cp4["total_bedrooms"] = simple_imputer.fit_transform(df['total_bedrooms'].values.reshape(-1, 1))

In [37]:
df_cp4.total_bedrooms.isna().sum()

0

### Multivariate Imputation Implementation Using IterativeImputer

"A more sophisticated approach is to use the IterativeImputer class, which models each feature with missing values as a function of other features, and uses that estimate for imputation. It does so in an iterated round-robin fashion: at each step, a feature column is designated as output y and the other feature columns are treated as inputs X. A regressor is fit on (X, y) for known y. Then, the regressor is used to predict the missing values of y. This is done for each feature in an iterative fashion, and then is repeated for max_iter imputation rounds. The results of the final imputation round are returned."

In [39]:
from sklearn.experimental import enable_iterative_imputer
from sklearn.impute import IterativeImputer

In [40]:
iter_imputer = IterativeImputer(max_iter=10, random_state=0)
iter_imputer.fit([[10, 15], [30, 35], [40, 46], [np.nan, 85], [70, np.nan]])



IterativeImputer(random_state=0)

In [42]:
X_test = [[np.nan, 25], [60, np.nan], [np.nan, 75]]

iter_imputer.transform(X_test)

array([[19.83170183, 25.        ],
       [60.        , 65.95344903],
       [68.94071447, 75.        ]])

In [47]:
iter_imputer.fit(df[['total_bedrooms', 'total_rooms']])

IterativeImputer(random_state=0)

In [49]:
df_cp5 = df.copy()

In [50]:
df_cp5[['total_bedrooms', 'total_rooms']] = iter_imputer.fit(df[['total_bedrooms', 'total_rooms']])

In [51]:
df_cp5.isna().sum()

longitude             0
latitude              0
housing_median_age    0
total_rooms           0
total_bedrooms        0
population            0
households            0
median_income         0
median_house_value    0
ocean_proximity       0
dtype: int64