## Imputing Missing Numeric Data

Machine learning models can't work with missing numerical data. The process of filling missing values is called imputation.

<img src="https://i.imgur.com/W7cfyOp.png" width="480">

There are several techniques for imputation, but we'll use the most basic one: replacing missing values with the average value in the column using the `SimpleImputer` class from `sklearn.impute`.

In [1]:
import pandas as pd
import numpy as np
raw_df = pd.read_csv("weather-dataset-rattle-package/weatherAUS.csv")
numeric_cols = raw_df.select_dtypes(include=np.number).columns.tolist()
categorical_cols = raw_df.select_dtypes('object').columns.tolist()

Before we perform imputation, let's check the no. of missing values in each numeric column.

In [2]:
raw_df[numeric_cols].isna().sum()

MinTemp           1485
MaxTemp           1261
Rainfall          3261
Evaporation      62790
Sunshine         69835
WindGustSpeed    10263
WindSpeed9am      1767
WindSpeed3pm      3062
Humidity9am       2654
Humidity3pm       4507
Pressure9am      15065
Pressure3pm      15028
Cloud9am         55888
Cloud3pm         59358
Temp9am           1767
Temp3pm           3609
dtype: int64

The first step in imputation is to `fit` the imputer to the data i.e. compute the chosen statistic (e.g. mean) for each column in the dataset. 

In [3]:
from sklearn.impute import SimpleImputer

In [10]:
# For info regarging SimpleImputer in sklearn.impute library
# A pop up is opened where related information is shown
?SimpleImputer

In [4]:
imputer = SimpleImputer(strategy = 'mean')
#imputer = SimpleImputer(strategy = 'median')

Other strategies include 'median' (for columns like salary, median would work better) or replace with some fixed value

In [5]:
imputer.fit(raw_df[numeric_cols])

SimpleImputer()

After calling `fit`, the computed statistic for each column is stored in the `statistics_` property of `imputer`.

In [6]:
list(imputer.statistics_)

[12.19403438096892,
 23.22134827564685,
 2.3609181499166656,
 5.468231522922462,
 7.6111775206611565,
 40.03523007167319,
 14.043425914971502,
 18.662656778887342,
 68.88083133761887,
 51.5391158755046,
 1017.6499397983052,
 1015.2558888309618,
 4.4474612602152455,
 4.509930082924903,
 16.990631415587398,
 21.68339031800974]

Now we can fill in the missing values using the `transform` method of `imputer`.

In [7]:
raw_df[numeric_cols] = imputer.transform(raw_df[numeric_cols])

In [8]:
raw_df[numeric_cols].isna().sum()

MinTemp          0
MaxTemp          0
Rainfall         0
Evaporation      0
Sunshine         0
WindGustSpeed    0
WindSpeed9am     0
WindSpeed3pm     0
Humidity9am      0
Humidity3pm      0
Pressure9am      0
Pressure3pm      0
Cloud9am         0
Cloud3pm         0
Temp9am          0
Temp3pm          0
dtype: int64

We can see that there are no missing values in data now