In this tutorial, you will learn three approaches to **dealing with missing values**. You will then compare the effectiveness of these approaches on a real-world dataset.

# Introduction

There are many ways data can end up with missing values. For example,
- A 2 bedroom house won't include a value for the size of a third bedroom.
- A survey respondent may choose to not share his income.

Most libraries (including scikit-learn) will give you an error if you try to build a model using data with missing values. So you'll need to choose one of the strategies below.

# Three Approaches


### 1) A Simple Option: Drop Columns with Missing Values

A simple option is to drop columns with missing values. 

![approach 1](./images/tut3_approach1.png)

Unless most values in the dropped columns are missing, the model loses access to a lot of (potentially useful!) information with this approach.  As an extreme example, consider a dataset with 10,000 rows, and containing an important column with only one missing entry.  This approach would drop the column entirely!

### 2) A Better Option: Imputation

**Imputation** fills in the missing values with some number.  For instance, we can fill in the mean value along each column. 

![approach 2](./images/tut3_approach2.png)

The imputed value won't be exactly right in most cases, but it usually gives more accurate models than dropping the column entirely.  

### 3) An Extension To Imputation

Imputation is the standard approach, and it usually works well. However, imputed values may be systematically above or below their actual values (which weren't collected in the dataset). Or rows with missing values may be unique in some other way. In that case, your model would make better predictions by considering which values were originally missing.  

![approach 3](./images/tut3_approach3.png)

In this approach, we impute the missing values, as before.  And, additionally, for each column with missing entries in the original dataset, we add a new column that shows the location of the imputed entries.

In some cases, this will meaningfully improve results. In other cases, it doesn't help at all.

# Example 

In the example, we will work with [Melbourne Housing data](https://www.kaggle.com/dansbecker/melbourne-housing-snapshot/home).  Our model will use information such as the number of rooms and land size to predict home price.

We won't focus on the data loading step. Instead, you can imagine you are at a point where you already have the training and test data in `train_X`, `test_X`, `train_y`, and `test_y`. 

In [None]:
#$HIDE$
import pandas as pd
from sklearn.model_selection import train_test_split

# load the data
data = pd.read_csv('../input/melbourne-housing-snapshot/melb_data.csv')

# select target
y = data.Price

# to keep things simple, we'll use only numeric predictors
melb_predictors = data.drop(['Price'], axis=1)
X = melb_predictors.select_dtypes(exclude=['object'])

# divide data into training and test subsets
X_train, X_test, y_train, y_test = train_test_split(X, y,
                                                    train_size=0.7,
                                                    test_size=0.3,
                                                    random_state=0)

### Define Function to Measure Quality of Each Approach

We define a function `score_dataset` to compare different approaches to dealing with missing values. This function reports the [mean absolute error](https://en.wikipedia.org/wiki/Mean_absolute_error) (MAE) from a random forest model.

In [None]:
#$HIDE$
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_absolute_error

# function for comparing different approaches
def score_dataset(X_train, X_test, y_train, y_test):
    model = RandomForestRegressor(n_estimators=10)
    model.fit(X_train, y_train)
    preds = model.predict(X_test)
    return mean_absolute_error(y_test, preds)

### Score from Approach 1 (Drop Columns with Missing Values)

Since we are working with both training and test sets, we are careful to drop the same columns in both DataFrames.  

In [None]:
# get names of columns with missing values
# (X contains all of the rows in both X_train and X_test)
cols_with_missing = [col for col in X.columns 
                                 if X[col].isnull().any()]

# drop columns in train and test data
reduced_X_train = X_train.drop(cols_with_missing, axis=1)
reduced_X_test  = X_test.drop(cols_with_missing, axis=1)

print("MAE from Approach 1 (Drop columns with missing values):")
print(score_dataset(reduced_X_train, reduced_X_test, y_train, y_test))

### Score from Approach 2 (Imputation)

Next, we use [`SimpleImputer`](https://scikit-learn.org/stable/modules/generated/sklearn.impute.SimpleImputer.html) to replace missing values with the mean value along each column.

Although it's simple, filling in the mean value generally performs quite well.  While statisticians have experimented with more complex ways to determine imputed values (such as **regression imputation**, for instance), the complex strategies typically give no additional benefit once you plug the results into sophisticated machine learning models.

In [None]:
from sklearn.impute import SimpleImputer

# imputation
my_imputer = SimpleImputer()
imputed_X_train = my_imputer.fit_transform(X_train)
imputed_X_test = my_imputer.transform(X_test)

print("MAE from Approach 2 (Imputation):")
print(score_dataset(imputed_X_train, imputed_X_test, y_train, y_test))

We see that **Approach 2** has lower MAE than **Approach 1**, so **Approach 2** performed better on this dataset.

### Score from Approach 3 (An Extension to Imputation)

Next, we impute the missing values, while also keeping track of which values were imputed.

In [None]:
# make copy to avoid changing original data (when imputing)
imputed_X_train_plus = X_train.copy()
imputed_X_test_plus = X_test.copy()

# make new columns indicating what will be imputed
for col in cols_with_missing:
    imputed_X_train_plus[col + '_was_missing'] = imputed_X_train_plus[col].isnull()
    imputed_X_test_plus[col + '_was_missing'] = imputed_X_test_plus[col].isnull()

# imputation
my_imputer = SimpleImputer()
imputed_X_train_plus = my_imputer.fit_transform(imputed_X_train_plus)
imputed_X_test_plus = my_imputer.transform(imputed_X_test_plus)

print("MAE from Approach 3 (An Extension to Imputation):")
print(score_dataset(imputed_X_train_plus, imputed_X_test_plus, y_train, y_test))

As we can see, **Approach 3** performed better than **Approach 1** and **Approach 2**!

### So, why did imputation perform better than dropping the columns?

When combined, the training and test data contain 13580 rows and 12 columns, where three columns contain missing data.  For each column, less than half of the entries are missing.  Thus, dropping the columns removes a lot of useful information, and so it makes sense that imputation would perform better.

In [None]:
# print shape of dataset (num_rows, num_columns)
print(X.shape)

# print number of missing values in each column of training data
missing_val_count_by_column = (X.isnull().sum())
print(missing_val_count_by_column[missing_val_count_by_column > 0])

# Conclusion
As is common, imputing missing values (in **Approach 2**) yielded better results, relative to when we simply dropped columns with missing values (in **Approach 1**).  We got an additional boost by tracking what values had been imputed (in **Approach 3**).

# Keep Going

...