In this tutorial, you will learn three approaches to dealing with missing values. You will then learn to compare the effectiveness of these approaches on any given dataset.

# Introduction

There are many ways data can end up with missing values. For example,
- A 2 bedroom house won't include a value for the size of a third bedroom.
- A survey respondent may choose to not share his income.

Most libraries (_including scikit-learn_) will give you an error if you try to build a model using data with missing values. So you'll need to choose one of the strategies below.

# Three Approaches


### 1) A Simple Option: Drop Columns with Missing Values

A simple option is to drop columns with missing values. If your data is in a DataFrame called `original_data`, you can do that as follows:

```python
reduced_data = original_data.dropna(axis=1)
```

In many cases, you'll have both a training dataset (`original_train_data`) and a test dataset (`original_test_data`).  You will want to drop the same columns in both DataFrames. In that case, you would write:

```python
# get names of columns with missing values
cols_with_missing = [col for col in original_data.columns 
                                 if original_data[col].isnull().any()]
# drop columns in train and test DataFrames
reduced_train_data = original_train_data.drop(cols_with_missing, axis=1)
reduced_test_data = original_test_data.drop(cols_with_missing, axis=1)
```

If those columns had useful information _in the entries that were not missing_, your model loses access to this information when the column is dropped. Also, if your test data has missing values in places where your training data did not, this will result in an error.  

So, it's usually not the best solution. However, it can be useful when most values in a column are missing.

### 2) A Better Option: Imputation

**Imputation** fills in the missing values with some number. This approach usually gives more accurate models than dropping columns entirely, and it is demonstrated in the pseudocode below:
```python
from sklearn.impute import SimpleImputer

my_imputer = SimpleImputer()
imputed_data = my_imputer.fit_transform(original_data)
```

The default behavior fills in the mean value for imputation.  Statisticians have researched more complex strategies, but those complex strategies typically give no benefit once you plug the results into sophisticated machine learning models.

### 3) An Extension To Imputation

Imputation is the standard approach, and it usually works well. However, imputed values may be systematically above or below their actual values (which weren't collected in the dataset). Or rows with missing values may be unique in some other way. In that case, your model would make better predictions by considering which values were originally missing.  Here's how it might look:

```python
# make copy to avoid changing original data (when imputing)
new_data = original_data.copy()

# make new columns indicating what will be imputed
cols_with_missing = (col for col in new_data.columns 
                                 if new_data[col].isnull().any())
for col in cols_with_missing:
    new_data[col + '_was_missing'] = new_data[col].isnull()

# imputation
my_imputer = SimpleImputer()
imputed_data = pd.DataFrame(my_imputer.fit_transform(new_data))
imputed_data.columns = original_data.columns
```

In some cases this approach will meaningfully improve results. In other cases, it doesn't help at all.

# Example (Comparing All Solutions)

We will see an example predicting housing prices from the Melbourne Housing data.  

### Set-up

`melb_target`, `melb_numeric_predictors`

We divide our data into training and test subsets with [`train_test_split`](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.train_test_split.html). 

In [None]:
import pandas as pd
from sklearn.model_selection import train_test_split

# load the data
melb_data = pd.read_csv('../input/melbourne-housing-snapshot/melb_data.csv')

# some comment here
melb_target = melb_data.Price

# to keep things simple, we'll use only numeric predictors
melb_predictors = melb_data.drop(['Price'], axis=1)
melb_numeric_predictors = melb_predictors.select_dtypes(exclude=['object'])

# divide data into training and test subsets
X_train, X_test, y_train, y_test = train_test_split(melb_numeric_predictors, 
                                                    melb_target,
                                                    train_size=0.7, 
                                                    test_size=0.3, 
                                                    random_state=0)

### Investigate Missing Values

In [None]:
# print number of missing values in each column of training data
missing_val_count_by_column = (X_train.isnull().sum())
print(missing_val_count_by_column[missing_val_count_by_column > 0])

### Create Function to Measure Quality of Each Approach

We define a function `score_dataset` to compare different approaches to dealing with missing values. This function reports the [mean absolute error](https://en.wikipedia.org/wiki/Mean_absolute_error) (MAE) from a random forest model.

In [None]:
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_absolute_error

# function for comparing different approaches
def score_dataset(X_train, X_test, y_train, y_test):
    model = RandomForestRegressor()
    model.fit(X_train, y_train)
    preds = model.predict(X_test)
    return mean_absolute_error(y_test, preds)

### Score from Approach 1 (Dropping Columns with Missing Values)

In [None]:
# get names of columns with missing values
cols_with_missing = [col for col in X_train.columns 
                                 if X_train[col].isnull().any()]

# drop columns in train and test DataFrames
reduced_X_train = X_train.drop(cols_with_missing, axis=1)
reduced_X_test  = X_test.drop(cols_with_missing, axis=1)

print("MAE from Approach 1 (Dropping columns with missing values):")
print(score_dataset(reduced_X_train, reduced_X_test, y_train, y_test))

### Score from Approach 2 (Imputation)

In [None]:
from sklearn.impute import SimpleImputer

my_imputer = SimpleImputer()
imputed_X_train = my_imputer.fit_transform(X_train)
imputed_X_test = my_imputer.transform(X_test)

print("MAE from Approach 2 (Imputation):")
print(score_dataset(imputed_X_train, imputed_X_test, y_train, y_test))

### Score from Approach 3 (An Extension to Imputation)

In [None]:
# make copy to avoid changing original data (when imputing)
imputed_X_train_plus = X_train.copy()
imputed_X_test_plus = X_test.copy()

# make new columns indicating what will be imputed
for col in cols_with_missing:
    imputed_X_train_plus[col + '_was_missing'] = imputed_X_train_plus[col].isnull()
    imputed_X_test_plus[col + '_was_missing'] = imputed_X_test_plus[col].isnull()

# imputation
my_imputer = SimpleImputer()
imputed_X_train_plus = my_imputer.fit_transform(imputed_X_train_plus)
imputed_X_test_plus = my_imputer.transform(imputed_X_test_plus)

print("MAE from Approach 3 (An Extension to Imputation):")
print(score_dataset(imputed_X_train_plus, imputed_X_test_plus, y_train, y_test))

# Conclusion
As is common, imputing missing values (in **Approach 2**) yielded better results, relative to when we simply dropped columns with missing values (in **Approach 1**).  We got an additional boost by tracking what values had been imputed (in **Approach 3**).

# Keep Going
Once you've added the Imputer and included columns with missing values, you are ready to [add categorical variables](https://www.kaggle.com/dansbecker/using-categorical-data-with-one-hot-encoding), which is non-numeric data representing categories (like the name of the neighborhood a house is in).