In [22]:
import pandas as pd
from sklearn.ensemble import RandomForestRegressor
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_absolute_error
from sklearn.impute import SimpleImputer

In [23]:
data = pd.read_csv('~/JProjects/kaggle/data/melb_data.csv')

In [24]:
# Select target
y = data.Price

# To keep things simple, we'll use only numerical predictors
melb_predictors = data.drop(['Price'], axis=1)
X = melb_predictors.select_dtypes(exclude=['object'])

# Divide data into training and validation subsets
X_train, X_valid, y_train, y_valid = train_test_split(X, y, train_size=0.8, test_size=0.2, random_state=0)

In [25]:
# Function for comparing different approaches
def score_dataset(X_train, X_valid, y_train, y_valid):
    model = RandomForestRegressor(n_estimators=10, random_state=0)
    model.fit(X_train, y_train)
    preds = model.predict(X_valid)
    return mean_absolute_error(y_valid, preds)

There are three approaches how to work with missing values in the dataset

## Drop columns with missing values

Unless most values in the dropped columns are missing, the model loses access to a lot of (potentially useful!) information with this approach. As an extreme example, consider a dataset with 10,000 rows, where one important column is missing a single entry. This approach would drop the column entirely!

In [26]:
# Get names of columns with missing values
cols_with_missing = [col for col in X_train.columns
                     if X_train[col].isnull().any()]

# Drop columns in training and validation data
reduced_X_train = X_train.drop(cols_with_missing, axis=1)
reduced_X_valid = X_valid.drop(cols_with_missing, axis=1)

print("MAE from Approach 1 (Drop columns with missing values):")
print(score_dataset(reduced_X_train, reduced_X_valid, y_train, y_valid))

MAE from Approach 1 (Drop columns with missing values):
183550.22137772635


## Imputation

Imputation fills in the missing values with some number. For instance, we can fill in the mean value along each column.
It gives more accurate model than just dropping columns entirely

In [27]:
# Imputation
imputer = SimpleImputer()
imputed_X_train = pd.DataFrame(imputer.fit_transform(X_train))
imputed_X_valid = pd.DataFrame(imputer.transform(X_valid))

# Imputer removed columns; put them back
imputed_X_train.columns = X_train.columns
imputed_X_valid.columns = X_valid.columns

print("MAE from Approach 2 (Imputation)")
print(score_dataset(imputed_X_train, imputed_X_valid, y_train, y_valid))

MAE from Approach 2 (Imputation)
178166.46269899711


## An Extension to Imputation

Imputation is the standard approach, and it usually works well. However, imputed values may be systematically above or below their actual values (which weren't collected in the dataset). Or rows with missing values may be unique in some other way. In that case, your model would make better predictions by considering which values were originally missing.

In this approach, we impute the missing values, as before. And, additionally, for each column with missing entries in the original dataset, we add a new column that shows the location of the imputed entries.

In some cases, this will meaningfully improve results. In other cases, it doesn't help at all.

In [28]:
# We impute the missing values, while also keeping track of which values were imputed

# Make copy to avoid changing original data (when imputing)
X_train_plus = X_train.copy()
X_valid_plus = X_valid.copy()

# Make new columns indicating what will be imputed
for col in col_w_missing:
    X_train_plus[col + 'was_missing'] = X_train_plus[col].isnull()
    X_valid_plus[col + 'was_missing'] = X_valid_plus[col].isnull()

# Imputation
imputer = SimpleImputer()
imputed_X_train_plus = pd.DataFrame(imputer.fit_transform(X_train_plus))
imputed_X_valid_plus = pd.DataFrame(imputer.transform(X_valid_plus))

# Put columns back
imputed_X_train_plus.columns = X_train_plus.columns
imputed_X_valid_plus.columns = X_valid_plus.columns

print('MAE from Approach 3 (An Extension to Imputation):')
print(score_dataset(imputed_X_train_plus, imputed_X_valid_plus, y_train, y_valid))

MAE from Approach 3 (An Extension to Imputation):
178927.503183954


As we can see, Approach 3 performed slightly worse than Approach 2.


In [29]:
# Shape of training data (num_rows, num_columns)
print(X_train.shape)

# Number of missing values in each column of training data
missing_val_count_by_column = (X_train.isnull().sum())
print(missing_val_count_by_column[missing_val_count_by_column > 0])

(10864, 12)
Car               49
BuildingArea    5156
YearBuilt       4307
dtype: int64


In scikit-learn's SimpleImputer, the fit_transform and transform methods serve different purposes:

1. fit_transform: This method is used for training the imputer based on the data (fitting) and transforming the data at the same time. In other words, it learns the imputation strategy from the training data and immediately applies it to fill missing values. It returns the transformed dataset.
`imputed_X_train = my_imputer.fit_transform(X_train)`
Here, fit_transform is used on the training data (X_train), and the imputer is fit to this data. The imputer then applies the learned strategy to fill missing values in X_train and returns the imputed dataset.

2. transform: This method is used for applying a previously learned imputation strategy to new data. It assumes that the imputer has already been fitted to some training data, and it applies the same imputation strategy to the new data without relearning from the new data.
`imputed_X_valid = my_imputer.transform(X_valid)`
Here, transform is used on the validation data (X_valid). It applies the imputation strategy learned from the training data to fill missing values in X_valid without refitting the imputer.

In summary:
Use fit_transform on your training data to both fit the imputer and transform the data.
Use transform on new or unseen data to apply the learned imputation strategy without refitting the imputer.
By using fit_transform on the training data and transform on the validation (or test) data, you ensure consistency in the imputation strategy between your training and validation datasets.