## Automatic selection of best imputation technique with Sklearn

In this notebook we will do a grid search over the imputation methods available in Scikit-learn to determine which imputation technique works best for this dataset and the machine learning model of choice.

We will also train a very simple machine learning model as part of a small pipeline.


In [1]:
import pandas as pd
import numpy as np

# import classes for imputation
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer

# import extra classes for modelling
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.linear_model import Lasso
from sklearn.model_selection import train_test_split, GridSearchCV

np.random.seed(0)

In [2]:
# let's load the car-data.csv dataset

data = pd.read_csv('C:\\Users\\gusal\\machine learning\\Feature engineering\\car-data_rev2.csv')


In [3]:
data.head()

Unnamed: 0,symboling,make,fuel-type,aspiration,num-of-doors,body-style,drive-wheels,engine-location,wheel-base,price
0,3,alfa-romero,gas,std,two,convertible,rwd,front,88.6,13495.0
1,3,alfa-romero,gas,std,two,convertible,rwd,front,88.6,16500.0
2,1,alfa-romero,gas,,two,hatchback,rwd,front,,16500.0
3,2,audi,gas,,four,,fwd,front,99.8,13950.0
4,2,audi,gas,,four,sedan,4wd,front,99.4,17450.0


In [4]:
data.dtypes

symboling            int64
make                object
fuel-type           object
aspiration          object
num-of-doors        object
body-style          object
drive-wheels        object
engine-location     object
wheel-base         float64
price              float64
dtype: object

In [5]:
# find categorical variables
# those of type 'Object' in the dataset
features_categorical = [c for c in data.columns if data[c].dtypes=='O']

# find numerical variables
# those different from object and also excluding the target Price
features_numerical = [c for c in data.columns if data[c].dtypes!='O' and c !='price']

In [6]:
# inspect the categorical variables

data[features_categorical].head()

Unnamed: 0,make,fuel-type,aspiration,num-of-doors,body-style,drive-wheels,engine-location
0,alfa-romero,gas,std,two,convertible,rwd,front
1,alfa-romero,gas,std,two,convertible,rwd,front
2,alfa-romero,gas,,two,hatchback,rwd,front
3,audi,gas,,four,,fwd,front
4,audi,gas,,four,sedan,4wd,front


In [7]:
# inspect the numerical variables

data[features_numerical].head()

Unnamed: 0,symboling,wheel-base
0,3,88.6
1,3,88.6
2,1,
3,2,99.8
4,2,99.4


In [8]:
inputs = data.drop(['price'], axis = 1)
target = data['price']

In [9]:
# separate intro train and test set

X_train, X_test, y_train, y_test = train_test_split(
    inputs,  # just the features
    target,  # the target
    test_size=0.3,  # the percentage of obs in the test set
    random_state=0)  # for reproducibility

X_train.shape, X_test.shape

((143, 9), (62, 9))

In [10]:
# We create the preprocessing pipelines for both
# numerical and categorical data

# adapted from Scikit-learn code available here under BSD3 license:
# https://scikit-learn.org/stable/auto_examples/compose/plot_column_transformer_mixed_types.html

numeric_transformer = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='median')),
    ('scaler', StandardScaler())])

categorical_transformer = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='constant', fill_value='missing')),
    ('onehot', OneHotEncoder(handle_unknown='ignore'))])

preprocessor = ColumnTransformer(
    transformers=[
        ('numerical', numeric_transformer, features_numerical),
        ('categorical', categorical_transformer, features_categorical)])

# Note that to initialise the pipeline I pass any argument to the transformers.
# Those will be changed during the gridsearch below.

In [11]:
# Append classifier to preprocessing pipeline.
# Now we have a full prediction pipeline.

clf = Pipeline(steps=[('preprocessor', preprocessor),
                      ('classifier', Lasso(max_iter=2000))])

In [12]:
# now we create the grid with all the parameters that we would like to test

param_grid = {
    'preprocessor__numerical__imputer__strategy': ['mean', 'median'],
    'preprocessor__categorical__imputer__strategy': ['most_frequent', 'constant'],
    'classifier__alpha': [10, 100, 200],
}

grid_search = GridSearchCV(clf, param_grid, cv=5, iid=False, n_jobs=-1, scoring='r2')

# cv=3 is the cross-validation
# no_jobs =-1 indicates to use all available cpus
# scoring='r2' indicates to evaluate using the r squared

# for more details in the grid parameters visit:
#https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.GridSearchCV.html

When setting the grid parameters, this is how we indicate the parameters:

preprocessor__numerical__imputer__strategy': ['mean', 'median'],

the above line of code indicates that I would like to test the mean and the median in the imputer step of the numerical processor.

preprocessor__categorical__imputer__strategy': ['most_frequent', 'constant']

the above line of code indicates that I would like to test the most frequent or a constant value in the imputer step of the categorical processor

classifier__alpha': [0.1, 1.0, 0.5]

the above line of code indicates that I want to test those 3 values for the alpha parameter of Lasso. Note that Lasso is the 'classifier' step of our last pipeline

In [13]:
# and now we train over all the possible combinations of the parameters above
grid_search.fit(X_train, y_train)

# and we print the best score over the train set
print(("best linear regression from grid search: %.3f"
       % grid_search.score(X_train, y_train)))

best linear regression from grid search: 0.908




In [14]:
# we can print the best estimator parameters like this
grid_search.best_estimator_

Pipeline(memory=None,
         steps=[('preprocessor',
                 ColumnTransformer(n_jobs=None, remainder='drop',
                                   sparse_threshold=0.3,
                                   transformer_weights=None,
                                   transformers=[('numerical',
                                                  Pipeline(memory=None,
                                                           steps=[('imputer',
                                                                   SimpleImputer(add_indicator=False,
                                                                                 copy=True,
                                                                                 fill_value=None,
                                                                                 missing_values=nan,
                                                                                 strategy='mean',
                                                         

In [15]:
# and find the best fit parameters like this
grid_search.best_params_

{'classifier__alpha': 10,
 'preprocessor__categorical__imputer__strategy': 'constant',
 'preprocessor__numerical__imputer__strategy': 'mean'}

In [16]:
# here we can see all the combinations evaluated during the gridsearch
grid_search.cv_results_['params']

[{'classifier__alpha': 10,
  'preprocessor__categorical__imputer__strategy': 'most_frequent',
  'preprocessor__numerical__imputer__strategy': 'mean'},
 {'classifier__alpha': 10,
  'preprocessor__categorical__imputer__strategy': 'most_frequent',
  'preprocessor__numerical__imputer__strategy': 'median'},
 {'classifier__alpha': 10,
  'preprocessor__categorical__imputer__strategy': 'constant',
  'preprocessor__numerical__imputer__strategy': 'mean'},
 {'classifier__alpha': 10,
  'preprocessor__categorical__imputer__strategy': 'constant',
  'preprocessor__numerical__imputer__strategy': 'median'},
 {'classifier__alpha': 100,
  'preprocessor__categorical__imputer__strategy': 'most_frequent',
  'preprocessor__numerical__imputer__strategy': 'mean'},
 {'classifier__alpha': 100,
  'preprocessor__categorical__imputer__strategy': 'most_frequent',
  'preprocessor__numerical__imputer__strategy': 'median'},
 {'classifier__alpha': 100,
  'preprocessor__categorical__imputer__strategy': 'constant',
  'pre

In [17]:
# and here the scores for each of one of the above combinations
grid_search.cv_results_['mean_test_score']

array([0.83492106, 0.83019562, 0.84015003, 0.83640128, 0.70817302,
       0.70043471, 0.70601422, 0.69819753, 0.54052612, 0.53452258,
       0.54018959, 0.53405383])

In [18]:
# and finally let's check the performance over the test set
print(("best linear regression from grid search: %.3f"
       % grid_search.score(X_test, y_test)))

best linear regression from grid search: 0.763


This model overfits to the train set, look at the r2 of 0.908 obtained for the train set vs 0.763 for the test set.

We will try to reduce this over-fitting later.