## Frequent category imputation with Scikit-learn ==> SimpleImputer

Scikit-learn provides a class to make most of the most common data imputation techniques.

The **SimpleImputer** class provides basic strategies for imputing missing values, including:

- Mean and median imputation for numerical variables
- Most frequent category imputation for categorical variables
- Arbitrary value imputation for both categorical and numerical variables

### Advantages

- Simple to use if applied to the entire dataframe
- Maintained by the scikit-learn developers: good quality code
- Fast computation (it uses numpy for calculations)
- Allows for grid search over the various imputation techniques
- Allows for different missing values encodings (you can indicate if the missing values are np.nan, or zeroes, etc)

### Limitations

- Returns a numpy array instead of a pandas dataframe, inconvenient for data analysis
- Needs to use additional classes to select which features to impute  ==>
    - requires more lines of code
    - additional classes still in beta (may change without warning)
    - not so straightforward to use anymore.

### More details about the transformers

- [SimpleImputer](https://scikit-learn.org/stable/modules/generated/sklearn.impute.SimpleImputer.html#sklearn.impute.SimpleImputer)
- [ColumnTransformer](https://scikit-learn.org/stable/modules/generated/sklearn.compose.ColumnTransformer.html)
- [Stackoverflow](https://stackoverflow.com/questions/54160370/how-to-use-sklearn-column-transformer)


In [1]:
import pandas as pd
import numpy as np

import matplotlib.pyplot as plt

# these are the objects we need to impute missing data
# with sklearn
from sklearn.impute import SimpleImputer
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline

# to split the datasets
from sklearn.model_selection import train_test_split

In [2]:
# let's load the car-data_rev2.csv dataset
# we use only a mix of categorical and numerical

data = pd.read_csv('C:\\Users\\gusal\\machine learning\\Feature engineering\\car-data_rev2.csv')


In [3]:
data.head()

Unnamed: 0,symboling,make,fuel-type,aspiration,num-of-doors,body-style,drive-wheels,engine-location,wheel-base,price
0,3,alfa-romero,gas,std,two,convertible,rwd,front,88.6,13495.0
1,3,alfa-romero,gas,std,two,convertible,rwd,front,88.6,16500.0
2,1,alfa-romero,gas,,two,hatchback,rwd,front,,16500.0
3,2,audi,gas,,four,,fwd,front,99.8,13950.0
4,2,audi,gas,,four,sedan,4wd,front,99.4,17450.0


In [4]:
# let's check the null values
data.isnull().mean()

symboling          0.000000
make               0.000000
fuel-type          0.000000
aspiration         0.175610
num-of-doors       0.000000
body-style         0.141463
drive-wheels       0.000000
engine-location    0.000000
wheel-base         0.121951
price              0.000000
dtype: float64

The cateogrical variables body-style and aspiration contain missing data.
The numerical variable wheel-base contain missing data.

In [5]:
inputs = data.drop(['price'], axis = 1)
target = data.price

In [6]:
# let's separate into training and testing set

X_train, X_test, y_train, y_test = train_test_split(inputs, # just the features
                                                    target, # the target
                                                    test_size=0.3, # the percentage of obs in the test set
                                                    random_state=0) # for reproducibility
X_train.shape, X_test.shape

((143, 9), (62, 9))

In [7]:
# let's check the misssing data again
X_train.isnull().mean()

symboling          0.000000
make               0.000000
fuel-type          0.000000
aspiration         0.139860
num-of-doors       0.000000
body-style         0.111888
drive-wheels       0.000000
engine-location    0.000000
wheel-base         0.104895
dtype: float64

In [8]:
X_train

Unnamed: 0,symboling,make,fuel-type,aspiration,num-of-doors,body-style,drive-wheels,engine-location,wheel-base
40,0,honda,gas,std,four,sedan,fwd,front,96.5
60,0,mazda,gas,std,four,sedan,fwd,front,98.8
56,3,mazda,gas,std,two,hatchback,rwd,front,95.3
101,0,nissan,gas,std,four,sedan,fwd,front,100.4
86,1,mitsubishi,gas,std,four,sedan,fwd,front,96.3
...,...,...,...,...,...,...,...,...,...
67,-1,mercedes-benz,diesel,turbo,four,sedan,rwd,front,110.0
192,0,volkswagen,diesel,turbo,four,sedan,fwd,front,100.4
117,0,peugot,gas,turbo,four,sedan,rwd,front,108.0
47,0,jaguar,gas,std,four,sedan,rwd,front,113.0


In [9]:
X_train.mode()

Unnamed: 0,symboling,make,fuel-type,aspiration,num-of-doors,body-style,drive-wheels,engine-location,wheel-base
0,0,toyota,gas,std,four,sedan,fwd,front,93.7


### SimpleImputer on the entire dataset

In [10]:
# Now we impute the missing values with SimpleImputer

# create an instance of the simple imputer
# we indicate that we want to impute with the 
# most frequent category

imputer = SimpleImputer(strategy='most_frequent')

# we fit the imputer to the train set
# the imputer will learn the mode of ALL variables
# categorical or not
imputer.fit(X_train)

SimpleImputer(add_indicator=False, copy=True, fill_value=None,
              missing_values=nan, strategy='most_frequent', verbose=0)

In [11]:
# we can look at the learnt frequent values like this:
imputer.statistics_

array([0, 'toyota', 'gas', 'std', 'four', 'sedan', 'fwd', 'front', 93.7],
      dtype=object)

**Note** that the transformer learns the most frequent value for both categorical AND numerical variables.

In [41]:
# and we can investigate the frequent values to corroborate
# the imputer did a good job
X_train.mode()

Unnamed: 0,symboling,make,fuel-type,aspiration,num-of-doors,body-style,drive-wheels,engine-location,wheel-base
0,0,toyota,gas,std,four,sedan,fwd,front,93.7


In [13]:
# and now we impute the train and test set

# NOTE: the data is returned as a numpy array!!!
X_train_imp = imputer.transform(X_train)
X_test_imp = imputer.transform(X_test)

X_train_imp

array([[0, 'honda', 'gas', ..., 'fwd', 'front', 96.5],
       [0, 'mazda', 'gas', ..., 'fwd', 'front', 98.8],
       [3, 'mazda', 'gas', ..., 'rwd', 'front', 95.3],
       ...,
       [0, 'peugot', 'gas', ..., 'rwd', 'front', 108.0],
       [0, 'jaguar', 'gas', ..., 'rwd', 'front', 113.0],
       [2, 'toyota', 'gas', ..., 'rwd', 'front', 98.4]], dtype=object)

In [14]:
# encode the train set back to a dataframe:
pd.DataFrame(X_train_imp, columns=inputs.columns).head()

Unnamed: 0,symboling,make,fuel-type,aspiration,num-of-doors,body-style,drive-wheels,engine-location,wheel-base
0,0,honda,gas,std,four,sedan,fwd,front,96.5
1,0,mazda,gas,std,four,sedan,fwd,front,98.8
2,3,mazda,gas,std,two,hatchback,rwd,front,95.3
3,0,nissan,gas,std,four,sedan,fwd,front,100.4
4,1,mitsubishi,gas,std,four,sedan,fwd,front,96.3


In [15]:
# finding how the mean of wheel-base has changed:

pd.DataFrame(X_train_imp, columns=inputs.columns).mode()

Unnamed: 0,symboling,make,fuel-type,aspiration,num-of-doors,body-style,drive-wheels,engine-location,wheel-base
0,0,toyota,gas,std,four,sedan,fwd,front,93.7


In [17]:
#confirming there are not null values
pd.DataFrame(X_train_imp, columns=inputs.columns).isnull().sum()

symboling          0
make               0
fuel-type          0
aspiration         0
num-of-doors       0
body-style         0
drive-wheels       0
engine-location    0
wheel-base         0
dtype: int64

### SimpleImputer: different procedures on different features

Now, I will impute:

- categorical variables with the frequent category
- numerical variables with the mean.

In [18]:
# let's separate into training and testing set

X_train, X_test, y_train, y_test = train_test_split(inputs,
                                                    target,
                                                    test_size=0.3,
                                                    random_state=0)
X_train.shape, X_test.shape

((143, 9), (62, 9))

In [19]:
X_train_original = X_train.copy()

In [20]:
# let's look at the missing values
X_train.isnull().mean()

symboling          0.000000
make               0.000000
fuel-type          0.000000
aspiration         0.139860
num-of-doors       0.000000
body-style         0.111888
drive-wheels       0.000000
engine-location    0.000000
wheel-base         0.104895
dtype: float64

In [21]:
# first we need to make lists, indicating which features
# will be imputed with each method

features_numeric = ['wheel-base']
features_categoric = ['aspiration','body-style']

# then we instantiate the imputers, within a pipeline
# we create one mean imputer and one frequent category imputer
# by changing the parameter in the strategy

numeric_imputer = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='mean')),
])

categoric_imputer = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='most_frequent')),
])

# then we put the features list and the transformers together
# using the column transformer

preprocessor = ColumnTransformer(transformers=[
    ('numeric_imputer', numeric_imputer, features_numeric),
    ('categoric_imputer', categoric_imputer, features_categoric)
])

In [22]:
# now we fit the preprocessor
preprocessor.fit(X_train)

ColumnTransformer(n_jobs=None, remainder='drop', sparse_threshold=0.3,
                  transformer_weights=None,
                  transformers=[('numeric_imputer',
                                 Pipeline(memory=None,
                                          steps=[('imputer',
                                                  SimpleImputer(add_indicator=False,
                                                                copy=True,
                                                                fill_value=None,
                                                                missing_values=nan,
                                                                strategy='mean',
                                                                verbose=0))],
                                          verbose=False),
                                 ['wheel-base']),
                                ('categoric_imputer',
                                 Pipeline(memory=None,
                

In [23]:
# we can explore the transformers like this:

preprocessor.transformers

[('numeric_imputer', Pipeline(memory=None,
           steps=[('imputer',
                   SimpleImputer(add_indicator=False, copy=True, fill_value=None,
                                 missing_values=nan, strategy='mean',
                                 verbose=0))],
           verbose=False), ['wheel-base']),
 ('categoric_imputer', Pipeline(memory=None,
           steps=[('imputer',
                   SimpleImputer(add_indicator=False, copy=True, fill_value=None,
                                 missing_values=nan, strategy='most_frequent',
                                 verbose=0))],
           verbose=False), ['aspiration', 'body-style'])]

In [24]:
# and we can look at the parameters learnt like this:

# for the mean imputer
preprocessor.named_transformers_['numeric_imputer'].named_steps['imputer'].statistics_

array([99.0578125])

In [25]:
# and we can corroborate the value with that one in
# the train set
X_train[features_numeric].mean()

wheel-base    99.057812
dtype: float64

In [26]:
# for frequent category imputer

preprocessor.named_transformers_['categoric_imputer'].named_steps['imputer'].statistics_

array(['std', 'sedan'], dtype=object)

In [27]:
# and we corroborate those values in the train set

X_train[features_categoric].mode()

Unnamed: 0,aspiration,body-style
0,std,sedan


In [28]:
X_train.head()

Unnamed: 0,symboling,make,fuel-type,aspiration,num-of-doors,body-style,drive-wheels,engine-location,wheel-base
40,0,honda,gas,std,four,sedan,fwd,front,96.5
60,0,mazda,gas,std,four,sedan,fwd,front,98.8
56,3,mazda,gas,std,two,hatchback,rwd,front,95.3
101,0,nissan,gas,std,four,sedan,fwd,front,100.4
86,1,mitsubishi,gas,std,four,sedan,fwd,front,96.3


In [29]:
X_train.shape

(143, 9)

In [30]:
# and now we can impute the data
X_train_imp = preprocessor.transform(X_train)
X_test_imp = preprocessor.transform(X_test)

In [31]:
X_train_imp.shape

(143, 3)

In [33]:
# see how the result of the imputation is a 3 column dataset
Data_imp = pd.DataFrame(X_train_imp ,columns=features_numeric + features_categoric)

In [34]:
Data_imp.head()

Unnamed: 0,wheel-base,aspiration,body-style
0,96.5,std,sedan
1,98.8,std,sedan
2,95.3,std,hatchback
3,100.4,std,sedan
4,96.3,std,sedan


In this case, the returned dataset contains only the variable that have been inputed. 
Now, we need to add the rest of the columns
 

columns_to_be_added = ['symboling', 'make', 'fuel-type', 'num-of-doors',
        'drive-wheels', 'engine-location']

first we need to reset the index of X_train in order to match with the index of Data_imp

In [35]:
X_train = X_train.reset_index(inplace = False, drop = True)

In [36]:
X_train.head()

Unnamed: 0,symboling,make,fuel-type,aspiration,num-of-doors,body-style,drive-wheels,engine-location,wheel-base
0,0,honda,gas,std,four,sedan,fwd,front,96.5
1,0,mazda,gas,std,four,sedan,fwd,front,98.8
2,3,mazda,gas,std,two,hatchback,rwd,front,95.3
3,0,nissan,gas,std,four,sedan,fwd,front,100.4
4,1,mitsubishi,gas,std,four,sedan,fwd,front,96.3


In [37]:
Data_imp[['symboling', 'make', 'fuel-type', 'num-of-doors','drive-wheels', 'engine-location']] = X_train[['symboling', 'make', 'fuel-type', 'num-of-doors','drive-wheels', 'engine-location']]

In [38]:
Data_imp

Unnamed: 0,wheel-base,aspiration,body-style,symboling,make,fuel-type,num-of-doors,drive-wheels,engine-location
0,96.5,std,sedan,0,honda,gas,four,fwd,front
1,98.8,std,sedan,0,mazda,gas,four,fwd,front
2,95.3,std,hatchback,3,mazda,gas,two,rwd,front
3,100.4,std,sedan,0,nissan,gas,four,fwd,front
4,96.3,std,sedan,1,mitsubishi,gas,four,fwd,front
...,...,...,...,...,...,...,...,...,...
138,110,turbo,sedan,-1,mercedes-benz,diesel,four,rwd,front
139,100.4,turbo,sedan,0,volkswagen,diesel,four,fwd,front
140,108,turbo,sedan,0,peugot,gas,four,rwd,front
141,113,std,sedan,0,jaguar,gas,four,rwd,front


In [39]:
# Confirming that there are not any null value
Data_imp.isnull().sum()

wheel-base         0
aspiration         0
body-style         0
symboling          0
make               0
fuel-type          0
num-of-doors       0
drive-wheels       0
engine-location    0
dtype: int64