# Machine Learning Pipeline - Feature Engineering with Classes

In this notebook, we will reproduce the Feature Engineering Pipeline from the notebook 2 (02-Machine-Learning-Pipeline-Feature-Engineering), but we will replace, whenever possible, the manually created functions with our created classes, and hopefully understand the value they bring forward.

Our classes are saved in the preprocessors module file.

In [1]:
# data manipulation
import pandas as pd
import numpy as np

# plotting
import matplotlib.pyplot as plt
import seaborn as sns
sns.set_style('darkgrid')

# from scikit-learn
from sklearn.linear_model import Lasso, LinearRegression
from sklearn.metrics import mean_squared_error, r2_score
from sklearn.preprocessing import MinMaxScaler
from sklearn.model_selection import train_test_split

# from feature-engine
from feature_engine.encoding import RareLabelEncoder

# to visualise all the columns in the dataframe
pd.pandas.set_option('display.max_columns', None)

# to rank our target values
from scipy.stats import rankdata

# our in-house classes
import preprocessors as pp

In [2]:
# load dataset
data = pd.read_csv('CarPrice_db.csv')

# rows and columns of the data
print(f'Number of rows: {data.shape[0]}')
print(f'Number of columns: {data.shape[1]}')

# visualise the dataset
data.head()

Number of rows: 205
Number of columns: 26


Unnamed: 0,car_ID,symboling,CarName,fueltype,aspiration,doornumber,carbody,drivewheel,enginelocation,wheelbase,carlength,carwidth,carheight,curbweight,enginetype,cylindernumber,enginesize,fuelsystem,boreratio,stroke,compressionratio,horsepower,peakrpm,citympg,highwaympg,price
0,1,3,alfa-romeo,gas,std,two,convertible,rwd,front,88.6,168.8,64.1,48.8,2548,dohc,four,130,mpfi,3.47,2.68,9.0,111,5000,21,27,13495.0
1,2,3,alfa-romeo,gas,std,two,convertible,rwd,front,88.6,168.8,64.1,48.8,2548,dohc,four,130,mpfi,3.47,2.68,9.0,111,5000,21,27,16500.0
2,3,1,alfa-romeo,gas,std,two,hatchback,rwd,front,94.5,171.2,65.5,52.4,2823,ohcv,six,152,mpfi,2.68,3.47,9.0,154,5000,19,26,16500.0
3,4,2,audi,gas,std,four,sedan,fwd,front,99.8,176.6,66.2,54.3,2337,ohc,four,109,mpfi,3.19,3.4,10.0,102,5500,24,30,13950.0
4,5,2,audi,gas,std,four,sedan,4wd,front,99.4,176.6,66.4,54.3,2824,ohc,five,136,mpfi,3.19,3.4,8.0,115,5500,18,22,17450.0


# Separate dataset into train and test

It is important to separate our data intro training and testing set.

When we engineer features, some techniques learn parameters from data. It is important to learn these parameters only from the train set. This is to avoid over-fitting.

**Separating the data into train and test involves randomness, therefore, we need to set the seed.**

In [3]:
X_train, X_test, y_train, y_test = train_test_split(
    data.drop(['car_ID','enginetype','price'],axis=1), 
    data['price'], 
    test_size=0.3, 
    random_state=0)

In [4]:
X_train.shape, X_test.shape

((143, 23), (62, 23))

# Feature Engineering

In the following cells, we will engineer the variables of the student grades dataset so that we tackle:

1. Remapping the symboling variable 
2. Reclassifying the names of cars to their countries of origin.
3. Categorical Variables: convert strings to numbers.
4. Discrete Variables: map strings to numbers per nonparametric ranking.
5. Scale the values of the variables to the same range.

## Categorical Variables

### Remap symboling
Auto symbols are typically numbered 1 through 6. In our dataset, they have actually been numbered from -2 to 3. We are going to use an in-house class to re-assign the numberings to reflect the domain reality.

In [5]:
remap = pp.RemapVariable(['symboling'])

remap.fit(X_train)

X_train = remap.transform(X_train)
X_test = remap.transform(X_test)

### Adjust CarName Cardinality
We will collapse the cardinality of the column to reflect the home countries of the manufacturers/brands.

In [6]:
car_object = pp.CarTransform(variable='CarName')

car_object.fit(X_train)

X_train = car_object.transform(X_train)
X_test = car_object.transform(X_test)

### Removing Rare Labels

For the categorical variables, we will group those categories present in less than 1% of observations, i.e. values of categorical variables shared by less than 1% of observations will be replaced by the string 'Rare'.

In [7]:
# list out the binary variables
binary_vars = [feat for feat in X_train.columns if X_train[feat].nunique() == 2]
print(binary_vars)

# list out the non-binary/discrete variables
non_binary_vars = ['CarName', 'carbody', 'drivewheel', 'cylindernumber', 'fuelsystem']

['fueltype', 'aspiration', 'doornumber', 'enginelocation']


In [8]:
# combine the two classes of categorical variables in one list
cat_vars = binary_vars + non_binary_vars

In [9]:
rare_encoder = RareLabelEncoder(tol=0.01, n_categories=1, variables=cat_vars)

rare_encoder.fit(X_train)

X_train = rare_encoder.transform(X_train)
X_test = rare_encoder.transform(X_test)

### Encode binary variables

In [10]:
cat_encoder = pp.CategoricalEncoder(binary_vars)
cat_encoder.fit(X_train)

X_train = cat_encoder.transform(X_train)
X_test = cat_encoder.transform(X_test)

Let's run a quick sanity check to ensure that the train and test sets have the same number of features:

In [11]:
for feat in X_train.columns:
    if feat not in X_test.columns:
        print(feat)

enginelocation_rear


In [12]:
X_train['enginelocation_rear'].value_counts()

0    140
1      3
Name: enginelocation_rear, dtype: int64

We can see that the drop-first operation during variable encoding has reduced the feature space in the test set by one variable. This is because the variable was imbalanced and all the instances of the '1' category were left in the train set after the split. 

We can go ahead and drop this variable from the train set because it is technically has a quasi-constant feature even though the feature only slightly misses the threshold for quasi-constant features.

In [13]:
X_train = X_train.drop('enginelocation_rear',axis=1)

In [14]:
# another sanity check to be sure
for feat in X_train.columns:
    if feat not in X_test.columns:
        print(feat)

We are back at parity between both sets of data.

### Encode non-binary variables

For our non-binary features, we will transform the strings into numbers that capture the monotonic relationship between the label/category and the target.

In [15]:
ord_encoder = pp.OrdinalEncoder(variables=non_binary_vars, target='price')

ord_encoder.fit(X_train,y_train)

X_train = ord_encoder.transform(X_train)
X_test = ord_encoder.transform(X_test)

In [16]:
X_train.head()

Unnamed: 0,symboling,CarName,carbody,drivewheel,wheelbase,carlength,carwidth,carheight,curbweight,cylindernumber,enginesize,fuelsystem,boreratio,stroke,compressionratio,horsepower,peakrpm,citympg,highwaympg,fueltype_gas,aspiration_turbo,doornumber_two
40,1,7,5,2,96.5,175.4,62.5,54.1,2372,6,110,3,3.15,3.58,9.0,86,5800,27,33,1,0,0
60,1,7,5,2,98.8,177.8,66.5,55.5,2410,6,122,5,3.39,3.39,8.6,84,4800,26,32,1,0,0
56,4,7,4,3,95.3,169.0,65.7,49.6,2380,2,70,2,3.33,3.255,9.4,101,6000,17,23,1,0,1
101,1,7,5,2,100.4,181.7,66.5,55.1,3095,5,181,7,3.43,3.27,9.0,152,5200,17,22,1,0,0
86,2,7,5,2,96.3,172.4,65.4,51.6,2405,6,122,5,3.35,3.46,8.5,88,5000,25,32,1,0,0


## Feature Scaling

For use in linear models, features need to be scaled. We will scale our features to the minimum and maximum values.

In [17]:
# list out the continuous features to be scaled
scaled_features = ['symboling',
                   'CarName',
                   'carbody',
                   'drivewheel',
                   'wheelbase',
                   'carlength',
                   'carwidth',
                   'carheight',
                   'curbweight',
                   'cylindernumber',
                   'enginesize',
                   'fuelsystem',
                   'boreratio',
                   'stroke',
                   'compressionratio',
                   'horsepower',
                   'peakrpm',
                   'citympg',
                   'highwaympg']

Our customised scaler class takes in a subset of continuous features - this is beacuse the original scaler object doesn't offer this selectivce option inside a pipeline.

In [18]:
# create scaler
scaler = pp.ContinuousScaler(variables=scaled_features)

# fit the scaler to the train set
scaler.fit(X_train)

# transform train and test sets with learned parameters
X_train = scaler.transform(X_train)
X_test = scaler.transform(X_test)

In [19]:
X_train.head()

Unnamed: 0,symboling,CarName,carbody,drivewheel,wheelbase,carlength,carwidth,carheight,curbweight,cylindernumber,enginesize,fuelsystem,boreratio,stroke,compressionratio,horsepower,peakrpm,citympg,highwaympg,fueltype_gas,aspiration_turbo,doornumber_two
40,0.0,1.0,1.0,0.5,0.28863,0.485039,0.070707,0.525,0.246106,1.0,0.15625,0.333333,0.419643,0.614379,0.125,0.161905,0.673469,0.56,0.548387,1,0,0
60,0.0,1.0,1.0,0.5,0.355685,0.522835,0.474747,0.641667,0.263017,1.0,0.203125,0.666667,0.633929,0.490196,0.1,0.152381,0.265306,0.52,0.516129,1,0,0
56,0.6,1.0,0.75,1.0,0.253644,0.384252,0.393939,0.15,0.249666,0.2,0.0,0.166667,0.580357,0.401961,0.15,0.233333,0.755102,0.16,0.225806,1,0,1
101,0.0,1.0,1.0,0.5,0.402332,0.584252,0.474747,0.608333,0.567868,0.8,0.433594,1.0,0.669643,0.411765,0.125,0.47619,0.428571,0.16,0.193548,1,0,0
86,0.2,1.0,1.0,0.5,0.282799,0.437795,0.363636,0.316667,0.260792,1.0,0.203125,0.666667,0.598214,0.535948,0.09375,0.171429,0.346939,0.48,0.516129,1,0,0


We now have several classes with parameters learned from the training dataset, that we can store and retrieve at a later stage, so that when a colleague comes with new data, we are in a better position to score it faster.

Still:

- we would need to save each class
- then we could load each class
- and apply each transformation individually.

Which sounds like a lot of work.

The good news is, we can reduce the amount of work, if we set up all the transformations within a pipeline.