# Machine Learning Pipeline - Feature Engineering with Classes

In this notebook, we will reproduce the Feature Engineering Pipeline from the notebook 2 (02-Machine-Learning-Pipeline-Feature-Engineering), but we will replace, whenever possible, the manually created functions with our created classes, and hopefully understand the value they bring forward.

Our classes are saved in the preprocessors module file.

In [1]:
# data manipulation
import pandas as pd
import numpy as np

# plotting
import matplotlib.pyplot as plt
import seaborn as sns
sns.set_style('darkgrid')

# from scikit-learn
from sklearn.linear_model import Lasso, LinearRegression
from sklearn.metrics import mean_squared_error, r2_score
from sklearn.preprocessing import MinMaxScaler
from sklearn.model_selection import train_test_split

# to visualise all the columns in the dataframe
pd.pandas.set_option('display.max_columns', None)

# to rank our target values
from scipy.stats import rankdata

# our in-house module
import preprocessors as pp

In [2]:
# load dataset
data = pd.read_csv('student-mat.csv',sep=';')

# rows and columns of the data
print(f'Number of rows: {data.shape[0]}')
print(f'Number of columns: {data.shape[1]}')

# visualise the dataset
data.head()

Number of rows: 395
Number of columns: 33


Unnamed: 0,school,sex,age,address,famsize,Pstatus,Medu,Fedu,Mjob,Fjob,reason,guardian,traveltime,studytime,failures,schoolsup,famsup,paid,activities,nursery,higher,internet,romantic,famrel,freetime,goout,Dalc,Walc,health,absences,G1,G2,G3
0,GP,F,18,U,GT3,A,4,4,at_home,teacher,course,mother,2,2,0,yes,no,no,no,yes,yes,no,no,4,3,4,1,1,3,6,5,6,6
1,GP,F,17,U,GT3,T,1,1,at_home,other,course,father,1,2,0,no,yes,no,no,no,yes,yes,no,5,3,3,1,1,3,4,5,5,6
2,GP,F,15,U,LE3,T,1,1,at_home,other,other,mother,1,2,3,yes,no,yes,no,yes,yes,yes,no,4,3,2,2,3,3,10,7,8,10
3,GP,F,15,U,GT3,T,4,2,health,services,home,mother,1,3,0,no,yes,yes,yes,yes,yes,yes,yes,3,2,2,1,1,5,2,15,14,15
4,GP,F,16,U,GT3,T,3,3,other,other,home,father,1,2,0,no,yes,yes,no,yes,yes,no,no,4,3,2,1,2,5,4,6,10,10


# Separate dataset into train and test

It is important to separate our data intro training and testing set.

When we engineer features, some techniques learn parameters from data. It is important to learn these parameters only from the train set. This is to avoid over-fitting.

**Separating the data into train and test involves randomness, therefore, we need to set the seed.**

In [None]:
X_train, X_test, y_train, y_test = train_test_split(
    data.drop(['G1','G2','G3'],axis=1), 
    data['G3'], 
    test_size=0.3, 
    random_state=0)

In [4]:
X_train.shape, X_test.shape

((276, 30), (119, 30))

# Feature Engineering

In the following cells, we will engineer the variables of the student grades dataset so that we tackle:

1. Categorical Variables: convert strings to numbers.
2. Discrete Variables: map strings to numbers per nonparametric ranking.
3. Scale the values of the variables to the same range.

## Categorical Variables

### Encode binary variables

Next, we need to transform the strings of the categorical variables into numbers. We will apply one-hot encoding on our binary features.

In [5]:
# list out the binary objects
binary_vars = ['school',
 'sex',
 'address',
 'famsize',
 'Pstatus',
 'schoolsup',
 'famsup',
 'paid',
 'activities',
 'nursery',
 'higher',
 'internet',
 'romantic']

In [6]:
cat_vars = [var for var in data.columns if data[var].dtype == 'O']
len(cat_vars)

17

In [7]:
# set up the class
cat_encoder = pp.CategoricalEncoder(binary_vars)

In [8]:
# fit the class to the train set
cat_encoder.fit(X_train)

<preprocessors.CategoricalEncoder at 0x11ba918e0>

In [9]:
# transform the values
X_train = cat_encoder.transform(X_train)
X_test = cat_encoder.transform(X_test)

### Encode non-binary variables

For our non-binary features, we will transform the strings into numbers that capture the monotonic relationship between the label/category and the target.

In [10]:
# list out the non binary variables
non_binary_vars = ['Mjob', 'Fjob', 'reason', 'guardian']

In [11]:
# set up the class
ord_encoder = pp.OrdinalEncoder(variables=non_binary_vars,target='G3')

In [12]:
# fit the class to the train set
ord_encoder.fit(X_train,y_train)

<preprocessors.OrdinalEncoder at 0x11be31610>

In [13]:
ord_encoder.labels_

{'Mjob': {'health': 1, 'at_home': 2, 'teacher': 3, 'services': 4, 'other': 5},
 'Fjob': {'at_home': 1, 'health': 2, 'teacher': 3, 'services': 4, 'other': 5},
 'reason': {'other': 1, 'home': 2, 'reputation': 3, 'course': 4},
 'guardian': {'other': 1, 'father': 2, 'mother': 3}}

In [14]:
# transform the values with the persisted parameters
X_train = ord_encoder.transform(X_train)
X_test = ord_encoder.transform(X_test)

## Feature Scaling

For use in linear models, features need to be scaled. We will scale our features to the minimum and maximum values.

In [15]:
# list out the features to be scaled
scaled = ['age','Medu','Fedu','Mjob','Fjob','guardian','traveltime','studytime','failures']

We have created a customised scaler class that takes in a subset of continuous features to be scaled. This is because when the original scaler object doesn't offer this option inside a pipeline.

In [16]:
# create scaler
scaler = pp.ContinuousScaler(variables=scaled)

# fit the scaler to the train set
scaler.fit(X_train)

# transform train and test sets with learned parameters
X_train = scaler.transform(X_train)
X_test = scaler.transform(X_test)

In [17]:
X_train.head()

Unnamed: 0,age,Medu,Fedu,Mjob,Fjob,reason,guardian,traveltime,studytime,failures,famrel,freetime,goout,Dalc,Walc,health,absences,school_MS,sex_M,address_U,famsize_LE3,Pstatus_T,schoolsup_yes,famsup_yes,paid_yes,activities_yes,nursery_yes,higher_yes,internet_yes,romantic_yes
22,0.142857,1.0,0.333333,0.5,1.0,4,1.0,0.0,0.333333,0.0,4,5,1,1,3,5,2,0,1,1,1,1,0,0,0,1,1,1,1,0
241,0.285714,1.0,1.0,0.5,1.0,4,1.0,0.333333,0.333333,0.0,3,3,3,2,3,4,2,0,1,0,1,0,0,1,1,0,1,1,1,0
122,0.142857,0.5,1.0,1.0,0.25,4,0.5,0.333333,0.333333,0.0,4,2,2,1,2,5,2,0,0,1,1,1,0,1,1,1,1,1,1,1
176,0.142857,0.5,0.333333,0.75,1.0,3,1.0,0.333333,0.333333,0.0,3,4,4,1,4,5,2,0,0,1,0,1,0,0,1,1,0,1,1,0
162,0.142857,0.25,0.333333,1.0,1.0,4,1.0,0.333333,0.0,0.333333,4,4,4,2,4,5,0,0,1,1,1,1,0,0,0,1,1,1,0,0


We now have several classes with parameters learned from the training dataset, that we can store and retrieve at a later stage, so that when a colleague comes with new data, we are in a better position to score it faster.

Still:

- we would need to save each class
- then we could load each class
- and apply each transformation individually.

Which sounds like a lot of work.

The good news is, we can reduce the amount of work, if we set up all the transformations within a pipeline.