# Machine Learning Pipeline - Feature Engineering with Classes

In this notebook, we will reproduce the Feature Engineering Pipeline from the notebook 2 (02-Machine-Learning-Pipeline-Feature-Engineering), but we will replace, whenever possible, the manually created functions with our created classes, and hopefully understand the value they bring forward.

Our classes are saved in the preprocessors module file.

In [1]:
# data manipulation
import pandas as pd
import numpy as np

# from sklearn
from sklearn.model_selection import train_test_split

# to visualise all the columns in the dataframe
pd.pandas.set_option('display.max_columns', None)

# our in-house pre-processing module
import preprocessors as pp

In [2]:
# load dataset
data = pd.read_csv('campaign.csv')

# rows and columns of the data
print(f'Number of rows: {data.shape[0]}')
print(f'Number of columns: {data.shape[1]}')

# visualise the dataset
data.head()

Number of rows: 2240
Number of columns: 29


Unnamed: 0,ID,Year_Birth,Education,Marital_Status,Income,Kidhome,Teenhome,Dt_Customer,Recency,MntWines,MntFruits,MntMeatProducts,MntFishProducts,MntSweetProducts,MntGoldProds,NumDealsPurchases,NumWebPurchases,NumCatalogPurchases,NumStorePurchases,NumWebVisitsMonth,AcceptedCmp3,AcceptedCmp4,AcceptedCmp5,AcceptedCmp1,AcceptedCmp2,Complain,Z_CostContact,Z_Revenue,Response
0,5524,1957,Graduation,Single,58138.0,0,0,2012-09-04,58,635,88,546,172,88,88,3,8,10,4,7,0,0,0,0,0,0,3,11,1
1,2174,1954,Graduation,Single,46344.0,1,1,2014-03-08,38,11,1,6,2,1,6,2,1,1,2,5,0,0,0,0,0,0,3,11,0
2,4141,1965,Graduation,Together,71613.0,0,0,2013-08-21,26,426,49,127,111,21,42,1,8,2,10,4,0,0,0,0,0,0,3,11,0
3,6182,1984,Graduation,Together,26646.0,1,0,2014-02-10,26,11,4,20,10,3,5,2,2,0,4,6,0,0,0,0,0,0,3,11,0
4,5324,1981,PhD,Married,58293.0,1,0,2014-01-19,94,173,43,118,46,27,15,5,5,3,6,5,0,0,0,0,0,0,3,11,0


# Separate dataset into train and test

It is important to separate our data intro training and testing set.

When we engineer features, some techniques learn parameters from data. It is important to learn these parameters only from the train set. This is to avoid over-fitting.

**Separating the data into train and test involves randomness, therefore, we need to set the seed.**

In [3]:
# Let's separate into train and test set
# Remember to set the seed (random_state for this sklearn function)

X_train, X_test, y_train, y_test = train_test_split(
    data.drop(['ID','Response'], axis=1), # predictive variables
    data['Response'], # target
    test_size=0.2, # portion of dataset to allocate to test set
    random_state=0, # we are setting the seed here
)

In [4]:
X_train.shape, X_test.shape

((1792, 27), (448, 27))

# Feature Engineering

In the following cells, we will engineer the variables of the student grades dataset so that we tackle:

1. Constant values in some variables.
2. Missing values
3. Temporal Values
3. Categorical Variables: ordinal encoding on non-binary variables
4. Oversampling the minority label
5. Scale the continuous variables to the same range

## Constant Values

We will drop the variables with constant values because they offer no information to the model.

In [5]:
# quick check to confirm constant-value features
[var for var in X_train.columns if X_train[var].nunique()==1]

['Z_CostContact', 'Z_Revenue']

In [6]:
drop_constant = pp.DropConstant()

In [7]:
drop_constant.fit(X_train)

<preprocessors.DropConstant at 0x12321d760>

In [8]:
drop_constant.constant_

['Z_CostContact', 'Z_Revenue']

In [9]:
X_train = drop_constant.transform(X_train)
X_test = drop_constant.transform(X_test)

In [10]:
# confirm features have been dropped
[var for var in X_train.columns if X_train[var].nunique()==1]

[]

## Mean Imputation

We had null values in the Income variable. We will now replace those null values with the mean of the variable. The average income value will be learned from the training set and used to transform both the train and test sets.

In [11]:
# the number of null entries in each dataset before the mean imputation.
len(X_train[X_train['Income'].isnull()]), len(X_test[X_test['Income'].isnull()])

(17, 7)

In [12]:
missing_imputer = pp.MissingImputer(['Income'])

In [13]:
missing_imputer.fit(X_train,y_train)

<preprocessors.MissingImputer at 0x123225670>

In [14]:
# mean value obtained from the train set
missing_imputer.params_

{'Income': 52662.7538028169}

In [15]:
X_train = missing_imputer.transform(X_train,y_train)
X_test = missing_imputer.transform(X_test,y_test)

In [16]:
# all the null values across both sets have now been replaced
len(X_train[X_train['Income'].isnull()]), len(X_test[X_test['Income'].isnull()])

(0, 0)

## Temporal Variables

We will now capture the ages of the customers and their patronage periods from the Year_Birth and Dt_Customer variables respectively

### Dt_Customer

In [17]:
X_train['Dt_Customer'].head()

818     2014-03-29
1281    2012-08-18
1766    2014-06-16
1577    2014-04-28
924     2014-05-18
Name: Dt_Customer, dtype: object

In [18]:
date_transform = pp.TransformDate(variables=['Dt_Customer'],current_year=2022)

In [19]:
date_transform.fit(X_train)

<preprocessors.TransformDate at 0x12321a670>

In [20]:
X_train = date_transform.transform(X_train)
X_test = date_transform.transform(X_test)

In [21]:
X_train['Dt_Customer'].head()

818      8
1281    10
1766     8
1577     8
924      8
Name: Dt_Customer, dtype: int64

### Year_Birth

In [22]:
X_train['Year_Birth'].head()

818     1972
1281    1971
1766    1980
1577    1947
924     1986
Name: Year_Birth, dtype: int64

In [23]:
year_transform = pp.TransformYear(variables=['Year_Birth'],current_year=2022)

In [24]:
year_transform.fit(X_train)

<preprocessors.TransformYear at 0x123225ca0>

In [25]:
X_train = year_transform.transform(X_train)
X_test = year_transform.transform(X_test)

In [26]:
X_train['Year_Birth'].head()

818     50
1281    51
1766    42
1577    75
924     36
Name: Year_Birth, dtype: int64

## Encoding Variables

Next, we need to transform the strings of the categorical variables into numbers. The variables that require encoding in this project are non-binary. To encode them, we will transform the strings into numbers that capture the monotonic relationship between the label and the target variable.

This means we will assign ordinality based on the YES responses per label.

In [27]:
X_train[['Education','Marital_Status']].head()

Unnamed: 0,Education,Marital_Status
818,Graduation,Married
1281,2n Cycle,Divorced
1766,PhD,Single
1577,PhD,Together
924,Graduation,Together


In [28]:
ord_encoder = pp.OrdinalEncoder(variables=['Education','Marital_Status'], target='Response')

In [29]:
ord_encoder.fit(X_train,y_train)

<preprocessors.OrdinalEncoder at 0x12323f490>

In [30]:
# response rates for the different labels
ord_encoder.params_

{'Education': {'Graduation': 0.12691466083150985,
  '2n Cycle': 0.11464968152866242,
  'PhD': 0.19543147208121828,
  'Master': 0.14385964912280702,
  'Basic': 0.047619047619047616},
 'Marital_Status': {'Married': 0.11285714285714285,
  'Divorced': 0.17714285714285713,
  'Single': 0.2198952879581152,
  'Together': 0.09462365591397849,
  'Widow': 0.21875,
  'Absurd': 0.5,
  'Alone': 0.3333333333333333,
  'YOLO': 0.0}}

In [31]:
# ordinal rankings of the response rates
ord_encoder.ordinal_labels_

{'Education': {'Basic': 1,
  '2n Cycle': 2,
  'Graduation': 3,
  'Master': 4,
  'PhD': 5},
 'Marital_Status': {'YOLO': 1,
  'Together': 2,
  'Married': 3,
  'Divorced': 4,
  'Widow': 5,
  'Single': 6,
  'Alone': 7,
  'Absurd': 8}}

In [32]:
X_train = ord_encoder.transform(X_train)
X_test = ord_encoder.transform(X_test)

In [33]:
X_train[['Education','Marital_Status']].head()

Unnamed: 0,Education,Marital_Status
818,3,3
1281,2,4
1766,5,6
1577,5,2
924,3,2


## Feature Scaling

In [34]:
# list out all the continuous variables to be scaled
scaled = [var for var in X_train.columns if X_train[var].nunique() > 2]
scaled

['Year_Birth',
 'Education',
 'Marital_Status',
 'Income',
 'Kidhome',
 'Teenhome',
 'Dt_Customer',
 'Recency',
 'MntWines',
 'MntFruits',
 'MntMeatProducts',
 'MntFishProducts',
 'MntSweetProducts',
 'MntGoldProds',
 'NumDealsPurchases',
 'NumWebPurchases',
 'NumCatalogPurchases',
 'NumStorePurchases',
 'NumWebVisitsMonth']

In [35]:
scaler = pp.ContinuousScaler(variables=scaled)

In [36]:
scaler.fit(X_train)

<preprocessors.ContinuousScaler at 0x12325ae80>

In [37]:
X_train = scaler.transform(X_train)
X_test = scaler.transform(X_test)

In [38]:
# visualise the scaled dataframe
X_train.head()

Unnamed: 0,Year_Birth,Education,Marital_Status,Income,Kidhome,Teenhome,Dt_Customer,Recency,MntWines,MntFruits,MntMeatProducts,MntFishProducts,MntSweetProducts,MntGoldProds,NumDealsPurchases,NumWebPurchases,NumCatalogPurchases,NumStorePurchases,NumWebVisitsMonth,AcceptedCmp3,AcceptedCmp4,AcceptedCmp5,AcceptedCmp1,AcceptedCmp2,Complain
818,0.23301,0.5,0.285714,0.095207,0.0,0.5,0.0,0.545455,0.430295,0.070352,0.028406,0.0,0.026718,0.17757,0.066667,0.333333,0.071429,0.692308,0.25,0,0,0,0,0,0
1281,0.242718,0.25,0.428571,0.070264,0.0,0.0,1.0,0.909091,0.41555,0.271357,0.138551,0.382239,0.374046,0.370717,0.133333,0.333333,0.25,0.769231,0.35,0,1,0,0,1,0
1766,0.15534,1.0,0.714286,0.051722,0.5,0.0,0.0,0.232323,0.010724,0.005025,0.001159,0.0,0.0,0.003115,0.066667,0.037037,0.0,0.230769,0.25,0,0,0,0,0,0
1577,0.475728,1.0,0.142857,0.119128,0.0,0.0,0.0,0.89899,0.839142,0.0,0.269565,0.177606,0.133588,0.0,0.066667,0.148148,0.178571,0.615385,0.05,0,1,1,0,0,0
924,0.097087,0.5,0.142857,0.121324,0.5,0.0,0.0,0.828283,0.544236,0.497487,0.249855,0.915058,0.568702,0.102804,0.066667,0.407407,0.142857,0.769231,0.25,0,0,0,1,0,0


We now have several classes with parameters learned from the training dataset, that we can store and retrieve at a later stage, so that when a colleague comes with new data, we are in a better position to score it faster.

Still:

- we would need to save each class
- then we could load each class
- and apply each transformation individually.

Which sounds like a lot of work.

The good news is, we can reduce the amount of work, if we set up all the transformations within a pipeline.