# Machine Learning Pipeline - Feature Engineering

In the following notebooks, we will go through the implementation of each one of the steps in the Machine Learning Pipeline. 

We will discuss:

1. Data Analysis
2. **Feature Engineering**
3. Feature Selection
4. Model Training
5. Obtaining Predictions / Scoring

In [1]:
# to handle datasets
import pandas as pd
import numpy as np

# for plotting
import matplotlib.pyplot as plt
import seaborn as sns
sns.set_style('ticks')

# for the yeo-johnson transformation
import scipy.stats as stats

# to divide train and test set
from sklearn.model_selection import train_test_split

# feature scaling
from sklearn.preprocessing import MinMaxScaler

# to save the trained scaler class
import joblib

# to visualise al the columns in the dataframe
pd.pandas.set_option('display.max_columns', None)

In [2]:
data = pd.read_csv('campaign.csv')

print(data.shape)

data.head()

(2240, 29)


Unnamed: 0,ID,Year_Birth,Education,Marital_Status,Income,Kidhome,Teenhome,Dt_Customer,Recency,MntWines,MntFruits,MntMeatProducts,MntFishProducts,MntSweetProducts,MntGoldProds,NumDealsPurchases,NumWebPurchases,NumCatalogPurchases,NumStorePurchases,NumWebVisitsMonth,AcceptedCmp3,AcceptedCmp4,AcceptedCmp5,AcceptedCmp1,AcceptedCmp2,Complain,Z_CostContact,Z_Revenue,Response
0,5524,1957,Graduation,Single,58138.0,0,0,2012-09-04,58,635,88,546,172,88,88,3,8,10,4,7,0,0,0,0,0,0,3,11,1
1,2174,1954,Graduation,Single,46344.0,1,1,2014-03-08,38,11,1,6,2,1,6,2,1,1,2,5,0,0,0,0,0,0,3,11,0
2,4141,1965,Graduation,Together,71613.0,0,0,2013-08-21,26,426,49,127,111,21,42,1,8,2,10,4,0,0,0,0,0,0,3,11,0
3,6182,1984,Graduation,Together,26646.0,1,0,2014-02-10,26,11,4,20,10,3,5,2,2,0,4,6,0,0,0,0,0,0,3,11,0
4,5324,1981,PhD,Married,58293.0,1,0,2014-01-19,94,173,43,118,46,27,15,5,5,3,6,5,0,0,0,0,0,0,3,11,0


# Separate dataset into train and test

It is important to separate our data intro training and testing set. 

When we engineer features, some techniques learn parameters from data. It is important to learn these parameters only from the train set. This is to avoid over-fitting.

**Separating the data into train and test involves randomness, therefore, we need to set the seed.**

In [3]:
# Let's separate into train and test set
# Remember to set the seed (random_state for this sklearn function)

X_train, X_test, y_train, y_test = train_test_split(
    data.drop(['ID','Response'], axis=1), # predictive variables
    data['Response'], # target
    test_size=0.2, # portion of dataset to allocate to test set
    random_state=0, # we are setting the seed here
)

In [4]:
X_train.shape, X_test.shape

((1792, 27), (448, 27))

In [5]:
X_train.shape

(1792, 27)

In [6]:
X_train.head()

Unnamed: 0,Year_Birth,Education,Marital_Status,Income,Kidhome,Teenhome,Dt_Customer,Recency,MntWines,MntFruits,MntMeatProducts,MntFishProducts,MntSweetProducts,MntGoldProds,NumDealsPurchases,NumWebPurchases,NumCatalogPurchases,NumStorePurchases,NumWebVisitsMonth,AcceptedCmp3,AcceptedCmp4,AcceptedCmp5,AcceptedCmp1,AcceptedCmp2,Complain,Z_CostContact,Z_Revenue
818,1972,Graduation,Married,65685.0,0,1,2014-03-29,54,642,14,49,0,7,57,1,9,2,9,5,0,0,0,0,0,0,3,11
1281,1971,2n Cycle,Divorced,49118.0,0,0,2012-08-18,90,620,54,239,99,98,119,2,9,7,10,7,0,1,0,0,1,0,3,11
1766,1980,PhD,Single,36802.0,1,0,2014-06-16,23,16,1,2,0,0,1,1,1,0,3,5,0,0,0,0,0,0,3,11
1577,1947,PhD,Together,81574.0,0,0,2014-04-28,89,1252,0,465,46,35,0,1,4,5,8,1,0,1,1,0,0,0,3,11
924,1986,Graduation,Together,83033.0,1,0,2014-05-18,82,812,99,431,237,149,33,1,11,4,10,5,0,0,0,1,0,0,3,11


# Feature Engineering

In the following cells, we will engineer the variables of the House Price Dataset so that we tackle:

1. Missing values
2. Temporal variables
3. Categorical variables: convert strings to numbers
4. Drop variables with constant features

## Missing Values

We only have one variable with missing values - Income. The missing values are only a tiny proportion of the samples. We will replace the missing values in the affected rows with the average income.

In [7]:
X_train['Income'].head()

818     65685.0
1281    49118.0
1766    36802.0
1577    81574.0
924     83033.0
Name: Income, dtype: float64

In [8]:
mean_income = X_train['Income'].mean()
mean_income

52662.7538028169

In [9]:
# number of null income rows on train and test sets
# before the transformation
len(X_train[X_train['Income'].isnull()]), len(X_test[X_test['Income'].isnull()])

(17, 7)

In [10]:
# transform the variable with the saved parameters
X_train['Income'] = np.where(X_train['Income'].isnull(),mean_income,X_train['Income'])
X_test['Income'] = np.where(X_test['Income'].isnull(),mean_income,X_test['Income'])

In [11]:
# number of null income rows on train and test sets
# before the transformation
len(X_train[X_train['Income'].isnull()]), len(X_test[X_test['Income'].isnull()])

(0, 0)

## Temporal Variables

There are two time-related variables in the dataset:
- Dt_Customer: The date the customer patronised the business for the first time.
- Year_Birth: The year of birth of the customer.

We will transform this variables so that they each reflect, respectively:
- The customer's patronage period
- The customer's age.

### Dt_Customer

In [12]:
X_train['Dt_Customer'].head()

818     2014-03-29
1281    2012-08-18
1766    2014-06-16
1577    2014-04-28
924     2014-05-18
Name: Dt_Customer, dtype: object

In [13]:
# grab the year from the date
# convert the year to int from str
# subtract the year from 2022 to get patronage period

def adjust_date(row):
    
    year = int(row.split('-')[0])
    patronage_period = 2022 - year
    return patronage_period

In [14]:
X_train['Dt_Customer'] = X_train['Dt_Customer'].apply(adjust_date)
X_test['Dt_Customer'] = X_test['Dt_Customer'].apply(adjust_date)

In [15]:
X_train['Dt_Customer'].head()

818      8
1281    10
1766     8
1577     8
924      8
Name: Dt_Customer, dtype: int64

### Year_Birth

In [16]:
X_train['Year_Birth'].head()

818     1972
1281    1971
1766    1980
1577    1947
924     1986
Name: Year_Birth, dtype: int64

In [17]:
# subtract the current year from the customer year of birth
X_train['Year_Birth'] = 2022 - X_train['Year_Birth']
X_test['Year_Birth'] = 2022 - X_test['Year_Birth']

In [18]:
X_train['Year_Birth'].head()

818     50
1281    51
1766    42
1577    75
924     36
Name: Year_Birth, dtype: int64

## Categorical Variables

### Encoding of non-binary variables
All the discrete variables in the dataset are already one-hot encoded. 

For our non-binary variables, we will assign a numerical ordinality to them. This ordinality will be defined by the rate of 'YES' responses in each label - i.e., labels with the highest YES responses will get the highest rankings.

In [19]:
non_binary = ['Education','Marital_Status']

In [20]:
train = pd.concat([X_train,y_train],axis=1)
test = pd.concat([X_test,y_test],axis=1)

In [21]:
train[non_binary].head()

Unnamed: 0,Education,Marital_Status
818,Graduation,Married
1281,2n Cycle,Divorced
1766,PhD,Single
1577,PhD,Together
924,Graduation,Together


In [22]:
# empty dictionary to store the parameters of the labels of the non-binary features
# the paramaters are the YES rates for each of the labels in the train set

params = {} 
for col in non_binary:
    params[col] = {}

# fit: grab and persits parameters to the dictionary
# iterate over each column
for column in non_binary:
    
    # iterate over the labels
    for label in train[column].unique():
        
        # grab all the rows for each label where Response = 1 
        label_yes = len(train[(train[column]==label) & (train['Response']==1)])
        label_size = len(train[train[column]==label])
        
        # persist the YES rate to its respective label in the dictionary
        params[column][label] = label_yes / label_size
    
    # rank the label ordinally
    labels = pd.Series(params[column])
    ordered_labels = labels.sort_values().index
    ordinal_label = {k: i for i, k in enumerate(ordered_labels, 1)}
    
    print(column, ordinal_label)
    print()
    
    # map the labels to their rankings in the dataset
    train[column] = train[column].map(ordinal_label)
    test[column] = test[column].map(ordinal_label)

Education {'Basic': 1, '2n Cycle': 2, 'Graduation': 3, 'Master': 4, 'PhD': 5}

Marital_Status {'YOLO': 1, 'Together': 2, 'Married': 3, 'Divorced': 4, 'Widow': 5, 'Single': 6, 'Alone': 7, 'Absurd': 8}



In [23]:
# print out the parameters
params

{'Education': {'Graduation': 0.12691466083150985,
  '2n Cycle': 0.11464968152866242,
  'PhD': 0.19543147208121828,
  'Master': 0.14385964912280702,
  'Basic': 0.047619047619047616},
 'Marital_Status': {'Married': 0.11285714285714285,
  'Divorced': 0.17714285714285713,
  'Single': 0.2198952879581152,
  'Together': 0.09462365591397849,
  'Widow': 0.21875,
  'Absurd': 0.5,
  'Alone': 0.3333333333333333,
  'YOLO': 0.0}}

Notice none of the parameters are greater than 50%, i.e. we don't have any labels where there are more YES responses than NO responses.

In [24]:
# visualise the engineered dataset
train.head()

Unnamed: 0,Year_Birth,Education,Marital_Status,Income,Kidhome,Teenhome,Dt_Customer,Recency,MntWines,MntFruits,MntMeatProducts,MntFishProducts,MntSweetProducts,MntGoldProds,NumDealsPurchases,NumWebPurchases,NumCatalogPurchases,NumStorePurchases,NumWebVisitsMonth,AcceptedCmp3,AcceptedCmp4,AcceptedCmp5,AcceptedCmp1,AcceptedCmp2,Complain,Z_CostContact,Z_Revenue,Response
818,50,3,3,65685.0,0,1,8,54,642,14,49,0,7,57,1,9,2,9,5,0,0,0,0,0,0,3,11,0
1281,51,2,4,49118.0,0,0,10,90,620,54,239,99,98,119,2,9,7,10,7,0,1,0,0,1,0,3,11,1
1766,42,5,6,36802.0,1,0,8,23,16,1,2,0,0,1,1,1,0,3,5,0,0,0,0,0,0,3,11,0
1577,75,5,2,81574.0,0,0,8,89,1252,0,465,46,35,0,1,4,5,8,1,0,1,1,0,0,0,3,11,0
924,36,3,2,83033.0,1,0,8,82,812,99,431,237,149,33,1,11,4,10,5,0,0,0,1,0,0,3,11,0


In [25]:
# drop the constant-value variables
train = train.drop(['Z_CostContact','Z_Revenue'],axis=1)
test = test.drop(['Z_CostContact','Z_Revenue'],axis=1)

In [26]:
X_train = train.drop('Response',axis=1)
X_test = test.drop('Response',axis=1)

In [27]:
# let's now save the train and test sets for the next notebook!

X_train.to_csv('xtrain_unscaled.csv', index=False)
X_test.to_csv('xtest_unscaled.csv', index=False)

y_train.to_csv('ytrain.csv', index=False)
y_test.to_csv('ytest.csv', index=False)