## What is this notebook used for ?

The purpose of this notebook is to prepare the data to fit the machine learning model. 
To do this, I will first split the raw data into 2 sets, a training set and a test set. 

I will then create data pre-processing pipelines that will contain basics steps to prepare the data.


## How is it done?

I use scikit-learn, especially train_test_split & ColumnTransformer.

In [1]:
import pandas as pd
import os
import joblib
import numpy as np

from sklearn.model_selection import train_test_split
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import OneHotEncoder, OrdinalEncoder, StandardScaler
from sklearn.compose import ColumnTransformer

## Data import

In [2]:
cd ..

D:\Projets personnels\employee_leave


In [3]:
# I use os for the paths so that this code works on differents OS.
path_data = os.path.join('data', 'raw', 'Employee.csv')
df_raw = pd.read_csv(path_data)

In [4]:
df_raw.columns

Index(['Education', 'JoiningYear', 'City', 'PaymentTier', 'Age', 'Gender',
       'EverBenched', 'ExperienceInCurrentDomain', 'LeaveOrNot'],
      dtype='object')

In [5]:
# input data
col_X = ['Education', 'JoiningYear', 'City', 'PaymentTier', 'Age', 'Gender',
       'EverBenched', 'ExperienceInCurrentDomain']

# output data (target)
col_y = ['LeaveOrNot']


X = df_raw[col_X]
y = df_raw[col_y]

## Split the data in training & test sets

At this level of work it is very important to create a test set and not work on it by storing it somewhere. 

In addition, the "random_state" parameter must be set so that the result is reproducible.

In [6]:
# Data are dataframes (before & after spliting them)
X_train, X_test, y_train, y_test = train_test_split(
       X, y, test_size=0.33, random_state=42)

### Saves the data in human readable format (csv) using pd.dataframes to_csv() method

An important note is that we store data in csv format and not in pickle format for example because our data are readable, if we haven't storage size constraints it's important to store human readable data in human readable file.

In [7]:
train_folder = os.path.join('data', 'training')
test_folder = os.path.join('data', 'test')

# export train data
X_train.to_csv(os.path.join(train_folder, 'x_train.csv'))
y_train.to_csv(os.path.join(train_folder, 'y_train.csv'))

# export test data
X_test.to_csv(os.path.join(test_folder, 'x_test.csv'))
y_test.to_csv(os.path.join(test_folder, 'y_test.csv'))

## Data processing pipeline

The steps that we will put in this pipeline are :
- put the categorical variables in **1-hot coding**
- encode the binary variables into numerical binary variables
- do a standardization of the numerical variables (in a specific pipeline for models that are sensitive to this).

These two first steps are necessary if we want to use a model on these data.

### One-hot variables

In [8]:
# one-hot encoder
one_hot = OneHotEncoder()
# categorial variables
col_one_hot = ['Education', 'City']

In [9]:
one_hot.fit_transform(X_train[col_one_hot]).shape

(3117, 6)

_Note:_ As reminder, the "Education" and "City" variables both have 3 variables, so this gives 6 columns once the one-hot encoding is done.

### Numerical binary

In [10]:
ord_binary = OrdinalEncoder()
col_binary = ['Gender', 'EverBenched']

In [11]:
ord_binary.fit_transform(X_train[col_binary]).shape

(3117, 2)

_Note:_  We keep the same number of columns here because we are transforming categorical values into numeric (here we use OrdinalEncoder() to transform into a binary variable because the first 2 values are 0 and 1).

### Standardization of data

In [12]:
col_numerical = [col for col in col_X if col not in col_one_hot + col_binary + ['PaymentTier']]
std_scaler = StandardScaler()

_Note:_ 
The standard score of a sample x is calculated as:

z = (x - u) / s

with u = mean of a variable & s = standard devition .

## Preprocessing steps


Here there will be 2 pipelines created: 
- the first will only format the categorical data into a numerical representation
- the second one will do an additional step of normalization of the numerical data.

**Why make these 2 pipelines?**


Because some of the models we are going to use later on need a normalisation of the numerical data (logistic regression) and others do not (decision tree models).

In [13]:
preprocessing_pipeline = ColumnTransformer([
                                            ('one_hot', one_hot, col_one_hot),
                                            ('to_binary', ord_binary, col_binary)
                                            ])

preprocessing_pipeline_linear_model = ColumnTransformer([
                                            ('one_hot', one_hot, col_one_hot),
                                            ('to_binary', ord_binary, col_binary),
                                            ('numerical', std_scaler, col_numerical)
                                            ])

In [14]:
# train the pipelines
prepared_col = preprocessing_pipeline.fit_transform(X_train)
prepared_col_lin = preprocessing_pipeline_linear_model.fit_transform(X_train)

# save the pipelines
output_path = os.path.join('pipeline', 'preprocessing', 'preprocessing_model.pkl')
output_path_lin = os.path.join('pipeline', 'preprocessing', 'preprocessing_linear_model.pkl')

joblib.dump(preprocessing_pipeline, output_path)
joblib.dump(preprocessing_pipeline_linear_model, output_path_lin)

['pipeline\\preprocessing\\preprocessing_linear_model.pkl']

In [15]:
# transformers can be accessed 
preprocessing_pipeline.transformers_

[('one_hot', OneHotEncoder(), ['Education', 'City']),
 ('to_binary', OrdinalEncoder(), ['Gender', 'EverBenched']),
 ('remainder', 'drop', [1, 3, 4, 7])]

In [16]:
# Note: we have 3x2+2 = 8 columns
prepared_col.shape

(3117, 8)

In [17]:
# raw data shape
print("Raw data shape: ", X_train.shape)

# Remove prepared columns from the raw data
X_train_drop = X_train.drop(columns=col_binary + col_one_hot)

# raw data shape
print("Data shape after drop: ", X_train_drop.shape)

Raw data shape:  (3117, 8)
Data shape after drop:  (3117, 4)


In [18]:
# add prepared columns to the raw data
X_prepared = np.concatenate((X_train_drop, prepared_col), axis=1)

In [19]:
X_prepared.shape

(3117, 12)

## Conclusion

In this notebook we have formatted the data that was not directly exploitable by a Machine Learning model.

These elementary steps are stored in 1 function in the script src.preprocessing.pipeline_preprocessing.py 

We can now move on to the modelling of the interactions between the variables and the Machine Learning (prediction) part!