## What is this notebook used for ?

The objective of this notebook is to obtain one or more Machine Learning models to predict whether an employee will leave within 2 years from his HR data.
Several models will be optimised and tested in order to select the best one(s) and deploy them afterwards.


## How is it done?

I approach this problem by starting with a model that predicts the majority class in the training data, which is the minimum performance of a useful model (Baseline model).

I then compare 3 Machine Learning models with features relevant to this task: features_importance and probabilities per class.

In order to optimize these models I use RandomizedSearchCV followed by GridSearchCV.


Note: The properties of these models are important because they provide useful functionality for delivery via HR software: 
- one could classify employees being detected as wanting to leave the company by probability and thus prioritise them 
- one can also interpret the coefficients of the models and analyse the factors causing the departure to correct them
- we can also give arguments to HR to communicate with employees.

In [2]:
cd ..

D:\Projets personnels\employee_leave


In [3]:
import pandas as pd
import os
import joblib
import numpy as np
import warnings
warnings.filterwarnings('ignore')

from sklearn.dummy import DummyClassifier
from sklearn.model_selection import cross_val_score
from sklearn.preprocessing import PolynomialFeatures
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from xgboost import XGBClassifier
from sklearn.model_selection import RandomizedSearchCV, GridSearchCV
from sklearn.pipeline import make_pipeline
from src.preprocessing.pipeline_preprocessing import prepare_raw_data
from hyperopt import STATUS_OK, Trials, fmin, hp, tpe
from sklearn.metrics import precision_score

# Load & prepare data

In [4]:
# Load the preprocessing pipeline
path_pipeline = os.path.join('pipeline', 'preprocessing', 'preprocessing_model.pkl')
preprocessing_pipeline = joblib.load(path_pipeline)
# pipeline with standardization
path_pipeline_num = os.path.join('pipeline', 'preprocessing', 'preprocessing_linear_model.pkl')
preprocessing_pipeline_num = joblib.load(path_pipeline_num)

In [5]:
# load & prepare training data
x_train_data_path = os.path.join('data', 'training', 'x_train.csv')
y_train_data_path = os.path.join('data', 'training', 'y_train.csv')

X_train = pd.read_csv(x_train_data_path, 
                      names=['Education', 'JoiningYear', 'City', 'PaymentTier', 'Age',
                       'Gender', 'EverBenched', 'ExperienceInCurrentDomain'],
                      header=1)
                     
y_train = pd.read_csv(y_train_data_path, names=['LeaveOrNot'], header=1)

X_train_prepared = prepare_raw_data(X_train, preprocessing_pipeline)
# data with standardization (for logistic regression model)
X_train_prepared_num = prepare_raw_data(X_train, preprocessing_pipeline_num, numerical_std=True)
y_train_prepared = y_train.LeaveOrNot

## Baseline model

We choose the precision metric, the justification is below.

In [63]:
# the uniform strategy generates predictions uniformly at random
dummy_model = DummyClassifier(strategy="uniform", random_state=0)
scores = cross_val_score(dummy_model, X_train_prepared, y_train_prepared, cv=5, scoring='precision')

mean_score = scores.mean()
print("Cross-validation precision scores : {scores} \nMean precision score : {mean_score}".format(scores=scores, 
                                                                                                  mean_score=mean_score))

Cross-validation precision scores : [0.31578947 0.3498452  0.34365325 0.3622291  0.36842105] 
Mean precision score : 0.3479876160990712


In [66]:
dummy_model.fit(X_train_prepared, y_train_prepared)

# save the dummy model
output_path = os.path.join('pipeline', 'machine_learning', 'dummy_uniform_random.pkl')
joblib.dump(dummy_model, output_path)

['pipeline\\machine_learning\\dummy_uniform_random.pkl']

# Machine Learning Models


## Data representation: polynomial features


Here the idea is to complete the representation of the data by modelling their interactions: 
- the interactions of the variables between them (location, gender, ...)
- the variables with squares or more.

The idea is to optimise the degree of the polynomials together with the Machine Learning models in a pipeline in order to find the best representation-model combination.

In [7]:
poly = PolynomialFeatures()

## Hyperparameters tuning

The approach to be adopted here is to search for the best hyperparameters first using a random search in a relatively large space and then to focus on the most promising area (identified through the random search) using a grid search.

Why not use a grid search directly?
1. Because a grid search can miss interesting areas in the space of hyperparameters by making linear steps
2. In order to control the calculation capacities and the optimisation time (constraints that can often occur in a professional context).

**The last model, XGboost, will be optimised using Bayesian optimisation on these hyperparameters to see several methods.**



## Metrics

Concerning the performance measure of our models, here in the case of a binary classification we have several choices.

From a business point of view this would be represented as having to choose between the following questions: 
- Do I prefer to have a model that is "sure" of itself before telling me that an employee wants to leave?  
- Would I prefer to have a model that detects as many employees as possible who want to leave the company, even if it means detecting some who do not want to leave? 
- Would I prefer a compromise between the two? 


From a statistical point of view, these questions translate into the following:
- maximising precision
- maximising recall
- maximise the f1 score.


Let's imagine that **the company needs to save time for its HR, even if it means not seeing all the employees who can potentially leave within 2 years** (this is the business metric, the real-world problematic that the Machine Learning model should solve).

We will therefore need to **maximise precision** in the searches below.

## Logistic regression

As this method is linear, it may benefit from the modelling of polynomial interactions (linear models benefit greatly from prior modelling of data interactions).

In [8]:
lr = LogisticRegression(random_state=0)
# we put PolynomialFeatures and LogisticRegression in a pipeline to chain them
pipeline_lr = make_pipeline(poly, 
                            lr)
pipeline_lr

Pipeline(steps=[('polynomialfeatures', PolynomialFeatures()),
                ('logisticregression', LogisticRegression(random_state=0))])

In [9]:
# hyperparameter space for randomize search
 
lr_grid = [
    # grid for l2 penalty/solver
     {'polynomialfeatures__degree': [1, 2, 3, 4, 5],
    'logisticregression__penalty': ['l2'],
    'logisticregression__C': np.logspace(-10,10, 10),
    'logisticregression__max_iter': [100, 150, 200, 250],
    'logisticregression__solver': ['lbfgs', 'liblinear']}
    ,
    # grid for l1 penalty/solver
    {'polynomialfeatures__degree': [1, 2, 3, 4, 5],
    'logisticregression__penalty': ['l1'],
    'logisticregression__C': np.logspace(-10,10, 10),
    'logisticregression__max_iter': [100, 150, 200, 250],
    'logisticregression__solver': ['liblinear']},
    # grid for elasticnet penalty/solver
    {'polynomialfeatures__degree': [1, 2, 3, 4, 5],
    'logisticregression__penalty': ['elasticnet'],
    'logisticregression__l1_ratio': np.linspace(0, 1, 10),
    'logisticregression__C': np.logspace(-10,10, 10),
    'logisticregression__max_iter': [100, 150, 200, 250],
    'logisticregression__solver': ['saga']}
]

**Note about the hyperparameter's space:**
- we need to make several dictionaries because the solvers are not compatible depending on the penalty we use
- If one were to test from a grid search all the above combinations one would have to train 2600 models. All this by adding the cross-validations and thus multiplying by 4 (value chosen arbitrarily).
**In the end, 10 400 models should be trained.**

We will randomly select 100 models from this space and compare them.

### Randomize search

In [10]:
rs_lr = RandomizedSearchCV(estimator=pipeline_lr, 
                       param_distributions=lr_grid, 
                       cv=4,
                       n_iter=100,
                       n_jobs=10, # becareful here I have 12 jobs on my computer may be you'll need to change this part
                       scoring='precision',
                       random_state=0)

rs_lr.fit(X_train_prepared_num, y_train_prepared)

RandomizedSearchCV(cv=4,
                   estimator=Pipeline(steps=[('polynomialfeatures',
                                              PolynomialFeatures()),
                                             ('logisticregression',
                                              LogisticRegression(random_state=0))]),
                   n_iter=100, n_jobs=10,
                   param_distributions=[{'logisticregression__C': array([1.00000000e-10, 1.66810054e-08, 2.78255940e-06, 4.64158883e-04,
       7.74263683e-02, 1.29154967e+01, 2.15443469e+03, 3.59381366e+05,
       5....
       5.99484250e+07, 1.00000000e+10]),
                                         'logisticregression__l1_ratio': array([0.        , 0.11111111, 0.22222222, 0.33333333, 0.44444444,
       0.55555556, 0.66666667, 0.77777778, 0.88888889, 1.        ]),
                                         'logisticregression__max_iter': [100,
                                                                          150,
              

In [11]:
print('Best parameters: %s' % rs_lr.best_params_)
print('Best precision: %.2f' % rs_lr.best_score_)

Best parameters: {'polynomialfeatures__degree': 3, 'logisticregression__solver': 'lbfgs', 'logisticregression__penalty': 'l2', 'logisticregression__max_iter': 250, 'logisticregression__C': 0.07742636826811278}
Best precision: 0.83


### Grid search

In [12]:
gs_grid = {
    'polynomialfeatures__degree': [2, 3, 4],
    'logisticregression__penalty': ['l1'],
    'logisticregression__C': np.logspace(10, 15, 20),
    'logisticregression__max_iter': [100, 150],
    'logisticregression__solver': ['liblinear']
        }


In [13]:
gs_lr = GridSearchCV(estimator=pipeline_lr, 
                       param_grid=gs_grid, 
                       cv=4,
                       n_jobs=10, # may be you'll need to change this part (computer's jobs)
                       scoring='precision'
                    )

gs_lr.fit(X_train_prepared, y_train_prepared)

GridSearchCV(cv=4,
             estimator=Pipeline(steps=[('polynomialfeatures',
                                        PolynomialFeatures()),
                                       ('logisticregression',
                                        LogisticRegression(random_state=0))]),
             n_jobs=10,
             param_grid={'logisticregression__C': array([1.00000000e+10, 1.83298071e+10, 3.35981829e+10, 6.15848211e+10,
       1.12883789e+11, 2.06913808e+11, 3.79269019e+11, 6.95192796e+11,
       1.27427499e+12, 2.33572147e+12, 4.28133240e+12, 7.84759970e+12,
       1.43844989e+13, 2.63665090e+13, 4.83293024e+13, 8.85866790e+13,
       1.62377674e+14, 2.97635144e+14, 5.45559478e+14, 1.00000000e+15]),
                         'logisticregression__max_iter': [100, 150],
                         'logisticregression__penalty': ['l1'],
                         'logisticregression__solver': ['liblinear'],
                         'polynomialfeatures__degree': [2, 3, 4]},
             s

In [14]:
print('Best parameters: %s' % rs_lr.best_params_)
print('Best precision: %.2f' % rs_lr.best_score_)

Best parameters: {'polynomialfeatures__degree': 3, 'logisticregression__solver': 'lbfgs', 'logisticregression__penalty': 'l2', 'logisticregression__max_iter': 250, 'logisticregression__C': 0.07742636826811278}
Best precision: 0.83


In [19]:
# we save the pipeline
output_path = os.path.join('pipeline', 'machine_learning', 'pipeline_logistic_regression.pkl')
# best_estimator_ attribute is the model trained on the entire training dataset
joblib.dump(rs_lr.best_estimator_, output_path)

['pipeline\\machine_learning\\pipeline_logistic_regression.pkl']

## Random forest

This tree-based algorithm generally gives good results and does not require any pre-processing of the data.

**It is based on a set of independent decision trees that can be controlled (depth, number of trees).**
We will apply the same method as before to optimise its hyperparameters.

In [35]:
# random forest pipeline
rf = RandomForestClassifier(random_state=0)
pipeline_rf = make_pipeline(poly, rf)

In [33]:
rf_grid ={
     'polynomialfeatures__degree': [1, 2, 3, 4, 5],
     'randomforestclassifier__bootstrap': [True, False],
     'randomforestclassifier__max_depth': [10, 20, 30, 40, 50, 60, 70, 80, 90, 100, None],
     'randomforestclassifier__max_features': ['auto', 'sqrt'],
     'randomforestclassifier__min_samples_leaf': [1, 2, 3, 4, 5],
     'randomforestclassifier__min_samples_split': [2, 4, 6, 8, 10],
     'randomforestclassifier__n_estimators': [200, 400, 600, 800, 1000, 1200, 1400, 1600, 1800, 2000]}

In [39]:
rs_rf = RandomizedSearchCV(estimator=pipeline_rf, 
                       param_distributions=rf_grid, 
                       cv=4,
                       n_iter=100,
                       n_jobs=10, # becareful here it depends on the number of cores in your PC
                       scoring='precision',
                       random_state=0)

rs_rf.fit(X_train_prepared, y_train_prepared)

RandomizedSearchCV(cv=4,
                   estimator=Pipeline(steps=[('polynomialfeatures',
                                              PolynomialFeatures()),
                                             ('randomforestclassifier',
                                              RandomForestClassifier(random_state=0))]),
                   n_iter=100, n_jobs=10,
                   param_distributions={'polynomialfeatures__degree': [1, 2, 3,
                                                                       4, 5],
                                        'randomforestclassifier__bootstrap': [True,
                                                                              False],
                                        'randomforestclassifier__max_depth': [10,
                                                                              20,
                                                                              30,
                                                             

In [40]:
print('Best parameters: %s' % rs_rf.best_params_)
print('Best precision: %.2f' % rs_rf.best_score_)

Best parameters: {'randomforestclassifier__n_estimators': 2000, 'randomforestclassifier__min_samples_split': 2, 'randomforestclassifier__min_samples_leaf': 1, 'randomforestclassifier__max_features': 'sqrt', 'randomforestclassifier__max_depth': 10, 'randomforestclassifier__bootstrap': True, 'polynomialfeatures__degree': 5}
Best precision: 0.88


In [42]:
rf_gs_grid ={
     'polynomialfeatures__degree': [5, 6, 7],
     'randomforestclassifier__bootstrap': [True],
     'randomforestclassifier__max_depth': [8, 9, 10],
     'randomforestclassifier__max_features': ['sqrt'],
     'randomforestclassifier__min_samples_leaf': [1],
     'randomforestclassifier__min_samples_split': [1, 2],
     'randomforestclassifier__n_estimators': [2000, 2500, 3000]}

In [43]:
gs_rf = GridSearchCV(estimator=pipeline_rf, 
                       param_grid=rf_gs_grid, 
                       cv=4,
                       n_jobs=10, # becareful here I have 12 jobs on my computer may be you'll need to change this part
                       scoring='precision'
                    )

gs_rf.fit(X_train_prepared, y_train_prepared)

        nan        nan        nan 0.88340789 0.88201124 0.88316591
        nan        nan        nan 0.87639014 0.87523856 0.87494674
        nan        nan        nan 0.89588552 0.89572682 0.89450675
        nan        nan        nan 0.88566752 0.88566752 0.88692927
        nan        nan        nan 0.87480428 0.87129637 0.87159122
        nan        nan        nan 0.89572682 0.8968036  0.8968036
        nan        nan        nan 0.88650258 0.88647448 0.88616331
        nan        nan        nan 0.87321125 0.87052702 0.87267697]


GridSearchCV(cv=4,
             estimator=Pipeline(steps=[('polynomialfeatures',
                                        PolynomialFeatures()),
                                       ('randomforestclassifier',
                                        RandomForestClassifier(random_state=0))]),
             n_jobs=10,
             param_grid={'polynomialfeatures__degree': [5, 6, 7],
                         'randomforestclassifier__bootstrap': [True],
                         'randomforestclassifier__max_depth': [8, 9, 10],
                         'randomforestclassifier__max_features': ['sqrt'],
                         'randomforestclassifier__min_samples_leaf': [1],
                         'randomforestclassifier__min_samples_split': [1, 2],
                         'randomforestclassifier__n_estimators': [2000, 2500,
                                                                  3000]},
             scoring='precision')

In [44]:
print('Best parameters: %s' % gs_rf.best_params_)
print('Best precision: %.2f' % gs_rf.best_score_)

Best parameters: {'polynomialfeatures__degree': 7, 'randomforestclassifier__bootstrap': True, 'randomforestclassifier__max_depth': 8, 'randomforestclassifier__max_features': 'sqrt', 'randomforestclassifier__min_samples_leaf': 1, 'randomforestclassifier__min_samples_split': 2, 'randomforestclassifier__n_estimators': 2500}
Best precision: 0.90


In [48]:
# we save the pipeline
output_path = os.path.join('pipeline', 'machine_learning', 'pipeline_random_forest_polynomial_7.pkl')
joblib.dump(gs_rf.best_estimator_, output_path)

['pipeline\\machine_learning\\pipeline_random_forest_polynomial_7.pkl']

## XGboost

This algorithm is also an algorithm based on decision trees.
The difference with the previous one is that the trees are built by improving the previous tree, which generally gives better performance. 

This algorithm keeps only the best trees, so **it is a selection of the best "weak learners" to estimate the "strong learner".**


### Hyperparameters tuning: Bayesian Optimisation

For the hyperparameters optimization of this last algorithm we will use another method of hyperparameters tuning: Bayesian optimization using the Hyperopt library.

The principle of this optimization is that the search is done in the direction of the hyperparameters giving good results. 
We can see it as a method combining a RandomizeSearch with a GridSearch in an automatic way.

The following links provide further insight:
- Bayesian optimization: https://www.kaggle.com/prashant111/bayesian-optimization-using-hyperopt
- Bayesian optimization applied to the XGBoost model: https://www.kaggle.com/prashant111/a-guide-on-xgboost-hyperparameters-tuning/notebook.


#### Hyperparameter space

**Note:**

- In bayesian statistics everything is a probability distribution, all hyperparameters are therefore modelled as probability distributions
- There is a work to do so the possibility to model our a priori from mathematical law, for example by putting a uniform law if we have rather an idea "between 0 and 10", or a normal law if we have some knowledge (or another law).

In [41]:
# define the hyperparameter space
params_space = {'max_depth': hp.quniform("max_depth", 2, 10, 1),
                'gamma': hp.uniform ('gamma', 1, 9),
                'reg_alpha' : hp.quniform('reg_alpha', 40, 180, 1),
                'reg_lambda' : hp.uniform('reg_lambda', 0, 1),
                'colsample_bytree' : hp.uniform('colsample_bytree', 0.5, 1),
                'min_child_weight' : hp.quniform('min_child_weight', 0, 10, 1),
                'n_estimators':  hp.quniform('n_estimators', 150, 500, 1),
                'eta': hp.uniform('eta', 0.01, 0.2),
                'seed': 0
                }

In [47]:
def objective_function(params_space):
    """
    The objective of this function is to return a score that hyperopt will seek to minimise. 
    Our objective here being to maximise the average precision score on the cross-validation folds.
    
    Args:
        params_space (dict): parameters space (the parameters must be expressed using the hyperopt library)
        
    Returns:
        (dict): a dictionary containing the score to be minimised and a status
    
    """
    clf = XGBClassifier(
                    n_estimators = int(params_space['n_estimators']), 
                    max_depth = int(params_space['max_depth']), 
                    gamma = params_space['gamma'],
                    reg_alpha = int(params_space['reg_alpha']),
                    min_child_weight=int(params_space['min_child_weight']),
                    colsample_bytree=int(params_space['colsample_bytree'])
                    )    
    # the score is the
    score =  cross_val_score(clf, X_train_prepared, y_train_prepared, cv=4, scoring='precision', verbose=False).mean()
    print ("SCORE:", score)
    # the loss is -score: minimize the loss ==> maximise the score
    return {'loss': -score, 'status': STATUS_OK}

In [54]:
trials = Trials()

best_hyperparams = fmin(fn = objective_function, # function to minimize
                        space = params_space, 
                        algo = tpe.suggest, # minimization algorithm
                        max_evals = 200, # number of model to try
                        trials = trials)

SCORE:                                                                                                                 
0.0                                                                                                                    
SCORE:                                                                                                                 
0.9166666666666666                                                                                                     
SCORE:                                                                                                                 
0.9729567307692308                                                                                                     
SCORE:                                                                                                                 
0.0                                                                                                                    
SCORE:                                  

SCORE:                                                                                                                 
0.0                                                                                                                    
SCORE:                                                                                                                 
0.5                                                                                                                    
SCORE:                                                                                                                 
0.9553794661534598                                                                                                     
SCORE:                                                                                                                 
0.9614906717109899                                                                                                     
SCORE:                                  

SCORE:                                                                                                                 
0.9807692307692308                                                                                                     
SCORE:                                                                                                                 
0.0                                                                                                                    
SCORE:                                                                                                                 
0.0                                                                                                                    
SCORE:                                                                                                                 
0.0                                                                                                                    
SCORE:                                  

SCORE:                                                                                                                 
0.0                                                                                                                    
SCORE:                                                                                                                 
0.0                                                                                                                    
SCORE:                                                                                                                 
0.0                                                                                                                    
SCORE:                                                                                                                 
0.9794647976149524                                                                                                     
SCORE:                                  

SCORE:                                                                                                                 
0.9794647976149524                                                                                                     
SCORE:                                                                                                                 
0.9276850677918457                                                                                                     
SCORE:                                                                                                                 
0.9794647976149524                                                                                                     
SCORE:                                                                                                                 
0.9794647976149524                                                                                                     
SCORE:                                  

SCORE:                                                                                                                 
0.0                                                                                                                    
SCORE:                                                                                                                 
0.0                                                                                                                    
SCORE:                                                                                                                 
0.9807692307692308                                                                                                     
SCORE:                                                                                                                 
0.9553794661534598                                                                                                     
SCORE:                                  

SCORE:                                                                                                                 
0.9716662236399078                                                                                                     
SCORE:                                                                                                                 
0.0                                                                                                                    
SCORE:                                                                                                                 
0.0                                                                                                                    
SCORE:                                                                                                                 
0.9553794661534598                                                                                                     
SCORE:                                  

SCORE:                                                                                                                 
0.5                                                                                                                    
SCORE:                                                                                                                 
0.9807692307692308                                                                                                     
SCORE:                                                                                                                 
0.9290034621537664                                                                                                     
SCORE:                                                                                                                 
0.9807692307692308                                                                                                     
SCORE:                                  

SCORE:                                                                                                                 
0.9561964596175121                                                                                                     
SCORE:                                                                                                                 
0.9588546624960333                                                                                                     
SCORE:                                                                                                                 
0.0                                                                                                                    
SCORE:                                                                                                                 
0.9807692307692308                                                                                                     
SCORE:                                  

SCORE:                                                                                                                 
0.9716662236399078                                                                                                     
SCORE:                                                                                                                 
0.0                                                                                                                    
SCORE:                                                                                                                 
0.9807692307692308                                                                                                     
SCORE:                                                                                                                 
0.9700189368182024                                                                                                     
SCORE:                                  

SCORE:                                                                                                                 
0.25                                                                                                                   
SCORE:                                                                                                                 
0.9807692307692308                                                                                                     
SCORE:                                                                                                                 
0.0                                                                                                                    
SCORE:                                                                                                                 
0.9807692307692308                                                                                                     
SCORE:                                  

SCORE:                                                                                                                 
0.0                                                                                                                    
SCORE:                                                                                                                 
0.954933833354886                                                                                                      
SCORE:                                                                                                                 
0.9807692307692308                                                                                                     
SCORE:                                                                                                                 
0.9553794661534598                                                                                                     
SCORE:                                  

SCORE:                                                                                                                 
0.9674604572860387                                                                                                     
SCORE:                                                                                                                 
0.0                                                                                                                    
SCORE:                                                                                                                 
0.9807692307692308                                                                                                     
SCORE:                                                                                                                 
0.0                                                                                                                    
SCORE:                                  

SCORE:                                                                                                                 
0.9091352528876127                                                                                                     
SCORE:                                                                                                                 
0.0                                                                                                                    
SCORE:                                                                                                                 
0.0                                                                                                                    
SCORE:                                                                                                                 
0.9794647976149524                                                                                                     
SCORE:                                  

SCORE:                                                                                                                 
0.9794647976149524                                                                                                     
SCORE:                                                                                                                 
0.0                                                                                                                    
SCORE:                                                                                                                 
0.9716662236399078                                                                                                     
SCORE:                                                                                                                 
0.9807692307692308                                                                                                     
SCORE:                                  

SCORE:                                                                                                                 
0.9166666666666666                                                                                                     
SCORE:                                                                                                                 
0.9807692307692308                                                                                                     
SCORE:                                                                                                                 
0.9807692307692308                                                                                                     
SCORE:                                                                                                                 
0.0                                                                                                                    
SCORE:                                  

SCORE:                                                                                                                 
0.0                                                                                                                    
SCORE:                                                                                                                 
0.9166666666666666                                                                                                     
SCORE:                                                                                                                 
0.9794647976149524                                                                                                     
SCORE:                                                                                                                 
0.9807692307692308                                                                                                     
SCORE:                                  

SCORE:                                                                                                                 
0.75                                                                                                                   
SCORE:                                                                                                                 
0.9716662236399078                                                                                                     
SCORE:                                                                                                                 
0.0                                                                                                                    
SCORE:                                                                                                                 
0.9794647976149524                                                                                                     
SCORE:                                  

SCORE:                                                                                                                 
0.5                                                                                                                    
SCORE:                                                                                                                 
0.0                                                                                                                    
SCORE:                                                                                                                 
0.9068345809459837                                                                                                     
SCORE:                                                                                                                 
0.9729567307692308                                                                                                     
SCORE:                                  

SCORE:                                                                                                                 
0.9166666666666666                                                                                                     
SCORE:                                                                                                                 
0.9794647976149524                                                                                                     
SCORE:                                                                                                                 
0.9579893889104416                                                                                                     
SCORE:                                                                                                                 
0.9794647976149524                                                                                                     
SCORE:                                  

SCORE:                                                                                                                 
0.9729567307692308                                                                                                     
SCORE:                                                                                                                 
0.0                                                                                                                    
100%|█████████████████████████████████████████████| 200/200 [03:07<00:00,  1.07trial/s, best loss: -0.9807692307692308]


In [55]:
print("The best hyperparameters are : \n", best_hyperparams)

The best hyperparameters are : 
 {'colsample_bytree': 0.574097388042144, 'eta': 0.13221495818030032, 'gamma': 6.357711973156068, 'max_depth': 3.0, 'min_child_weight': 4.0, 'n_estimators': 328.0, 'reg_alpha': 82.0, 'reg_lambda': 0.782676232232534}


In [56]:
best_xgb = XGBClassifier(
                    n_estimators = int(best_hyperparams['n_estimators']), 
                    max_depth = int(best_hyperparams['max_depth']), 
                    gamma = best_hyperparams['gamma'],
                    reg_alpha = int(best_hyperparams['reg_alpha']),
                    min_child_weight=int(best_hyperparams['min_child_weight']),
                    colsample_bytree=int(best_hyperparams['colsample_bytree']))  

In [57]:
cross_val = cross_val_score(best_xgb, X_train_prepared, y_train_prepared, cv=5, scoring='precision')
cross_val_mean = cross_val.mean()

print('\nCross_validation precisions : {cross_val}'.format(cross_val=cross_val))
print('Corresponding mean: {cross_val_mean}'.format(cross_val_mean=cross_val_mean))


Cross_validation precisions : [0.93333333 1.         1.         1.         1.        ]
Corresponding mean: 0.9866666666666667


In [59]:
# train the model on the entire training set
best_xgb.fit(X_train_prepared, y_train_prepared)

# save the best xgboost model
output_path = os.path.join('pipeline', 'machine_learning', 'optimized_xgboost.pkl')
joblib.dump(best_xgb, output_path)



['pipeline\\machine_learning\\optimized_xgboost.pkl']

# Testing the models on the test set

Now that we have trained 4 models and tested their performance using cross-validation we can actually test their performance on a set that has never been seen by the models before: the test set.

In [21]:
# load & prepare test data
X_test_data_path = os.path.join('data', 'test', 'x_test.csv')
y_test_data_path = os.path.join('data', 'test', 'y_test.csv')

X_test = pd.read_csv(X_test_data_path, 
                      names=['Education', 'JoiningYear', 'City', 'PaymentTier', 'Age',
                       'Gender', 'EverBenched', 'ExperienceInCurrentDomain'],
                      header=1)
                     
y_test = pd.read_csv(y_test_data_path, names=['LeaveOrNot'], header=1)

X_test_prepared = prepare_raw_data(X_test, preprocessing_pipeline)
X_test_prepared_num = prepare_raw_data(X_test, preprocessing_pipeline_num, numerical_std=True)
y_test_prepared = y_test.LeaveOrNot

In [22]:
def load_and_test(path_model, X_test, y_test):
    """
    Load the Machine Learning model and test it by making a prediction.
    
    Args:
        path_model (str): model's path
        X_test (numpy.ndarray): input test data
        y_test (numpy.ndarray): target test data
    
    Returns:
        precision (float): the precision score between the prediction & the true test data
    """
    model = joblib.load(path_model)
    y_pred = model.predict(X_test)
    precision = precision_score(y_test, y_pred)
    return precision

In [23]:
path_dummy = os.path.join('pipeline', 'machine_learning', 'dummy_uniform_random.pkl')
path_logistic = os.path.join('pipeline', 'machine_learning', 'pipeline_logistic_regression.pkl')
path_random_forest = os.path.join('pipeline', 'machine_learning', 'pipeline_random_forest_polynomial_7.pkl')
path_xgb = os.path.join('pipeline', 'machine_learning', 'optimized_xgboost.pkl')

print('Precision scores: ')
print('Dummy --> ', load_and_test(path_dummy, X_test_prepared, y_test_prepared))
print('Logistic regression --> ', load_and_test(path_logistic, X_test_prepared_num, y_test_prepared))
print('Random Forest --> ', load_and_test(path_random_forest, X_test_prepared, y_test_prepared))
print('XGboost --> ', load_and_test(path_xgb, X_test_prepared, y_test_prepared))

Precision scores: 
Dummy -->  0.3392857142857143
Logistic regression -->  0.8975
Random Forest -->  0.9151193633952255
XGboost -->  1.0


### Interpretation and criticism of results 

- Not surprisingly, the "Dummy" model performed the worst
- The logistic regression performs better than the training data, it is likely that this is a fluke as it is not normal to get a better performance from test than from train
- The random forest model performs similarly to the train model
- The XGboost model gets the "perfect" score of 100% accuracy. 


At first sight it seems astonishing to get a perfect performance but it can be justified in several ways:
- as with logistic regression it is possible that the test set data is particularly "easy" and close to the training set 
- we have optimised the precision and thus the fact that the model is really confident when it predicts that an employee wants to leave the company.

**But is the model perfect? Let's have a look!**

In [113]:
sum(best_xgb.predict(X_test_prepared) == y_test_prepared) / len(y_test)

0.7374592833876221

We can see that the model gets an accuracy that is not huge. 

This is normal, we did not optimise it for this but it explains the good performance obtained: **a number of employees who wanted to leave the company are not detected!**

The model has been optimised to be safe when it predicts that an employee wants to leave, so it only predicts it when there are truly indusputable elements.

In [116]:
from sklearn.metrics import recall_score

y_pred = best_xgb.predict(X_test_prepared)

print(precision_score(y_test, y_pred))
print(recall_score(y_test, y_pred))

1.0
0.24105461393596986


Optimising precision lowers recall and vice versa!

**What recall tells us is that only 24% of employees who want to leave the company are detected by the model!**

Is this a problem? 

It depends, as long as we thought of developing this model to save time for HR who don't want to deal with employees who don't want to leave, the results are excellent.

If a company could keep those employees who want to leave by trading with them, **it would reduce departures by 24%, which is a remarkable business performance.**



# Conclusion


The objective of this modelling work was to correctly identify employees who wanted to leave the company within 2 years, and thus **maximise precision score.**


To do this we trained 3 different models (+ 1 dummy model), and optimised their hyperparameters using 2 approaches:
- a **mixed randomizesearch and gridsearch approach**
- an approach using **Bayesian optimisation.**


The approach that gave the best results and with a lower computation time (this must be qualified slightly because there was no polynomial features step for this model) than the others is the one using XGboost and a Bayesian optimisation.

The XGboost algorithm and library being particularly powerful and optimised, it is not surprising that the best results come from this model (both in terms of performance and calculation time).

**The Bayesian optimisation approach also seems to be faster than the randomizesearch + gridsearch approach.**


The performance of the XGboost model is excellent according to our business metrics.
If the objective of this model becomes to identify more employees, even those where there is doubt, then the models would have to be optimised again!