## Predict Talent Migration with Machine Learning - 03 Grid Build

---

<p style="text-align: center;">
Project split into 5 Notebooks:</br>
</br>
Predict Talent Migration with Machine Learning - 01 Exploratory Analysis</br>
*</br>
Predict Talent Migration with Machine Learning - 02 Modeling</br>
*</br>
Predict Talent Migration with Machine Learning - 03 Grid Build</br>
*</br>
Predict Talent Migration with Machine Learning - 04 Final Take Model A</br>
*</br>
Predict Talent Migration with Machine Learning - 05 Final Take Model B</br> </p> 
         
---

This project aims to create a machine learning model that allows the prediction of employees departure from an Organization. </br>
We will focus our forecast on the Organization's best employees (Top Performers), although a comparison between employees in general will be made. Top Performers will be identified through the creation of a condition, which will be based on the classification of employees evaluation cycles. An analysis on model performance will also be presented, concerning the model's ability to predict the employees departure by **Generation** and **Gender.**

Employees are evaluated on a semiannual basis and our data concerns the last 3 evaluation cycles, between January 1st 2018 to September 31st 2019.

Ratings per evaluation are: </br>

  - Mid Year 2018 (MY2018): **0,1,2,3,4,5** </br>
  - Year End 2018 (YE2018): **0,1,2,3,4,5** </br>
  - Mid Year 2019: **0,1,2,3***.</br>
             
**Data from the Organization shows a change of ratings scale was purpousely made to decrease results granularity.*

In [1]:
import random
import pandas as pd
import numpy as np
import seaborn as sns
from matplotlib import pyplot

from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestRegressor
from sklearn.ensemble import GradientBoostingClassifier

from sklearn.model_selection import cross_val_score, train_test_split
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import KBinsDiscretizer, PowerTransformer, OneHotEncoder

from sklearn.model_selection import KFold
from sklearn.inspection import permutation_importance

from category_encoders.ordinal import OrdinalEncoder

from sklearn.model_selection import GridSearchCV

from sklearn.metrics import roc_auc_score

---

## Data:

**From the previous notebook "Predict Talent Migration with Machine Learning - 02 Modeling" we analysed our models performance with our train dataset. From that we selected the models with the best performance (Random Forest + Gradient Boosting) with and without overestimating variables.**

In [2]:
df = pd.read_csv('projeto_final.csv', index_col = False)

In [3]:
X = df.copy()

In [4]:
y = df.pop('Out')

In [5]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)

---

## Best Models Without Overestimating Features:

Data transformation:

In [6]:
transf_v12 = ColumnTransformer(
    [('cat', OrdinalEncoder(), ['Generation', 'Gender', 'MProximity', 'Rehire', 'Contract Type', 'PeopleManager', 'Country', 'CostCenterH', 'TopPerformer']),
     ('null', SimpleImputer (missing_values= np.nan, strategy = 'median'), ['PayIncrease', 'Age']), 
     ('other', "passthrough", ['Tenure', 'Dependents', 'EmployeeCount', 'HireCount', 'MTenure', 'HomeOffice'])
    ])


**Random Forest:**

In [7]:
model_for_12 = Pipeline([
    ('Feature Eng', transf_v12),
    ('Random Forest', RandomForestRegressor())
])

model_for_12.steps

[('Feature Eng',
  ColumnTransformer(transformers=[('cat', OrdinalEncoder(),
                                   ['Generation', 'Gender', 'MProximity',
                                    'Rehire', 'Contract Type', 'PeopleManager',
                                    'Country', 'CostCenterH', 'TopPerformer']),
                                  ('null', SimpleImputer(strategy='median'),
                                   ['PayIncrease', 'Age']),
                                  ('other', 'passthrough',
                                   ['Tenure', 'Dependents', 'EmployeeCount',
                                    'HireCount', 'MTenure', 'HomeOffice'])])),
 ('Random Forest', RandomForestRegressor())]

In [8]:
results_for_12 = cross_val_score(model_for_12, X_train, y_train, cv = KFold(n_splits = 5, shuffle = True), scoring = "roc_auc")
results_for_12_mean = results_for_12.mean()
results_for_12_std = results_for_12.std()

In [9]:
results_for_12_mean

0.8802391937839198

In [10]:
results_for_12_std

0.007826395558931245

---

In [11]:
model_for_12.get_params()

{'memory': None,
 'steps': [('Feature Eng',
   ColumnTransformer(transformers=[('cat', OrdinalEncoder(),
                                    ['Generation', 'Gender', 'MProximity',
                                     'Rehire', 'Contract Type', 'PeopleManager',
                                     'Country', 'CostCenterH', 'TopPerformer']),
                                   ('null', SimpleImputer(strategy='median'),
                                    ['PayIncrease', 'Age']),
                                   ('other', 'passthrough',
                                    ['Tenure', 'Dependents', 'EmployeeCount',
                                     'HireCount', 'MTenure', 'HomeOffice'])])),
  ('Random Forest', RandomForestRegressor())],
 'verbose': False,
 'Feature Eng': ColumnTransformer(transformers=[('cat', OrdinalEncoder(),
                                  ['Generation', 'Gender', 'MProximity',
                                   'Rehire', 'Contract Type', 'PeopleManager',
         

In [12]:
params_rf = {
    'Random Forest__n_estimators': [10, 20, 50, 100],
    'Random Forest__max_depth': [1, 3, 9, None], 
    'Random Forest__max_features': ['sqrt', 'log2', None]
}

In [13]:
clf_rf = GridSearchCV(estimator = model_for_12, param_grid = params_rf, scoring = 'roc_auc' , cv = KFold(n_splits=5, shuffle=True, random_state=0), n_jobs = -1)

## Fit & Save:

In [14]:
#clf_rf.fit(X_train, y_train)

In [15]:
#clf_rf.cv_results_

In [16]:
#results_1 = pd.DataFrame(clf_rf.cv_results_)

In [17]:
#results_1.to_csv('results_1.csv')

---

**Gradient Boosting:**

In [18]:
model_boost_12 = Pipeline([
    ('Feature Eng', transf_v12),
    ('Gradient Boosting', GradientBoostingClassifier())
])

In [19]:
results_boost_12 = cross_val_score(model_boost_12, X_train, y_train, cv = KFold(n_splits = 5, shuffle = True), scoring = "roc_auc")
results_boost_12_mean = results_boost_12.mean()
results_boost_12_std = results_boost_12.std()

In [20]:
results_boost_12_mean

0.8744620770808875

In [21]:
results_boost_12_std

0.01842666231848989

---

In [22]:
model_boost_12.get_params()

{'memory': None,
 'steps': [('Feature Eng',
   ColumnTransformer(transformers=[('cat', OrdinalEncoder(),
                                    ['Generation', 'Gender', 'MProximity',
                                     'Rehire', 'Contract Type', 'PeopleManager',
                                     'Country', 'CostCenterH', 'TopPerformer']),
                                   ('null', SimpleImputer(strategy='median'),
                                    ['PayIncrease', 'Age']),
                                   ('other', 'passthrough',
                                    ['Tenure', 'Dependents', 'EmployeeCount',
                                     'HireCount', 'MTenure', 'HomeOffice'])])),
  ('Gradient Boosting', GradientBoostingClassifier())],
 'verbose': False,
 'Feature Eng': ColumnTransformer(transformers=[('cat', OrdinalEncoder(),
                                  ['Generation', 'Gender', 'MProximity',
                                   'Rehire', 'Contract Type', 'PeopleManager',


In [23]:
params_gb = {
    'Gradient Boosting__n_estimators': [10, 20, 50, 100],
    'Gradient Boosting__max_depth': [1, 3, 9, None], 
    'Gradient Boosting__learning_rate': [0.001, 0.01, 0.1]
}

In [24]:
clf_gb = GridSearchCV(estimator = model_boost_12, param_grid = params_gb, scoring = 'roc_auc' , cv = KFold(n_splits=5, shuffle=True, random_state=0), n_jobs = -1)

## Fit & Save:

In [25]:
#clf_gb.fit(X_train, y_train)

In [26]:
#clf_gb.cv_results_

In [27]:
#results_2 = pd.DataFrame(clf_gb.cv_results_)

In [28]:
#results_2.to_csv("results_2.csv")

---

## Best Models With Overestimating Features:

Data transformation:

In [30]:
transf_v6 = ColumnTransformer(
    [('cat', OrdinalEncoder(), ['Generation', 'Gender', 'MProximity']),
     ('null', SimpleImputer (missing_values= np.nan, strategy = 'median'), ['PayIncrease']), 
     ('other', "passthrough", ['Tenure', 'Dependents', 'TimeJobProfile', 'TimeinPosition'])
    ])

**Random Forest:**

In [31]:
model_for_6 = Pipeline([
    ('Feature Eng', transf_v6),
    ('Random Forest', RandomForestRegressor())
     ])

model_for_6.steps

[('Feature Eng',
  ColumnTransformer(transformers=[('cat', OrdinalEncoder(),
                                   ['Generation', 'Gender', 'MProximity']),
                                  ('null', SimpleImputer(strategy='median'),
                                   ['PayIncrease']),
                                  ('other', 'passthrough',
                                   ['Tenure', 'Dependents', 'TimeJobProfile',
                                    'TimeinPosition'])])),
 ('Random Forest', RandomForestRegressor())]

In [32]:
results_for_6 = cross_val_score(model_for_6, X_train, y_train, cv = KFold(n_splits = 5, shuffle = True), scoring = "roc_auc")
results_for_6_mean = results_for_6.mean()
results_for_6_std = results_for_6.std()

In [33]:
results_for_6_mean

0.9781951815336406

In [34]:
results_for_6_std

0.003077569187063055

In [35]:
model_for_6.get_params()

{'memory': None,
 'steps': [('Feature Eng',
   ColumnTransformer(transformers=[('cat', OrdinalEncoder(),
                                    ['Generation', 'Gender', 'MProximity']),
                                   ('null', SimpleImputer(strategy='median'),
                                    ['PayIncrease']),
                                   ('other', 'passthrough',
                                    ['Tenure', 'Dependents', 'TimeJobProfile',
                                     'TimeinPosition'])])),
  ('Random Forest', RandomForestRegressor())],
 'verbose': False,
 'Feature Eng': ColumnTransformer(transformers=[('cat', OrdinalEncoder(),
                                  ['Generation', 'Gender', 'MProximity']),
                                 ('null', SimpleImputer(strategy='median'),
                                  ['PayIncrease']),
                                 ('other', 'passthrough',
                                  ['Tenure', 'Dependents', 'TimeJobProfile',
         

In [36]:
clf_rf_fp = GridSearchCV(estimator = model_for_6, param_grid = params_rf, scoring = 'roc_auc' , cv = KFold(n_splits=5, shuffle=True, random_state=0), n_jobs = -1)

## Fit & Save:

In [37]:
#clf_rf_fp.fit(X_train, y_train)

In [38]:
#clf_rf_fp.cv_results_

In [39]:
#results_3 = pd.DataFrame(clf_rf_fp.cv_results_)

In [40]:
#results_3.to_csv('results_3.csv')

---

Data transformation:

In [41]:
transf_v13 = ColumnTransformer(
    [('cat', OrdinalEncoder(), ['Generation', 'Gender', 'MProximity', 'Rehire', 'Contract Type', 'PeopleManager', 'Country', 'CostCenterH', 'TopPerformer']),
     ('null', SimpleImputer (missing_values= np.nan, strategy = 'median'), ['PayIncrease', 'Age']), 
     ('other', "passthrough", ['Tenure', 'Dependents', 'EmployeeCount', 'HireCount', 'MTenure', 'HomeOffice','TimeinPosition', 'TimeJobProfile', 'MTimeJobProfile', 'TerminationCount'])
    ])

**Gradient Boosting:**

In [42]:
model_boost_13 = Pipeline([
    ('Feature Eng', transf_v13),
    ('Gradient Boosting', GradientBoostingClassifier())
])

In [43]:
results_boost_13 = cross_val_score(model_boost_13, X_train, y_train, cv = KFold(n_splits = 5, shuffle = True), scoring = "roc_auc")
results_boost_13_mean = results_boost_13.mean()
results_boost_13_std = results_boost_13.std()

In [44]:
results_boost_13_mean

0.9673606061148897

In [45]:
results_boost_13_std

0.005089812768855205

---

In [46]:
clf_gb_fp = GridSearchCV(estimator = model_boost_13, param_grid = params_gb, scoring = 'roc_auc' , cv = KFold(n_splits=5, shuffle=True, random_state=0), n_jobs = -1)

## Fit & Save:

In [47]:
#clf_gb_fp.fit(X_train, y_train)

In [48]:
#clf_gb_fp.cv_results_

In [49]:
#results_4 = pd.DataFrame(clf_gb_fp.cv_results_)

In [50]:
#results_4.to_csv("results_4.csv")

---