# HR EMPLOYEE ATTRITION - HYPERPARAMETER TUNING
Data taken from : https://www.kaggle.com/pavansubhasht/ibm-hr-analytics-attrition-dataset

In this case study, a HR dataset was sourced from IBM HR Analytics Employee Attrition & Performance which contains employee data for 1,470 employees with various information about the employees. In this section we will try to find the best hyperparameter to best tune the respected models.

As stated on the IBM website "This is a fictional data set created by IBM data scientists". Its main purpose was to demonstrate the IBM Watson Analytics tool for employee attrition.

## IMPORT LIBRARIES

In [8]:
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt

from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder

from sklearn import tree
from sklearn.ensemble import ExtraTreesClassifier, RandomForestClassifier
from xgboost import XGBClassifier

from sklearn.metrics import confusion_matrix, classification_report, recall_score
from sklearn.model_selection import RandomizedSearchCV

import warnings
warnings.filterwarnings('ignore')
warnings.filterwarnings(action='ignore', category=FutureWarning)

## OVERVIEW

In [4]:
df = pd.read_csv('attrition_HP.csv')

In [5]:
df.head()

Unnamed: 0,OverTime_Yes,Age,TotalWorkingYears,JobLevel,MonthlyIncome,DistanceFromHome,YearsAtCompany,WorkLifeBalance,YearsInCurrentRole,NumCompaniesWorked,Attrition
0,1,41,8,2,5993,1,6,0,4,8,1
1,0,49,10,2,5130,8,10,2,7,1,0
2,1,37,7,4,2090,2,0,2,0,6,1
3,1,33,8,4,2909,3,8,2,7,1,0
4,0,27,6,4,3468,2,2,2,2,9,0


In [6]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1470 entries, 0 to 1469
Data columns (total 11 columns):
 #   Column              Non-Null Count  Dtype
---  ------              --------------  -----
 0   OverTime_Yes        1470 non-null   int64
 1   Age                 1470 non-null   int64
 2   TotalWorkingYears   1470 non-null   int64
 3   JobLevel            1470 non-null   int64
 4   MonthlyIncome       1470 non-null   int64
 5   DistanceFromHome    1470 non-null   int64
 6   YearsAtCompany      1470 non-null   int64
 7   WorkLifeBalance     1470 non-null   int64
 8   YearsInCurrentRole  1470 non-null   int64
 9   NumCompaniesWorked  1470 non-null   int64
 10  Attrition           1470 non-null   int64
dtypes: int64(11)
memory usage: 126.5 KB


## PARAMETERS

In [7]:
# Random Forest Classifier

max_depth = [10, 20, 40, 'None']
min_samples_leaf = [2, 4, 8]
min_samples_split = [2, 10, 100]
n_estimators = [10, 100, 500]

RFC_param = {'max_depth' : max_depth, 
             'min_samples_leaf': min_samples_leaf, 
             'min_samples_split' : min_samples_split, 
             'n_estimators' : n_estimators}

In [15]:
# XGBoost Classifier

max_depth = [3,5,10]
gamma = [1,2]
reg_alpha = [40,180]
reg_lambda = [0,1]
colsample_bytree = [0.5,1]
min_child_weight = [0, 10, 1]
n_estimators = [180, 200, 500]

XGC_param = {'max_depth': max_depth,
             'gamma': gamma,
             'reg_alpha' : reg_alpha,
             'reg_lambda' : reg_lambda,
             'colsample_bytree' : colsample_bytree,
             'min_child_weight' : min_child_weight,
             'n_estimators': n_estimators}

## SPLIT DATA

In [10]:
# Split target - predictors

X = df.drop(['Attrition'], axis = 1)
y = df['Attrition']

In [11]:
# Split 80% train data

X_train, X_test, y_train, y_test = train_test_split(X, y, train_size = .8, random_state = 42)

In [12]:
X_train.head()

Unnamed: 0,OverTime_Yes,Age,TotalWorkingYears,JobLevel,MonthlyIncome,DistanceFromHome,YearsAtCompany,WorkLifeBalance,YearsInCurrentRole,NumCompaniesWorked
1097,0,24,2,4,2296,21,1,2,1,0
727,0,18,0,4,1051,5,0,2,0,1
254,0,29,10,2,6931,20,3,2,2,2
1175,0,39,7,2,5295,12,5,2,4,4
1341,0,31,10,2,4197,20,10,2,8,1


In [13]:
X_test.head()

Unnamed: 0,OverTime_Yes,Age,TotalWorkingYears,JobLevel,MonthlyIncome,DistanceFromHome,YearsAtCompany,WorkLifeBalance,YearsInCurrentRole,NumCompaniesWorked
1041,0,28,6,2,8463,5,5,2,4,0
184,0,53,5,2,4450,13,4,2,2,1
1222,0,24,1,4,1555,22,1,2,0,1
67,0,45,25,0,9724,7,1,2,0,2
220,0,36,16,2,5914,5,13,1,11,8


## DEFAULT PARMETER MODEL PREPARATION

In [14]:
# Random Forest Classifier

RFC = RandomForestClassifier().fit(X_train, y_train)

# XGBoost Classifier

XGC = XGBClassifier().fit(X_train, y_train)

## HYPERPARAMETER TUNING

In [16]:

# Random Forest Classifier

def CVRFC (est, xtr, ytr):
    result = RandomizedSearchCV(estimator = est, 
                                param_distributions = RFC_param, 
                                cv=5, scoring = 'accuracy').fit(xtr, ytr)
    
    return result

# XGBoost Classifier

def CVXGC (est, xtr, ytr):
    result = RandomizedSearchCV(estimator = est, 
                                param_distributions = XGC_param, 
                                cv=5, scoring = 'accuracy').fit(xtr, ytr)
    
    return result

In [17]:
# Random Forest Classifier

for i in range(1,4):
    cv_rfc = CVRFC(RFC, X_train, y_train)
    print('Hyper Model', i, cv_rfc.best_params_)

Hyper Model 1 {'n_estimators': 500, 'min_samples_split': 2, 'min_samples_leaf': 8, 'max_depth': 20}
Hyper Model 2 {'n_estimators': 100, 'min_samples_split': 10, 'min_samples_leaf': 4, 'max_depth': 10}
Hyper Model 3 {'n_estimators': 500, 'min_samples_split': 10, 'min_samples_leaf': 8, 'max_depth': 20}


In [18]:
# XGBoost Classifier

for i in range(1,4):
    cv_xgc = CVXGC(XGC, X_train, y_train)
    print('Hyper Model', i, cv_xgc.best_params_)

Hyper Model 1 {'reg_lambda': 1, 'reg_alpha': 180, 'n_estimators': 200, 'min_child_weight': 0, 'max_depth': 3, 'gamma': 2, 'colsample_bytree': 0.5}
Hyper Model 2 {'reg_lambda': 1, 'reg_alpha': 40, 'n_estimators': 180, 'min_child_weight': 10, 'max_depth': 3, 'gamma': 1, 'colsample_bytree': 0.5}
Hyper Model 3 {'reg_lambda': 1, 'reg_alpha': 40, 'n_estimators': 500, 'min_child_weight': 0, 'max_depth': 3, 'gamma': 1, 'colsample_bytree': 1}


## APPLY TUNED PARAMETER

In [19]:
# Random Forest Classifier

hyper_RFC1 = RandomForestClassifier(n_estimators = 500, min_samples_split = 2, min_samples_leaf = 8, max_depth = 20).fit(X_train, y_train)
hyper_RFC2 = RandomForestClassifier(n_estimators = 100, min_samples_split = 10, min_samples_leaf = 4, max_depth = 10).fit(X_train, y_train)
hyper_RFC3 = RandomForestClassifier(n_estimators = 500, min_samples_split = 10, min_samples_leaf = 8, max_depth = 20).fit(X_train, y_train)

In [22]:
# XGBoost Classifier

hyper_XGC1 = XGBClassifier(reg_lambda=1, reg_alpha = 180, n_estimators = 200, min_child_weight = 0, max_depth = 3, gamma = 2, colsample_bytree = 0.5, seed = 0).fit(X_train, y_train)
hyper_XGC2 = XGBClassifier(reg_lambda=1, reg_alpha = 40, n_estimators = 180, min_child_weight = 10, max_depth = 3, gamma = 1, colsample_bytree = 0.5, seed = 0).fit(X_train, y_train)
hyper_XGC3 = XGBClassifier(reg_lambda=1, reg_alpha = 40, n_estimators = 500, min_child_weight = 0, max_depth = 3, gamma = 1, colsample_bytree = 1, seed = 0).fit(X_train, y_train)

In [23]:
# y_predict for Random Forest Classifier

yp_RFC = RFC.predict(X_test)
yp_hyper_RFC1 = hyper_RFC1.predict(X_test)
yp_hyper_RFC2 = hyper_RFC2.predict(X_test)
yp_hyper_RFC3 = hyper_RFC3.predict(X_test)

# y_predict for XGBoost Classifier

yp_XGC = XGC.predict(X_test)
yp_hyper_XGC1 = hyper_XGC1.predict(X_test)
yp_hyper_XGC2 = hyper_XGC2.predict(X_test)
yp_hyper_XGC3 = hyper_XGC3.predict(X_test)

## MEASURE THE DEFAULT VS THE HYPERPARAMETER TUNED¶

In [24]:
# Measure the default vs hyperparameter tuned model score for Random Forest Classifier Model

RFC_acc = RFC.score(X_test, y_test)
hyper_RFC1_acc = hyper_RFC1.score(X_test, y_test)
hyper_RFC2_acc = hyper_RFC2.score(X_test, y_test)
hyper_RFC3_acc = hyper_RFC3.score(X_test, y_test)

In [25]:
model_RFC_score = pd.DataFrame({'Random Forest Classifier' : ['Default', 'Hyper Test 1', 'Hyper Test 2', 'Hyper Test 3'], 
                                'Accuracy Score': [RFC_acc, hyper_RFC1_acc, hyper_RFC2_acc, hyper_RFC3_acc]})

In [26]:
model_RFC_score

Unnamed: 0,Random Forest Classifier,Accuracy Score
0,Default,0.857143
1,Hyper Test 1,0.857143
2,Hyper Test 2,0.857143
3,Hyper Test 3,0.857143


> Apparently all the models on Random Forest Classifier score are the same. Therefore taking any test is all good.

In [27]:
# Measure the default vs hyperparameter tuned model score for XGBoost Classifier Model

XGC_acc = XGC.score(X_test, y_test)
hyper_XGC1_acc = hyper_XGC1.score(X_test, y_test)
hyper_XGC2_acc = hyper_XGC2.score(X_test, y_test)
hyper_XGC3_acc = hyper_XGC3.score(X_test, y_test)

In [28]:
model_XGC_score = pd.DataFrame({'XGBoost Classifier' : ['Default', 'Hyper Test 1', 'Hyper Test 2', 'Hyper Test 3'], 
                                'Accuracy Score': [XGC_acc, hyper_XGC1_acc, hyper_XGC2_acc, hyper_XGC3_acc]})

In [29]:
model_XGC_score

Unnamed: 0,XGBoost Classifier,Accuracy Score
0,Default,0.823129
1,Hyper Test 1,0.867347
2,Hyper Test 2,0.867347
3,Hyper Test 3,0.867347


> While in this XGBoost Classifier model, the hyperparameter tuned models turns out to be better than the default model. Therefore we'll take any hyperparameter tuned model.