In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import warnings
warnings.filterwarnings('ignore')

# Project Summary : Business Case Analysis:
## Data Requirement: 
   - The data is given in an excel file and is about a data analytics firm, whose employee performance indexes are coming down, and customer satisfaction levels are declining. 
   - To be able to take any remedial measures, the factors affecting the poor performance and to be able to predict if a employee is poorly performing or not based on the attributes, is to be determined.
   - Seeking the help of ML, we are asked to develop a model that analyses the current employees and figures out the causes/factors of poorer performances. Additionally,the model can be designed to predict if a employee is underperforming or not and the rating of performance.
   - Insights expected from the project are as given:
        1. Department wise performances
        2. Top 3 Important Factors effecting employee performance
        3. A trained model which can predict the employee performance based on factors as inputs. This will be used to hire employees
        4. Recommendations to improve the employee performance based on insights from analysis.

## Data Loading and basic checks

In [2]:
#Load data from excel
data = pd.read_excel('/Users/subbalakshmivedam/Desktop/datascience projects/IABAC/data/processed.xlsx',index_col = 0)

In [3]:
#See the data
data.head()

Unnamed: 0,EducationBackground,MaritalStatus,EmpDepartment,EmpJobRole,BusinessTravelFrequency,DistanceFromHome,EmpEducationLevel,EmpEnvironmentSatisfaction,EmpHourlyRate,EmpJobInvolvement,...,EmpRelationshipSatisfaction,TotalWorkExperienceInYears,TrainingTimesLastYear,EmpWorkLifeBalance,ExperienceYearsInCurrentRole,YearsSinceLastPromotion,Attrition,PerformanceRating,Male,Female
0,2,2,5,13,1,10,3,4,55,3,...,4,10,2,2,7,0,0,3,1,0
1,2,2,5,13,1,14,4,4,42,3,...,4,20,2,3,7,1,0,3,1,0
2,1,1,5,13,2,5,4,4,48,2,...,3,20,2,3,13,1,0,4,1,0
3,0,0,3,8,1,10,4,2,73,2,...,2,23,2,2,6,12,0,3,1,0
4,2,2,5,13,1,16,4,1,84,3,...,4,10,1,3,2,2,0,3,1,0


In [4]:
#Data info
data.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 1200 entries, 0 to 1199
Data columns (total 24 columns):
 #   Column                        Non-Null Count  Dtype
---  ------                        --------------  -----
 0   EducationBackground           1200 non-null   int64
 1   MaritalStatus                 1200 non-null   int64
 2   EmpDepartment                 1200 non-null   int64
 3   EmpJobRole                    1200 non-null   int64
 4   BusinessTravelFrequency       1200 non-null   int64
 5   DistanceFromHome              1200 non-null   int64
 6   EmpEducationLevel             1200 non-null   int64
 7   EmpEnvironmentSatisfaction    1200 non-null   int64
 8   EmpHourlyRate                 1200 non-null   int64
 9   EmpJobInvolvement             1200 non-null   int64
 10  EmpJobSatisfaction            1200 non-null   int64
 11  NumCompaniesWorked            1200 non-null   int64
 12  OverTime                      1200 non-null   int64
 13  EmpLastSalaryHikePercent      120

In [5]:
#Details and spread of numerical columns
data.describe()

Unnamed: 0,EducationBackground,MaritalStatus,EmpDepartment,EmpJobRole,BusinessTravelFrequency,DistanceFromHome,EmpEducationLevel,EmpEnvironmentSatisfaction,EmpHourlyRate,EmpJobInvolvement,...,EmpRelationshipSatisfaction,TotalWorkExperienceInYears,TrainingTimesLastYear,EmpWorkLifeBalance,ExperienceYearsInCurrentRole,YearsSinceLastPromotion,Attrition,PerformanceRating,Male,Female
count,1200.0,1200.0,1200.0,1200.0,1200.0,1200.0,1200.0,1200.0,1200.0,1200.0,...,1200.0,1200.0,1200.0,1200.0,1200.0,1200.0,1200.0,1200.0,1200.0,1200.0
mean,2.235,1.096667,3.215,9.039167,1.075,9.165833,2.8925,2.715833,65.981667,2.731667,...,2.725,11.33,2.785833,2.744167,4.291667,2.194167,0.148333,2.948333,0.604167,0.395833
std,1.31004,0.73105,1.696911,4.754451,0.53816,8.176636,1.04412,1.090599,20.211302,0.707164,...,1.075642,7.797228,1.263446,0.699374,3.613744,3.22156,0.355578,0.518866,0.489233,0.489233
min,0.0,0.0,0.0,0.0,0.0,1.0,1.0,1.0,30.0,1.0,...,1.0,0.0,0.0,1.0,0.0,0.0,0.0,2.0,0.0,0.0
25%,1.0,1.0,1.0,4.0,1.0,2.0,2.0,2.0,48.0,2.0,...,2.0,6.0,2.0,2.0,2.0,0.0,0.0,3.0,0.0,0.0
50%,2.0,1.0,4.0,9.0,1.0,7.0,3.0,3.0,66.0,3.0,...,3.0,10.0,3.0,3.0,3.0,1.0,0.0,3.0,1.0,0.0
75%,3.0,2.0,5.0,13.0,1.0,14.0,4.0,4.0,83.0,3.0,...,4.0,15.0,3.0,3.0,7.0,3.0,0.0,3.0,1.0,1.0
max,5.0,2.0,5.0,18.0,2.0,29.0,5.0,4.0,100.0,4.0,...,4.0,40.0,6.0,4.0,18.0,15.0,1.0,4.0,1.0,1.0


## Scaling: 
- To ensure that the range of data values of all features remain the same and is between 0-1, we do scaling.

In [6]:
#We'll employ minmax scaling as its more general in nature
from sklearn.preprocessing import MinMaxScaler
mms = MinMaxScaler()
data_scaled = pd.DataFrame(mms.fit_transform(data.drop('PerformanceRating',axis=1)),columns = data.drop('PerformanceRating',axis=1).columns)

In [7]:
data_scaled['y'] = data.PerformanceRating

In [8]:
data_scaled.head()

Unnamed: 0,EducationBackground,MaritalStatus,EmpDepartment,EmpJobRole,BusinessTravelFrequency,DistanceFromHome,EmpEducationLevel,EmpEnvironmentSatisfaction,EmpHourlyRate,EmpJobInvolvement,...,EmpRelationshipSatisfaction,TotalWorkExperienceInYears,TrainingTimesLastYear,EmpWorkLifeBalance,ExperienceYearsInCurrentRole,YearsSinceLastPromotion,Attrition,Male,Female,y
0,0.4,1.0,1.0,0.722222,0.5,0.321429,0.5,1.0,0.357143,0.666667,...,1.0,0.25,0.333333,0.333333,0.388889,0.0,0.0,1.0,0.0,3
1,0.4,1.0,1.0,0.722222,0.5,0.464286,0.75,1.0,0.171429,0.666667,...,1.0,0.5,0.333333,0.666667,0.388889,0.066667,0.0,1.0,0.0,3
2,0.2,0.5,1.0,0.722222,1.0,0.142857,0.75,1.0,0.257143,0.333333,...,0.666667,0.5,0.333333,0.666667,0.722222,0.066667,0.0,1.0,0.0,4
3,0.0,0.0,0.6,0.444444,0.5,0.321429,0.75,0.333333,0.614286,0.333333,...,0.333333,0.575,0.333333,0.333333,0.333333,0.8,0.0,1.0,0.0,3
4,0.4,1.0,1.0,0.722222,0.5,0.535714,0.75,0.0,0.771429,0.666667,...,1.0,0.25,0.166667,0.666667,0.111111,0.133333,0.0,1.0,0.0,3


In [9]:
#Creating x (predictors) and y (target) -- the usual conventional notation
x = data_scaled.drop('y',axis=1)
y = data_scaled.y

In [10]:
y.value_counts()
#The data is imbalanced, lets try creating models with both balanced and imbalanced data

3    874
2    194
4    132
Name: y, dtype: int64

In [11]:
from sklearn.model_selection import train_test_split
x_train,x_test,y_train,y_test = train_test_split(x,y,test_size=0.25,random_state=42)
y_train.value_counts()

3    642
2    157
4    101
Name: y, dtype: int64

In [12]:
#Balancing data through smote
from imblearn.over_sampling import SMOTE
sm = SMOTE()
x_smote,y_smote = sm.fit_resample(x_train,y_train)
y_smote.value_counts()

3    642
4    642
2    642
Name: y, dtype: int64

In [13]:
#Balancing data through sklearn minority resampling.
#Creating copy of train data
data_min_resample = x_train.copy()
data_min_resample['y'] = y_train
data_min_resample.y.value_counts()

3    642
2    157
4    101
Name: y, dtype: int64

In [14]:
#Separating classwise data
x_2 = data_min_resample.loc[data_min_resample.y==2]
x_3 = data_min_resample.loc[data_min_resample.y==3]
x_4 = data_min_resample.loc[data_min_resample.y==4]
print(len(x_2),len(x_3),len(x_4))

157 642 101


In [15]:
from sklearn.utils import resample
x_2_upsampled = resample(x_2,replace=True,n_samples=642,random_state=42)
x_4_upsampled = resample(x_4,replace=True,n_samples=642,random_state=42)

In [16]:
data_resampled = pd.concat([x_2_upsampled,x_4_upsampled,x_3])

In [17]:
#Creating re-sampled training predictor and target data from resampled data.
x1_train = data_resampled.drop('y',axis=1)
y1_train = data_resampled.y
y1_train.value_counts()

2    642
4    642
3    642
Name: y, dtype: int64

In [18]:
#DF for storing results of each model
Results = pd.DataFrame(index=['Accuracy','Precision','Recall','F1'])

In [19]:
#Importing the metrics
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, confusion_matrix, classification_report

#### Model developing
- We have three sets of data. Imbalanced, SMOTE balanced, Minority resampling - balanced.
- We'll use multi class logistic regression, KNN, decision tree, Random Forest, Gradient Boosting algorithms with each of these data
- We'll store the results of each model in a data frame and compare, at the end

In [20]:
#Logreg with imbalanced data
from sklearn.linear_model import LogisticRegression

LR = LogisticRegression(multi_class='ovr')
LR.fit(x_train,y_train)
y_pred = LR.predict(x_test)

acc = accuracy_score(y_test,y_pred)
pr = precision_score(y_test,y_pred,average='macro')
re = recall_score(y_test,y_pred,average='macro')
f1 = f1_score(y_test,y_pred,average='macro')
cm = confusion_matrix(y_test,y_pred)
print('Accuracy: ',acc,'\nPrecision: ',pr,'\nRecall Score', re,'\nF1 Score', f1, '\nConfusion Matrix\n', cm)
print(classification_report(y_test,y_pred))
Results['LogReg w/o balancing'] = [acc,pr,re,f1]

Accuracy:  0.8133333333333334 
Precision:  0.701640656262505 
Recall Score 0.6266134042830372 
F1 Score 0.6562204490718974 
Confusion Matrix
 [[ 18  18   1]
 [ 16 211   5]
 [  0  16  15]]
              precision    recall  f1-score   support

           2       0.53      0.49      0.51        37
           3       0.86      0.91      0.88       232
           4       0.71      0.48      0.58        31

    accuracy                           0.81       300
   macro avg       0.70      0.63      0.66       300
weighted avg       0.81      0.81      0.81       300



In [21]:
#Logistic Regression with SMOTE balancing
from sklearn.linear_model import LogisticRegression

LR = LogisticRegression(multi_class='ovr')
LR.fit(x_smote,y_smote)
y_pred = LR.predict(x_test)

acc = accuracy_score(y_test,y_pred)
pr = precision_score(y_test,y_pred,average='macro')
re = recall_score(y_test,y_pred,average='macro')
f1 = f1_score(y_test,y_pred,average='macro')
cm = confusion_matrix(y_test,y_pred)
print('Accuracy: ',acc,'\nPrecision: ',pr,'\nRecall Score', re,'\nF1 Score', f1, '\nConfusion Matrix\n', cm)
print(classification_report(y_test,y_pred))
Results['LogReg with smote'] = [acc,pr,re,f1]

Accuracy:  0.6966666666666667 
Precision:  0.5870518172727724 
Recall Score 0.7696677489502851 
F1 Score 0.6218608253125345 
Confusion Matrix
 [[ 30   4   3]
 [ 47 153  32]
 [  1   4  26]]
              precision    recall  f1-score   support

           2       0.38      0.81      0.52        37
           3       0.95      0.66      0.78       232
           4       0.43      0.84      0.57        31

    accuracy                           0.70       300
   macro avg       0.59      0.77      0.62       300
weighted avg       0.83      0.70      0.72       300



In [22]:
#Logistic Regression with minority resampling from sklearn
from sklearn.linear_model import LogisticRegression

LR = LogisticRegression(multi_class='ovr')
LR.fit(x1_train,y1_train)
y_pred = LR.predict(x_test)

acc = accuracy_score(y_test,y_pred)
pr = precision_score(y_test,y_pred,average='macro')
re = recall_score(y_test,y_pred,average='macro')
f1 = f1_score(y_test,y_pred,average='macro')
cm = confusion_matrix(y_test,y_pred)
print('Accuracy: ',acc,'\nPrecision: ',pr,'\nRecall Score', re,'\nF1 Score', f1, '\nConfusion Matrix\n', cm)
print(classification_report(y_test,y_pred))
Results['LogReg MinReSmpling'] = [acc,pr,re,f1]

Accuracy:  0.6833333333333333 
Precision:  0.5733333162125814 
Recall Score 0.7470324885508424 
F1 Score 0.6033376417420638 
Confusion Matrix
 [[ 29   4   4]
 [ 48 151  33]
 [  2   4  25]]
              precision    recall  f1-score   support

           2       0.37      0.78      0.50        37
           3       0.95      0.65      0.77       232
           4       0.40      0.81      0.54        31

    accuracy                           0.68       300
   macro avg       0.57      0.75      0.60       300
weighted avg       0.82      0.68      0.71       300



In [23]:
#KNN without data balancing
from sklearn.neighbors import KNeighborsClassifier
KNN = KNeighborsClassifier()
KNN.fit(x_train,y_train)
y_pred = KNN.predict(x_test)

acc = accuracy_score(y_test,y_pred)
pr = precision_score(y_test,y_pred,average='macro')
re = recall_score(y_test,y_pred,average='macro')
f1 = f1_score(y_test,y_pred,average='macro')
cm = confusion_matrix(y_test,y_pred)
print('Accuracy: ',acc,'\nPrecision: ',pr,'\nRecall Score', re,'\nF1 Score', f1, '\nConfusion Matrix\n', cm)
print(classification_report(y_test,y_pred))
Results['KNN w/o balance'] = [acc,pr,re,f1]

Accuracy:  0.7566666666666667 
Precision:  0.5510899262438292 
Recall Score 0.45253610117347604 
F1 Score 0.4648328967214416 
Confusion Matrix
 [[ 13  24   0]
 [ 17 211   4]
 [  1  27   3]]
              precision    recall  f1-score   support

           2       0.42      0.35      0.38        37
           3       0.81      0.91      0.85       232
           4       0.43      0.10      0.16        31

    accuracy                           0.76       300
   macro avg       0.55      0.45      0.46       300
weighted avg       0.72      0.76      0.72       300



In [24]:
#KNN with SMOTE balanced data
from sklearn.neighbors import KNeighborsClassifier
KNN = KNeighborsClassifier()
KNN.fit(x_smote,y_smote)
y_pred = KNN.predict(x_test)

acc = accuracy_score(y_test,y_pred)
pr = precision_score(y_test,y_pred,average='macro')
re = recall_score(y_test,y_pred,average='macro')
f1 = f1_score(y_test,y_pred,average='macro')
cm = confusion_matrix(y_test,y_pred)
print('Accuracy: ',acc,'\nPrecision: ',pr,'\nRecall Score', re,'\nF1 Score', f1, '\nConfusion Matrix\n', cm)
print(classification_report(y_test,y_pred))
Results['KNN with SMOTE'] = [acc,pr,re,f1]

Accuracy:  0.5533333333333333 
Precision:  0.486603101957793 
Recall Score 0.6327613765044243 
F1 Score 0.4849153444559986 
Confusion Matrix
 [[ 25   8   4]
 [ 64 119  49]
 [  3   6  22]]
              precision    recall  f1-score   support

           2       0.27      0.68      0.39        37
           3       0.89      0.51      0.65       232
           4       0.29      0.71      0.42        31

    accuracy                           0.55       300
   macro avg       0.49      0.63      0.48       300
weighted avg       0.76      0.55      0.59       300



In [25]:
#KNN with minority resampling from sklearn
from sklearn.neighbors import KNeighborsClassifier
KNN = KNeighborsClassifier()
KNN.fit(x1_train,y1_train)
y_pred = KNN.predict(x_test)

acc = accuracy_score(y_test,y_pred)
pr = precision_score(y_test,y_pred,average='macro')
re = recall_score(y_test,y_pred,average='macro')
f1 = f1_score(y_test,y_pred,average='macro')
cm = confusion_matrix(y_test,y_pred)
print('Accuracy: ',acc,'\nPrecision: ',pr,'\nRecall Score', re,'\nF1 Score', f1, '\nConfusion Matrix\n', cm)
print(classification_report(y_test,y_pred))
Results['KNN MinReSmpling'] = [acc,pr,re,f1]

Accuracy:  0.64 
Precision:  0.5143815212263713 
Recall Score 0.6380851095812163 
F1 Score 0.5378261786919042 
Confusion Matrix
 [[ 22  11   4]
 [ 49 149  34]
 [  1   9  21]]
              precision    recall  f1-score   support

           2       0.31      0.59      0.40        37
           3       0.88      0.64      0.74       232
           4       0.36      0.68      0.47        31

    accuracy                           0.64       300
   macro avg       0.51      0.64      0.54       300
weighted avg       0.76      0.64      0.67       300



In [26]:
#Decision Tree with imbalanced data
from sklearn.tree import DecisionTreeClassifier
dt = DecisionTreeClassifier()
dt.fit(x_train,y_train)
y_pred=dt.predict(x_test)

acc = accuracy_score(y_test,y_pred)
pr = precision_score(y_test,y_pred,average='macro')
re = recall_score(y_test,y_pred,average='macro')
f1 = f1_score(y_test,y_pred,average='macro')
cm = confusion_matrix(y_test,y_pred)
print('Accuracy: ',acc,'\nPrecision: ',pr,'\nRecall Score', re,'\nF1 Score', f1, '\nConfusion Matrix\n', cm)
print(classification_report(y_test,y_pred))
Results['DT w/o balance'] = [acc,pr,re,f1]

Accuracy:  0.89 
Precision:  0.8088529748283753 
Recall Score 0.8209684935213298 
F1 Score 0.8147955747955749 
Confusion Matrix
 [[ 27  10   0]
 [ 10 215   7]
 [  1   5  25]]
              precision    recall  f1-score   support

           2       0.71      0.73      0.72        37
           3       0.93      0.93      0.93       232
           4       0.78      0.81      0.79        31

    accuracy                           0.89       300
   macro avg       0.81      0.82      0.81       300
weighted avg       0.89      0.89      0.89       300



In [27]:
#Decision Tree with smote
from sklearn.tree import DecisionTreeClassifier
dt = DecisionTreeClassifier()
dt.fit(x_smote,y_smote)
y_pred=dt.predict(x_test)

acc = accuracy_score(y_test,y_pred)
pr = precision_score(y_test,y_pred,average='macro')
re = recall_score(y_test,y_pred,average='macro')
f1 = f1_score(y_test,y_pred,average='macro')
cm = confusion_matrix(y_test,y_pred)
print('Accuracy: ',acc,'\nPrecision: ',pr,'\nRecall Score', re,'\nF1 Score', f1, '\nConfusion Matrix\n', cm)
print(classification_report(y_test,y_pred))
Results['DT with smote'] = [acc,pr,re,f1]

Accuracy:  0.9 
Precision:  0.8423908169670881 
Recall Score 0.8142192526230346 
F1 Score 0.827560580172607 
Confusion Matrix
 [[ 28   9   0]
 [  8 219   5]
 [  0   8  23]]
              precision    recall  f1-score   support

           2       0.78      0.76      0.77        37
           3       0.93      0.94      0.94       232
           4       0.82      0.74      0.78        31

    accuracy                           0.90       300
   macro avg       0.84      0.81      0.83       300
weighted avg       0.90      0.90      0.90       300



In [28]:
#Decision Tree with minority resampling from sklearn
from sklearn.tree import DecisionTreeClassifier
dt = DecisionTreeClassifier()
dt.fit(x1_train,y1_train)
y_pred=dt.predict(x_test)

acc = accuracy_score(y_test,y_pred)
pr = precision_score(y_test,y_pred,average='macro')
re = recall_score(y_test,y_pred,average='macro')
f1 = f1_score(y_test,y_pred,average='macro')
cm = confusion_matrix(y_test,y_pred)
print('Accuracy: ',acc,'\nPrecision: ',pr,'\nRecall Score', re,'\nF1 Score', f1, '\nConfusion Matrix\n', cm)
print(classification_report(y_test,y_pred))
Results['DT MinReSmpling'] = [acc,pr,re,f1]

Accuracy:  0.89 
Precision:  0.8053446877024383 
Recall Score 0.8495137239575504 
F1 Score 0.8242501377363406 
Confusion Matrix
 [[ 32   5   0]
 [ 14 211   7]
 [  0   7  24]]
              precision    recall  f1-score   support

           2       0.70      0.86      0.77        37
           3       0.95      0.91      0.93       232
           4       0.77      0.77      0.77        31

    accuracy                           0.89       300
   macro avg       0.81      0.85      0.82       300
weighted avg       0.90      0.89      0.89       300



In [29]:
#Random Forest with imbalanced data
from sklearn.ensemble import RandomForestClassifier
rf_clf = RandomForestClassifier(n_estimators=100)
rf_clf.fit(x_train,y_train)
y_pred = rf_clf.predict(x_test)

acc = accuracy_score(y_test,y_pred)
pr = precision_score(y_test,y_pred,average='macro')
re = recall_score(y_test,y_pred,average='macro')
f1 = f1_score(y_test,y_pred,average='macro')
cm = confusion_matrix(y_test,y_pred)
print('Accuracy: ',acc,'\nPrecision: ',pr,'\nRecall Score', re,'\nF1 Score', f1, '\nConfusion Matrix\n', cm)
print(classification_report(y_test,y_pred))
Results['RF w/o balance'] = [acc,pr,re,f1]

Accuracy:  0.93 
Precision:  0.907651375224597 
Recall Score 0.8481232901422001 
F1 Score 0.8734262925345728 
Confusion Matrix
 [[ 32   5   0]
 [  5 225   2]
 [  0   9  22]]
              precision    recall  f1-score   support

           2       0.86      0.86      0.86        37
           3       0.94      0.97      0.96       232
           4       0.92      0.71      0.80        31

    accuracy                           0.93       300
   macro avg       0.91      0.85      0.87       300
weighted avg       0.93      0.93      0.93       300



In [30]:
#Random Forest with smote
from sklearn.ensemble import RandomForestClassifier
rf_clf = RandomForestClassifier(n_estimators=100)
rf_clf.fit(x_smote,y_smote)
y_pred = rf_clf.predict(x_test)

acc = accuracy_score(y_test,y_pred)
pr = precision_score(y_test,y_pred,average='macro')
re = recall_score(y_test,y_pred,average='macro')
f1 = f1_score(y_test,y_pred,average='macro')
cm = confusion_matrix(y_test,y_pred)
print('Accuracy: ',acc,'\nPrecision: ',pr,'\nRecall Score', re,'\nF1 Score', f1, '\nConfusion Matrix\n', cm)
print(classification_report(y_test,y_pred))
Results['RF with SMOTE'] = [acc,pr,re,f1]

Accuracy:  0.9233333333333333 
Precision:  0.8661782661782661 
Recall Score 0.8714537674493181 
F1 Score 0.8685434242050972 
Confusion Matrix
 [[ 33   4   0]
 [  6 220   6]
 [  0   7  24]]
              precision    recall  f1-score   support

           2       0.85      0.89      0.87        37
           3       0.95      0.95      0.95       232
           4       0.80      0.77      0.79        31

    accuracy                           0.92       300
   macro avg       0.87      0.87      0.87       300
weighted avg       0.92      0.92      0.92       300



In [31]:
#Random Forest with minority resampling from sklearn
from sklearn.ensemble import RandomForestClassifier
rf_clf = RandomForestClassifier(n_estimators=100)
rf_clf.fit(x1_train,y1_train)
y_pred = rf_clf.predict(x_test)

acc = accuracy_score(y_test,y_pred)
pr = precision_score(y_test,y_pred,average='macro')
re = recall_score(y_test,y_pred,average='macro')
f1 = f1_score(y_test,y_pred,average='macro')
cm = confusion_matrix(y_test,y_pred)
print('Accuracy: ',acc,'\nPrecision: ',pr,'\nRecall Score', re,'\nF1 Score', f1, '\nConfusion Matrix\n', cm)
print(classification_report(y_test,y_pred))
Results['RF MinReSmpling'] = [acc,pr,re,f1]

Accuracy:  0.9266666666666666 
Precision:  0.8782430213464697 
Recall Score 0.8728905490585136 
F1 Score 0.8744294620244357 
Confusion Matrix
 [[ 33   4   0]
 [  7 221   4]
 [  0   7  24]]
              precision    recall  f1-score   support

           2       0.82      0.89      0.86        37
           3       0.95      0.95      0.95       232
           4       0.86      0.77      0.81        31

    accuracy                           0.93       300
   macro avg       0.88      0.87      0.87       300
weighted avg       0.93      0.93      0.93       300



In [32]:
#Gradient Boosting with imbalanced data
from sklearn.ensemble import GradientBoostingClassifier
gbm=GradientBoostingClassifier()
gbm.fit(x_train,y_train) 
y_pred=gbm.predict(x_test)

acc = accuracy_score(y_test,y_pred)
pr = precision_score(y_test,y_pred,average='macro')
re = recall_score(y_test,y_pred,average='macro')
f1 = f1_score(y_test,y_pred,average='macro')
cm = confusion_matrix(y_test,y_pred)
print('Accuracy: ',acc,'\nPrecision: ',pr,'\nRecall Score', re,'\nF1 Score', f1, '\nConfusion Matrix\n', cm)
print(classification_report(y_test,y_pred))
Results['GB w/o balance'] = [acc,pr,re,f1]

Accuracy:  0.9266666666666666 
Precision:  0.8716183574879227 
Recall Score 0.8822064556213611 
F1 Score 0.8763986468904501 
Confusion Matrix
 [[ 33   4   0]
 [  7 220   5]
 [  0   6  25]]
              precision    recall  f1-score   support

           2       0.82      0.89      0.86        37
           3       0.96      0.95      0.95       232
           4       0.83      0.81      0.82        31

    accuracy                           0.93       300
   macro avg       0.87      0.88      0.88       300
weighted avg       0.93      0.93      0.93       300



In [33]:
#Gradient Boosting with SMOTE
from sklearn.ensemble import GradientBoostingClassifier
gbm=GradientBoostingClassifier()
gbm.fit(x_smote,y_smote) 
y_pred=gbm.predict(x_test)

acc = accuracy_score(y_test,y_pred)
pr = precision_score(y_test,y_pred,average='macro')
re = recall_score(y_test,y_pred,average='macro')
f1 = f1_score(y_test,y_pred,average='macro')
cm = confusion_matrix(y_test,y_pred)
print('Accuracy: ',acc,'\nPrecision: ',pr,'\nRecall Score', re,'\nF1 Score', f1, '\nConfusion Matrix\n', cm)
print(classification_report(y_test,y_pred))
Results['GB with SMOTE'] = [acc,pr,re,f1]

Accuracy:  0.9233333333333333 
Precision:  0.863114899414026 
Recall Score 0.8825133531751997 
F1 Score 0.872536849474856 
Confusion Matrix
 [[ 32   5   0]
 [  7 219   6]
 [  0   5  26]]
              precision    recall  f1-score   support

           2       0.82      0.86      0.84        37
           3       0.96      0.94      0.95       232
           4       0.81      0.84      0.83        31

    accuracy                           0.92       300
   macro avg       0.86      0.88      0.87       300
weighted avg       0.92      0.92      0.92       300



In [34]:
#Gradient Boosting with minority resampling from sklearn
from sklearn.ensemble import GradientBoostingClassifier
gbm=GradientBoostingClassifier()
gbm.fit(x1_train,y1_train) 
y_pred=gbm.predict(x_test)

acc = accuracy_score(y_test,y_pred)
pr = precision_score(y_test,y_pred,average='macro')
re = recall_score(y_test,y_pred,average='macro')
f1 = f1_score(y_test,y_pred,average='macro')
cm = confusion_matrix(y_test,y_pred)
print('Accuracy: ',acc,'\nPrecision: ',pr,'\nRecall Score', re,'\nF1 Score', f1, '\nConfusion Matrix\n', cm)
print(classification_report(y_test,y_pred))
Results['GB MinReSmpling'] = [acc,pr,re,f1]

Accuracy:  0.9166666666666666 
Precision:  0.8495698924731183 
Recall Score 0.8872120173566226 
F1 Score 0.8662732990160581 
Confusion Matrix
 [[ 33   4   0]
 [ 11 216   5]
 [  0   5  26]]
              precision    recall  f1-score   support

           2       0.75      0.89      0.81        37
           3       0.96      0.93      0.95       232
           4       0.84      0.84      0.84        31

    accuracy                           0.92       300
   macro avg       0.85      0.89      0.87       300
weighted avg       0.92      0.92      0.92       300



In [35]:
Results

Unnamed: 0,LogReg w/o balancing,LogReg with smote,LogReg MinReSmpling,KNN w/o balance,KNN with SMOTE,KNN MinReSmpling,DT w/o balance,DT with smote,DT MinReSmpling,RF w/o balance,RF with SMOTE,RF MinReSmpling,GB w/o balance,GB with SMOTE,GB MinReSmpling
Accuracy,0.813333,0.696667,0.683333,0.756667,0.553333,0.64,0.89,0.9,0.89,0.93,0.923333,0.926667,0.926667,0.923333,0.916667
Precision,0.701641,0.587052,0.573333,0.55109,0.486603,0.514382,0.808853,0.842391,0.805345,0.907651,0.866178,0.878243,0.871618,0.863115,0.84957
Recall,0.626613,0.769668,0.747032,0.452536,0.632761,0.638085,0.820968,0.814219,0.849514,0.848123,0.871454,0.872891,0.882206,0.882513,0.887212
F1,0.65622,0.621861,0.603338,0.464833,0.484915,0.537826,0.814796,0.827561,0.82425,0.873426,0.868543,0.874429,0.876399,0.872537,0.866273


#### From the results, it can be seen that Random Forest Model wihtout any data balancing has yielded very good results. The credibility of the model can be understood, upon seeing consistent results with balanced the data too. 
- Let us try to further enhance the model by doing hyper parameter tuning too. 

In [36]:
#Hyperparameter tuning of random forest model
from sklearn.model_selection import RandomizedSearchCV,GridSearchCV
params = {"n_estimators":[int(x) for x in np.linspace(100,250,10)],"max_depth":list(range(1,20)),'min_samples_split':[3,6,9],'min_samples_leaf':list(range(2,20))}
rf = RandomForestClassifier()
rf_cv = RandomizedSearchCV(estimator=rf,scoring='f1',param_distributions = params,cv=3)
rf_cv.fit(x_train,y_train)

RandomizedSearchCV(cv=3, estimator=RandomForestClassifier(),
                   param_distributions={'max_depth': [1, 2, 3, 4, 5, 6, 7, 8, 9,
                                                      10, 11, 12, 13, 14, 15,
                                                      16, 17, 18, 19],
                                        'min_samples_leaf': [2, 3, 4, 5, 6, 7,
                                                             8, 9, 10, 11, 12,
                                                             13, 14, 15, 16, 17,
                                                             18, 19],
                                        'min_samples_split': [3, 6, 9],
                                        'n_estimators': [100, 116, 133, 150,
                                                         166, 183, 200, 216,
                                                         233, 250]},
                   scoring='f1')

In [37]:
rf_cv.best_params_

{'n_estimators': 166,
 'min_samples_split': 6,
 'min_samples_leaf': 11,
 'max_depth': 13}

In [38]:
#RF with imbalanced data and hyper parameter tuning 
from sklearn.ensemble import RandomForestClassifier
rf_clf = RandomForestClassifier(n_estimators=200,min_samples_split=6,min_samples_leaf=10,max_depth=11)
rf_clf.fit(x_train,y_train)
y_pred = rf_clf.predict(x_test)

acc = accuracy_score(y_test,y_pred)
pr = precision_score(y_test,y_pred,average='macro')
re = recall_score(y_test,y_pred,average='macro')
f1 = f1_score(y_test,y_pred,average='macro')
cm = confusion_matrix(y_test,y_pred)
print('Accuracy: ',acc,'\nPrecision: ',pr,'\nRecall Score', re,'\nF1 Score', f1, '\nConfusion Matrix\n', cm)
print(classification_report(y_test,y_pred))
Results['RF MinReSmpling_hyperparameter tuning'] = [acc,pr,re,f1]

Accuracy:  0.93 
Precision:  0.907629041678236 
Recall Score 0.846379610979166 
F1 Score 0.8710639301722104 
Confusion Matrix
 [[ 33   4   0]
 [  5 225   2]
 [  0  10  21]]
              precision    recall  f1-score   support

           2       0.87      0.89      0.88        37
           3       0.94      0.97      0.96       232
           4       0.91      0.68      0.78        31

    accuracy                           0.93       300
   macro avg       0.91      0.85      0.87       300
weighted avg       0.93      0.93      0.93       300



- Randomised search for values of hyper parameters didn't yeild any improvement. 
- We'll try grid search, thus ensuring every option is tried for convergence.

In [39]:
#Commenting this entire section as gridsearch takes lot of time. I however did it once.

#Hyperparameter tuning of random forest model
#from sklearn.model_selection import RandomizedSearchCV,GridSearchCV
#params = {"n_estimators":[int(x) for x in np.linspace(100,250,10)],"max_depth":list(range(1,20)),'min_samples_split':[3,6,9],'min_samples_leaf':list(range(2,20))}
#rf = RandomForestClassifier()
#grid_rf = GridSearchCV(rf,params,cv=3,scoring='f1',verbose=2)
#grid_rf.fit(x_train,y_train)
#grid_rf.best_params_

In [40]:
#RF with imbalanced data and hyper parameter tuning 2
from sklearn.ensemble import RandomForestClassifier
rf_clf = RandomForestClassifier(n_estimators=100,min_samples_split=2,min_samples_leaf=2,max_depth=15)
rf_clf.fit(x_train,y_train)
y_pred = rf_clf.predict(x_test)

acc = accuracy_score(y_test,y_pred)
pr = precision_score(y_test,y_pred,average='macro')
re = recall_score(y_test,y_pred,average='macro')
f1 = f1_score(y_test,y_pred,average='macro')
cm = confusion_matrix(y_test,y_pred)
print('Accuracy: ',acc,'\nPrecision: ',pr,'\nRecall Score', re,'\nF1 Score', f1, '\nConfusion Matrix\n', cm)
print(classification_report(y_test,y_pred))
Results['RF MinReSmpling_hyperparameter tuning2'] = [acc,pr,re,f1]

Accuracy:  0.9266666666666666 
Precision:  0.8905453006701891 
Recall Score 0.8560024150958524 
F1 Score 0.8711141100614785 
Confusion Matrix
 [[ 32   5   0]
 [  6 223   3]
 [  0   8  23]]
              precision    recall  f1-score   support

           2       0.84      0.86      0.85        37
           3       0.94      0.96      0.95       232
           4       0.88      0.74      0.81        31

    accuracy                           0.93       300
   macro avg       0.89      0.86      0.87       300
weighted avg       0.93      0.93      0.93       300



In [41]:
Results

Unnamed: 0,LogReg w/o balancing,LogReg with smote,LogReg MinReSmpling,KNN w/o balance,KNN with SMOTE,KNN MinReSmpling,DT w/o balance,DT with smote,DT MinReSmpling,RF w/o balance,RF with SMOTE,RF MinReSmpling,GB w/o balance,GB with SMOTE,GB MinReSmpling,RF MinReSmpling_hyperparameter tuning,RF MinReSmpling_hyperparameter tuning2
Accuracy,0.813333,0.696667,0.683333,0.756667,0.553333,0.64,0.89,0.9,0.89,0.93,0.923333,0.926667,0.926667,0.923333,0.916667,0.93,0.926667
Precision,0.701641,0.587052,0.573333,0.55109,0.486603,0.514382,0.808853,0.842391,0.805345,0.907651,0.866178,0.878243,0.871618,0.863115,0.84957,0.907629,0.890545
Recall,0.626613,0.769668,0.747032,0.452536,0.632761,0.638085,0.820968,0.814219,0.849514,0.848123,0.871454,0.872891,0.882206,0.882513,0.887212,0.84638,0.856002
F1,0.65622,0.621861,0.603338,0.464833,0.484915,0.537826,0.814796,0.827561,0.82425,0.873426,0.868543,0.874429,0.876399,0.872537,0.866273,0.871064,0.871114


#### Hyperparameter tuning helped to improve the results. 

In [42]:
print(y_train.unique(),y_test.unique(),y_smote.unique(),y1_train.unique())

[3 4 2] [3 2 4] [3 4 2] [2 4 3]


In [43]:
#XGBoosting with imbalanced data
#XG Boosting needs classes to start from 0 and not from 1 as we have. 
#So we modify our y to suit the needs of algorithm
y_train=y_train.map({2:0,3:1,4:4})
y_test=y_test.map({2:0,3:1,4:4})
y_smote=y_smote.map({2:0,3:1,4:4})
y1_train=y1_train.map({2:0,3:1,4:4})

In [44]:
print(y_train.unique(),y_test.unique(),y_smote.unique(),y1_train.unique())

[1 4 0] [1 0 4] [1 4 0] [0 4 1]


In [45]:
y_train=y_train.map({0:0,1:1,4:2})
y_test=y_test.map({0:0,1:1,4:2})
y_smote=y_smote.map({0:0,1:1,4:2})
y1_train=y1_train.map({0:0,1:1,4:2})

In [46]:
print(y_train.unique(),y_test.unique(),y_smote.unique(),y1_train.unique())

[1 2 0] [1 0 2] [1 2 0] [0 2 1]


In [47]:
#XG Boosting without data balancing
from xgboost import XGBClassifier
xgbc = XGBClassifier()
xgbc.fit(x_train,y_train)
y_pred = xgbc.predict(x_test)

acc = accuracy_score(y_test,y_pred)
pr = precision_score(y_test,y_pred,average='macro')
re = recall_score(y_test,y_pred,average='macro')
f1 = f1_score(y_test,y_pred,average='macro')
cm = confusion_matrix(y_test,y_pred)
print('Accuracy: ',acc,'\nPrecision: ',pr,'\nRecall Score', re,'\nF1 Score', f1, '\nConfusion Matrix\n', cm)
print(classification_report(y_test,y_pred))
Results['XGB w/o balancing'] = [acc,pr,re,f1]

Accuracy:  0.91 
Precision:  0.8608298171589311 
Recall Score 0.8261018248504343 
F1 Score 0.8425490920431472 
Confusion Matrix
 [[ 29   8   0]
 [  6 221   5]
 [  0   8  23]]
              precision    recall  f1-score   support

           0       0.83      0.78      0.81        37
           1       0.93      0.95      0.94       232
           2       0.82      0.74      0.78        31

    accuracy                           0.91       300
   macro avg       0.86      0.83      0.84       300
weighted avg       0.91      0.91      0.91       300



In [48]:
#XG Boosting with SMOTE
from xgboost import XGBClassifier
xgbc = XGBClassifier()
xgbc.fit(x_smote,y_smote)
y_pred = xgbc.predict(x_test)

acc = accuracy_score(y_test,y_pred)
pr = precision_score(y_test,y_pred,average='macro')
re = recall_score(y_test,y_pred,average='macro')
f1 = f1_score(y_test,y_pred,average='macro')
cm = confusion_matrix(y_test,y_pred)
print('Accuracy: ',acc,'\nPrecision: ',pr,'\nRecall Score', re,'\nF1 Score', f1, '\nConfusion Matrix\n', cm)
print(classification_report(y_test,y_pred))
Results['XGB with SMOTE'] = [acc,pr,re,f1]

Accuracy:  0.92 
Precision:  0.8770513855259617 
Recall Score 0.8473003036406818 
F1 Score 0.8613983596570084 
Confusion Matrix
 [[ 30   7   0]
 [  6 222   4]
 [  0   7  24]]
              precision    recall  f1-score   support

           0       0.83      0.81      0.82        37
           1       0.94      0.96      0.95       232
           2       0.86      0.77      0.81        31

    accuracy                           0.92       300
   macro avg       0.88      0.85      0.86       300
weighted avg       0.92      0.92      0.92       300



In [49]:
#XG Boosting with minority resampling from sklearn
from xgboost import XGBClassifier
xgbc = XGBClassifier()
xgbc.fit(x1_train,y1_train)
y_pred = xgbc.predict(x_test)

acc = accuracy_score(y_test,y_pred)
pr = precision_score(y_test,y_pred,average='macro')
re = recall_score(y_test,y_pred,average='macro')
f1 = f1_score(y_test,y_pred,average='macro')
cm = confusion_matrix(y_test,y_pred)
print('Accuracy: ',acc,'\nPrecision: ',pr,'\nRecall Score', re,'\nF1 Score', f1, '\nConfusion Matrix\n', cm)
print(classification_report(y_test,y_pred))
Results['XGB MinReSmpling'] = [acc,pr,re,f1]

Accuracy:  0.9166666666666666 
Precision:  0.8587523587523588 
Recall Score 0.8627516559941476 
F1 Score 0.860488641495393 
Confusion Matrix
 [[ 31   6   0]
 [  8 219   5]
 [  0   6  25]]
              precision    recall  f1-score   support

           0       0.79      0.84      0.82        37
           1       0.95      0.94      0.95       232
           2       0.83      0.81      0.82        31

    accuracy                           0.92       300
   macro avg       0.86      0.86      0.86       300
weighted avg       0.92      0.92      0.92       300



In [50]:
Results

Unnamed: 0,LogReg w/o balancing,LogReg with smote,LogReg MinReSmpling,KNN w/o balance,KNN with SMOTE,KNN MinReSmpling,DT w/o balance,DT with smote,DT MinReSmpling,RF w/o balance,RF with SMOTE,RF MinReSmpling,GB w/o balance,GB with SMOTE,GB MinReSmpling,RF MinReSmpling_hyperparameter tuning,RF MinReSmpling_hyperparameter tuning2,XGB w/o balancing,XGB with SMOTE,XGB MinReSmpling
Accuracy,0.813333,0.696667,0.683333,0.756667,0.553333,0.64,0.89,0.9,0.89,0.93,0.923333,0.926667,0.926667,0.923333,0.916667,0.93,0.926667,0.91,0.92,0.916667
Precision,0.701641,0.587052,0.573333,0.55109,0.486603,0.514382,0.808853,0.842391,0.805345,0.907651,0.866178,0.878243,0.871618,0.863115,0.84957,0.907629,0.890545,0.86083,0.877051,0.858752
Recall,0.626613,0.769668,0.747032,0.452536,0.632761,0.638085,0.820968,0.814219,0.849514,0.848123,0.871454,0.872891,0.882206,0.882513,0.887212,0.84638,0.856002,0.826102,0.8473,0.862752
F1,0.65622,0.621861,0.603338,0.464833,0.484915,0.537826,0.814796,0.827561,0.82425,0.873426,0.868543,0.874429,0.876399,0.872537,0.866273,0.871064,0.871114,0.842549,0.861398,0.860489


## TASK 3: MODEL TO EVALUATE AN EMPLOYEE PERFORMANCE
### Conclusions from modeling
- Random Forest has turned out to be the best model with F1 score of 0.89 for unbalanced data and similar results for resampled balanced data. On hyper parameter tuning, it further improvised the F1 score, but with each run, the results haven't been consistent (I could made sure same results come each time, but this was a way i could check for better parameters)
- The results of various models have been presented in the dataframe Results

## TASK 4: Recommendations to improve the employee performance based on insights from analysis.
- Higher proportion of underperformers or average performers is seen in those who have been in same role for more number of years, under same manager for more number of years. This may be interpreted as a kind of stagnation induced work monotony. Hence, occasional change of work, change of roles, might ensure the employee isn't finding job monotonous and deliver better performance.
- As years of experience in the job role increases, the proportion of better performers is coming down. This can be avoided, by using seniors into managerial roles, and new enthusiastic workers into job roles.
- Midlife crisis seems to be playing a role in overall performance as we see relatively lesser proportion of better performers in 35-45 aged people. We may address this issue by ensuring right work life balance for that segment of employees. It was also evident that better work life balance employees delivered better performance
- Employees who had better job environment satisfaction and job satisfaction had higher proportion of better performers. So, in other words, better environment, and job role, can be a factor that affects performance.
- Those who haven't worked in any company before had higher proportion of average performers, while those who have worked in a company before had higher proportion of better performers. Again, the performance is more of an average in those who have been hopping across companies. So it may turn out to be beneficial to hire those who worked in a company before rather than those who haven't worked anywhere before. Definitely, hoppers aren't great choice.
- Higher salary hikes and quicker promotions may promote the employee's performance, but subject to a matter of relevance. Since experienced (>10 years) employees have higher proportion of better performers, we my consider higher salary hikes and quicker promotions to experienced people to motivate their performance.