## Train Model

**Machine Learning Model is trained and report genereated in this Notebook**  
   
   
   
**Section 1:  Developing ML model for existing employees**  
In this section, an ML model to predict the performance rating of **existing employees** or any employee not in the given data base is developed.  
   
   
Following techniques are applied to get good accuracy.  
1. Feature considered   :   
List of top features obtained from file data_exploratory_analysis.ipynb, "Section 2".   
Least important features are dropped to optimize the model for high accuracy   
2. Encoding             :   
Categorical columns are encoded using two techniques, 'Lable Encoding', 'One Hot Encoding' to figure out what encoding gives better accuracy   
3. Data balancing       :   
Most performance ratings are tagged as 3 and hence the model learns from imbalance data, 'smote' technique is used to balance the data.  
4. Algorithms           :  
Random Forest, XGBoost, Support Vector Machine and Artificial Neural Network algorithms are tried and their parameters are tuned   
     
      
**Report 1:**  
    All features considered, 
    Label Encoding done, 
    Data remains imbalance, 
    Random Forest Considered   

Train Accuracy :  100.0(Accuracy with trained data)   
**Test  Accuracy** :  92.5 

-----------------------------------------------------
              precision    recall  f1-score   support

           2       0.89      0.89      0.89        63
           3       0.94      0.96      0.95       264
           4       0.82      0.70      0.75        33
           
           
**Report 2:**   
    All features considered, 
    **One Hot Encoding** done, 
    Data remains imbalance, 
    Random Forest Considered   

Train Accuracy :  100.0   
**Test  Accuracy :  91.67**        

-----------------------------------------------------
              precision    recall  f1-score   support

           2       0.91      0.84      0.88        63
           3       0.92      0.97      0.95       264
           4       0.84      0.64      0.72        33
  
**Report 3:**   
    All features considered,   
    Label Encoding done,     
    **Data balanced using smote**,         
    Random Forest Considered      
    
Train Accuracy :  100.0     
**Test  Accuracy :  93.06**     

-----------------------------------------------------
              precision    recall  f1-score   support

           2       0.90      0.89      0.90        63
           3       0.95      0.96      0.96       264
           4       0.79      0.79      0.79        33
           
           
**Note 1:** Some less import features are dropped, but model did not see improvement any further.   
**Note 2:** XGBoost algorithm and Support Vector Classifier algorithms were tried, did not see accuracy better than Random Forest algorithm.
   
     
       
      
   
    
**Section 2:  Developing ML model for new hire**

In this section, an ML model to predict the performance rating of a new hire is developed.   
At the time of new hire selection process we will not have all the features provided in the data.    
We need to select only those features that are available at the time of hiring.  

Addition to accuracy improvement techninques applied in Section 1, following technique/algorithms/ideas considered in this section.
 
1. Judiciously classify the features as those that are available at hiring and those that are not available.   
Some features can be easily classfied by intuition, for others data analysis is essential.    
Analysis of this classification is in "Section 2" of file data_processing.ipynb.   
Following features will be used   
    'Age',  
    'Gender',  
    'EducationBackground',  
    'MaritalStatus',  
    'EmpDepartment',  
    'EmpJobRole',  
    'BusinessTravelFrequency',  
    'DistanceFromHome',  
    'EmpEducationLevel',  
    'EmpHourlyRate',  
    'EmpJobLevel',  
    'NumCompaniesWorked',  
    'EmpLastSalaryHikePercent',  
    'TotalWorkExperienceInYears',    
    
2. Addition to Random Forest and XGBoost, other algorithms like Support Vector Machine/Classifier, Artifical Neural Nerwork are considered  
   
   
Each of these algorithms have been tried   
* with their default parameters,   
* later with balanced data and   
* again with best tuned parameter.      

Since the target data is imbalanced, accuracy alone doesn't explain how good or bad a model is.  
Statistical recall also matters.   
Based on accuracy and recall, I recommend XGBoost algorithm, trained with balanced data with following hyper parameters.   

'colsample_bylevel': 0.8,  
 'colsample_bytree': 0.8,  
 'learning_rate': 0.4,  
 'max_depth': 8,  
 'min_child_weight': 2,  
 'n_estimators': 100,  
 'subsample': 0.9  
 
 
**Report for the selected model :**   

 
Train Accuracy :  86.03   
**Test  Accuracy :  64.72**

 
-----------------------------------------------------
              precision    recall  f1-score   support

           2       0.21      0.22      0.21        63
           3       0.80      0.73      0.77       264
           4       0.50      0.76      0.60        33

    accuracy                           0.65       360



In [1]:
import pandas as pd
import numpy as np

In [2]:
df = pd.read_excel('Employee_Data.xls', index_col='EmpNumber')

In [3]:
df.head(2)

Unnamed: 0_level_0,Age,Gender,EducationBackground,MaritalStatus,EmpDepartment,EmpJobRole,BusinessTravelFrequency,DistanceFromHome,EmpEducationLevel,EmpEnvironmentSatisfaction,...,EmpRelationshipSatisfaction,TotalWorkExperienceInYears,TrainingTimesLastYear,EmpWorkLifeBalance,ExperienceYearsAtThisCompany,ExperienceYearsInCurrentRole,YearsSinceLastPromotion,YearsWithCurrManager,Attrition,PerformanceRating
EmpNumber,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
E1001000,32,Male,Marketing,Single,Sales,Sales Executive,Travel_Rarely,10,3,4,...,4,10,2,2,10,7,0,8,No,3
E1001006,47,Male,Marketing,Single,Sales,Sales Executive,Travel_Rarely,14,4,4,...,4,20,2,3,7,7,1,7,No,3


In [4]:
#From the data, Features that contain only labels(categorical label) are
cat_columns = ['Gender', 'EducationBackground', 'MaritalStatus','EmpDepartment', 'EmpJobRole', 'BusinessTravelFrequency', 
               'Attrition', 'OverTime']

In [5]:
#From the data, Features that contain only numerical Discrete values are
num_columns = ['Age', 'DistanceFromHome', 'EmpEducationLevel', 'EmpEnvironmentSatisfaction','EmpHourlyRate', 
               'EmpJobInvolvement', 'EmpJobLevel', 'EmpJobSatisfaction', 'NumCompaniesWorked', 'EmpLastSalaryHikePercent', 
               'EmpRelationshipSatisfaction','TotalWorkExperienceInYears', 'TrainingTimesLastYear', 'EmpWorkLifeBalance', 
               'ExperienceYearsAtThisCompany','ExperienceYearsInCurrentRole', 'YearsSinceLastPromotion',
               'YearsWithCurrManager', 'PerformanceRating']

In [6]:
#Utility Function
from sklearn.metrics import accuracy_score,confusion_matrix
from sklearn.metrics import classification_report
def predict_n_print_report(model, X_train, X_test, y_train, y_test):
    y_predict = model.predict(X_test)
    y_train_predict = model.predict(X_train)
    
    print('Train Accuracy : ', accuracy_score(y_train, y_train_predict).round(4)*100)
    print('Test  Accuracy : ', accuracy_score(y_test,  y_predict).round(4)*100)
    print('-----------------------------------------------------')
    print(pd.crosstab(y_test, y_predict, rownames=['Actual '+y_test.name], colnames=['Predicted']))
    print('-----------------------------------------------------')
    print( classification_report(y_test,y_predict) )

## section 1: Predict the performance of Existing employees

ML model to predict the performance rating of Existing employees

In [7]:
X = df.drop('PerformanceRating', axis=1)

In [8]:
y = df.PerformanceRating

In [9]:
#Label Encode categorical-label columns
from sklearn.preprocessing import LabelEncoder
enc = LabelEncoder()
for labelColumn in cat_columns:
    X[labelColumn] = enc.fit_transform(X[labelColumn])

In [10]:
X.head(2)

Unnamed: 0_level_0,Age,Gender,EducationBackground,MaritalStatus,EmpDepartment,EmpJobRole,BusinessTravelFrequency,DistanceFromHome,EmpEducationLevel,EmpEnvironmentSatisfaction,...,EmpLastSalaryHikePercent,EmpRelationshipSatisfaction,TotalWorkExperienceInYears,TrainingTimesLastYear,EmpWorkLifeBalance,ExperienceYearsAtThisCompany,ExperienceYearsInCurrentRole,YearsSinceLastPromotion,YearsWithCurrManager,Attrition
EmpNumber,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
E1001000,32,1,2,2,5,13,2,10,3,4,...,12,4,10,2,2,10,7,0,8,0
E1001006,47,1,2,2,5,13,2,14,4,4,...,12,4,20,2,3,7,7,1,7,0


In [11]:
#One Hot Encoding
X = pd.get_dummies(X, columns=cat_columns)

In [12]:
#Splitting data for training and testing
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X,y,random_state=10, test_size=0.3)

In [13]:
print('Shape of X_train', X_train.shape)
print('Shape of X_test', X_test.shape)
print('Shape of y_train', y_train.shape)
print('Shape of y_test', y_test.shape)

Shape of X_train (840, 61)
Shape of X_test (360, 61)
Shape of y_train (840,)
Shape of y_test (360,)


In [14]:
from collections import Counter
print('Y train data imbalance : ', Counter(y_train))
print('Y test  data imbalance : ', Counter(y_test))

Y train data imbalance :  Counter({3: 610, 2: 131, 4: 99})
Y test  data imbalance :  Counter({3: 264, 2: 63, 4: 33})


In [15]:
from imblearn.over_sampling import SMOTE

smote = SMOTE({2: 500, 3: 610, 4: 400})
X_train_smote, y_train_smote = smote.fit_sample(X_train, y_train)

X_train = pd.DataFrame(X_train_smote, columns=X_test.columns)

y_train = pd.Series(y_train_smote, name=y_train.name)



In [16]:
from sklearn.ensemble import RandomForestClassifier
modelrf = RandomForestClassifier(random_state=0) 

modelrf.fit(X_train, y_train)

RandomForestClassifier(bootstrap=True, ccp_alpha=0.0, class_weight=None,
                       criterion='gini', max_depth=None, max_features='auto',
                       max_leaf_nodes=None, max_samples=None,
                       min_impurity_decrease=0.0, min_impurity_split=None,
                       min_samples_leaf=1, min_samples_split=2,
                       min_weight_fraction_leaf=0.0, n_estimators=100,
                       n_jobs=None, oob_score=False, random_state=0, verbose=0,
                       warm_start=False)

In [17]:
predict_n_print_report(modelrf, X_train, X_test, y_train, y_test)

Train Accuracy :  100.0
Test  Accuracy :  91.94
-----------------------------------------------------
Predicted                  2    3   4
Actual PerformanceRating             
2                         55    8   0
3                          4  253   7
4                          3    7  23
-----------------------------------------------------
              precision    recall  f1-score   support

           2       0.89      0.87      0.88        63
           3       0.94      0.96      0.95       264
           4       0.77      0.70      0.73        33

    accuracy                           0.92       360
   macro avg       0.87      0.84      0.85       360
weighted avg       0.92      0.92      0.92       360



## Section 2: ML model for new hire

**Note:** Considering only those features that are available for new employee hiring, Some features like 'Age' are obviously available at the time of hiring, but other fetures like 'EmpLastSalaryHikePercent' or  'OverTime' are not that obvious. Analysis of not so obvious features are in jupyter notebook "Section 2" of  **IABAC_Project_Submission/src/Data_Processing/data_processing.ipynb**   

In [18]:
#From IABAC_Project_Submission/src/Data_Processing/data_processing.ipynb
new_hire_columns = ['Age',
 'Gender',
 'EducationBackground',
 'MaritalStatus',
 'EmpDepartment',
 'EmpJobRole',
 'BusinessTravelFrequency',
 'DistanceFromHome',
 'EmpEducationLevel',
 #'EmpEnvironmentSatisfaction',
 'EmpHourlyRate',
 #'EmpJobInvolvement',
 'EmpJobLevel',
 #'EmpJobSatisfaction',
 'NumCompaniesWorked',
 #'OverTime',
 'EmpLastSalaryHikePercent',
 #'EmpRelationshipSatisfaction',
 'TotalWorkExperienceInYears',
 #'TrainingTimesLastYear',
 #'EmpWorkLifeBalance',
 #'ExperienceYearsAtThisCompany',
 #'ExperienceYearsInCurrentRole',
 #'YearsSinceLastPromotion',
 #'YearsWithCurrManager',
 #'Attrition',
 #'PerformanceRating'
 ]

In [19]:
X = df[new_hire_columns] # X represents data frame containing data for new hire

In [20]:
#Label Encode categorical-label columns
from sklearn.preprocessing import LabelEncoder
enc = LabelEncoder()
for labelColumn in cat_columns:
    if labelColumn in new_hire_columns:
        X[labelColumn] = enc.fit_transform(X[labelColumn])

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  


In [21]:
y=df.PerformanceRating

In [22]:
#Splitting data for training and testing
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X,y,random_state=10, test_size=0.3)

### Algorithm 1: Random Forest algorithm

In [23]:
from sklearn.ensemble import RandomForestClassifier
modelrf = RandomForestClassifier(n_estimators=100, random_state=0) 
modelrf.fit(X_train, y_train)

RandomForestClassifier(bootstrap=True, ccp_alpha=0.0, class_weight=None,
                       criterion='gini', max_depth=None, max_features='auto',
                       max_leaf_nodes=None, max_samples=None,
                       min_impurity_decrease=0.0, min_impurity_split=None,
                       min_samples_leaf=1, min_samples_split=2,
                       min_weight_fraction_leaf=0.0, n_estimators=100,
                       n_jobs=None, oob_score=False, random_state=0, verbose=0,
                       warm_start=False)

In [24]:
X_train.columns

Index(['Age', 'Gender', 'EducationBackground', 'MaritalStatus',
       'EmpDepartment', 'EmpJobRole', 'BusinessTravelFrequency',
       'DistanceFromHome', 'EmpEducationLevel', 'EmpHourlyRate', 'EmpJobLevel',
       'NumCompaniesWorked', 'EmpLastSalaryHikePercent',
       'TotalWorkExperienceInYears'],
      dtype='object')

In [25]:
predict_n_print_report(modelrf, X_train, X_test, y_train, y_test)

Train Accuracy :  100.0
Test  Accuracy :  77.22
-----------------------------------------------------
Predicted                 2    3   4
Actual PerformanceRating            
2                         4   54   5
3                         2  250  12
4                         1    8  24
-----------------------------------------------------
              precision    recall  f1-score   support

           2       0.57      0.06      0.11        63
           3       0.80      0.95      0.87       264
           4       0.59      0.73      0.65        33

    accuracy                           0.77       360
   macro avg       0.65      0.58      0.54       360
weighted avg       0.74      0.77      0.72       360



In [26]:
from collections import Counter
Counter(y_train)

Counter({2: 131, 3: 610, 4: 99})

**Note :**  
The data is very imbalanced. 
Accuracy is No. of correct predictions / Total number of observations. In the above case 278/360 = 77.22%  
Here, even if we predict everything as 3, 264/360 = 73.33%
So, we are not happy with the accuracy of 77.22%. Also, the recall is only 0.06 for rating=2, this is a concern.  
Hence lets train the model with balanced data.

### Predicting using Balanced data

In [27]:
from imblearn.over_sampling import SMOTE

smote = SMOTE({2: 500, 3: 610, 4: 400})
X_train_smote, y_train_smote = smote.fit_sample(X_train, y_train)

X_train = pd.DataFrame(X_train_smote, columns=X_test.columns)

y_train = pd.Series(y_train_smote, name=y_train.name)



In [28]:
modelrf_smote = RandomForestClassifier(n_estimators=50, random_state=0) 
modelrf_smote.fit(X_train, y_train)

RandomForestClassifier(bootstrap=True, ccp_alpha=0.0, class_weight=None,
                       criterion='gini', max_depth=None, max_features='auto',
                       max_leaf_nodes=None, max_samples=None,
                       min_impurity_decrease=0.0, min_impurity_split=None,
                       min_samples_leaf=1, min_samples_split=2,
                       min_weight_fraction_leaf=0.0, n_estimators=50,
                       n_jobs=None, oob_score=False, random_state=0, verbose=0,
                       warm_start=False)

In [29]:
predict_n_print_report(modelrf, X_train, X_test, y_train, y_test)

Train Accuracy :  78.61
Test  Accuracy :  77.22
-----------------------------------------------------
Predicted                 2    3   4
Actual PerformanceRating            
2                         4   54   5
3                         2  250  12
4                         1    8  24
-----------------------------------------------------
              precision    recall  f1-score   support

           2       0.57      0.06      0.11        63
           3       0.80      0.95      0.87       264
           4       0.59      0.73      0.65        33

    accuracy                           0.77       360
   macro avg       0.65      0.58      0.54       360
weighted avg       0.74      0.77      0.72       360



### Using GridSearch to tune Random Forest parameters

In [30]:
from sklearn.model_selection import GridSearchCV
params=[{
    'n_estimators':[20, 50, 100],
    'min_samples_split':[2,3,4,5],
    'criterion':['gini','entropy'],
    'min_samples_leaf':[1,2,3]
}]

modelrf_gs=GridSearchCV(
    estimator=RandomForestClassifier(random_state=0), 
    param_grid=params, 
    #scoring='accuracy',
    cv=8)
modelrf_gs.fit(X_train,y_train)

GridSearchCV(cv=8, error_score=nan,
             estimator=RandomForestClassifier(bootstrap=True, ccp_alpha=0.0,
                                              class_weight=None,
                                              criterion='gini', max_depth=None,
                                              max_features='auto',
                                              max_leaf_nodes=None,
                                              max_samples=None,
                                              min_impurity_decrease=0.0,
                                              min_impurity_split=None,
                                              min_samples_leaf=1,
                                              min_samples_split=2,
                                              min_weight_fraction_leaf=0.0,
                                              n_estimators=100, n_jobs=None,
                                              oob_score=False, random_state=0,
                                   

In [31]:
modelrf_gs.best_params_

{'criterion': 'entropy',
 'min_samples_leaf': 1,
 'min_samples_split': 3,
 'n_estimators': 50}

In [32]:
predict_n_print_report(modelrf_gs, X_train, X_test, y_train, y_test)

Train Accuracy :  100.0
Test  Accuracy :  69.72
-----------------------------------------------------
Predicted                  2    3   4
Actual PerformanceRating             
2                         16   41   6
3                         40  209  15
4                          2    5  26
-----------------------------------------------------
              precision    recall  f1-score   support

           2       0.28      0.25      0.26        63
           3       0.82      0.79      0.81       264
           4       0.55      0.79      0.65        33

    accuracy                           0.70       360
   macro avg       0.55      0.61      0.57       360
weighted avg       0.70      0.70      0.70       360



**Note :**  
Here, the data is relatively balanced.   
Used GridSearch tool to decide on algorithm parameter.  
Although accuracy is not as good as previous models, the recall for performance rating=2 is much better.  
So, recommending this model to use to filter new hires.  

## Algorithm 2: XGBoost

In [33]:
X = df[new_hire_columns] # X represents data frame containing data for new hire

In [34]:
#Label Encode categorical-label columns
from sklearn.preprocessing import LabelEncoder
enc = LabelEncoder()
for labelColumn in cat_columns:
    if labelColumn in new_hire_columns:
        X[labelColumn] = enc.fit_transform(X[labelColumn])

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  


In [35]:
y=df.PerformanceRating

In [36]:
#Splitting data for training and testing
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X,y,random_state=10, test_size=0.3)

In [37]:
from xgboost import XGBClassifier
modelxg = XGBClassifier(random_state=0)
modelxg.fit(X_train, y_train)

XGBClassifier(base_score=0.5, booster='gbtree', colsample_bylevel=1,
              colsample_bynode=1, colsample_bytree=1, gamma=0,
              learning_rate=0.1, max_delta_step=0, max_depth=3,
              min_child_weight=1, missing=None, n_estimators=100, n_jobs=1,
              nthread=None, objective='multi:softprob', random_state=0,
              reg_alpha=0, reg_lambda=1, scale_pos_weight=1, seed=None,
              silent=None, subsample=1, verbosity=1)

In [38]:
predict_n_print_report(modelxg, X_train, X_test, y_train, y_test)

Train Accuracy :  85.0
Test  Accuracy :  77.78
-----------------------------------------------------
Predicted                 2    3   4
Actual PerformanceRating            
2                         4   51   8
3                         1  252  11
4                         1    8  24
-----------------------------------------------------
              precision    recall  f1-score   support

           2       0.67      0.06      0.12        63
           3       0.81      0.95      0.88       264
           4       0.56      0.73      0.63        33

    accuracy                           0.78       360
   macro avg       0.68      0.58      0.54       360
weighted avg       0.76      0.78      0.72       360



### Using balanced data

In [39]:
from imblearn.over_sampling import SMOTE

smote = SMOTE({2: 500, 3: 610, 4: 400})
X_train_smote, y_train_smote = smote.fit_sample(X_train, y_train)

X_train = pd.DataFrame(X_train_smote, columns=X_test.columns)

y_train = pd.Series(y_train_smote, name=y_train.name)



In [40]:
modelxg_smote = XGBClassifier(random_state=0)
modelxg_smote.fit(X_train, y_train)

XGBClassifier(base_score=0.5, booster='gbtree', colsample_bylevel=1,
              colsample_bynode=1, colsample_bytree=1, gamma=0,
              learning_rate=0.1, max_delta_step=0, max_depth=3,
              min_child_weight=1, missing=None, n_estimators=100, n_jobs=1,
              nthread=None, objective='multi:softprob', random_state=0,
              reg_alpha=0, reg_lambda=1, scale_pos_weight=1, seed=None,
              silent=None, subsample=1, verbosity=1)

In [41]:
predict_n_print_report(modelxg_smote, X_train, X_test, y_train, y_test)

Train Accuracy :  86.49
Test  Accuracy :  63.33
-----------------------------------------------------
Predicted                  2    3   4
Actual PerformanceRating             
2                         14   40   9
3                         59  188  17
4                          1    6  26
-----------------------------------------------------
              precision    recall  f1-score   support

           2       0.19      0.22      0.20        63
           3       0.80      0.71      0.76       264
           4       0.50      0.79      0.61        33

    accuracy                           0.63       360
   macro avg       0.50      0.57      0.52       360
weighted avg       0.67      0.63      0.65       360



### Tuning XGBoost parameters

In [42]:
from sklearn.model_selection import GridSearchCV

params=[{
    'learning_rate':[0.05, 0.1, 0.4], 
    'max_depth':[2,3,6], 
    'min_child_weight':[2,5,10], 
    'subsample':[0.8,0.9,1],
    #'colsample_bytree':0.8,
    #'colsample_bylevel':1,
    #'n_estimators':100,
    #'gamma':0,
    #'reg_alpha':0, 
    #'reg_lambda':1,
    }]

modelxg_gs=GridSearchCV(
    estimator=XGBClassifier(random_state=0), 
    param_grid=params, 
    cv=8)
modelxg_gs.fit(X_train,y_train)

GridSearchCV(cv=8, error_score=nan,
             estimator=XGBClassifier(base_score=0.5, booster='gbtree',
                                     colsample_bylevel=1, colsample_bynode=1,
                                     colsample_bytree=1, gamma=0,
                                     learning_rate=0.1, max_delta_step=0,
                                     max_depth=3, min_child_weight=1,
                                     missing=None, n_estimators=100, n_jobs=1,
                                     nthread=None, objective='binary:logistic',
                                     random_state=0, reg_alpha=0, reg_lambda=1,
                                     scale_pos_weight=1, seed=None, silent=None,
                                     subsample=1, verbosity=1),
             iid='deprecated', n_jobs=None,
             param_grid=[{'learning_rate': [0.05, 0.1, 0.4],
                          'max_depth': [2, 3, 6],
                          'min_child_weight': [2, 5, 10],
        

In [43]:
modelxg_gs.best_params_

{'learning_rate': 0.4, 'max_depth': 6, 'min_child_weight': 2, 'subsample': 0.8}

In [44]:
predict_n_print_report(modelxg_smote, X_train, X_test, y_train, y_test)

Train Accuracy :  86.49
Test  Accuracy :  63.33
-----------------------------------------------------
Predicted                  2    3   4
Actual PerformanceRating             
2                         14   40   9
3                         59  188  17
4                          1    6  26
-----------------------------------------------------
              precision    recall  f1-score   support

           2       0.19      0.22      0.20        63
           3       0.80      0.71      0.76       264
           4       0.50      0.79      0.61        33

    accuracy                           0.63       360
   macro avg       0.50      0.57      0.52       360
weighted avg       0.67      0.63      0.65       360



### Tuning more XGBoost parameters

In [46]:
from sklearn.model_selection import GridSearchCV

params=[{
    'learning_rate':[0.4], 
    'max_depth':[6,8], 
    'min_child_weight':[2], 
    'subsample':[0.9],
    'colsample_bytree':[0.8,0.9,1],
    'colsample_bylevel':[0.8,0.9,1],
    'n_estimators':[50,100,150],
    #'gamma':0,
    #'reg_alpha':[0,0.1], 
    #'reg_lambda':[1, 1.1],
    }]

modelxg_gs=GridSearchCV(
    estimator=XGBClassifier(random_state=0), 
    param_grid=params, 
    cv=8)
modelxg_gs.fit(X_train,y_train)

GridSearchCV(cv=8, error_score=nan,
             estimator=XGBClassifier(base_score=0.5, booster='gbtree',
                                     colsample_bylevel=1, colsample_bynode=1,
                                     colsample_bytree=1, gamma=0,
                                     learning_rate=0.1, max_delta_step=0,
                                     max_depth=3, min_child_weight=1,
                                     missing=None, n_estimators=100, n_jobs=1,
                                     nthread=None, objective='binary:logistic',
                                     random_state=0, reg_alpha=0, reg_lambda=1,
                                     scale_po...t=1, seed=None, silent=None,
                                     subsample=1, verbosity=1),
             iid='deprecated', n_jobs=None,
             param_grid=[{'colsample_bylevel': [0.8, 0.9, 1],
                          'colsample_bytree': [0.8, 0.9, 1],
                          'learning_rate': [0.4], 'max_dep

In [47]:
modelxg_gs.best_params_

{'colsample_bylevel': 0.8,
 'colsample_bytree': 0.9,
 'learning_rate': 0.4,
 'max_depth': 8,
 'min_child_weight': 2,
 'n_estimators': 100,
 'subsample': 0.9}

In [48]:
predict_n_print_report(modelxg_smote, X_train, X_test, y_train, y_test)

Train Accuracy :  86.49
Test  Accuracy :  63.33
-----------------------------------------------------
Predicted                  2    3   4
Actual PerformanceRating             
2                         14   40   9
3                         59  188  17
4                          1    6  26
-----------------------------------------------------
              precision    recall  f1-score   support

           2       0.19      0.22      0.20        63
           3       0.80      0.71      0.76       264
           4       0.50      0.79      0.61        33

    accuracy                           0.63       360
   macro avg       0.50      0.57      0.52       360
weighted avg       0.67      0.63      0.65       360



### Algorithm 3: Support Vector Machine/Classifier

In [49]:
X = df[new_hire_columns] # X represents data frame containing data for new hire

In [50]:
#Label Encode categorical-label columns
from sklearn.preprocessing import LabelEncoder
enc = LabelEncoder()
for labelColumn in cat_columns:
    if labelColumn in new_hire_columns:
        X[labelColumn] = enc.fit_transform(X[labelColumn])

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  


In [51]:
y=df.PerformanceRating

In [52]:
#Splitting data for training and testing
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X,y,random_state=10, test_size=0.3)

In [53]:
from sklearn.svm import SVC
model_svm = SVC(
    #C=200, 
    #gamma=0.08, 
    random_state=10)  # C between 0.1 to 1000(helps in mainting margin), gamma - 0.01 to 10(for smoothening curve)
model_svm.fit(X_train,y_train) # training model

SVC(C=1.0, break_ties=False, cache_size=200, class_weight=None, coef0=0.0,
    decision_function_shape='ovr', degree=3, gamma='scale', kernel='rbf',
    max_iter=-1, probability=False, random_state=10, shrinking=True, tol=0.001,
    verbose=False)

In [54]:
predict_n_print_report(model_svm, X_train, X_test, y_train, y_test)

Train Accuracy :  72.61999999999999
Test  Accuracy :  73.33
-----------------------------------------------------
Predicted                   3
Actual PerformanceRating     
2                          63
3                         264
4                          33
-----------------------------------------------------
              precision    recall  f1-score   support

           2       0.00      0.00      0.00        63
           3       0.73      1.00      0.85       264
           4       0.00      0.00      0.00        33

    accuracy                           0.73       360
   macro avg       0.24      0.33      0.28       360
weighted avg       0.54      0.73      0.62       360



  _warn_prf(average, modifier, msg_start, len(result))


**Note :**  
By using default params, SVM has provided very bad prediction. It would predict every new hire would have performance rating equal to 3.    
By balancing the data, we will have better prediction  

### using balanced data

In [55]:
from imblearn.over_sampling import SMOTE

smote = SMOTE({2: 500, 3: 610, 4: 400})
X_train_smote, y_train_smote = smote.fit_sample(X_train, y_train)

X_train = pd.DataFrame(X_train_smote, columns=X_test.columns)

y_train = pd.Series(y_train_smote, name=y_train.name)



In [56]:
model_svm_smote = SVC(
    #C=100, 
    #gamma=0.08, 
    random_state=10)  # C between 0.1 to 1000(helps in mainting margin), gamma - 0.01 to 10(for smoothening curve)
model_svm_smote.fit(X_train,y_train) # training model

SVC(C=1.0, break_ties=False, cache_size=200, class_weight=None, coef0=0.0,
    decision_function_shape='ovr', degree=3, gamma='scale', kernel='rbf',
    max_iter=-1, probability=False, random_state=10, shrinking=True, tol=0.001,
    verbose=False)

In [57]:
predict_n_print_report(model_svm_smote, X_train, X_test, y_train, y_test)

Train Accuracy :  64.3
Test  Accuracy :  62.22
-----------------------------------------------------
Predicted                  2    3   4
Actual PerformanceRating             
2                         11   39  13
3                         49  185  30
4                          2    3  28
-----------------------------------------------------
              precision    recall  f1-score   support

           2       0.18      0.17      0.18        63
           3       0.81      0.70      0.75       264
           4       0.39      0.85      0.54        33

    accuracy                           0.62       360
   macro avg       0.46      0.57      0.49       360
weighted avg       0.66      0.62      0.63       360



### Tuning SVM parameters

In [58]:
from sklearn.model_selection import GridSearchCV

params=[{
    'C':[0.1,1,10,100,500], 
    'gamma':[0.05, 0.5, 5],
}]

modelsvm_gs=GridSearchCV(
    estimator=SVC(random_state=10), 
    param_grid=params, 
    cv=8)
modelsvm_gs.fit(X_train,y_train)

GridSearchCV(cv=8, error_score=nan,
             estimator=SVC(C=1.0, break_ties=False, cache_size=200,
                           class_weight=None, coef0=0.0,
                           decision_function_shape='ovr', degree=3,
                           gamma='scale', kernel='rbf', max_iter=-1,
                           probability=False, random_state=10, shrinking=True,
                           tol=0.001, verbose=False),
             iid='deprecated', n_jobs=None,
             param_grid=[{'C': [0.1, 1, 10, 100, 500],
                          'gamma': [0.05, 0.5, 5]}],
             pre_dispatch='2*n_jobs', refit=True, return_train_score=False,
             scoring=None, verbose=0)

In [59]:
modelsvm_gs.best_params_

{'C': 10, 'gamma': 0.05}

In [60]:
predict_n_print_report(modelsvm_gs, X_train, X_test, y_train, y_test)

Train Accuracy :  100.0
Test  Accuracy :  71.94
-----------------------------------------------------
Predicted                 2    3  4
Actual PerformanceRating           
2                         1   62  0
3                         5  257  2
4                         0   32  1
-----------------------------------------------------
              precision    recall  f1-score   support

           2       0.17      0.02      0.03        63
           3       0.73      0.97      0.84       264
           4       0.33      0.03      0.06        33

    accuracy                           0.72       360
   macro avg       0.41      0.34      0.31       360
weighted avg       0.60      0.72      0.62       360



## Algorithm 4: Artificial Neural Network

In [61]:
X = df[new_hire_columns] # X represents data frame containing data for new hire

In [62]:
#Label Encode categorical-label columns
from sklearn.preprocessing import LabelEncoder
enc = LabelEncoder()
for labelColumn in cat_columns:
    if labelColumn in new_hire_columns:
        X[labelColumn] = enc.fit_transform(X[labelColumn])

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  


In [63]:
y=df.PerformanceRating

In [64]:
from sklearn.preprocessing import scale
X_ANN = scale(X)

In [65]:
X_train, X_test, y_train, y_test = train_test_split(
    X_ANN,
    y,
random_state=10, test_size=0.3)

In [66]:
from sklearn.neural_network import MLPClassifier # Multi layer percetron classifier
model_ann = MLPClassifier(hidden_layer_sizes=(90,90,90), random_state=10)
model_ann.fit(X_train, y_train)

MLPClassifier(activation='relu', alpha=0.0001, batch_size='auto', beta_1=0.9,
              beta_2=0.999, early_stopping=False, epsilon=1e-08,
              hidden_layer_sizes=(90, 90, 90), learning_rate='constant',
              learning_rate_init=0.001, max_fun=15000, max_iter=200,
              momentum=0.9, n_iter_no_change=10, nesterovs_momentum=True,
              power_t=0.5, random_state=10, shuffle=True, solver='adam',
              tol=0.0001, validation_fraction=0.1, verbose=False,
              warm_start=False)

In [67]:
predict_n_print_report(model_ann, X_train, X_test, y_train, y_test)

Train Accuracy :  100.0
Test  Accuracy :  69.44
-----------------------------------------------------
Predicted                  2    3   4
Actual PerformanceRating             
2                         16   44   3
3                         35  219  10
4                          3   15  15
-----------------------------------------------------
              precision    recall  f1-score   support

           2       0.30      0.25      0.27        63
           3       0.79      0.83      0.81       264
           4       0.54      0.45      0.49        33

    accuracy                           0.69       360
   macro avg       0.54      0.51      0.52       360
weighted avg       0.68      0.69      0.69       360



### ANN with balanced data

In [68]:
from imblearn.over_sampling import SMOTE

smote = SMOTE({2: 500, 3: 610, 4: 400})
X_train_smote, y_train_smote = smote.fit_sample(X_train, y_train)

X_train = pd.DataFrame(X_train_smote, columns=X.columns)

y_train = pd.Series(y_train_smote, name=y.name)



In [69]:
from sklearn.neural_network import MLPClassifier # Multi layer percetron classifier
model_ann = MLPClassifier(hidden_layer_sizes=(90,90,90), random_state=10)
model_ann.fit(X_train, y_train)

MLPClassifier(activation='relu', alpha=0.0001, batch_size='auto', beta_1=0.9,
              beta_2=0.999, early_stopping=False, epsilon=1e-08,
              hidden_layer_sizes=(90, 90, 90), learning_rate='constant',
              learning_rate_init=0.001, max_fun=15000, max_iter=200,
              momentum=0.9, n_iter_no_change=10, nesterovs_momentum=True,
              power_t=0.5, random_state=10, shuffle=True, solver='adam',
              tol=0.0001, validation_fraction=0.1, verbose=False,
              warm_start=False)

In [70]:
predict_n_print_report(model_ann, X_train, X_test, y_train, y_test)

Train Accuracy :  100.0
Test  Accuracy :  67.5
-----------------------------------------------------
Predicted                  2    3   4
Actual PerformanceRating             
2                         16   41   6
3                         40  210  14
4                          5   11  17
-----------------------------------------------------
              precision    recall  f1-score   support

           2       0.26      0.25      0.26        63
           3       0.80      0.80      0.80       264
           4       0.46      0.52      0.49        33

    accuracy                           0.68       360
   macro avg       0.51      0.52      0.51       360
weighted avg       0.68      0.68      0.68       360

