<h2><center>HR Analytics ML Project</center></h2>
<h3><center>Feature Engineering & Model Training</center></h3>
<h4><center>Author: Akshay Pandurang Paunikar</center></h4>
<h5>Problem Statement:</h5>
Your client is a large MNC and they have 9 broad verticals across the organisation. One of the problem your client is facing is around identifying the right people for promotion (only for manager position and below) and prepare them in time. Currently the process, they are following is:

 - They first identify a set of employees based on recommendations/ past performance
 - Selected employees go through the separate training and evaluation program for each vertical. These programs are based on the required skill of each vertical
 - At the end of the program, based on various factors such as training performance, KPI completion (only employees with KPIs completed greater than 60% are considered) etc., employee gets promotion

For above mentioned process, the final promotions are only announced after the evaluation and this leads to delay in transition to their new roles. Hence, company needs your help in identifying the eligible candidates at a particular checkpoint so that they can expedite the entire promotion cycle.

They have provided multiple attributes around Employee's past and current performance along with demographics. Now, The task is to predict whether a potential promotee at checkpoint in the test set will be promoted or not after the evaluation process.

In [32]:
# import required libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns
from sklearn.model_selection import train_test_split
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import LabelEncoder, OneHotEncoder, StandardScaler
from sklearn.pipeline import Pipeline

In [33]:
# import the dataset
train_data = pd.read_csv("E:\\iNeuron\Projects\\HR_Analytics\\notebook\\datasets\\train_data.csv")
test_data = pd.read_csv("E:\\iNeuron\Projects\\HR_Analytics\\notebook\\datasets\\test_data.csv")

In [34]:
# Split the train data into independent features and target variable
X = train_data.drop(['is_promoted'], axis=1)
y = train_data['is_promoted']

In [35]:
X.head()

Unnamed: 0,department,region,education,gender,recruitment_channel,no_of_trainings,age,previous_year_rating,length_of_service,KPIs_met,awards_won,avg_training_score
0,Sales & Marketing,region_7,Master's & above,Female,sourcing,1,35,5.0,8,Yes,No,49
1,Operations,region_22,Bachelor's,Male,other,1,30,5.0,4,No,No,60
2,Sales & Marketing,region_19,Bachelor's,Male,sourcing,1,34,3.0,7,No,No,50
3,Sales & Marketing,region_23,Bachelor's,Male,other,2,39,1.0,10,No,No,50
4,Technology,region_26,Bachelor's,Male,other,1,45,3.0,2,No,No,73


In [36]:
y

0        No
1        No
2        No
3        No
4        No
         ..
54642    No
54643    No
54644    No
54645    No
54646    No
Name: is_promoted, Length: 54647, dtype: object

In [37]:
# categorical features and numerical features
cat_features = X.select_dtypes(include="object").columns
num_features = X.select_dtypes(exclude="object").columns

print("categorical features:\n", cat_features)
print("numerical features:\n", num_features)

categorical features:
 Index(['department', 'region', 'education', 'gender', 'recruitment_channel',
       'KPIs_met', 'awards_won'],
      dtype='object')
numerical features:
 Index(['no_of_trainings', 'age', 'previous_year_rating', 'length_of_service',
       'avg_training_score'],
      dtype='object')


In [38]:
# create numerical and categorical pipeline
num_pipeline = Pipeline(
    steps=[
        ("scaler", StandardScaler())
    ]
)

cat_pipeline = Pipeline(
    steps=[
        ("one_hot", OneHotEncoder()),
        ("scaler", StandardScaler(with_mean=False))
    ]
)

In [39]:
preprocessor = ColumnTransformer(
    transformers=[
        ("num_pipeline", num_pipeline, num_features),
        ("cat_pipeline", cat_pipeline, cat_features)
    ]
)

In [40]:
# applying preprocessor to features
X = preprocessor.fit_transform(X)

In [41]:
# Label encode target variable
le = LabelEncoder()
y = le.fit_transform(y)

In [42]:
y

array([0, 0, 0, ..., 0, 0, 0])

In [43]:
# # divide the data into train and test set
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.30, random_state=555)

print("X_train:", X_train.shape)
print("y_train:", y_train.shape)
print("X_test:", X_test.shape)
print("y_test:", y_test.shape)

X_train: (38252, 60)
y_train: (38252,)
X_test: (16395, 60)
y_test: (16395,)


#### Model Training:

In [44]:
# import required libraries
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier, AdaBoostClassifier
from sklearn.svm import SVC
from sklearn.naive_bayes import GaussianNB
from sklearn.neighbors import KNeighborsClassifier
from catboost import CatBoostClassifier
from xgboost import XGBClassifier
from sklearn.metrics import accuracy_score, confusion_matrix, classification_report

In [45]:
# Create an Evaluate Function to give all metrics after model Training
def evaluate_model(true, predicted):
    accuracy = accuracy_score(true, predicted)
    confusionmatrix = confusion_matrix(true, predicted)    
    classificationreport = classification_report(true, predicted)
    return accuracy, confusionmatrix, classificationreport

In [46]:
models = {
    'Logistic Regression': LogisticRegression(max_iter=1000),
    'Decision Tree Classifier': DecisionTreeClassifier(),
    'Random Forest Classifier': RandomForestClassifier(),
    'Gradient Boosting Classifier': GradientBoostingClassifier(),
    'AdaBoost Classifier': AdaBoostClassifier(),
    'Support Vector Classifier': SVC(),
    # 'Gaussian Naive Bayes': GaussianNB(),
    'K-Neighbors Classifier': KNeighborsClassifier(),
    'CatBoost Classifier': CatBoostClassifier(verbose=False),
    'XGBoost Classifier': XGBClassifier()
}

model_list = []
accuracy_list = []

for i in range(len(list(models))):
    model = list(models.values())[i]
    model.fit(X_train, y_train) # Train model

    # Make predictions
    y_train_pred = model.predict(X_train)
    y_test_pred = model.predict(X_test)
    
    # Evaluate Train and Test dataset
    train_accuracy, train_confusionmatrix, train_classificationreport = evaluate_model(y_train,y_train_pred)

    test_accuracy, test_confusionmatrix, test_classificationreport = evaluate_model(y_test, y_test_pred)
        
    print(list(models.keys())[i])
    model_list.append(list(models.keys())[i])
    
    print('Model performance for Training set')
    print("**Accuracy Score:", train_accuracy)
    print("**Confusion Matrix: \n", train_confusionmatrix)
    print("**Classification Report: \n", train_classificationreport)

    print('-'*35)
    
    print('Model performance for Test set')
    print("**Accuracy Score:", test_accuracy)
    print("**Confusion Matrix: \n", test_confusionmatrix)
    print("**Classification Report: \n", test_classificationreport)
    
    accuracy_list.append(test_accuracy)
    
    print('='*35)
    print('\n')

Logistic Regression
Model performance for Training set
**Accuracy Score: 0.9323172644567604
**Confusion Matrix: 
 [[34780   214]
 [ 2375   883]]
**Classification Report: 
               precision    recall  f1-score   support

           0       0.94      0.99      0.96     34994
           1       0.80      0.27      0.41      3258

    accuracy                           0.93     38252
   macro avg       0.87      0.63      0.68     38252
weighted avg       0.92      0.93      0.92     38252

-----------------------------------
Model performance for Test set
**Accuracy Score: 0.9320524550167734
**Confusion Matrix: 
 [[14908    80]
 [ 1034   373]]
**Classification Report: 
               precision    recall  f1-score   support

           0       0.94      0.99      0.96     14988
           1       0.82      0.27      0.40      1407

    accuracy                           0.93     16395
   macro avg       0.88      0.63      0.68     16395
weighted avg       0.93      0.93      0.92  

In [47]:
# Results
pd.DataFrame(list(zip(model_list, accuracy_list)), columns=['Model Name', 'Accuracy Score']).sort_values(by=["Accuracy Score"],
                                                                                                         ascending=False)

Unnamed: 0,Model Name,Accuracy Score
7,CatBoost Classifier,0.940226
8,XGBoost Classifier,0.939555
3,Gradient Boosting Classifier,0.938335
0,Logistic Regression,0.932052
2,Random Forest Classifier,0.93126
5,Support Vector Classifier,0.925953
4,AdaBoost Classifier,0.923452
6,K-Neighbors Classifier,0.910278
1,Decision Tree Classifier,0.894236


In [48]:
# CatBoost Classifier
model_catboost = CatBoostClassifier()

In [49]:
# fit training data
model_catboost.fit(X_train, y_train)
model_catboost.score(X_train, y_train)

Learning rate set to 0.048835
0:	learn: 0.6338370	total: 18.7ms	remaining: 18.7s
1:	learn: 0.5671889	total: 31.9ms	remaining: 15.9s
2:	learn: 0.5233300	total: 43ms	remaining: 14.3s
3:	learn: 0.4806173	total: 54.4ms	remaining: 13.5s
4:	learn: 0.4521115	total: 65.6ms	remaining: 13s
5:	learn: 0.4250118	total: 73.9ms	remaining: 12.2s
6:	learn: 0.4005204	total: 86.2ms	remaining: 12.2s
7:	learn: 0.3826926	total: 95.5ms	remaining: 11.8s
8:	learn: 0.3634308	total: 104ms	remaining: 11.5s
9:	learn: 0.3471403	total: 115ms	remaining: 11.4s
10:	learn: 0.3326624	total: 125ms	remaining: 11.2s
11:	learn: 0.3204431	total: 135ms	remaining: 11.1s
12:	learn: 0.3089806	total: 148ms	remaining: 11.2s
13:	learn: 0.2998307	total: 158ms	remaining: 11.1s
14:	learn: 0.2893552	total: 169ms	remaining: 11.1s
15:	learn: 0.2827460	total: 178ms	remaining: 11s
16:	learn: 0.2745264	total: 188ms	remaining: 10.9s
17:	learn: 0.2641837	total: 199ms	remaining: 10.8s
18:	learn: 0.2583840	total: 209ms	remaining: 10.8s
19:	learn

0.9491268430408868

In [50]:
# make predictions on test data
predictions = model_catboost.predict(X_test)

In [51]:
# performance metrics
print("Accuracy Score:", accuracy_score(y_test, predictions).round(4)*100)
print("Confusion Matrix:\n", confusion_matrix(y_test, predictions))
print("Classification Report:\n", classification_report(y_test, predictions))

Accuracy Score: 94.02000000000001
Confusion Matrix:
 [[14928    60]
 [  920   487]]
Classification Report:
               precision    recall  f1-score   support

           0       0.94      1.00      0.97     14988
           1       0.89      0.35      0.50      1407

    accuracy                           0.94     16395
   macro avg       0.92      0.67      0.73     16395
weighted avg       0.94      0.94      0.93     16395



In [52]:
# Difference between Actual and Predicted Values
pred_df=pd.DataFrame({'Actual Value':y_test,'Predicted Value':predictions})
pred_df

Unnamed: 0,Actual Value,Predicted Value
0,0,0
1,0,0
2,0,0
3,0,0
4,0,0
...,...,...
16390,0,0
16391,1,0
16392,0,0
16393,0,0


#### test data for submission

In [53]:
# applying preprocessor to features
test_data = preprocessor.transform(test_data)

In [54]:
# make predictions on test data
predicted = model_catboost.predict(test_data)

In [55]:
predicted = pd.DataFrame(predicted,columns=["predictions"])
predicted

Unnamed: 0,predictions
0,0
1,0
2,0
3,0
4,0
...,...
23447,0
23448,0
23449,0
23450,0


In [56]:
# set path to save csv file
import io
%cd "datasets/"

e:\iNeuron\Projects\HR_Analytics\notebook\datasets


In [57]:
# save predictions to a csv file
predicted.to_csv("final_submission.csv")