## HR Job Promotions Prediction

### Data Analysis Question

#### Context

**Background Information**
HR analytics is revolutionising the way human resources departments operate, leading to higher efficiency and better results overall. Human resources have been using analytics for years. However, the collection, processing, and analysis of data have been largely manual, and given the nature of human resources dynamics and HR KPIs, the approach has been constraining HR. Therefore, it is surprising that HR departments woke up to the utility of machine learning so late in the game.

**Problem Statement**
Your client is a large Multinational Corporation, and they have nine broad verticals across the organization. One of the problems your client faces is identifying the right people for promotion (only for the manager position and below) and preparing them in time.

Currently, the process they are following is:

* They first identify a set of employees based on recommendations/ past performance.
* Selected employees go through the separate training and evaluation program for each vertical.
* These programs are based on the required skill of each vertical. At the end of the program, based on various factors such as training performance, KPI completion (only employees with KPIs completed greater than 60% are considered) etc., the employee gets a promotion.

For the process mentioned above, the final promotions are only announced after the evaluation, and this leads to a delay in transition to their new roles. Hence, the company needs your help in identifying the eligible candidates at a particular checkpoint so that they can expedite the entire promotion cycle.

They have provided multiple attributes around employees’ past and current performance along with demographics. Now, The task is to predict whether a potential promotee at a checkpoint will be promoted or not after the evaluation process.

Dataset URL: https://bit.ly/2ODZvLCHRDataset

#### Metric of success

A model (with an accuracy score of at least 80% ) that can  predict whether a potential promotee at a checkpoint will be promoted or not after the evaluation process

### Data Exploration

Import libraries

In [73]:
import pandas as pd
import numpy as np

import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns

from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
from sklearn.metrics import classification_report
from sklearn.preprocessing import MinMaxScaler, RobustScaler, OneHotEncoder

from sklearn.linear_model import LogisticRegression # Logistic Regression Classifier
from sklearn.tree import DecisionTreeClassifier     # Decision Tree Classifier
from sklearn.svm import SVC                         # SVM Classifier
from sklearn.naive_bayes import GaussianNB          # Naive Bayes Classifier
from sklearn.neighbors import KNeighborsClassifier  # KNN Classifier
from sklearn.ensemble import RandomForestClassifier # Random Forest Classifier
from sklearn.svm import SVC                         # SVM Classifier

##### Load Data 

In [13]:
df = pd.read_csv('https://bit.ly/2ODZvLCHRDataset')

In [102]:
pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', None)

Explore data

In [None]:
df.head(3)

Unnamed: 0,employee_id,department,region,education,gender,recruitment_channel,no_of_trainings,age,previous_year_rating,length_of_service,KPIs_met >80%,awards_won?,avg_training_score,is_promoted
0,65438,Sales & Marketing,region_7,Master's & above,f,sourcing,1,35,5.0,8,1,0,49,0
1,65141,Operations,region_22,Bachelor's,m,other,1,30,5.0,4,0,0,60,0
2,7513,Sales & Marketing,region_19,Bachelor's,m,sourcing,1,34,3.0,7,0,0,50,0


In [None]:
df.tail(3)

Unnamed: 0,employee_id,department,region,education,gender,recruitment_channel,no_of_trainings,age,previous_year_rating,length_of_service,KPIs_met >80%,awards_won?,avg_training_score,is_promoted
54805,13918,Analytics,region_1,Bachelor's,m,other,1,27,5.0,3,1,0,79,0
54806,13614,Sales & Marketing,region_9,,m,sourcing,1,29,1.0,2,0,0,45,0
54807,51526,HR,region_22,Bachelor's,m,other,1,27,1.0,5,0,0,49,0


In [None]:
df.shape

(54808, 14)

In [None]:
df.dtypes

employee_id               int64
department               object
region                   object
education                object
gender                   object
recruitment_channel      object
no_of_trainings           int64
age                       int64
previous_year_rating    float64
length_of_service         int64
KPIs_met >80%             int64
awards_won?               int64
avg_training_score        int64
is_promoted               int64
dtype: object

In [None]:
df.isnull().sum()

employee_id                0
department                 0
region                     0
education               2409
gender                     0
recruitment_channel        0
no_of_trainings            0
age                        0
previous_year_rating    4124
length_of_service          0
KPIs_met >80%              0
awards_won?                0
avg_training_score         0
is_promoted                0
dtype: int64

In [15]:
df['is_promoted'].value_counts()

0    50140
1     4668
Name: is_promoted, dtype: int64

In [26]:
df['department'].value_counts()

Sales & Marketing    16840
Operations           11348
Technology            7138
Procurement           7138
Analytics             5352
Finance               2536
HR                    2418
Legal                 1039
R&D                    999
Name: department, dtype: int64

In [14]:
sum(df.duplicated())

0

### Data Preparation

##### Fix column names

In [16]:
df.columns = df.columns.str.lower()

In [18]:
df.rename(columns = {'kpis_met >80%':'kpis_met_greater_than_80', 'awards_won?':'awards_won'}, inplace = True)

##### Handle missing data

In [19]:
df['previous_year_rating'].fillna(0,inplace=True)

In [20]:
df['education'].fillna("Other" ,inplace=True)

In [21]:
df.isnull().sum()

employee_id                 0
department                  0
region                      0
education                   0
gender                      0
recruitment_channel         0
no_of_trainings             0
age                         0
previous_year_rating        0
length_of_service           0
kpis_met_greater_than_80    0
awards_won                  0
avg_training_score          0
is_promoted                 0
dtype: int64

##### Encode categorical data

In [56]:
df_encoded =pd.get_dummies(df, columns=['gender','region','education','department','recruitment_channel'], drop_first=True)

In [63]:
df_encoded.columns = df_encoded.columns.str.lower()

In [64]:
df_encoded.head(2)

Unnamed: 0,employee_id,no_of_trainings,age,previous_year_rating,length_of_service,kpis_met_greater_than_80,awards_won,avg_training_score,is_promoted,gender_m,region_region_10,region_region_11,region_region_12,region_region_13,region_region_14,region_region_15,region_region_16,region_region_17,region_region_18,region_region_19,region_region_2,region_region_20,region_region_21,region_region_22,region_region_23,region_region_24,region_region_25,region_region_26,region_region_27,region_region_28,region_region_29,region_region_3,region_region_30,region_region_31,region_region_32,region_region_33,region_region_34,region_region_4,region_region_5,region_region_6,region_region_7,region_region_8,region_region_9,education_below secondary,education_master's & above,education_other,department_finance,department_hr,department_legal,department_operations,department_procurement,department_r&d,department_sales & marketing,department_technology,recruitment_channel_referred,recruitment_channel_sourcing
0,65438,1,35,5.0,8,1,0,49,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,1,0,0,0,0,0,0,0,1,0,0,1
1,65141,1,30,5.0,4,0,0,60,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0


### Data Modelling

Set predictors and the target

In [68]:
X = df_encoded.drop(['is_promoted','employee_id'], axis=1)
Y = df_encoded['is_promoted']

Split data into training and testing sets

In [70]:
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=.3, random_state=0)

Normalisation

In [100]:
norm = MinMaxScaler().fit(X_train) 
X_train = norm.transform(X_train) 
X_test = norm.transform(X_test)

Instantiate the classifiers

In [91]:
logistic_classifier = LogisticRegression()
decision_classifier = DecisionTreeClassifier()
svm_classifier = SVC()
knn_classifier = KNeighborsClassifier()
naive_classifier = GaussianNB()
random_forest_classifier = RandomForestClassifier()

Fitting

In [94]:
max_iter = 60000

In [95]:
logistic_classifier.fit(X_train, Y_train)
decision_classifier.fit(X_train, Y_train)
svm_classifier.fit(X_train, Y_train)
knn_classifier.fit(X_train, Y_train)
naive_classifier.fit(X_train, Y_train)
random_forest_classifier.fit(X_train, Y_train)

RandomForestClassifier()

Predict results

In [96]:
logistic_y_prediction = logistic_classifier.predict(X_test) 
decision_y_prediction = decision_classifier.predict(X_test) 
svm_y_prediction = svm_classifier.predict(X_test) 
knn_y_prediction = knn_classifier.predict(X_test) 
naive_y_prediction = naive_classifier.predict(X_test) 
random_y_prediction = random_forest_classifier.predict(X_test) 

Print accuracy of classifiers

In [97]:
print(accuracy_score(logistic_y_prediction, Y_test))
print(accuracy_score(decision_y_prediction, Y_test))
print(accuracy_score(svm_y_prediction, Y_test))
print(accuracy_score(knn_y_prediction, Y_test))
print(accuracy_score(naive_y_prediction, Y_test))
print(accuracy_score(random_y_prediction, Y_test))

0.9183239068296539
0.9023900748038679
0.9157088122605364
0.9217296113847838
0.4232196071276531
0.9348050842303716


Print the classification report

In [98]:
print('Logistic classifier:')
print(classification_report(Y_test, logistic_y_prediction))

print('Decision Tree classifier:')
print(classification_report(Y_test, decision_y_prediction))

print('SVM Classifier:')
print(classification_report(Y_test, svm_y_prediction))

print('KNN Classifier:')
print(classification_report(Y_test, knn_y_prediction))

print('Naive Bayes Classifier:')
print(classification_report(Y_test, naive_y_prediction)) 

print('Random Forest Classifier:')
print(classification_report(Y_test, random_y_prediction)) 

Logistic classifier:
              precision    recall  f1-score   support

           0       0.92      0.99      0.96     15057
           1       0.58      0.12      0.19      1386

    accuracy                           0.92     16443
   macro avg       0.75      0.55      0.58     16443
weighted avg       0.89      0.92      0.89     16443

Decision Tree classifier:
              precision    recall  f1-score   support

           0       0.95      0.94      0.95     15057
           1       0.43      0.47      0.45      1386

    accuracy                           0.90     16443
   macro avg       0.69      0.71      0.70     16443
weighted avg       0.91      0.90      0.90     16443

SVM Classifier:
              precision    recall  f1-score   support

           0       0.92      1.00      0.96     15057
           1       0.00      0.00      0.00      1386

    accuracy                           0.92     16443
   macro avg       0.46      0.50      0.48     16443
weighted av

  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))


### Summary of Findings and Recommendations

We exceeded our target of an accuracy score of 80%. Of the 6 classifiers, Random Forest was the best performer, with an accuracy score of 93%.  I would recommend it in the development of the promotion prediction solution.