<h1 align='center'>  HR ANALYTICS CHALLENGE </h1>
<h3 align='center'> <b>Predict Whether a Potential Promotee Will be Promoted or Not</b> </h3>

### **The Challenge**

HR analytics is revolutionising the way human resources departments operate, leading to higher efficiency and better results overall. Human resources has been using analytics for years. However, the collection, processing and analysis of data has been largely manual, and given the nature of human resources dynamics and HR KPIs, the approach has been constraining HR. Therefore, it is surprising that HR departments woke up to the utility of machine learning so late in the game. 

## 0. Import relevant Dependencies

Incase you are getting any error saying the package is not installed while running the below cell, then you can use two methods:
- pip install ________.
- google 'How to install ________'.

In [1]:
# Import Dependencies -To see the graphs in the notebook.
%matplotlib inline   

# Python Imports
import math,time,random,datetime

# Data Manipulation
import numpy as np
import pandas as pd

# Visualization -This is where the graphs come in.
import matplotlib.pyplot as plt
import seaborn as sns
import missingno
plt.style.use('fivethirtyeight')

# Preprocessing
from sklearn.preprocessing import OneHotEncoder, LabelEncoder, label_binarize

# Machine Learning
import catboost
from sklearn.model_selection import train_test_split
from sklearn import model_selection, tree, preprocessing, metrics, linear_model
from sklearn.svm import LinearSVC
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.naive_bayes import GaussianNB
from sklearn.linear_model import LogisticRegression, SGDClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from catboost import CatBoostClassifier, Pool, cv

# Performance Metrics
from sklearn.metrics import accuracy_score
from sklearn.metrics import precision_score
from sklearn.metrics import recall_score
from sklearn.metrics import f1_score

# Ignore Warnings
import warnings
warnings.filterwarnings('ignore')

# Display all the columns/rows of the DataFrame.
pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', None)

## 0. Loading the required Data

In [2]:
# Import the train data.
train = pd.read_csv('Final_train.csv')

## 1. Model Building

### Algorithms
From here, we will be running the following algorithms.

- Logistic Regression
- KNN
- Naive Bayes
- Stochastic Gradient Decent
- Linear SVC
- Decision Tree
- Gradient Boosted Trees
- Random Forest
- CatBoost Algorithm

In [3]:
train.head()

Unnamed: 0,is_promoted,department,education,gender,recruitment_channel,no_of_trainings,previous_year_rating,length_of_service,KPIs_met >80%,awards_won?,avg_training_score
0,0,7,2,0,2,-0.415276,1.385021,0.50046,1.356878,-0.154018,-1.075931
1,0,4,0,1,0,-0.415276,1.385021,-0.437395,-0.736986,-0.154018,-0.253282
2,0,7,0,1,2,-0.415276,-0.259125,0.265996,-0.736986,-0.154018,-1.001145
3,0,7,0,1,0,1.226063,-1.903271,0.969387,-0.736986,-0.154018,-1.001145
4,0,8,0,1,0,-0.415276,-0.259125,-0.906322,-0.736986,-0.154018,0.718939


In [4]:
X = train.drop(columns= 'is_promoted')
y = train['is_promoted']

### Overcoming Class Imbalance

In [5]:
from imblearn.combine import SMOTETomek
from collections import Counter

In [6]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size= 0.4, random_state= 42)

In [7]:
os= SMOTETomek(random_state= 42)
X_train_ns,y_train_ns = os.fit_resample(X_train,y_train)
print("The number of classes before fit {}".format(Counter(y_train)))
print("The number of classes after fit {}".format(Counter(y_train_ns)))

The number of classes before fit Counter({0: 30000, 1: 2884})
The number of classes after fit Counter({0: 29566, 1: 29566})


In any model building, we mainly focus on 3 main steps:

- Fitting the model and finding the accuracy (accuracy score) of the fitted model.
- Perform K-Fold Cross Validation (K needs to be specified).
- Find the accuracy of the Cross Validation. 

**We will be running a whole bunch of models to figure out which model is best suited for our data.**

#### Model 1: Logistic Regression

In [8]:
start_time = time.time()
algorithm = LogisticRegression()

## Step 1:
model = algorithm.fit(X_train_ns,y_train_ns)      # Creating the model. We will fit the algorithm to the training data.
log_acc = round(model.score(X_train_ns,y_train_ns)*100, 2)

## Step 2:  --> This code performs Cross Validation automatically.
log_train_pred = model_selection.cross_val_predict(algorithm, X_train_ns,y_train_ns, cv= 10, n_jobs= -1)

## Step 3:  --> Cross Validation accuracy metric.
log_acc_cv = round(metrics.accuracy_score(y_train_ns, log_train_pred)*100, 2)

log_pre_cv = precision_score(y_train_ns, log_train_pred)
log_rec_cv = recall_score(y_train_ns, log_train_pred)
log_f1_cv = f1_score(y_train_ns, log_train_pred)

log_time = (time.time()- start_time)

In [9]:
# Logistic Regression
print('Accuracy of the model is: ', log_acc)
print('Accuracy of 10-Fold CV is: ', log_acc_cv)
print('Running time is: ', datetime.timedelta(seconds= log_time))

print('Precision: ', log_pre_cv)
print('Recall: ', log_rec_cv)
print('F1-Score: ', log_f1_cv)


Accuracy of the model is:  72.05
Accuracy of 10-Fold CV is:  72.06
Running time is:  0:00:13.493614
Precision:  0.7227060068982003
Recall:  0.7157884056010282
F1-Score:  0.7192305731618209


#### Model 2: K-Nearest Neighbours

In [10]:
start_time = time.time()
algorithm = KNeighborsClassifier()

## Step 1:
model = algorithm.fit(X_train_ns,y_train_ns)      # Creating the model. We will fit the algorithm to the training data.
knn_acc = round(model.score(X_train_ns,y_train_ns)*100, 2)

## Step 2:  --> This code performs Cross Validation automatically.
knn_train_pred = model_selection.cross_val_predict(algorithm, X_train_ns,y_train_ns, cv= 10, n_jobs= -1)

## Step 3:  --> Cross Validation accuracy metric.
knn_acc_cv = round(metrics.accuracy_score(y_train_ns, knn_train_pred)*100, 2)

knn_pre_cv = precision_score(y_train_ns, knn_train_pred)
knn_rec_cv = recall_score(y_train_ns, knn_train_pred)
knn_f1_cv = f1_score(y_train_ns, knn_train_pred)

knn_time = (time.time()- start_time)

In [11]:
# K-Nearest Neighbours
print('Accuracy of the model is: ', knn_acc)
print('Accuracy of 10-Fold CV is: ', knn_acc_cv)
print('Running time is: ', datetime.timedelta(seconds= knn_time))

print('Precision: ', knn_pre_cv)
print('Recall: ', knn_rec_cv)
print('F1-Score: ', knn_f1_cv)

Accuracy of the model is:  93.24
Accuracy of 10-Fold CV is:  89.62
Running time is:  0:04:21.812483
Precision:  0.8634910783553142
Recall:  0.9411486166542651
F1-Score:  0.9006489618229193


#### Model 3: Gaussian Naive Bayes

In [12]:
start_time = time.time()
algorithm = GaussianNB()

## Step 1:
model = algorithm.fit(X_train_ns,y_train_ns)      # Creating the model. We will fit the algorithm to the training data.
gnb_acc = round(model.score(X_train_ns,y_train_ns)*100, 2)

## Step 2:  --> This code performs Cross Validation automatically.
gnb_train_pred = model_selection.cross_val_predict(algorithm, X_train_ns,y_train_ns, cv= 10, n_jobs= -1)

## Step 3:  --> Cross Validation accuracy metric.
gnb_acc_cv = round(metrics.accuracy_score(y_train_ns, gnb_train_pred)*100, 2)

gnb_pre_cv = precision_score(y_train_ns, gnb_train_pred)
gnb_rec_cv = recall_score(y_train_ns, gnb_train_pred)
gnb_f1_cv = f1_score(y_train_ns, gnb_train_pred)

gnb_time = (time.time()- start_time)

In [13]:
# Gaussian Naive Bayes
print('Accuracy of the model is: ', gnb_acc)
print('Accuracy of 10-Fold CV is: ', gnb_acc_cv)
print('Running time is: ', datetime.timedelta(seconds= gnb_time))

print('Precision: ', gnb_pre_cv)
print('Recall: ', gnb_rec_cv)
print('F1-Score: ', gnb_f1_cv)

Accuracy of the model is:  67.51
Accuracy of 10-Fold CV is:  67.49
Running time is:  0:00:01.809234
Precision:  0.7563454293079516
Recall:  0.5160319285665967
F1-Score:  0.6134947122924123


#### Model 4: Linear Support Vector Machines (SVC)

In [14]:
start_time = time.time()
algorithm = LinearSVC()

## Step 1:
model = algorithm.fit(X_train_ns,y_train_ns)      # Creating the model. We will fit the algorithm to the training data.
svc_acc = round(model.score(X_train_ns,y_train_ns)*100, 2)

## Step 2:  --> This code performs Cross Validation automatically.
svc_train_pred = model_selection.cross_val_predict(algorithm, X_train_ns,y_train_ns, cv= 10, n_jobs= -1)

## Step 3:  --> Cross Validation accuracy metric.
svc_acc_cv = round(metrics.accuracy_score(y_train_ns, svc_train_pred)*100, 2)

svc_pre_cv = precision_score(y_train_ns, svc_train_pred)
svc_rec_cv = recall_score(y_train_ns, svc_train_pred)
svc_f1_cv = f1_score(y_train_ns, svc_train_pred)

svc_time = (time.time()- start_time)

In [15]:
# Linear Support Vector Machines
print('Accuracy of the model is: ', svc_acc)
print('Accuracy of 10-Fold CV is: ', svc_acc_cv)
print('Running time is: ', datetime.timedelta(seconds= svc_time))

print('Precision: ', svc_pre_cv)
print('Recall: ', svc_rec_cv)
print('F1-Score: ', svc_f1_cv)

Accuracy of the model is:  72.44
Accuracy of 10-Fold CV is:  72.45
Running time is:  0:03:30.388529
Precision:  0.7242416052969394
Recall:  0.7251234526144896
F1-Score:  0.7246822606814495


#### Model 5: Stochastic Gradient Descent

In [16]:
start_time = time.time()
algorithm = SGDClassifier()

## Step 1:
model = algorithm.fit(X_train_ns,y_train_ns)      # Creating the model. We will fit the algorithm to the training data.
SGD_acc = round(model.score(X_train_ns,y_train_ns)*100, 2)

## Step 2:  --> This code performs Cross Validation automatically.
SGD_train_pred = model_selection.cross_val_predict(algorithm, X_train_ns,y_train_ns, cv= 10, n_jobs= -1)

## Step 3:  --> Cross Validation accuracy metric.
SGD_acc_cv = round(metrics.accuracy_score(y_train_ns, SGD_train_pred)*100, 2)

SGD_pre_cv = precision_score(y_train_ns, SGD_train_pred)
SGD_rec_cv = recall_score(y_train_ns, SGD_train_pred)
SGD_f1_cv = f1_score(y_train_ns, SGD_train_pred)

SGD_time = (time.time()- start_time)

In [17]:
# Stochastic Gradient Descent
print('Accuracy of the model is: ', SGD_acc)
print('Accuracy of 10-Fold CV is: ', SGD_acc_cv)
print('Running time is: ', datetime.timedelta(seconds= SGD_time))

print('Precision: ', SGD_pre_cv)
print('Recall: ', SGD_rec_cv)
print('F1-Score: ', SGD_f1_cv)

Accuracy of the model is:  72.56
Accuracy of 10-Fold CV is:  72.76
Running time is:  0:00:08.297752
Precision:  0.7120572292953485
Recall:  0.7642224176418859
F1-Score:  0.7372181800385006


#### Model 6: Decision Tree Classifier

In [18]:
start_time = time.time()
algorithm = DecisionTreeClassifier()

## Step 1:
model = algorithm.fit(X_train_ns,y_train_ns)      # Creating the model. We will fit the algorithm to the training data.
dt_acc = round(model.score(X_train_ns,y_train_ns)*100, 2)

## Step 2:  --> This code performs Cross Validation automatically.
dt_train_pred = model_selection.cross_val_predict(algorithm, X_train_ns,y_train_ns, cv= 10, n_jobs= -1)

## Step 3:  --> Cross Validation accuracy metric.
dt_acc_cv = round(metrics.accuracy_score(y_train_ns, dt_train_pred)*100, 2)

dt_pre_cv = precision_score(y_train_ns, dt_train_pred)
dt_rec_cv = recall_score(y_train_ns, dt_train_pred)
dt_f1_cv = f1_score(y_train_ns, dt_train_pred)

dt_time = (time.time()- start_time)

In [19]:
#  Decision Tree Classifier
print('Accuracy of the model is: ', dt_acc)
print('Accuracy of 10-Fold CV is: ', dt_acc_cv)
print('Running time is: ', datetime.timedelta(seconds= dt_time))

print('Precision: ', dt_pre_cv)
print('Recall: ', dt_rec_cv)
print('F1-Score: ', dt_f1_cv)

Accuracy of the model is:  98.88
Accuracy of 10-Fold CV is:  93.74
Running time is:  0:00:04.647428
Precision:  0.9437238350147553
Recall:  0.9301900832036799
F1-Score:  0.9369080874838183


#### Model 7: Gradient Boost Trees

In [20]:
start_time = time.time()
algorithm = GradientBoostingClassifier()

## Step 1:
model = algorithm.fit(X_train_ns,y_train_ns)      # Creating the model. We will fit the algorithm to the training data.
gbt_acc = round(model.score(X_train_ns,y_train_ns)*100, 2)

## Step 2:  --> This code performs Cross Validation automatically.
gbt_train_pred = model_selection.cross_val_predict(algorithm, X_train_ns,y_train_ns, cv= 10, n_jobs= -1)

## Step 3:  --> Cross Validation accuracy metric.
gbt_acc_cv = round(metrics.accuracy_score(y_train_ns, gbt_train_pred)*100, 2)

gbt_pre_cv = precision_score(y_train_ns, gbt_train_pred)
gbt_rec_cv = recall_score(y_train_ns, gbt_train_pred)
gbt_f1_cv = f1_score(y_train_ns, gbt_train_pred)

gbt_time = (time.time()- start_time)

In [21]:
# Gradient Boost Trees
print('Accuracy of the model is: ', gbt_acc)
print('Accuracy of 10-Fold CV is: ', gbt_acc_cv)
print('Running time is: ', datetime.timedelta(seconds= gbt_time))

print('Precision: ', gbt_pre_cv)
print('Recall: ', gbt_rec_cv)
print('F1-Score: ', gbt_f1_cv)

Accuracy of the model is:  88.07
Accuracy of 10-Fold CV is:  87.74
Running time is:  0:01:49.096689
Precision:  0.858533607505462
Recall:  0.9037746059663126
F1-Score:  0.8805734058329214


#### Model 8: Random Forest


In [22]:
start_time = time.time()
algorithm = RandomForestClassifier()

## Step 1:
model = algorithm.fit(X_train_ns,y_train_ns)      # Creating the model. We will fit the algorithm to the training data.
rf_acc = round(model.score(X_train_ns,y_train_ns)*100, 2)

## Step 2:  --> This code performs Cross Validation automatically.
rf_train_pred = model_selection.cross_val_predict(algorithm, X_train_ns,y_train_ns, cv= 10, n_jobs= -1)

## Step 3:  --> Cross Validation accuracy metric.
rf_acc_cv = round(metrics.accuracy_score(y_train_ns, rf_train_pred)*100, 2)

rf_pre_cv = precision_score(y_train_ns, rf_train_pred)
rf_rec_cv = recall_score(y_train_ns, rf_train_pred)
rf_f1_cv = f1_score(y_train_ns, rf_train_pred)

rf_time = (time.time()- start_time)

In [23]:
print('Accuracy of the model is: ', rf_acc)
print('Accuracy of 10-Fold CV is: ', rf_acc_cv)
print('Running time is: ', datetime.timedelta(seconds= rf_time))

print('Precision: ', rf_pre_cv)
print('Recall: ', rf_rec_cv)
print('F1-Score: ', rf_f1_cv)

Accuracy of the model is:  98.88
Accuracy of 10-Fold CV is:  95.03
Running time is:  0:01:57.698702
Precision:  0.9520592129833972
Recall:  0.948420482987215
F1-Score:  0.9502363645605651


### Model Results

Now let's see which model has the best cross-validation accuracy.

- <b>NOTE:</b> We care more about the accuracy of cross validation, as the metrics we get from the model can randomly score higher than usual.

In [24]:
cv_models = pd.DataFrame({'Model':[' Logistic Regression', 'K-Nearest Neighbours', 'Gaussian Naive Bayes', 
                                'Linear Support Vector Machines (SVC)', 'Stochastic Gradient Descent', 
                                'Decision Tree Classifier', 'Gradient Boost Trees', 'Random Forest'],
                      'Score':[log_acc_cv, knn_acc_cv, gnb_acc_cv, svc_acc_cv, SGD_acc_cv, dt_acc_cv, gbt_acc_cv, rf_acc_cv]})

print('-----Cross-Validation Accuracy Scores-----')
cv_models.nlargest(9,'Score')

-----Cross-Validation Accuracy Scores-----


Unnamed: 0,Model,Score
7,Random Forest,95.03
5,Decision Tree Classifier,93.74
1,K-Nearest Neighbours,89.62
6,Gradient Boost Trees,87.74
4,Stochastic Gradient Descent,72.76
3,Linear Support Vector Machines (SVC),72.45
0,Logistic Regression,72.06
2,Gaussian Naive Bayes,67.49


### Precision and Recall

Precision and Recall are metrics that you use when you have an imbalanced classification problem.

- Recall - a metric which measures a models ability to find all relevant cases in a dataset.
- Precision - a metric which measures a models ability to correctly identify only relevant cases.

Combining  Precision and Recall gives us the **F1 score.**

They fall between 0 and 1, with 1 being better.

In [25]:
f1_cv_models = pd.DataFrame({'Model':[' Logistic Regression', 'K-Nearest Neighbours', 'Gaussian Naive Bayes', 
                                'Linear Support Vector Machines (SVC)', 'Stochastic Gradient Descent', 
                                'Decision Tree Classifier', 'Gradient Boost Trees', 'Random Forest'],
                      'F1-Score':[log_f1_cv, knn_f1_cv, gnb_f1_cv, svc_f1_cv, SGD_f1_cv, dt_f1_cv, gbt_f1_cv, rf_f1_cv]})

print('-----Cross-Validation Accuracy Scores-----')
f1_cv_models.nlargest(8,'F1-Score')

-----Cross-Validation Accuracy Scores-----


Unnamed: 0,Model,F1-Score
7,Random Forest,0.950236
5,Decision Tree Classifier,0.936908
1,K-Nearest Neighbours,0.900649
6,Gradient Boost Trees,0.880573
4,Stochastic Gradient Descent,0.737218
3,Linear Support Vector Machines (SVC),0.724682
0,Logistic Regression,0.719231
2,Gaussian Naive Bayes,0.613495


> **Recall = TP/(TP + FN)**
- Here the Recall is pretty high. This means that there is a lesser amount of False Negatives (predicting 'Did not launch' when it was actually 'Launched').

> **Pression = TP/(TP + FP)**
- Precision is high. Thus, we can say say that there is less False Positives (predicting 'Launched' when it actually 'Did not launch')

### Prediction

Let's use the model with the highest cross-validation accuracy score to make a prediction on the test dataset.

We want to make predictions on the same columnns our model is trained on.

So we have to select the subset of right columns of the test dateframe, encode them and make a prediciton with our model.

In [26]:
# Create a list of columns to be used for predictions.
wanted_columns = X_train.columns
wanted_columns

Index(['department', 'education', 'gender', 'recruitment_channel',
       'no_of_trainings', 'previous_year_rating', 'length_of_service',
       'KPIs_met >80%', 'awards_won?', 'avg_training_score'],
      dtype='object')

In [28]:
# Make predictions using RandomForest model on wanted columns.
predictions = algorithm.predict(X_test[wanted_columns])

In [29]:
#  RandomForest Algorithm
print('Accuracy of the model is: ', accuracy_score(y_test, predictions))
print('Precision: ', precision_score(y_test, predictions))
print('Recall: ', recall_score(y_test, predictions))
print('F1: ', f1_score(y_test, predictions))

Accuracy of the model is:  0.9077723043240284
Precision:  0.4355362946912243
Recall:  0.45067264573991034
F1:  0.44297520661157025


## 2. Prediction on the Test dataset

Let's use the model with the highest cross-validation accuracy score to make a prediction on the test dataset.

We want to make predictions on the same columnns our model is trained on.

So we have to select the subset of right columns of the test dateframe, encode them and make a prediciton with our model.

In [30]:
test = pd.read_csv('Final_test.csv')
test.head()

Unnamed: 0,department,education,gender,recruitment_channel,no_of_trainings,previous_year_rating,length_of_service,KPIs_met >80%,awards_won?,avg_training_score
0,8,0,1,2,-0.423094,-0.266732,-1.1432,1.336715,-0.152665,1.024263
1,2,0,0,0,-0.423094,-0.266732,-0.19259,-0.748103,-0.152665,-0.914377
2,7,0,1,0,-0.423094,-1.907786,-0.430243,-0.748103,-0.152665,-1.212629
3,5,0,0,0,2.905264,-1.087259,0.758019,-0.748103,-0.152665,0.129506
4,1,0,1,2,-0.423094,0.553794,0.282714,-0.748103,-0.152665,-0.168746


In [31]:
# Make predictions using RandomForest model on wanted columns.
predictions = algorithm.predict(test[wanted_columns])

In [32]:
# Our predictions array is comprised of 0's and 1's.
predictions[:30]

array([1, 0, 0, 0, 0, 0, 1, 0, 0, 1, 1, 0, 0, 1, 1, 1, 0, 1, 1, 0, 0, 0,
       0, 1, 1, 1, 1, 1, 0, 0], dtype=int64)

In [33]:
pd.set_option('display.max_rows',100)
df = pd.read_csv('test.csv')

# Create a dataframe and append the relevant colimns.
submission = pd.DataFrame()
submission['employee_id'] = df['employee_id']
submission['is_promoted'] = predictions
submission.head()

Unnamed: 0,employee_id,is_promoted
0,8724,1
1,74430,0
2,72255,0
3,38562,0
4,64486,0


In [34]:
submission['is_promoted'].value_counts()

0    13171
1    10319
Name: is_promoted, dtype: int64

In [35]:
# Are our test and submission the same length?
if len(submission) == len(test):
    print('The submission and the test dataframes are of the same length')
else:
    print('Dataframes mismatched')

The submission and the test dataframes are of the same length


In [36]:
# convert submission dataframe to csv.
submission.to_csv('HR_Analytics-(Before HPT).csv', index= False)
print('Submission csv is ready')

Submission csv is ready
