<h1 align='center'>  HR ANALYTICS CHALLENGE </h1>
<h3 align='center'> <b>Predict Whether a Potential Promotee Will be Promoted or Not</b> </h3>

### **The Challenge**

HR analytics is revolutionising the way human resources departments operate, leading to higher efficiency and better results overall. Human resources has been using analytics for years. However, the collection, processing and analysis of data has been largely manual, and given the nature of human resources dynamics and HR KPIs, the approach has been constraining HR. Therefore, it is surprising that HR departments woke up to the utility of machine learning so late in the game. 

## 0. Import relevant Dependencies

Incase you are getting any error saying the package is not installed while running the below cell, then you can use two methods:
- pip install ________.
- google 'How to install ________'.

In [1]:
# Import Dependencies -To see the graphs in the notebook.
%matplotlib inline   

# Python Imports
import math,time,random,datetime

# Data Manipulation
import numpy as np
import pandas as pd

# Visualization -This is where the graphs come in.
import matplotlib.pyplot as plt
import seaborn as sns
import missingno
plt.style.use('fivethirtyeight')

# Preprocessing
from sklearn.preprocessing import OneHotEncoder, LabelEncoder, label_binarize

# Machine Learning
import catboost
from sklearn.model_selection import train_test_split
from sklearn import model_selection, tree, preprocessing, metrics, linear_model
from sklearn.svm import LinearSVC
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.naive_bayes import GaussianNB
from sklearn.linear_model import LogisticRegression, SGDClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from catboost import CatBoostClassifier, Pool, cv

# Performance Metrics
from sklearn.metrics import accuracy_score
from sklearn.metrics import precision_score
from sklearn.metrics import recall_score
from sklearn.metrics import f1_score

# Ignore Warnings
import warnings
warnings.filterwarnings('ignore')

# Display all the columns/rows of the DataFrame.
pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', None)

## 0. Loading the required Data

In [2]:
# Import the train data.
train = pd.read_csv('Final_train.csv')
test = pd.read_csv('Final_test.csv')

## 1. Model Building

### Algorithms
From here, we will be running the following algorithms.

- Logistic Regression
- KNN
- Naive Bayes
- Stochastic Gradient Decent
- Linear SVC
- Decision Tree
- Gradient Boosted Trees
- Random Forest
- CatBoost Algorithm

In [3]:
train.head()

Unnamed: 0,is_promoted,department,gender,recruitment_channel,no_of_trainings,age,previous_year_rating,length_of_service,KPIs_met >80%,awards_won?,avg_training_score
0,0,0.808933,-1.536223,1.154111,-0.415276,-0.284763,1.385021,0.50046,1.356878,-0.154018,-1.075931
1,0,-0.388183,0.650947,-0.885239,-0.415276,-0.284763,1.385021,-0.437395,-0.736986,-0.154018,-0.253282
2,0,0.808933,0.650947,1.154111,-0.415276,-0.284763,-0.259125,0.265996,-0.736986,-0.154018,-1.001145
3,0,0.808933,0.650947,-0.885239,1.226063,0.718471,-1.903271,0.969387,-0.736986,-0.154018,-1.001145
4,0,1.207972,0.650947,-0.885239,-0.415276,1.721705,-0.259125,-0.906322,-0.736986,-0.154018,0.718939


In [4]:
X = train.drop(columns= 'is_promoted')
y = train['is_promoted']

### Overcoming Class Imbalance

In [5]:
from imblearn.combine import SMOTETomek
from collections import Counter

In [6]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size= 0.4, random_state= 42)

In [7]:
os= SMOTETomek(random_state= 42)
X_train_ns,y_train_ns = os.fit_resample(X,y)
print("The number of classes before fit {}".format(Counter(y_train)))
print("The number of classes after fit {}".format(Counter(y_train_ns)))

The number of classes before fit Counter({0: 30000, 1: 2884})
The number of classes after fit Counter({0: 49560, 1: 49560})


In any model building, we mainly focus on 3 main steps:

- Fitting the model and finding the accuracy (accuracy score) of the fitted model.
- Perform K-Fold Cross Validation (K needs to be specified).
- Find the accuracy of the Cross Validation. 

**We will be running a whole bunch of models to figure out which model is best suited for our data.**

#### Model 1: Logistic Regression

In [9]:
start_time = time.time()
algorithm = LogisticRegression()

## Step 1:
model = algorithm.fit(X_train_ns,y_train_ns)      # Creating the model. We will fit the algorithm to the training data.
log_acc = round(model.score(X_train_ns,y_train_ns)*100, 2)

## Step 2:  --> This code performs Cross Validation automatically.
log_train_pred = model_selection.cross_val_predict(algorithm, X_train_ns,y_train_ns, cv= 10, n_jobs= -1)

## Step 3:  --> Cross Validation accuracy metric.
log_acc_cv = round(metrics.accuracy_score(y_train_ns, log_train_pred)*100, 2)

log_pre_cv = precision_score(y_train_ns, log_train_pred)
log_rec_cv = recall_score(y_train_ns, log_train_pred)
log_f1_cv = f1_score(y_train_ns, log_train_pred)

log_time = (time.time()- start_time)

In [10]:
# Logistic Regression
print('Accuracy of the model is: ', log_acc)
print('Accuracy of 10-Fold CV is: ', log_acc_cv)
print('Running time is: ', datetime.timedelta(seconds= log_time))

print('Precision: ', log_pre_cv)
print('Recall: ', log_rec_cv)
print('F1-Score: ', log_f1_cv)


Accuracy of the model is:  72.07
Accuracy of 10-Fold CV is:  72.08
Running time is:  0:00:03.472665
Precision:  0.7224639154299655
Recall:  0.7170702179176756
F1-Score:  0.7197569620253165


#### Model 2: K-Nearest Neighbours

In [12]:
start_time = time.time()
algorithm = KNeighborsClassifier()

## Step 1:
model = algorithm.fit(X_train_ns,y_train_ns)      # Creating the model. We will fit the algorithm to the training data.
knn_acc = round(model.score(X_train_ns,y_train_ns)*100, 2)

## Step 2:  --> This code performs Cross Validation automatically.
knn_train_pred = model_selection.cross_val_predict(algorithm, X_train_ns,y_train_ns, cv= 10, n_jobs= -1)

## Step 3:  --> Cross Validation accuracy metric.
knn_acc_cv = round(metrics.accuracy_score(y_train_ns, knn_train_pred)*100, 2)

knn_pre_cv = precision_score(y_train_ns, knn_train_pred)
knn_rec_cv = recall_score(y_train_ns, knn_train_pred)
knn_f1_cv = f1_score(y_train_ns, knn_train_pred)

knn_time = (time.time()- start_time)

In [13]:
# K-Nearest Neighbours
print('Accuracy of the model is: ', knn_acc)
print('Accuracy of 10-Fold CV is: ', knn_acc_cv)
print('Running time is: ', datetime.timedelta(seconds= knn_time))

print('Precision: ', knn_pre_cv)
print('Recall: ', knn_rec_cv)
print('F1-Score: ', knn_f1_cv)

Accuracy of the model is:  93.83
Accuracy of 10-Fold CV is:  90.36
Running time is:  0:11:38.534342
Precision:  0.8700921437294157
Recall:  0.9488498789346247
F1-Score:  0.907765959500415


#### Model 3: Gaussian Naive Bayes

In [14]:
start_time = time.time()
algorithm = GaussianNB()

## Step 1:
model = algorithm.fit(X_train_ns,y_train_ns)      # Creating the model. We will fit the algorithm to the training data.
gnb_acc = round(model.score(X_train_ns,y_train_ns)*100, 2)

## Step 2:  --> This code performs Cross Validation automatically.
gnb_train_pred = model_selection.cross_val_predict(algorithm, X_train_ns,y_train_ns, cv= 10, n_jobs= -1)

## Step 3:  --> Cross Validation accuracy metric.
gnb_acc_cv = round(metrics.accuracy_score(y_train_ns, gnb_train_pred)*100, 2)

gnb_pre_cv = precision_score(y_train_ns, gnb_train_pred)
gnb_rec_cv = recall_score(y_train_ns, gnb_train_pred)
gnb_f1_cv = f1_score(y_train_ns, gnb_train_pred)

gnb_time = (time.time()- start_time)

In [15]:
# Gaussian Naive Bayes
print('Accuracy of the model is: ', gnb_acc)
print('Accuracy of 10-Fold CV is: ', gnb_acc_cv)
print('Running time is: ', datetime.timedelta(seconds= gnb_time))

print('Precision: ', gnb_pre_cv)
print('Recall: ', gnb_rec_cv)
print('F1-Score: ', gnb_f1_cv)

Accuracy of the model is:  67.72
Accuracy of 10-Fold CV is:  67.73
Running time is:  0:00:03.415667
Precision:  0.7479959354183132
Recall:  0.5347054075867635
F1-Score:  0.623617451875559


#### Model 4: Linear Support Vector Machines (SVC)

In [16]:
start_time = time.time()
algorithm = LinearSVC()

## Step 1:
model = algorithm.fit(X_train_ns,y_train_ns)      # Creating the model. We will fit the algorithm to the training data.
svc_acc = round(model.score(X_train_ns,y_train_ns)*100, 2)

## Step 2:  --> This code performs Cross Validation automatically.
svc_train_pred = model_selection.cross_val_predict(algorithm, X_train_ns,y_train_ns, cv= 10, n_jobs= -1)

## Step 3:  --> Cross Validation accuracy metric.
svc_acc_cv = round(metrics.accuracy_score(y_train_ns, svc_train_pred)*100, 2)

svc_pre_cv = precision_score(y_train_ns, svc_train_pred)
svc_rec_cv = recall_score(y_train_ns, svc_train_pred)
svc_f1_cv = f1_score(y_train_ns, svc_train_pred)

svc_time = (time.time()- start_time)

In [17]:
# Linear Support Vector Machines
print('Accuracy of the model is: ', svc_acc)
print('Accuracy of 10-Fold CV is: ', svc_acc_cv)
print('Running time is: ', datetime.timedelta(seconds= svc_time))

print('Precision: ', svc_pre_cv)
print('Recall: ', svc_rec_cv)
print('F1-Score: ', svc_f1_cv)

Accuracy of the model is:  72.47
Accuracy of 10-Fold CV is:  72.5
Running time is:  0:06:21.635300
Precision:  0.7237932764676367
Recall:  0.7276634382566586
F1-Score:  0.7257231976656437


#### Model 5: Stochastic Gradient Descent

In [18]:
start_time = time.time()
algorithm = SGDClassifier()

## Step 1:
model = algorithm.fit(X_train_ns,y_train_ns)      # Creating the model. We will fit the algorithm to the training data.
SGD_acc = round(model.score(X_train_ns,y_train_ns)*100, 2)

## Step 2:  --> This code performs Cross Validation automatically.
SGD_train_pred = model_selection.cross_val_predict(algorithm, X_train_ns,y_train_ns, cv= 10, n_jobs= -1)

## Step 3:  --> Cross Validation accuracy metric.
SGD_acc_cv = round(metrics.accuracy_score(y_train_ns, SGD_train_pred)*100, 2)

SGD_pre_cv = precision_score(y_train_ns, SGD_train_pred)
SGD_rec_cv = recall_score(y_train_ns, SGD_train_pred)
SGD_f1_cv = f1_score(y_train_ns, SGD_train_pred)

SGD_time = (time.time()- start_time)

In [19]:
# Stochastic Gradient Descent
print('Accuracy of the model is: ', SGD_acc)
print('Accuracy of 10-Fold CV is: ', SGD_acc_cv)
print('Running time is: ', datetime.timedelta(seconds= SGD_time))

print('Precision: ', SGD_pre_cv)
print('Recall: ', SGD_rec_cv)
print('F1-Score: ', SGD_f1_cv)

Accuracy of the model is:  71.47
Accuracy of 10-Fold CV is:  72.22
Running time is:  0:00:07.511801
Precision:  0.7059032517436751
Recall:  0.7617231638418079
F1-Score:  0.732751676549656


#### Model 6: Decision Tree Classifier

In [20]:
start_time = time.time()
algorithm = DecisionTreeClassifier()

## Step 1:
model = algorithm.fit(X_train_ns,y_train_ns)      # Creating the model. We will fit the algorithm to the training data.
dt_acc = round(model.score(X_train_ns,y_train_ns)*100, 2)

## Step 2:  --> This code performs Cross Validation automatically.
dt_train_pred = model_selection.cross_val_predict(algorithm, X_train_ns,y_train_ns, cv= 10, n_jobs= -1)

## Step 3:  --> Cross Validation accuracy metric.
dt_acc_cv = round(metrics.accuracy_score(y_train_ns, dt_train_pred)*100, 2)

dt_pre_cv = precision_score(y_train_ns, dt_train_pred)
dt_rec_cv = recall_score(y_train_ns, dt_train_pred)
dt_f1_cv = f1_score(y_train_ns, dt_train_pred)

dt_time = (time.time()- start_time)

In [21]:
#  Decision Tree Classifier
print('Accuracy of the model is: ', dt_acc)
print('Accuracy of 10-Fold CV is: ', dt_acc_cv)
print('Running time is: ', datetime.timedelta(seconds= dt_time))

print('Precision: ', dt_pre_cv)
print('Recall: ', dt_rec_cv)
print('F1-Score: ', dt_f1_cv)

Accuracy of the model is:  98.87
Accuracy of 10-Fold CV is:  94.06
Running time is:  0:00:07.487040
Precision:  0.9478865440226419
Recall:  0.9325665859564165
F1-Score:  0.9401641595215575


#### Model 7: Gradient Boost Trees

In [22]:
start_time = time.time()
algorithm = GradientBoostingClassifier()

## Step 1:
model = algorithm.fit(X_train_ns,y_train_ns)      # Creating the model. We will fit the algorithm to the training data.
gbt_acc = round(model.score(X_train_ns,y_train_ns)*100, 2)

## Step 2:  --> This code performs Cross Validation automatically.
gbt_train_pred = model_selection.cross_val_predict(algorithm, X_train_ns,y_train_ns, cv= 10, n_jobs= -1)

## Step 3:  --> Cross Validation accuracy metric.
gbt_acc_cv = round(metrics.accuracy_score(y_train_ns, gbt_train_pred)*100, 2)

gbt_pre_cv = precision_score(y_train_ns, gbt_train_pred)
gbt_rec_cv = recall_score(y_train_ns, gbt_train_pred)
gbt_f1_cv = f1_score(y_train_ns, gbt_train_pred)

gbt_time = (time.time()- start_time)

In [23]:
# Gradient Boost Trees
print('Accuracy of the model is: ', gbt_acc)
print('Accuracy of 10-Fold CV is: ', gbt_acc_cv)
print('Running time is: ', datetime.timedelta(seconds= gbt_time))

print('Precision: ', gbt_pre_cv)
print('Recall: ', gbt_rec_cv)
print('F1-Score: ', gbt_f1_cv)

Accuracy of the model is:  88.31
Accuracy of 10-Fold CV is:  87.62
Running time is:  0:02:34.256252
Precision:  0.8428505747126437
Recall:  0.9247376916868443
F1-Score:  0.881897339683456


#### Model 8: Random Forest


In [24]:
start_time = time.time()
algorithm = RandomForestClassifier()

## Step 1:
model = algorithm.fit(X_train_ns,y_train_ns)      # Creating the model. We will fit the algorithm to the training data.
rf_acc = round(model.score(X_train_ns,y_train_ns)*100, 2)

## Step 2:  --> This code performs Cross Validation automatically.
rf_train_pred = model_selection.cross_val_predict(algorithm, X_train_ns,y_train_ns, cv= 10, n_jobs= -1)

## Step 3:  --> Cross Validation accuracy metric.
rf_acc_cv = round(metrics.accuracy_score(y_train_ns, rf_train_pred)*100, 2)

rf_pre_cv = precision_score(y_train_ns, rf_train_pred)
rf_rec_cv = recall_score(y_train_ns, rf_train_pred)
rf_f1_cv = f1_score(y_train_ns, rf_train_pred)

rf_time = (time.time()- start_time)

In [25]:
print('Accuracy of the model is: ', rf_acc)
print('Accuracy of 10-Fold CV is: ', rf_acc_cv)
print('Running time is: ', datetime.timedelta(seconds= rf_time))

print('Precision: ', rf_pre_cv)
print('Recall: ', rf_rec_cv)
print('F1-Score: ', rf_f1_cv)

Accuracy of the model is:  98.87
Accuracy of 10-Fold CV is:  95.42
Running time is:  0:04:01.479676
Precision:  0.9598929388689114
Recall:  0.9479620661824052
F1-Score:  0.9538901973523918


#### Model 9: CatBoost Algorithm

This is by a Russian company (Yandex) who created it as an in-house algorithm but now it is open sourced.

- CatBoost is a state-of-the-art open source gradient boosting on decision trees library.
- It is simple and easy to use. 

For more details --> https://catboost.ai/

In [26]:
# Define the categorical features for CatBoost model
cat_features = np.where(X_train_ns.dtypes != np.float)[0]
cat_features

array([], dtype=int64)

In [27]:
# We will use CatBoost Pool() function to pool together the training data and the categorical labels
train_pool = Pool(X_train_ns, y_train_ns, cat_features)

In [28]:
# CatBoost Model definition
catboost_model = CatBoostClassifier(iterations= 100, custom_loss= ['Accuracy'], loss_function= 'Logloss')

# Fit CatBoost model
catboost_model.fit(train_pool, plot= False)

# CatBoost accuracy
catboost_acc = round(catboost_model.score(X_train_ns, y_train_ns)*100, 2)

Learning rate set to 0.5
0:	learn: 0.5541546	total: 235ms	remaining: 23.3s
1:	learn: 0.4934685	total: 314ms	remaining: 15.4s
2:	learn: 0.4424429	total: 366ms	remaining: 11.8s
3:	learn: 0.4113296	total: 422ms	remaining: 10.1s
4:	learn: 0.3937579	total: 499ms	remaining: 9.48s
5:	learn: 0.3786337	total: 579ms	remaining: 9.08s
6:	learn: 0.3552970	total: 627ms	remaining: 8.33s
7:	learn: 0.3465555	total: 674ms	remaining: 7.75s
8:	learn: 0.3391982	total: 726ms	remaining: 7.34s
9:	learn: 0.3280376	total: 810ms	remaining: 7.29s
10:	learn: 0.3131784	total: 860ms	remaining: 6.96s
11:	learn: 0.2994968	total: 905ms	remaining: 6.64s
12:	learn: 0.2936664	total: 973ms	remaining: 6.51s
13:	learn: 0.2845793	total: 1.09s	remaining: 6.7s
14:	learn: 0.2786553	total: 1.17s	remaining: 6.61s
15:	learn: 0.2755087	total: 1.32s	remaining: 6.93s
16:	learn: 0.2678949	total: 1.49s	remaining: 7.3s
17:	learn: 0.2646111	total: 1.58s	remaining: 7.21s
18:	learn: 0.2605414	total: 1.66s	remaining: 7.09s
19:	learn: 0.25729

In [29]:
# CatBoost Cross Validation
start_time = time.time()

# Set the parameters for cross validation as same as the initial model
cv_param = catboost_model.get_params()

# Run 10-Folds CV
cv_data = cv(train_pool, cv_param, fold_count= 10, plot= False)

# How long does it take?
catboost_time = (time.time()- start_time)

# CatBoost results get saved into a dataframe,the maximum accuracy score is
catboost_acc_cv = round(np.max(cv_data['test-Accuracy-mean'])*100, 2)

0:	learn: 0.6735507	test: 0.6735745	best: 0.6735745 (0)
1:	learn: 0.6529239	test: 0.6529738	best: 0.6529738 (1)
2:	learn: 0.6376420	test: 0.6377065	best: 0.6377065 (2)
3:	learn: 0.6251267	test: 0.6251969	best: 0.6251969 (3)	total: 5.62s	remaining: 2m 14s
4:	learn: 0.6113096	test: 0.6114020	best: 0.6114020 (4)
5:	learn: 0.5995553	test: 0.5996653	best: 0.5996653 (5)
6:	learn: 0.5855851	test: 0.5857072	best: 0.5857072 (6)	total: 8.99s	remaining: 1m 59s
7:	learn: 0.5736767	test: 0.5738188	best: 0.5738188 (7)
8:	learn: 0.5615577	test: 0.5617286	best: 0.5617286 (8)
9:	learn: 0.5508472	test: 0.5510363	best: 0.5510363 (9)
10:	learn: 0.5399741	test: 0.5401680	best: 0.5401680 (10)	total: 13.2s	remaining: 1m 46s
11:	learn: 0.5321353	test: 0.5323736	best: 0.5323736 (11)
12:	learn: 0.5214601	test: 0.5217416	best: 0.5217416 (12)
13:	learn: 0.5135052	test: 0.5138169	best: 0.5138169 (13)
14:	learn: 0.5066545	test: 0.5069547	best: 0.5069547 (14)	total: 17.5s	remaining: 1m 39s
15:	learn: 0.5006904	test:

In [30]:
#  CatBoost Algorithm
print('Accuracy of the model is: ', catboost_acc)
print('Accuracy of 10-Fold CV is: ', catboost_acc_cv)
print('Running time is: ', datetime.timedelta(seconds= catboost_time))

Accuracy of the model is:  95.91
Accuracy of 10-Fold CV is:  86.64
Running time is:  0:01:37.487928


### Model Results

Now let's see which model has the best cross-validation accuracy.

- <b>NOTE:</b> We care more about the accuracy of cross validation, as the metrics we get from the model can randomly score higher than usual.

In [32]:
cv_models = pd.DataFrame({'Model':[' Logistic Regression', 'K-Nearest Neighbours', 'Gaussian Naive Bayes', 
                                'Linear Support Vector Machines (SVC)', 'Stochastic Gradient Descent', 
                                'Decision Tree Classifier', 'Gradient Boost Trees', 'Random Forest', 'CatBoost Algorithm'],
                      'Score':[log_acc_cv, knn_acc_cv, gnb_acc_cv, svc_acc_cv, SGD_acc_cv, dt_acc_cv, gbt_acc_cv, rf_acc_cv, 
                               catboost_acc_cv]})

print('-----Cross-Validation Accuracy Scores-----')
cv_models.nlargest(9,'Score')

-----Cross-Validation Accuracy Scores-----


Unnamed: 0,Model,Score
7,Random Forest,95.42
5,Decision Tree Classifier,94.06
1,K-Nearest Neighbours,90.36
6,Gradient Boost Trees,87.62
8,CatBoost Algorithm,86.64
3,Linear Support Vector Machines (SVC),72.5
4,Stochastic Gradient Descent,72.22
0,Logistic Regression,72.08
2,Gaussian Naive Bayes,67.73


### Precision and Recall

Precision and Recall are metrics that you use when you have an imbalanced classification problem.

- Recall - a metric which measures a models ability to find all relevant cases in a dataset.
- Precision - a metric which measures a models ability to correctly identify only relevant cases.

Combining  Precision and Recall gives us the **F1 score.**

They fall between 0 and 1, with 1 being better.

In [33]:
f1_cv_models = pd.DataFrame({'Model':[' Logistic Regression', 'K-Nearest Neighbours', 'Gaussian Naive Bayes', 
                                'Linear Support Vector Machines (SVC)', 'Stochastic Gradient Descent', 
                                'Decision Tree Classifier', 'Gradient Boost Trees', 'Random Forest'],
                      'F1-Score':[log_f1_cv, knn_f1_cv, gnb_f1_cv, svc_f1_cv, SGD_f1_cv, dt_f1_cv, gbt_f1_cv, rf_f1_cv]})

print('-----Cross-Validation Accuracy Scores-----')
f1_cv_models.nlargest(8,'F1-Score')

-----Cross-Validation Accuracy Scores-----


Unnamed: 0,Model,F1-Score
7,Random Forest,0.95389
5,Decision Tree Classifier,0.940164
1,K-Nearest Neighbours,0.907766
6,Gradient Boost Trees,0.881897
4,Stochastic Gradient Descent,0.732752
3,Linear Support Vector Machines (SVC),0.725723
0,Logistic Regression,0.719757
2,Gaussian Naive Bayes,0.623617


In [34]:
metrics = ['Precision', 'Recall', 'F1', 'AUC']

eval_metrics = catboost_model.eval_metrics(train_pool, metrics= metrics, plot= False)

for metric in metrics:
    print(str(metric)+ ': {}' .format(np.mean(eval_metrics[metric])))

Precision: 0.9324237850221612
Recall: 0.9235191686844231
F1: 0.9265853789951105
AUC: 0.9769019482035128


> **Recall = TP/(TP + FN)**
- Here the Recall is pretty high. This means that there is a lesser amount of False Negatives (predicting 'Did not launch' when it was actually 'Launched').

> **Pression = TP/(TP + FP)**
- Precision is high. Thus, we can say say that there is less False Positives (predicting 'Launched' when it actually 'Did not launch')

### Prediction

Let's use the model with the highest cross-validation accuracy score to make a prediction on the test dataset.

We want to make predictions on the same columnns our model is trained on.

So we have to select the subset of right columns of the test dateframe, encode them and make a prediciton with our model.

In [36]:
# Create a list of columns to be used for predictions.
wanted_columns = X_train.columns
wanted_columns

Index(['department', 'gender', 'recruitment_channel', 'no_of_trainings', 'age',
       'previous_year_rating', 'length_of_service', 'KPIs_met >80%',
       'awards_won?', 'avg_training_score'],
      dtype='object')

In [39]:
# Make predictions using RandomForest model on wanted columns.
from sklearn.model_selection import cross_val_predict
predictions = algorithm.predict(X_test[wanted_columns])

In [40]:
#  RandomForest Algorithm
print('Accuracy of the model is: ', accuracy_score(y_test, predictions))
print('Precision: ', precision_score(y_test, predictions))
print('Recall: ', recall_score(y_test, predictions))
print('F1: ', f1_score(y_test, predictions))

Accuracy of the model is:  0.9763729246487867
Precision:  0.8540268456375839
Recall:  0.8559417040358744
F1:  0.8549832026875701


## 2. Prediction on the Test dataset

Let's use the model with the highest cross-validation accuracy score to make a prediction on the test dataset.

We want to make predictions on the same columnns our model is trained on.

So we have to select the subset of right columns of the test dateframe, encode them and make a prediciton with our model.

In [41]:
test.drop(columns= 'education', inplace= True)
test.head()

Unnamed: 0,department,gender,recruitment_channel,no_of_trainings,age,previous_year_rating,length_of_service,KPIs_met >80%,awards_won?,avg_training_score
0,1.206058,0.644516,1.154134,-0.423094,-1.283525,-0.266732,-1.1432,1.336715,-0.152665,1.024263
1,-1.180154,-1.551551,-0.883722,-0.423094,-0.282097,-0.266732,-0.19259,-0.748103,-0.152665,-0.914377
2,0.808356,0.644516,-0.883722,-0.423094,-0.282097,-1.907786,-0.430243,-0.748103,-0.152665,-1.212629
3,0.012952,-1.551551,-0.883722,2.905264,-0.282097,-1.087259,0.758019,-0.748103,-0.152665,0.129506
4,-1.577856,0.644516,1.154134,-0.423094,-0.282097,0.553794,0.282714,-0.748103,-0.152665,-0.168746


In [42]:
# Make predictions using RandomForest model on wanted columns.
predictions = algorithm.predict(test[wanted_columns])

In [43]:
# Our predictions array is comprised of 0's and 1's.
predictions[:20]

array([1, 0, 0, 0, 0, 1, 1, 1, 0, 1, 1, 0, 0, 1, 1, 1, 0, 1, 1, 0],
      dtype=int64)

In [44]:
pd.set_option('display.max_rows',100)
df = pd.read_csv('test.csv')

# Create a dataframe and append the relevant colimns.
submission = pd.DataFrame()
submission['employee_id'] = df['employee_id']
submission['is_promoted'] = predictions
submission.head()

Unnamed: 0,employee_id,is_promoted
0,8724,1
1,74430,0
2,72255,0
3,38562,0
4,64486,0


In [47]:
submission['is_promoted'].value_counts()

1    13338
0    10152
Name: is_promoted, dtype: int64

In [45]:
# Are our test and submission the same length?
if len(submission) == len(test):
    print('The submission and the test dataframes are of the same length')
else:
    print('Dataframes mismatched')

The submission and the test dataframes are of the same length


In [46]:
# convert submission dataframe to csv.
submission.to_csv('HR_Analytics.csv', index= False)
print('Submission csv is ready')

Submission csv is ready
