## Problem Statement:

A multinational corporation with nine distinct areas of operation faces a challenge in selecting suitable candidates for managerial (and below positions) promotions and providing them with timely training. The announcement of final promotions occurs only after evaluations, resulting in delays in transitioning to new roles. Therefore, the company seeks assistance in identifying qualified candidates at specific checkpoints to accelerate the promotion process as a whole.

The company has historical data comprising of demographic, educational, work experience and skill details, along with the promotion details.


## Solution:

As the company has access to historical data, it makes sense to build a machine learning model that can find the traits of employees fit for promotion. As the goal is find the list of employees who can be promoted, it will come under supervised classification category. 

## Step 1: Importing the libraries and reading the training and test data

In [1]:
#!pip install --upgrade pip
#!pip install scikit-learn
#!pip install pandas
#!pip install numpy

from sklearn.preprocessing import LabelEncoder, StandardScaler, RobustScaler 
from sklearn.model_selection import train_test_split, StratifiedKFold
from sklearn.linear_model import LogisticRegression
from sklearn.impute import SimpleImputer
from sklearn.metrics import * 
import pandas as pd 
import numpy as np

In [2]:
training_data = pd.read_csv('hr_train.csv')
test_data = pd.read_csv('hr_test.csv')

## Step 2: Data Preprocessing

As the employee_id column is an ID column, it serves no purpose in the modeling process. So, we will delete the employee_id column from the training and test data.

In [4]:
del training_data['employee_id']
del test_data['employee_id']

In [6]:
categorical_columns = [c for c in training_data.columns if training_data[c].dtypes=='object']
numeric_columns = [n for n in training_data.columns if n not in categorical_columns]

In [8]:
true_categorical_columns = ['department', 'region', 'education', 'gender', 'recruitment_channel','awards_won?', 
             'previous_year_rating','length_of_service', 'no_of_trainings']
true_numeric_columns = [c for c in training_data.columns if c not in true_categorical_columns]
true_numeric_columns.remove('is_promoted')

There are very few employees whose work experience is 30 years or more (0.1%). These can be considered as outliers and it is a best practice to remove them before we feed data to the model.

In [13]:
training_data = training_data[training_data.length_of_service<30]

We would be applying imputation strategy for the education and previous_year_rating columns.

In [14]:
target = training_data.pop('is_promoted')
all_data = pd.concat([training_data, test_data], axis=0)
si_most_frequent = SimpleImputer(strategy= 'most_frequent')
all_data['education'] = si_most_frequent.fit_transform(all_data.education.values.reshape(-1, 1))
si_mean = SimpleImputer(strategy='mean')
all_data['previous_year_rating'] = si_mean.fit_transform(all_data.previous_year_rating.values.reshape(-1, 1))
all_data.isnull().sum()

department              0
region                  0
education               0
gender                  0
recruitment_channel     0
no_of_trainings         0
age                     0
previous_year_rating    0
length_of_service       0
awards_won?             0
avg_training_score      0
dtype: int64

Converting Categorical features into dummy ones

In [15]:
all_data_dummy = pd.get_dummies(all_data)
new_training_data = all_data_dummy.iloc[:len(training_data)]
new_test_data = all_data_dummy.iloc[len(training_data):]
new_training_data.shape, new_test_data.shape

((54752, 57), (23490, 57))

Applying StandardScaler for standardizing the features in training and test data

In [16]:
ss = StandardScaler()
new_training_data = ss.fit_transform(new_training_data)
new_test_data = ss.transform(new_test_data)

## Step 3: Model Training (Logistic Regression) and Evaluation

Performing Stratified K fold cross validation on the data, training model and evaluating the performance. As the purpose of this activity is to show how AI/ML model outputs can be used for Decision Intelligence, we would keep the modeling process preety simple and would not train multiple algorithms to see which ones perform the best. Instead, we will train a Logistic Regression model with default/minimal changes to the parameters and will see how the results can be interpreted.

In [21]:
scores = []
probs = np.zeros(len(new_training_data))
y_le = target.values
folds = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)

for fold_, (train_ind, val_ind) in enumerate(folds.split(new_training_data, y_le)):
    print('fold:', fold_)
    X_tr, X_test = new_training_data[train_ind], new_training_data[val_ind]
    y_tr, y_test = y_le[train_ind], y_le[val_ind]
    clf = LogisticRegression(max_iter=200, random_state=2020)
    clf.fit(X_tr, y_tr)
    probs[val_ind]= clf.predict_proba(X_test)[:, 1]
    y = clf.predict_proba(X_tr)[:,1] 
    print('train:',roc_auc_score(y_tr, y),'val :' , roc_auc_score(y_test, (probs[val_ind])))
    print(20 * '-')
    
    scores.append(roc_auc_score(y_test, probs[val_ind]))
    
print('log reg  roc_auc=  ', round(np.mean(scores)*100,2))
probs_rnd = np.where(probs > 0.5, 1, 0)
print('F1 Score= ', round(f1_score(target, probs_rnd)*100,2))
print('Recall Score= ', round(recall_score(target, probs_rnd)*100,2))

fold: 0
train: 0.7931202718718571 val : 0.7819824102253672
--------------------
fold: 1
train: 0.7897501709069539 val : 0.79278434937156
--------------------
fold: 2
train: 0.7868061847019592 val : 0.804318671120831
--------------------
fold: 3
train: 0.7930340720726938 val : 0.7815370354855481
--------------------
fold: 4
train: 0.7943494182608088 val : 0.7806564317616108
--------------------
log reg  roc_auc=   78.83
F1 Score=  44.66
Recall Score=  29.24


**As we can see, the model performance is satisfactory and with additional techniques, it can be improved. Now, the HR team can leverage the predictions made on the employees through the ML model, and create a list of potential promotion candidates based on their probability scores.** 