# 1.  Defining the Question

# a) Specifying the Data Analysis Question

Build a model to predict whether a potential promotee at a particular checkpoint will be promoted or not after the evaluation process

# b) Defining the Metric for Success

We will have achieved our objective if we can identify the eligible candidates at a particular checkpoint and reduce delay in transition to new roles by the employees

# c) Understanding the Context

Your client is a large MNC and they have 9 broad verticals across the organisation. One of the problem your client is facing is around identifying the right people for promotion (only for manager position and below) and prepare them in time. Currently the process, they are following is:

They first identify a set of employees based on recommendations/ past performance
Selected employees go through the separate training and evaluation program for each vertical. These programs are based on the required skill of each vertical
At the end of the program, based on various factors such as training performance, KPI completion (only employees with KPIs completed greater than 60% are considered) etc., employee gets promotion.

For above mentioned process, the final promotions are only announced after the evaluation and this leads to delay in transition to their new roles. Hence, company needs your help in identifying the eligible candidates at a particular checkpoint so that they can expedite the entire promotion cycle.

They have provided multiple attributes around Employee's past and current performance along with demographics. Now, The task is to predict whether a potential promotee at a checkpoint in the test set will be promoted or not after the evaluation process.

# d) Recording the Experimental Design



1.  Import libraries
2.  Import the data
3.  Perform data cleaning and preparation
4.  Analyze the data
5.  Build and train the model
6.  Draw conclusions



# e) Data Relevance

The data was relevant for the analysis

# f) Data Preparation

In [None]:
#importing the libraries
import numpy as np
import pandas as pd
from sklearn.tree import DecisionTreeClassifier

In [None]:
# Importing the data

hr_df = pd.read_csv("https://bit.ly/2ODZvLCHRDataset")
#checking the firs 5 record
hr_df.head()

Unnamed: 0,employee_id,department,region,education,gender,recruitment_channel,no_of_trainings,age,previous_year_rating,length_of_service,KPIs_met >80%,awards_won?,avg_training_score,is_promoted
0,65438,Sales & Marketing,region_7,Master's & above,f,sourcing,1,35,5.0,8,1,0,49,0
1,65141,Operations,region_22,Bachelor's,m,other,1,30,5.0,4,0,0,60,0
2,7513,Sales & Marketing,region_19,Bachelor's,m,sourcing,1,34,3.0,7,0,0,50,0
3,2542,Sales & Marketing,region_23,Bachelor's,m,other,2,39,1.0,10,0,0,50,0
4,48945,Technology,region_26,Bachelor's,m,other,1,45,3.0,2,0,0,73,0


In [None]:
#checking the last 5 record
hr_df.tail()

Unnamed: 0,employee_id,department,region,education,gender,recruitment_channel,no_of_trainings,age,previous_year_rating,length_of_service,KPIs_met >80%,awards_won?,avg_training_score,is_promoted
54803,3030,Technology,region_14,Bachelor's,m,sourcing,1,48,3.0,17,0,0,78,0
54804,74592,Operations,region_27,Master's & above,f,other,1,37,2.0,6,0,0,56,0
54805,13918,Analytics,region_1,Bachelor's,m,other,1,27,5.0,3,1,0,79,0
54806,13614,Sales & Marketing,region_9,,m,sourcing,1,29,1.0,2,0,0,45,0
54807,51526,HR,region_22,Bachelor's,m,other,1,27,1.0,5,0,0,49,0


In [None]:
#renaming 'awards_won?' column to get rid of the question mark and make the naming consistent with other columns
hr_df = hr_df.rename({'awards_won?': 'awards_won'},axis=1)
hr_df.head()

Unnamed: 0,employee_id,department,region,education,gender,recruitment_channel,no_of_trainings,age,previous_year_rating,length_of_service,KPIs_met >80%,awards_won,avg_training_score,is_promoted
0,65438,Sales & Marketing,region_7,Master's & above,f,sourcing,1,35,5.0,8,1,0,49,0
1,65141,Operations,region_22,Bachelor's,m,other,1,30,5.0,4,0,0,60,0
2,7513,Sales & Marketing,region_19,Bachelor's,m,sourcing,1,34,3.0,7,0,0,50,0
3,2542,Sales & Marketing,region_23,Bachelor's,m,other,2,39,1.0,10,0,0,50,0
4,48945,Technology,region_26,Bachelor's,m,other,1,45,3.0,2,0,0,73,0


In [None]:
#checking the size of the data
hr_df.shape

(54808, 14)

In [None]:
#checking missing values
hr_df.isnull().sum()

employee_id                0
department                 0
region                     0
education               2409
gender                     0
recruitment_channel        0
no_of_trainings            0
age                        0
previous_year_rating    4124
length_of_service          0
KPIs_met >80%              0
awards_won                 0
avg_training_score         0
is_promoted                0
dtype: int64

Education level and previous year rating are critical in determining the eligibility for promotion, we will therefore drop records where these values are not known

In [None]:
#dropping null values 
hr_df = hr_df.dropna()
hr_df.shape

(48660, 14)

In [None]:
#checking the datatypes
hr_df.dtypes

employee_id               int64
department               object
region                   object
education                object
gender                   object
recruitment_channel      object
no_of_trainings           int64
age                       int64
previous_year_rating    float64
length_of_service         int64
KPIs_met >80%             int64
awards_won                int64
avg_training_score        int64
is_promoted               int64
dtype: object

In [None]:
#converting previous year rating to int
hr_df['previous_year_rating']= hr_df['previous_year_rating'].astype(np.int64)
hr_df.dtypes

employee_id              int64
department              object
region                  object
education               object
gender                  object
recruitment_channel     object
no_of_trainings          int64
age                      int64
previous_year_rating     int64
length_of_service        int64
KPIs_met >80%            int64
awards_won               int64
avg_training_score       int64
is_promoted              int64
dtype: object

In [None]:
#checking for duplicates
hr_df.duplicated().sum()

0

There are no duplicates in our dataset

In [None]:
#checking statistical measures of the data
hr_df.describe()

Unnamed: 0,employee_id,no_of_trainings,age,previous_year_rating,length_of_service,KPIs_met >80%,awards_won,avg_training_score,is_promoted
count,48660.0,48660.0,48660.0,48660.0,48660.0,48660.0,48660.0,48660.0,48660.0
mean,39169.271681,1.251993,35.589437,3.337526,6.31157,0.356473,0.02314,63.603309,0.086971
std,22630.461554,0.604994,7.534571,1.257922,4.20476,0.478962,0.15035,13.273502,0.281795
min,1.0,1.0,20.0,1.0,1.0,0.0,0.0,39.0,0.0
25%,19563.5,1.0,30.0,3.0,3.0,0.0,0.0,51.0,0.0
50%,39154.0,1.0,34.0,3.0,5.0,0.0,0.0,60.0,0.0
75%,58788.25,1.0,39.0,4.0,8.0,1.0,0.0,76.0,0.0
max,78298.0,10.0,60.0,5.0,37.0,1.0,1.0,99.0,1.0


In [None]:
hr_df.head()

Unnamed: 0,employee_id,department,region,education,gender,recruitment_channel,no_of_trainings,age,previous_year_rating,length_of_service,KPIs_met >80%,awards_won,avg_training_score,is_promoted
0,65438,Sales & Marketing,region_7,Master's & above,f,sourcing,1,35,5,8,1,0,49,0
1,65141,Operations,region_22,Bachelor's,m,other,1,30,5,4,0,0,60,0
2,7513,Sales & Marketing,region_19,Bachelor's,m,sourcing,1,34,3,7,0,0,50,0
3,2542,Sales & Marketing,region_23,Bachelor's,m,other,2,39,1,10,0,0,50,0
4,48945,Technology,region_26,Bachelor's,m,other,1,45,3,2,0,0,73,0


# g) Building and training the model

In [21]:
features = hr_df.drop(['is_promoted','department','region','education','gender','recruitment_channel'], axis=1)
target = hr_df['is_promoted']

model = DecisionTreeClassifier(random_state=12345)

model.fit(features, target)

predictions = model.predict(features)

print('Predictions:', predictions)
print('Actual:', target.values)


Predictions: [0 0 0 ... 0 0 0]
Actual: [0 0 0 ... 0 0 0]


#h) Conclusion

We compared to the actual values with the predictions and the model seems to accurately predict whether potential promotees will be promoted or not