# **Problem Statement**
Your client is a large Multinational Corporation, and they have nine broad verticals across the organization. One of the problems your client faces is identifying the right people for promotion (only for the manager position and below) and preparing them in time.
Currently the process, they are following is:
* They first identify a set of employees based on recommendations/ past performance.
* Selected employees go through the separate training and evaluation program for
each vertical.
* These programs are based on the required skill of each vertical. At the end of the
program, based on various factors such as training performance, KPI completion
(only employees with KPIs completed greater than 60% are considered) etc., the
employee gets a promotion.

For the process mentioned above, the final promotions are only announced after the evaluation, and this leads to a delay in transition to their new roles. Hence, the company needs your help in identifying the eligible candidates at a particular checkpoint so that they can expedite the entire promotion cycle.
They have provided multiple attributes around employees’ past and current performance along with demographics. Now, The task is to predict whether a potential promotee at a checkpoint will be promoted or not after the evaluation process.


# **Data Importation**

In [None]:
#load dataset
import pandas as pd

data = pd.read_csv('https://bit.ly/2ODZvLCHRDataset')

# **Data Exploration**



In [None]:
data.head()

Unnamed: 0,employee_id,department,region,education,gender,recruitment_channel,no_of_trainings,age,previous_year_rating,length_of_service,KPIs_met >80%,awards_won?,avg_training_score,is_promoted
0,65438,Sales & Marketing,region_7,Master's & above,f,sourcing,1,35,5.0,8,1,0,49,0
1,65141,Operations,region_22,Bachelor's,m,other,1,30,5.0,4,0,0,60,0
2,7513,Sales & Marketing,region_19,Bachelor's,m,sourcing,1,34,3.0,7,0,0,50,0
3,2542,Sales & Marketing,region_23,Bachelor's,m,other,2,39,1.0,10,0,0,50,0
4,48945,Technology,region_26,Bachelor's,m,other,1,45,3.0,2,0,0,73,0


In [None]:
#no. of rows and columns
data.shape

(54808, 14)

In [None]:
#datatypes
data.dtypes

employee_id               int64
department               object
region                   object
education                object
gender                   object
recruitment_channel      object
no_of_trainings           int64
age                       int64
previous_year_rating    float64
length_of_service         int64
KPIs_met >80%             int64
awards_won?               int64
avg_training_score        int64
is_promoted               int64
dtype: object

# **Data Cleaning**

In [None]:
data.isnull().sum()

employee_id                0
department                 0
region                     0
education               2409
gender                     0
recruitment_channel        0
no_of_trainings            0
age                        0
previous_year_rating    4124
length_of_service          0
KPIs_met >80%              0
awards_won?                0
avg_training_score         0
is_promoted                0
dtype: int64

In [None]:
data.dropna(axis=0, how='any',inplace = True)

In [None]:
data.isnull().sum()

employee_id             0
department              0
region                  0
education               0
gender                  0
recruitment_channel     0
no_of_trainings         0
age                     0
previous_year_rating    0
length_of_service       0
KPIs_met >80%           0
awards_won?             0
avg_training_score      0
is_promoted             0
dtype: int64

In [None]:
data.duplicated().sum()

0

In [None]:
data['is_promoted'].unique()

array([0, 1])

# **Data Preparation**

In [None]:
from sklearn.model_selection import train_test_split

#divide dataset into features and target
#drop all columns with strings
features = data.drop(['is_promoted','department','region','education', 'gender','recruitment_channel'], axis =1)
target = data['is_promoted']

features_train, features_valid, target_train, target_valid = train_test_split(features, target, test_size=0.25, random_state=12345)

# **Data Modelling**

In [None]:
from sklearn.tree import DecisionTreeClassifier
model = DecisionTreeClassifier()

model.fit(features_train, target_train)
predictions = model.predict(features_valid)

# **Model Evaluation**

In [None]:
#accuracy before tuning
#print("Decision Tree Accuracy: ",model.score(features_valid, target_valid)) 
#0.8709412248253185

#tune
for t in range(1,11):
  #best value for max_depth = 4
  model = DecisionTreeClassifier(max_depth=t, random_state= 12345)
  model.fit(features_train, target_train)
  print("Decision Tree Accuracy: ",model.score(features_valid, target_valid))

Decision Tree Accuracy:  0.9219893136046033
Decision Tree Accuracy:  0.9219893136046033
Decision Tree Accuracy:  0.9212494862309906
Decision Tree Accuracy:  0.9241265926839293
Decision Tree Accuracy:  0.9237155774763667
Decision Tree Accuracy:  0.9239621866009042
Decision Tree Accuracy:  0.9223181257706535
Decision Tree Accuracy:  0.9219071105630908
Decision Tree Accuracy:  0.9218249075215783
Decision Tree Accuracy:  0.9212494862309906
