<a href="https://colab.research.google.com/github/Ckiteme/CKiteme-Assignment-Introduction-to-Machine-Learning/blob/main/CKiteme_Assignment_Introduction_to_Machine_Learning.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

**Instructions**

Introduction to Machine Learning
Project Deliverables
You will be required to submit:
● A GitHub repository with your project written in Python or R.
Instructions
Background Information
HR analytics is revolutionising the way human resources departments operate, leading
to higher efficiency and better results overall. Human resources have been using
analytics for years. However, the collection, processing, and analysis of data have been
largely manual, and given the nature of human resources dynamics and HR KPIs, the
approach has been constraining HR. Therefore, it is surprising that HR departments
woke up to the utility of machine learning so late in the game.
Problem Statement
Your client is a large Multinational Corporation, and they have nine broad verticals
across the organization. One of the problems your client faces is identifying the right
people for promotion (only for the manager position and below) and preparing them in
time.
Currently the process, they are following is:
● They first identify a set of employees based on recommendations/ past
performance.
● Selected employees go through the separate training and evaluation program for
each vertical.
● These programs are based on the required skill of each vertical. At the end of the
program, based on various factors such as training performance, KPI completion
(only employees with KPIs completed greater than 60% are considered) etc., the
employee gets a promotion.
For the process mentioned above, the final promotions are only announced after the
evaluation, and this leads to a delay in transition to their new roles. Hence, the company
needs your help in identifying the eligible candidates at a particular checkpoint so that
they can expedite the entire promotion cycle.
They have provided multiple attributes around employees’ past and current performance
along with demographics. Now, The task is to predict whether a potential promotee at a
checkpoint will be promoted or not after the evaluation process.

Dataset
● Dataset URL: https://bit.ly/2ODZvLCHRDataset
● Glossary URL: https://bit.ly/2Wz3sWcGlossary


In [2]:
##Pre-requisites**
#import libraries
import pandas as pd
#import decision tree from the sklearn library
from sklearn.tree import DecisionTreeClassifier

# load and preview dataset
hr_train = pd.read_csv ('https://bit.ly/2ODZvLCHRDataset')
hr_train

Unnamed: 0,employee_id,department,region,education,gender,recruitment_channel,no_of_trainings,age,previous_year_rating,length_of_service,KPIs_met >80%,awards_won?,avg_training_score,is_promoted
0,65438,Sales & Marketing,region_7,Master's & above,f,sourcing,1,35,5.0,8,1,0,49,0
1,65141,Operations,region_22,Bachelor's,m,other,1,30,5.0,4,0,0,60,0
2,7513,Sales & Marketing,region_19,Bachelor's,m,sourcing,1,34,3.0,7,0,0,50,0
3,2542,Sales & Marketing,region_23,Bachelor's,m,other,2,39,1.0,10,0,0,50,0
4,48945,Technology,region_26,Bachelor's,m,other,1,45,3.0,2,0,0,73,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
54803,3030,Technology,region_14,Bachelor's,m,sourcing,1,48,3.0,17,0,0,78,0
54804,74592,Operations,region_27,Master's & above,f,other,1,37,2.0,6,0,0,56,0
54805,13918,Analytics,region_1,Bachelor's,m,other,1,27,5.0,3,1,0,79,0
54806,13614,Sales & Marketing,region_9,,m,sourcing,1,29,1.0,2,0,0,45,0


In [3]:
hr_train['is_promoted'].value_counts()

0    50140
1     4668
Name: is_promoted, dtype: int64

In [4]:
# preview dataset shape
hr_train.shape

(54808, 14)

In [5]:
# preview datatypes
hr_train.dtypes

employee_id               int64
department               object
region                   object
education                object
gender                   object
recruitment_channel      object
no_of_trainings           int64
age                       int64
previous_year_rating    float64
length_of_service         int64
KPIs_met >80%             int64
awards_won?               int64
avg_training_score        int64
is_promoted               int64
dtype: object

In [6]:
#Identify duplicates records in the data
hr_train.duplicated().sum()

0

In [7]:
# look for missing records
hr_train.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 54808 entries, 0 to 54807
Data columns (total 14 columns):
 #   Column                Non-Null Count  Dtype  
---  ------                --------------  -----  
 0   employee_id           54808 non-null  int64  
 1   department            54808 non-null  object 
 2   region                54808 non-null  object 
 3   education             52399 non-null  object 
 4   gender                54808 non-null  object 
 5   recruitment_channel   54808 non-null  object 
 6   no_of_trainings       54808 non-null  int64  
 7   age                   54808 non-null  int64  
 8   previous_year_rating  50684 non-null  float64
 9   length_of_service     54808 non-null  int64  
 10  KPIs_met >80%         54808 non-null  int64  
 11  awards_won?           54808 non-null  int64  
 12  avg_training_score    54808 non-null  int64  
 13  is_promoted           54808 non-null  int64  
dtypes: float64(1), int64(8), object(5)
memory usage: 5.9+ MB


In [8]:
hr_train.isnull().sum()

employee_id                0
department                 0
region                     0
education               2409
gender                     0
recruitment_channel        0
no_of_trainings            0
age                        0
previous_year_rating    4124
length_of_service          0
KPIs_met >80%              0
awards_won?                0
avg_training_score         0
is_promoted                0
dtype: int64

From the above we see education column has 2409 and previous_year_rating column has 4124 missing values.

In [9]:
#drop null values and columns that have no useful information
hr_train.drop(['employee_id'], axis=1, inplace=True)

hr_train.dropna(subset=['education'],inplace=True)

hr_train['previous_year_rating'].fillna(value=hr_train['previous_year_rating'].mean(), inplace=True)

In [10]:
hr_train.isna().sum()

department              0
region                  0
education               0
gender                  0
recruitment_channel     0
no_of_trainings         0
age                     0
previous_year_rating    0
length_of_service       0
KPIs_met >80%           0
awards_won?             0
avg_training_score      0
is_promoted             0
dtype: int64

In [11]:
hr_train.head()

Unnamed: 0,department,region,education,gender,recruitment_channel,no_of_trainings,age,previous_year_rating,length_of_service,KPIs_met >80%,awards_won?,avg_training_score,is_promoted
0,Sales & Marketing,region_7,Master's & above,f,sourcing,1,35,5.0,8,1,0,49,0
1,Operations,region_22,Bachelor's,m,other,1,30,5.0,4,0,0,60,0
2,Sales & Marketing,region_19,Bachelor's,m,sourcing,1,34,3.0,7,0,0,50,0
3,Sales & Marketing,region_23,Bachelor's,m,other,2,39,1.0,10,0,0,50,0
4,Technology,region_26,Bachelor's,m,other,1,45,3.0,2,0,0,73,0


**Data Modeling**

In [12]:
hr_train=pd.get_dummies(hr_train)
hr_train.head()

Unnamed: 0,no_of_trainings,age,previous_year_rating,length_of_service,KPIs_met >80%,awards_won?,avg_training_score,is_promoted,department_Analytics,department_Finance,...,region_region_8,region_region_9,education_Bachelor's,education_Below Secondary,education_Master's & above,gender_f,gender_m,recruitment_channel_other,recruitment_channel_referred,recruitment_channel_sourcing
0,1,35,5.0,8,1,0,49,0,0,0,...,0,0,0,0,1,1,0,0,0,1
1,1,30,5.0,4,0,0,60,0,0,0,...,0,0,1,0,0,0,1,1,0,0
2,1,34,3.0,7,0,0,50,0,0,0,...,0,0,1,0,0,0,1,0,0,1
3,2,39,1.0,10,0,0,50,0,0,0,...,0,0,1,0,0,0,1,1,0,0
4,1,45,3.0,2,0,0,73,0,0,0,...,0,0,1,0,0,0,1,1,0,0


In [13]:
#we first start by creating two variable to store our dataset

features = hr_train.drop(['is_promoted'], axis=1)
target = hr_train['is_promoted']

print(features.shape)
print(target.shape)

(52399, 58)
(52399,)


In [14]:
# create an empty model and assign it to a variable
model = DecisionTreeClassifier()
model

DecisionTreeClassifier()

In [15]:
# train a model by calling the fit() method
model.fit(features, target)

DecisionTreeClassifier()

In [19]:
#making  predictions using a model.
hr_test = pd.read_csv("/content/test_2umaH9m.csv")
hr_test

Unnamed: 0,employee_id,department,region,education,gender,recruitment_channel,no_of_trainings,age,previous_year_rating,length_of_service,KPIs_met >80%,awards_won?,avg_training_score
0,8724,Technology,region_26,Bachelor's,m,sourcing,1,24,,1,1,0,77
1,74430,HR,region_4,Bachelor's,f,other,1,31,3.0,5,0,0,51
2,72255,Sales & Marketing,region_13,Bachelor's,m,other,1,31,1.0,4,0,0,47
3,38562,Procurement,region_2,Bachelor's,f,other,3,31,2.0,9,0,0,65
4,64486,Finance,region_29,Bachelor's,m,sourcing,1,30,4.0,7,0,0,61
...,...,...,...,...,...,...,...,...,...,...,...,...,...
23485,53478,Legal,region_2,Below Secondary,m,sourcing,1,24,3.0,1,0,0,61
23486,25600,Technology,region_25,Bachelor's,m,sourcing,1,31,3.0,7,0,0,74
23487,45409,HR,region_16,Bachelor's,f,sourcing,1,26,4.0,4,0,0,50
23488,1186,Procurement,region_31,Bachelor's,m,sourcing,3,27,,1,0,0,70


In [20]:
#drop null values and columns that have no useful information
hr_test.drop(['employee_id'], axis=1, inplace=True)

hr_test.dropna(subset=['education'],inplace=True)

hr_test['previous_year_rating'].fillna(value=hr_test['previous_year_rating'].mean(), inplace=True)
hr_test=pd.get_dummies(hr_test)
hr_test.head()

Unnamed: 0,no_of_trainings,age,previous_year_rating,length_of_service,KPIs_met >80%,awards_won?,avg_training_score,department_Analytics,department_Finance,department_HR,...,region_region_8,region_region_9,education_Bachelor's,education_Below Secondary,education_Master's & above,gender_f,gender_m,recruitment_channel_other,recruitment_channel_referred,recruitment_channel_sourcing
0,1,24,3.34848,1,1,0,77,0,0,0,...,0,0,1,0,0,0,1,0,0,1
1,1,31,3.0,5,0,0,51,0,0,1,...,0,0,1,0,0,1,0,1,0,0
2,1,31,1.0,4,0,0,47,0,0,0,...,0,0,1,0,0,0,1,1,0,0
3,3,31,2.0,9,0,0,65,0,0,0,...,0,0,1,0,0,1,0,1,0,0
4,1,30,4.0,7,0,0,61,0,1,0,...,0,0,1,0,0,0,1,0,0,1


In [21]:
new_features = hr_test

new_features['is_promoted']= model.predict(new_features)

new_features.head()

Unnamed: 0,no_of_trainings,age,previous_year_rating,length_of_service,KPIs_met >80%,awards_won?,avg_training_score,department_Analytics,department_Finance,department_HR,...,region_region_9,education_Bachelor's,education_Below Secondary,education_Master's & above,gender_f,gender_m,recruitment_channel_other,recruitment_channel_referred,recruitment_channel_sourcing,is_promoted
0,1,24,3.34848,1,1,0,77,0,0,0,...,0,1,0,0,0,1,0,0,1,0
1,1,31,3.0,5,0,0,51,0,0,1,...,0,1,0,0,1,0,1,0,0,0
2,1,31,1.0,4,0,0,47,0,0,0,...,0,1,0,0,0,1,1,0,0,0
3,3,31,2.0,9,0,0,65,0,0,0,...,0,1,0,0,1,0,1,0,0,0
4,1,30,4.0,7,0,0,61,0,1,0,...,0,1,0,0,0,1,0,0,1,0


In [22]:
new_features['is_promoted'].value_counts()

0    20406
1     2050
Name: is_promoted, dtype: int64