<a href="https://colab.research.google.com/github/Jesyldah/hello-world/blob/master/Project_Introduction_to_Machine_Learning_Jesyldah.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Introduction to Machine Learning Project

## 1. Defining the Question


### a) Specifying the Data Analysis Question

Predict whether a potential promotee at a checkpoint in the test set will be promoted or not after the evaluation process

### b) Defining the Metric for Success

The analysis question will be answered by providing a model that will predict whether a potential employee will be promoted or not after the evaluation process

### c) Understanding the context 

HR analytics is revolutionising the way human resources departments operate, leading to higher efficiency and better results overall. Human resources have been using analytics for years. However, the collection, processing, and analysis of data have been largely manual, and given the nature of human resources dynamics and HR KPIs, the approach has been constraining HR. Therefore, it is surprising that HR departments woke up to the utility of machine learning so late in the game.
Your client is a large Multinational Corporation, and they have nine broad verticals across the organization. One of the problems your client faces is identifying the right people for promotion (only for the manager position and below) and preparing them in time.
Currently the process, they are following is:
1. They first identify a set of employees based on recommendations/ past
performance.
2. Selected employees go through the separate training and evaluation program for each vertical.
3. These programs are based on the required skill of each vertical. At the end of the program, based on various factors such as training performance, KPI completion (only employees with KPIs completed greater than 60% are considered) etc., the employee gets a promotion.
For the process mentioned above, the final promotions are only announced after the evaluation, and this leads to a delay in transition to their new roles. Hence, the company needs your help in identifying the eligible candidates at a particular checkpoint so that they can expedite the entire promotion cycle.
They have provided multiple attributes around employees’ past and current performance along with demographics.





### d) Recording the Experimental Design

1. Reading in the data from the source so that it is available for analysis
2. Explore the data in order to understand the structure of the data
3. Prepare the data for analysis:
* Checking for and handling missing values
* Finding and removing duplicate records
* Deleting null columns & rows
* Renaming columns
* Checking for uniformity of data in the columns, correcting errors in values and datatypes
4. Modeling
* Define and train the model
* Make predictions using the model


### e) Data Relevance

The dataset included all employee information relevant to answer the research question


## 2. Reading the Data

In [88]:
# Importing our libraries

import pandas as pd

import numpy as np

In [89]:
# Load the dataset

# Dataset url = https://bit.ly/2ODZvLCHRDataset
hr_df = pd.read_csv('https://bit.ly/2ODZvLCHRDataset')

In [90]:
# Checking the first 5 rows of data
hr_df.head()

Unnamed: 0,employee_id,department,region,education,gender,recruitment_channel,no_of_trainings,age,previous_year_rating,length_of_service,KPIs_met >80%,awards_won?,avg_training_score,is_promoted
0,65438,Sales & Marketing,region_7,Master's & above,f,sourcing,1,35,5.0,8,1,0,49,0
1,65141,Operations,region_22,Bachelor's,m,other,1,30,5.0,4,0,0,60,0
2,7513,Sales & Marketing,region_19,Bachelor's,m,sourcing,1,34,3.0,7,0,0,50,0
3,2542,Sales & Marketing,region_23,Bachelor's,m,other,2,39,1.0,10,0,0,50,0
4,48945,Technology,region_26,Bachelor's,m,other,1,45,3.0,2,0,0,73,0


In [91]:
# Checking the last 5 rows of data
hr_df.tail()

Unnamed: 0,employee_id,department,region,education,gender,recruitment_channel,no_of_trainings,age,previous_year_rating,length_of_service,KPIs_met >80%,awards_won?,avg_training_score,is_promoted
54803,3030,Technology,region_14,Bachelor's,m,sourcing,1,48,3.0,17,0,0,78,0
54804,74592,Operations,region_27,Master's & above,f,other,1,37,2.0,6,0,0,56,0
54805,13918,Analytics,region_1,Bachelor's,m,other,1,27,5.0,3,1,0,79,0
54806,13614,Sales & Marketing,region_9,,m,sourcing,1,29,1.0,2,0,0,45,0
54807,51526,HR,region_22,Bachelor's,m,other,1,27,1.0,5,0,0,49,0


In [92]:
# Determine the size of the dataset
hr_df.shape

(54808, 14)

In [93]:
# Checking datatypes
hr_df.dtypes

employee_id               int64
department               object
region                   object
education                object
gender                   object
recruitment_channel      object
no_of_trainings           int64
age                       int64
previous_year_rating    float64
length_of_service         int64
KPIs_met >80%             int64
awards_won?               int64
avg_training_score        int64
is_promoted               int64
dtype: object

In [94]:
# View the features in the dataset
hr_df.columns

Index(['employee_id', 'department', 'region', 'education', 'gender',
       'recruitment_channel', 'no_of_trainings', 'age', 'previous_year_rating',
       'length_of_service', 'KPIs_met >80%', 'awards_won?',
       'avg_training_score', 'is_promoted'],
      dtype='object')

The dataset provided has a total of 54,808 observations and 14 variables. The variables are:
* employee_id - Unique ID for employee
* department - Department of employee
* region - Region of employment (unordered)
* education - Education Level
* gender - Gender of Employee
* recruitment_channel - Channel of recruitment for employee
* no_of_trainings - no of other trainings completed in previous year on soft skills, technical skills etc.
* age - Age of Employee
* previous_year_rating - Employee Rating for the previous year
* length_of_service - Length of service in years
* KPIs_met >80% - if Percent of KPIs(Key performance Indicators) >80% then 1 else 0
* awards_won? - if awards won during previous year then 1 else 0
* avg_training_score - Average score in current training evaluations
* is_promoted (Target) Recommended for promotion

## 3. External Data Source Validation

The dataset was scrapped from a contest held by https://www.analyticsvidhya.com/ which is a data science community, therefore valid for this analysis

## 4. Data Preparation

### Performing Data Cleaning

In [95]:
# Standardize column names - strip whitespaces, convert to lowercase, replace ' ' with '_' and remove '?'
hr_df.columns = hr_df.columns.str.strip().str.lower().str.replace(' ','_').str.replace('?','')
list(hr_df.columns)

['employee_id',
 'department',
 'region',
 'education',
 'gender',
 'recruitment_channel',
 'no_of_trainings',
 'age',
 'previous_year_rating',
 'length_of_service',
 'kpis_met_>80%',
 'awards_won',
 'avg_training_score',
 'is_promoted']

In [96]:
# Checking how many duplicate rows are there in the data

sum(hr_df.duplicated())

0

In [97]:
# Checking if any of the columns are all null

hr_df.isnull().all(1).any()

False

In [98]:
# Checking if any of the rows are all null

hr_df.isnull().all(0).any()

False

In [99]:
# Check for missing values

print(hr_df.isnull().values.any())

hr_df.isnull().sum()

True


employee_id                0
department                 0
region                     0
education               2409
gender                     0
recruitment_channel        0
no_of_trainings            0
age                        0
previous_year_rating    4124
length_of_service          0
kpis_met_>80%              0
awards_won                 0
avg_training_score         0
is_promoted                0
dtype: int64

In [100]:
# Drop observations with missing values
hr_df1 = hr_df.dropna()


In [101]:
print(hr_df1.isnull().values.any())

hr_df1.isnull().sum()

False


employee_id             0
department              0
region                  0
education               0
gender                  0
recruitment_channel     0
no_of_trainings         0
age                     0
previous_year_rating    0
length_of_service       0
kpis_met_>80%           0
awards_won              0
avg_training_score      0
is_promoted             0
dtype: int64

In [102]:
# Get unique values for department variable
hr_df1['department'].unique()

array(['Sales & Marketing', 'Operations', 'Technology', 'Analytics',
       'R&D', 'Procurement', 'Finance', 'HR', 'Legal'], dtype=object)

In [103]:
# Get unique values for region variable
hr_df1['region'].unique()

array(['region_7', 'region_22', 'region_19', 'region_23', 'region_26',
       'region_2', 'region_20', 'region_34', 'region_1', 'region_4',
       'region_29', 'region_31', 'region_15', 'region_14', 'region_11',
       'region_5', 'region_28', 'region_17', 'region_13', 'region_16',
       'region_25', 'region_10', 'region_27', 'region_30', 'region_12',
       'region_21', 'region_32', 'region_6', 'region_33', 'region_8',
       'region_24', 'region_3', 'region_9', 'region_18'], dtype=object)

In [104]:
# Get unique values for education variable
hr_df1['education'].unique()

array(["Master's & above", "Bachelor's", 'Below Secondary'], dtype=object)

In [105]:
# Get unique values for gender variable
hr_df1['gender'].unique()

array(['f', 'm'], dtype=object)

In [106]:
# Get unique values for recruitment_channel variable
hr_df1['recruitment_channel'].unique()

array(['sourcing', 'other', 'referred'], dtype=object)

In [107]:
# Sample dataset

hr_df1.sample()

Unnamed: 0,employee_id,department,region,education,gender,recruitment_channel,no_of_trainings,age,previous_year_rating,length_of_service,kpis_met_>80%,awards_won,avg_training_score,is_promoted
22815,5335,HR,region_22,Bachelor's,m,other,1,27,5.0,2,1,0,48,0


## 5. Solution Implementation

### Variable selection
We will omit the variables; 'department', 'region' , 'sourcing', employee_id , assuming promotions will happen irrespective

In [108]:
# Convert 'education' and 'gender' values to boolean so that we can include them in the model
# These variable are considered key in promotions

hr_df1.loc[hr_df1['gender'] == "m", 'gender'] = 1
hr_df1.loc[hr_df1['gender'] == "f", 'gender'] = 0
hr_df1.loc[hr_df1['education'] == "Master's & above", 'education'] = 1
hr_df1.loc[hr_df1['education'] == "Bachelor's",'education'] = 0
hr_df1.loc[hr_df1['education'] == "Below Secondary",'education'] = 2


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  isetter(loc, value)


In [109]:
hr_df1.head()

Unnamed: 0,employee_id,department,region,education,gender,recruitment_channel,no_of_trainings,age,previous_year_rating,length_of_service,kpis_met_>80%,awards_won,avg_training_score,is_promoted
0,65438,Sales & Marketing,region_7,1,0,sourcing,1,35,5.0,8,1,0,49,0
1,65141,Operations,region_22,0,1,other,1,30,5.0,4,0,0,60,0
2,7513,Sales & Marketing,region_19,0,1,sourcing,1,34,3.0,7,0,0,50,0
3,2542,Sales & Marketing,region_23,0,1,other,2,39,1.0,10,0,0,50,0
4,48945,Technology,region_26,0,1,other,1,45,3.0,2,0,0,73,0


In [110]:
# Convert the datatype to integer so that they can be incuded in the model
hr_df1['education'] = hr_df1['education'].apply(np.int64)
hr_df1['gender'] = hr_df1['gender'].apply(np.int64)

hr_df1.dtypes

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  This is separate from the ipykernel package so we can avoid doing imports until


employee_id               int64
department               object
region                   object
education                 int64
gender                    int64
recruitment_channel      object
no_of_trainings           int64
age                       int64
previous_year_rating    float64
length_of_service         int64
kpis_met_>80%             int64
awards_won                int64
avg_training_score        int64
is_promoted               int64
dtype: object

### Defining and training the modelling

We will use the decision tree algorithm to train our model

In [111]:

# import decision tree from the sklearn library
from sklearn.tree import DecisionTreeClassifier

# Declare feature and target variables
features = hr_df1.drop(['employee_id', 'department','region','recruitment_channel','is_promoted'], axis=1)
target = hr_df1['is_promoted']

# create an empty model and assign it to a variable
model = DecisionTreeClassifier()

# train a model by calling the fit() method
model.fit(features, target)

print(model)

DecisionTreeClassifier(ccp_alpha=0.0, class_weight=None, criterion='gini',
                       max_depth=None, max_features=None, max_leaf_nodes=None,
                       min_impurity_decrease=0.0, min_impurity_split=None,
                       min_samples_leaf=1, min_samples_split=2,
                       min_weight_fraction_leaf=0.0, presort='deprecated',
                       random_state=None, splitter='best')


In [112]:
# Preview the features 
features.sample()

Unnamed: 0,education,gender,no_of_trainings,age,previous_year_rating,length_of_service,kpis_met_>80%,awards_won,avg_training_score
1617,0,1,1,39,3.0,12,0,0,51


### Prediction using the trained model

In [113]:
# Create new observations to be used in the model
new_features = pd.DataFrame(
    [
        [1, 1, 1, 26, 4.0, 2, 1, 1, 70],
        [0, 0, 1, 26, 4.0, 2, 1, 1, 70],
        [2, 0, 1, 26, 4.0, 2, 1, 1, 70],
     
    ],
    columns=features.columns,
)

new_features

Unnamed: 0,education,gender,no_of_trainings,age,previous_year_rating,length_of_service,kpis_met_>80%,awards_won,avg_training_score
0,1,1,1,26,4.0,2,1,1,70
1,0,0,1,26,4.0,2,1,1,70
2,2,0,1,26,4.0,2,1,1,70


In [114]:
# Predict the target variable using the new observations
answers = model.predict(new_features)

print(answers)

[1 1 0]


### Recommendations

1. The HR can adopt the model that will predict where an employee is eligible for promotion using the following historical information:'education', 'gender', 'no_of_trainings', 'age', 'previous_year_rating','length_of_service', 'kpis_met_>80%', 'awards_won', & 'avg_training_score
2. The HR department will need to collect and maintain accurate and updated employee data to use the model. Most employees had missing data on 'education' & 'previous_year_rating', which is crucial in identifying eligible candicates for promotion


## Challenging your Solution

In [115]:
# Does the department play a significant role in the promotions?

# Determine unique values in the deparment variable
hr_df1["department"].unique()

array(['Sales & Marketing', 'Operations', 'Technology', 'Analytics',
       'R&D', 'Procurement', 'Finance', 'HR', 'Legal'], dtype=object)

In [116]:
# Assign numerical values to different departments
hr_df1.loc[hr_df1['department'] == "Sales & Marketing", 'department'] = 0
hr_df1.loc[hr_df1['department'] == "Operations",'department'] = 1
hr_df1.loc[hr_df1['department'] == "Technology",'department'] = 2
hr_df1.loc[hr_df1['department'] == "Analytics", 'department'] = 3
hr_df1.loc[hr_df1['department'] == "R&D",'department'] = 4
hr_df1.loc[hr_df1['department'] == "Procurement",'department'] = 5
hr_df1.loc[hr_df1['department'] == "Finance", 'department'] = 6
hr_df1.loc[hr_df1['department'] == "HR",'department'] = 7
hr_df1.loc[hr_df1['department'] == "Legal",'department'] = 8

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  isetter(loc, value)


In [117]:
# Convert the datatype to integer so that it can be incuded in the model
hr_df1['department'] = hr_df1['department'].apply(np.int64)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  


In [120]:
# Train the model and predict new observations
# Declare feature and target variables
features2 = hr_df1.drop(['employee_id', 'region','recruitment_channel','is_promoted'], axis=1)
target = hr_df1['is_promoted']

# create an empty model and assign it to a variable
model2 = DecisionTreeClassifier()

# train a model by calling the fit() method
model2.fit(features2, target)

# Create new observations to be used in the model
new_features2 = pd.DataFrame(
    [
        [0, 1, 1, 1, 26, 4.0, 2, 1, 1, 70],
        [1, 2, 0, 1, 26, 4.0, 2, 1, 1, 70],
        [2, 0, 1, 1, 26, 4.0, 2, 1, 1, 70],
        [3, 1, 0, 1, 26, 4.0, 2, 1, 1, 70],
        [4, 2, 0, 1, 26, 4.0, 2, 1, 1, 70],
        [5, 0, 1, 1, 26, 4.0, 2, 1, 1, 70],
        [6, 1, 0, 1, 26, 4.0, 2, 1, 1, 70],
        [7, 2, 1, 1, 26, 4.0, 2, 1, 1, 70],
        [8, 1, 1, 1, 26, 4.0, 2, 1, 1, 70],
     
    ],
    columns=features2.columns,
)

# Predict the target variable using the new observations
answers2 = model2.predict(new_features2)

print(answers2)

[1 1 1 1 0 0 1 1 1]


In [123]:
# Preview features
features2.sample()

Unnamed: 0,department,education,gender,no_of_trainings,age,previous_year_rating,length_of_service,kpis_met_>80%,awards_won,avg_training_score
11089,3,1,1,1,44,3.0,5,0,0,87


The department does seem to have an impact on the predictions. We would get different results if we used the previous model that did not include the department variable

In [122]:
# Predict the target variable using the new observations
# Create new observations to be used in the model
new_features = pd.DataFrame(
    [
        [1, 1, 1, 26, 4.0, 2, 1, 1, 70],
        [2, 0, 1, 26, 4.0, 2, 1, 1, 70],
        [0, 1, 1, 26, 4.0, 2, 1, 1, 70],
        [1, 0, 1, 26, 4.0, 2, 1, 1, 70],
        [2, 0, 1, 26, 4.0, 2, 1, 1, 70],
        [0, 1, 1, 26, 4.0, 2, 1, 1, 70],
        [1, 0, 1, 26, 4.0, 2, 1, 1, 70],
        [2, 1, 1, 26, 4.0, 2, 1, 1, 70],
        [1, 1, 1, 26, 4.0, 2, 1, 1, 70],
     
    ],
    columns=features.columns,
)
answers = model.predict(new_features)

print(answers)

[1 0 1 1 0 1 1 0 1]


## 7. Follow up questions

### a). Did we have the right data?

Yes. The data has enough information to draw recommendations and/or suggestions that would involve addional data analysis

### b). Do we need other data to answer our question?

Should we put more weight on gender or education or age? Does the HR department have preferences around that?

### c). Did we have the right question?

Yes we did