# HR Promotion clasification Problem

Your client is a large MNC and they have 9 broad verticals across the organisation. One of the problem your client is facing is around identifying the right people for promotion (only for manager position and below) and prepare them in time. Currently the process, they are following is:

- They first identify a set of employees based on recommendations/ past performance
- Selected employees go through the separate training and evaluation program for each vertical. These programs are based on the required skill of each vertical
- At the end of the program, based on various factors such as training performance, KPI completion (only employees with KPIs completed greater than 60% are considered) etc., employee gets promotion

For above mentioned process, the final promotions are only announced after the evaluation and this leads to delay in transition to their new roles. Hence, company needs your help in identifying the eligible candidates at a particular checkpoint so that they can expedite the entire promotion cycle.

- Attribute Information:

- employee_id	Unique ID for employee
- department	Department of employee
- region	Region of employment (unordered)
- education	Education Level
- gender	Gender of Employee
- recruitment_channel	Channel of recruitment for employee
- no_of_trainings	no of other trainings completed in previous year on soft - skills, technical skills etc.
- age	Age of Employee
- previous_year_rating	Employee Rating for the previous year
- length_of_service	Length of service in years
- KPIs_met >80%	if Percent of KPIs(Key performance Indicators) >80% then 1 else 0
- awards_won?	if awards won during previous year then 1 else 0
- avg_training_score	Average score in current training evaluations
- is_promoted	(Target) Recommended for promotion

# Importing required libraries for the project

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

- Loading Performence dataset from git hub account

In [2]:
Perfdf = pd.read_csv('https://raw.githubusercontent.com/Manju410/MLPractice/main/data/train_Performance.csv')

In [3]:
Perfdf.head()

Unnamed: 0,employee_id,department,region,education,gender,recruitment_channel,no_of_trainings,age,previous_year_rating,length_of_service,KPIs_met >80%,awards_won?,avg_training_score,is_promoted
0,65438,Sales & Marketing,region_7,Master's & above,f,sourcing,1,35,5.0,8,1,0,49,0
1,65141,Operations,region_22,Bachelor's,m,other,1,30,5.0,4,0,0,60,0
2,7513,Sales & Marketing,region_19,Bachelor's,m,sourcing,1,34,3.0,7,0,0,50,0
3,2542,Sales & Marketing,region_23,Bachelor's,m,other,2,39,1.0,10,0,0,50,0
4,48945,Technology,region_26,Bachelor's,m,other,1,45,3.0,2,0,0,73,0


- Number of rows and columns in the dataset

In [4]:
Perfdf.shape

(54808, 14)

- There are 54808 rows and 14 columns in the above dataset

- Information about dataset like datatype,count etc

In [5]:
Perfdf.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 54808 entries, 0 to 54807
Data columns (total 14 columns):
 #   Column                Non-Null Count  Dtype  
---  ------                --------------  -----  
 0   employee_id           54808 non-null  int64  
 1   department            54808 non-null  object 
 2   region                54808 non-null  object 
 3   education             52399 non-null  object 
 4   gender                54808 non-null  object 
 5   recruitment_channel   54808 non-null  object 
 6   no_of_trainings       54808 non-null  int64  
 7   age                   54808 non-null  int64  
 8   previous_year_rating  50684 non-null  float64
 9   length_of_service     54808 non-null  int64  
 10  KPIs_met >80%         54808 non-null  int64  
 11  awards_won?           54808 non-null  int64  
 12  avg_training_score    54808 non-null  int64  
 13  is_promoted           54808 non-null  int64  
dtypes: float64(1), int64(8), object(5)
memory usage: 5.9+ MB


# Summary of above output
- Above dataset contains 14 columns.
- Eight columns are integer datatype and Five columns are Object datatype,
 and one column is float datatype
- Above dataset have null values or empty values in education and previous year columns.
- Above dataset have 54808 etries total

- Checking null values in dataset

In [6]:
Perfdf.isna().sum()

employee_id                0
department                 0
region                     0
education               2409
gender                     0
recruitment_channel        0
no_of_trainings            0
age                        0
previous_year_rating    4124
length_of_service          0
KPIs_met >80%              0
awards_won?                0
avg_training_score         0
is_promoted                0
dtype: int64

In [7]:
100*Perfdf.isna().sum()/Perfdf.shape[0]

employee_id             0.000000
department              0.000000
region                  0.000000
education               4.395344
gender                  0.000000
recruitment_channel     0.000000
no_of_trainings         0.000000
age                     0.000000
previous_year_rating    7.524449
length_of_service       0.000000
KPIs_met >80%           0.000000
awards_won?             0.000000
avg_training_score      0.000000
is_promoted             0.000000
dtype: float64

In [8]:
len(Perfdf.employee_id.unique())

54808

- Removing Employee_Id column

Perfdf dataset contain Employee_Id column which is no related to outcom or lable variable so we are droping that column.

In [9]:
Perfdf.rename(columns={'KPIs_met >80%':'KPI_Score','awards_won?':'awards_won'},inplace=True)
Perfdf.drop('employee_id', axis=1, inplace=True)
Perfdf.head()

Unnamed: 0,department,region,education,gender,recruitment_channel,no_of_trainings,age,previous_year_rating,length_of_service,KPI_Score,awards_won,avg_training_score,is_promoted
0,Sales & Marketing,region_7,Master's & above,f,sourcing,1,35,5.0,8,1,0,49,0
1,Operations,region_22,Bachelor's,m,other,1,30,5.0,4,0,0,60,0
2,Sales & Marketing,region_19,Bachelor's,m,sourcing,1,34,3.0,7,0,0,50,0
3,Sales & Marketing,region_23,Bachelor's,m,other,2,39,1.0,10,0,0,50,0
4,Technology,region_26,Bachelor's,m,other,1,45,3.0,2,0,0,73,0


In [10]:
Perfdf.columns

Index(['department', 'region', 'education', 'gender', 'recruitment_channel',
       'no_of_trainings', 'age', 'previous_year_rating', 'length_of_service',
       'KPI_Score', 'awards_won', 'avg_training_score', 'is_promoted'],
      dtype='object')

In [11]:
Perfdf.department.unique()

array(['Sales & Marketing', 'Operations', 'Technology', 'Analytics',
       'R&D', 'Procurement', 'Finance', 'HR', 'Legal'], dtype=object)

In [12]:
Perfdf.region.unique()

array(['region_7', 'region_22', 'region_19', 'region_23', 'region_26',
       'region_2', 'region_20', 'region_34', 'region_1', 'region_4',
       'region_29', 'region_31', 'region_15', 'region_14', 'region_11',
       'region_5', 'region_28', 'region_17', 'region_13', 'region_16',
       'region_25', 'region_10', 'region_27', 'region_30', 'region_12',
       'region_21', 'region_8', 'region_32', 'region_6', 'region_33',
       'region_24', 'region_3', 'region_9', 'region_18'], dtype=object)

In [13]:
Perfdf.education.unique()

array(["Master's & above", "Bachelor's", nan, 'Below Secondary'],
      dtype=object)

In [14]:
Perfdf.education.value_counts()

Bachelor's          36669
Master's & above    14925
Below Secondary       805
Name: education, dtype: int64

In [15]:
Perfdf.education = Perfdf.education.fillna("Bachelor's")

In [16]:
Perfdf.gender.unique()

array(['f', 'm'], dtype=object)

In [17]:
Perfdf.recruitment_channel.unique()

array(['sourcing', 'other', 'referred'], dtype=object)

In [18]:
Perfdf.no_of_trainings.unique()

array([ 1,  2,  3,  4,  7,  5,  6,  8, 10,  9])

In [19]:
Perfdf.age.unique()

array([35, 30, 34, 39, 45, 31, 33, 28, 32, 49, 37, 38, 41, 27, 29, 26, 24,
       57, 40, 42, 23, 59, 44, 50, 56, 20, 25, 47, 36, 46, 60, 43, 22, 54,
       58, 48, 53, 55, 51, 52, 21])

In [20]:
Perfdf.previous_year_rating.unique()

array([ 5.,  3.,  1.,  4., nan,  2.])

In [21]:
Perfdf.previous_year_rating.value_counts()

3.0    18618
5.0    11741
4.0     9877
1.0     6223
2.0     4225
Name: previous_year_rating, dtype: int64

In [22]:
Perfdf.previous_year_rating = Perfdf.previous_year_rating.fillna(3.)

In [23]:
Perfdf.previous_year_rating.unique()

array([5., 3., 1., 4., 2.])

In [24]:
Perfdf.length_of_service.unique()

array([ 8,  4,  7, 10,  2,  5,  6,  1,  3, 16,  9, 11, 26, 12, 17, 14, 13,
       19, 15, 23, 18, 20, 22, 25, 28, 24, 31, 21, 29, 30, 34, 27, 33, 32,
       37])

In [25]:
Perfdf.KPI_Score.unique()

array([1, 0])

In [26]:
Perfdf.awards_won.unique()

array([0, 1])

In [27]:
Perfdf.avg_training_score.unique()

array([49, 60, 50, 73, 85, 59, 63, 83, 54, 77, 80, 84, 51, 46, 75, 57, 70,
       68, 79, 44, 72, 61, 48, 58, 87, 47, 52, 88, 71, 65, 62, 53, 78, 91,
       82, 69, 55, 74, 86, 90, 92, 67, 89, 56, 76, 81, 45, 64, 39, 94, 93,
       66, 95, 42, 96, 40, 99, 43, 97, 41, 98])

In [28]:
Perfdf.is_promoted.unique()

array([0, 1])

In [56]:
Perfdf.head()

Unnamed: 0,department,region,education,gender,recruitment_channel,no_of_trainings,age,previous_year_rating,length_of_service,KPI_Score,awards_won,avg_training_score,is_promoted
0,Sales & Marketing,region_7,2,0,sourcing,1,35,5.0,8,1,0,49,0
1,Operations,region_22,1,1,other,1,30,5.0,4,0,0,60,0
2,Sales & Marketing,region_19,1,1,sourcing,1,34,3.0,7,0,0,50,0
3,Sales & Marketing,region_23,1,1,other,2,39,1.0,10,0,0,50,0
4,Technology,region_26,1,1,other,1,45,3.0,2,0,0,73,0


In [48]:
Perfdf.gender = [0 if x=='f' else 1 for x in Perfdf.gender]

In [55]:
def Education(word):
  if word=="Below Secondary": return 0
  elif word=="Bachelor's": return 1
  else: return 2
  
Perfdf.education = Perfdf.education.map(Education)

In [30]:
depdf = pd.get_dummies(Perfdf.department,prefix='depart')

In [31]:
regdf = pd.get_dummies(Perfdf.region,prefix='reg')

In [32]:
rec_chan = pd.get_dummies(Perfdf.recruitment_channel,prefix='recchan')

In [57]:
Perfdf1 = pd.concat([Perfdf,depdf,regdf,rec_chan],axis=1)

In [58]:
Perfdf1.head()

Unnamed: 0,department,region,education,gender,recruitment_channel,no_of_trainings,age,previous_year_rating,length_of_service,KPI_Score,awards_won,avg_training_score,is_promoted,depart_Analytics,depart_Finance,depart_HR,depart_Legal,depart_Operations,depart_Procurement,depart_R&D,depart_Sales & Marketing,depart_Technology,reg_region_1,reg_region_10,reg_region_11,reg_region_12,reg_region_13,reg_region_14,reg_region_15,reg_region_16,reg_region_17,reg_region_18,reg_region_19,reg_region_2,reg_region_20,reg_region_21,reg_region_22,reg_region_23,reg_region_24,reg_region_25,reg_region_26,reg_region_27,reg_region_28,reg_region_29,reg_region_3,reg_region_30,reg_region_31,reg_region_32,reg_region_33,reg_region_34,reg_region_4,reg_region_5,reg_region_6,reg_region_7,reg_region_8,reg_region_9,recchan_other,recchan_referred,recchan_sourcing
0,Sales & Marketing,region_7,2,0,sourcing,1,35,5.0,8,1,0,49,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,1
1,Operations,region_22,1,1,other,1,30,5.0,4,0,0,60,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0
2,Sales & Marketing,region_19,1,1,sourcing,1,34,3.0,7,0,0,50,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1
3,Sales & Marketing,region_23,1,1,other,2,39,1.0,10,0,0,50,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0
4,Technology,region_26,1,1,other,1,45,3.0,2,0,0,73,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0


In [59]:
Perfdf1.drop(columns=['department','region','recruitment_channel'], axis=1, inplace=True)
Perfdf1.head()

Unnamed: 0,education,gender,no_of_trainings,age,previous_year_rating,length_of_service,KPI_Score,awards_won,avg_training_score,is_promoted,depart_Analytics,depart_Finance,depart_HR,depart_Legal,depart_Operations,depart_Procurement,depart_R&D,depart_Sales & Marketing,depart_Technology,reg_region_1,reg_region_10,reg_region_11,reg_region_12,reg_region_13,reg_region_14,reg_region_15,reg_region_16,reg_region_17,reg_region_18,reg_region_19,reg_region_2,reg_region_20,reg_region_21,reg_region_22,reg_region_23,reg_region_24,reg_region_25,reg_region_26,reg_region_27,reg_region_28,reg_region_29,reg_region_3,reg_region_30,reg_region_31,reg_region_32,reg_region_33,reg_region_34,reg_region_4,reg_region_5,reg_region_6,reg_region_7,reg_region_8,reg_region_9,recchan_other,recchan_referred,recchan_sourcing
0,2,0,1,35,5.0,8,1,0,49,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,1
1,1,1,1,30,5.0,4,0,0,60,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0
2,1,1,1,34,3.0,7,0,0,50,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1
3,1,1,2,39,1.0,10,0,0,50,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0
4,1,1,1,45,3.0,2,0,0,73,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0


In [60]:
Perfdf1.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 54808 entries, 0 to 54807
Data columns (total 56 columns):
 #   Column                    Non-Null Count  Dtype  
---  ------                    --------------  -----  
 0   education                 54808 non-null  int64  
 1   gender                    54808 non-null  int64  
 2   no_of_trainings           54808 non-null  int64  
 3   age                       54808 non-null  int64  
 4   previous_year_rating      54808 non-null  float64
 5   length_of_service         54808 non-null  int64  
 6   KPI_Score                 54808 non-null  int64  
 7   awards_won                54808 non-null  int64  
 8   avg_training_score        54808 non-null  int64  
 9   is_promoted               54808 non-null  int64  
 10  depart_Analytics          54808 non-null  uint8  
 11  depart_Finance            54808 non-null  uint8  
 12  depart_HR                 54808 non-null  uint8  
 13  depart_Legal              54808 non-null  uint8  
 14  depart

In [61]:
from google.colab import  drive

In [63]:
drive.mount('/drive')

Drive already mounted at /drive; to attempt to forcibly remount, call drive.mount("/drive", force_remount=True).


In [64]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [65]:
Perfdf1.to_csv('/content/drive/MyDrive/Promotion_Cleanup.csv', index=False)