Employee attrition is a critical challenge for organizations, as losing skilled employees can lead to increased recruitment costs, reduced productivity, and disruption of team dynamics. Predicting which employees are likely to leave allows HR departments to take proactive measures to improve retention and maintain organizational stability.

This project focuses on analyzing an HR dataset containing employee demographics, job-related attributes, and workplace information to predict employee attrition. The dataset includes both numerical features (such as Age, DistanceFromHome, TotalWorkingYears) and categorical features (such as Department, JobRole, OverTime), which are preprocessed and used as inputs for machine learning models.

The primary objective is to build a predictive model that can accurately classify whether an employee is likely to stay or leave. The insights generated from this analysis can help organizations identify risk factors for attrition and design effective employee engagement and retention strategies.

In [1]:
#importing the libraries

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

In [2]:
from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


In [3]:
import os
os.chdir('/content/drive/MyDrive/Boosting')

Step1: Loading the datasets

In [4]:
df = pd.read_csv('WA_Fn-UseC_-HR-Employee-Attrition.csv')

In [5]:
df.head()

Unnamed: 0,Age,Attrition,BusinessTravel,DailyRate,Department,DistanceFromHome,Education,EducationField,EmployeeCount,EmployeeNumber,...,RelationshipSatisfaction,StandardHours,StockOptionLevel,TotalWorkingYears,TrainingTimesLastYear,WorkLifeBalance,YearsAtCompany,YearsInCurrentRole,YearsSinceLastPromotion,YearsWithCurrManager
0,41,Yes,Travel_Rarely,1102,Sales,1,2,Life Sciences,1,1,...,1,80,0,8,0,1,6,4,0,5
1,49,No,Travel_Frequently,279,Research & Development,8,1,Life Sciences,1,2,...,4,80,1,10,3,3,10,7,1,7
2,37,Yes,Travel_Rarely,1373,Research & Development,2,2,Other,1,4,...,2,80,0,7,3,3,0,0,0,0
3,33,No,Travel_Frequently,1392,Research & Development,3,4,Life Sciences,1,5,...,3,80,0,8,3,3,8,7,3,0
4,27,No,Travel_Rarely,591,Research & Development,2,1,Medical,1,7,...,4,80,1,6,3,3,2,2,2,2


In [6]:
#Checking the number of 'Yes' and 'No' in 'Attrition'
df['Attrition'].value_counts()

Unnamed: 0_level_0,count
Attrition,Unnamed: 1_level_1
No,1233
Yes,237


In [7]:
#checking any missing value in the data
df.isnull().sum()

Unnamed: 0,0
Age,0
Attrition,0
BusinessTravel,0
DailyRate,0
Department,0
DistanceFromHome,0
Education,0
EducationField,0
EmployeeCount,0
EmployeeNumber,0


In [8]:
#checking the info
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1470 entries, 0 to 1469
Data columns (total 35 columns):
 #   Column                    Non-Null Count  Dtype 
---  ------                    --------------  ----- 
 0   Age                       1470 non-null   int64 
 1   Attrition                 1470 non-null   object
 2   BusinessTravel            1470 non-null   object
 3   DailyRate                 1470 non-null   int64 
 4   Department                1470 non-null   object
 5   DistanceFromHome          1470 non-null   int64 
 6   Education                 1470 non-null   int64 
 7   EducationField            1470 non-null   object
 8   EmployeeCount             1470 non-null   int64 
 9   EmployeeNumber            1470 non-null   int64 
 10  EnvironmentSatisfaction   1470 non-null   int64 
 11  Gender                    1470 non-null   object
 12  HourlyRate                1470 non-null   int64 
 13  JobInvolvement            1470 non-null   int64 
 14  JobLevel                

Step2: Feature Engineering

The numeric and categorical fields need to be treated separately.The following few steps separate the numeric and categorical fields and drops the target field 'Attrition' from the feature set.

In [9]:
# Extracting the numeric values
df_num = df.select_dtypes(include = ['int64'])

In [10]:
df_num.head()

Unnamed: 0,Age,DailyRate,DistanceFromHome,Education,EmployeeCount,EmployeeNumber,EnvironmentSatisfaction,HourlyRate,JobInvolvement,JobLevel,...,RelationshipSatisfaction,StandardHours,StockOptionLevel,TotalWorkingYears,TrainingTimesLastYear,WorkLifeBalance,YearsAtCompany,YearsInCurrentRole,YearsSinceLastPromotion,YearsWithCurrManager
0,41,1102,1,2,1,1,2,94,3,2,...,1,80,0,8,0,1,6,4,0,5
1,49,279,8,1,1,2,3,61,2,2,...,4,80,1,10,3,3,10,7,1,7
2,37,1373,2,2,1,4,4,92,2,1,...,2,80,0,7,3,3,0,0,0,0
3,33,1392,3,4,1,5,4,56,3,1,...,3,80,0,8,3,3,8,7,3,0
4,27,591,2,1,1,7,1,40,3,1,...,4,80,1,6,3,3,2,2,2,2


In [11]:
# extracting the categorical values
df_cat = df.select_dtypes(include = ['object'])

In [12]:
df_cat.head()

Unnamed: 0,Attrition,BusinessTravel,Department,EducationField,Gender,JobRole,MaritalStatus,Over18,OverTime
0,Yes,Travel_Rarely,Sales,Life Sciences,Female,Sales Executive,Single,Y,Yes
1,No,Travel_Frequently,Research & Development,Life Sciences,Male,Research Scientist,Married,Y,No
2,Yes,Travel_Rarely,Research & Development,Other,Male,Laboratory Technician,Single,Y,Yes
3,No,Travel_Frequently,Research & Development,Life Sciences,Female,Research Scientist,Married,Y,Yes
4,No,Travel_Rarely,Research & Development,Medical,Male,Laboratory Technician,Married,Y,No


In [13]:
# dropping the Attrition from df_cat

df_cat = df_cat.drop('Attrition', axis =1)

In [14]:
df_cat.head()

Unnamed: 0,BusinessTravel,Department,EducationField,Gender,JobRole,MaritalStatus,Over18,OverTime
0,Travel_Rarely,Sales,Life Sciences,Female,Sales Executive,Single,Y,Yes
1,Travel_Frequently,Research & Development,Life Sciences,Male,Research Scientist,Married,Y,No
2,Travel_Rarely,Research & Development,Other,Male,Laboratory Technician,Single,Y,Yes
3,Travel_Frequently,Research & Development,Life Sciences,Female,Research Scientist,Married,Y,Yes
4,Travel_Rarely,Research & Development,Medical,Male,Laboratory Technician,Married,Y,No


In [15]:
# encoding using pandas to get dummies

df_cat_encoded = pd.get_dummies(df_cat)

In [16]:
df_cat_encoded.head()

Unnamed: 0,BusinessTravel_Non-Travel,BusinessTravel_Travel_Frequently,BusinessTravel_Travel_Rarely,Department_Human Resources,Department_Research & Development,Department_Sales,EducationField_Human Resources,EducationField_Life Sciences,EducationField_Marketing,EducationField_Medical,...,JobRole_Research Director,JobRole_Research Scientist,JobRole_Sales Executive,JobRole_Sales Representative,MaritalStatus_Divorced,MaritalStatus_Married,MaritalStatus_Single,Over18_Y,OverTime_No,OverTime_Yes
0,False,False,True,False,False,True,False,True,False,False,...,False,False,True,False,False,False,True,True,False,True
1,False,True,False,False,True,False,False,True,False,False,...,False,True,False,False,False,True,False,True,True,False
2,False,False,True,False,True,False,False,False,False,False,...,False,False,False,False,False,False,True,True,False,True
3,False,True,False,False,True,False,False,True,False,False,...,False,True,False,False,False,True,False,True,False,True
4,False,False,True,False,True,False,False,False,False,True,...,False,False,False,False,False,True,False,True,True,False


Scaling the numeric fields

The numeric fields have been scaled next for best results. StandardScaler() has been used for the same. After scaling the numeric features, they will be merged with the categorical features.

In [17]:
# scaling the numeric fields
from sklearn.preprocessing import StandardScaler

In [18]:
scaler = StandardScaler()


In [19]:
df_num_scaled = scaler.fit_transform(df_num)



In [20]:
df_num_scaled

array([[ 0.4463504 ,  0.74252653, -1.01090934, ..., -0.0632959 ,
        -0.67914568,  0.24583399],
       [ 1.32236521, -1.2977746 , -0.14714972, ...,  0.76499762,
        -0.36871529,  0.80654148],
       [ 0.008343  ,  1.41436324, -0.88751511, ..., -1.16768726,
        -0.67914568, -1.15593471],
       ...,
       [-1.08667552, -1.60518328, -0.64072665, ..., -0.61549158,
        -0.67914568, -0.31487349],
       [ 1.32236521,  0.54667746, -0.88751511, ...,  0.48889978,
        -0.67914568,  1.08689522],
       [-0.32016256, -0.43256792, -0.14714972, ..., -0.33939374,
        -0.36871529, -0.59522723]])

In [21]:
#converting num array to data frame
df_num_scaled = pd.DataFrame(df_num_scaled, columns = df_num.columns)

In [22]:
df_num_scaled.head()

Unnamed: 0,Age,DailyRate,DistanceFromHome,Education,EmployeeCount,EmployeeNumber,EnvironmentSatisfaction,HourlyRate,JobInvolvement,JobLevel,...,RelationshipSatisfaction,StandardHours,StockOptionLevel,TotalWorkingYears,TrainingTimesLastYear,WorkLifeBalance,YearsAtCompany,YearsInCurrentRole,YearsSinceLastPromotion,YearsWithCurrManager
0,0.44635,0.742527,-1.010909,-0.891688,0.0,-1.701283,-0.660531,1.383138,0.379672,-0.057788,...,-1.584178,0.0,-0.932014,-0.421642,-2.171982,-2.49382,-0.164613,-0.063296,-0.679146,0.245834
1,1.322365,-1.297775,-0.14715,-1.868426,0.0,-1.699621,0.254625,-0.240677,-1.026167,-0.057788,...,1.191438,0.0,0.241988,-0.164511,0.155707,0.338096,0.488508,0.764998,-0.368715,0.806541
2,0.008343,1.414363,-0.887515,-0.891688,0.0,-1.696298,1.169781,1.284725,-1.026167,-0.961486,...,-0.658973,0.0,-0.932014,-0.550208,0.155707,0.338096,-1.144294,-1.167687,-0.679146,-1.155935
3,-0.429664,1.461466,-0.764121,1.061787,0.0,-1.694636,1.169781,-0.486709,0.379672,-0.961486,...,0.266233,0.0,-0.932014,-0.421642,0.155707,0.338096,0.161947,0.764998,0.252146,-1.155935
4,-1.086676,-0.524295,-0.887515,-1.868426,0.0,-1.691313,-1.575686,-1.274014,0.379672,-0.961486,...,1.191438,0.0,0.241988,-0.678774,0.155707,0.338096,-0.817734,-0.615492,-0.058285,-0.595227


In [23]:
df_num_scaled.shape

(1470, 26)

In [24]:
df_cat_encoded.shape

(1470, 29)

In [25]:
# combining the both num and cat data frame
df_final = pd.concat([df_num_scaled, df_cat_encoded], axis=1)

In [26]:
df_final.head()

Unnamed: 0,Age,DailyRate,DistanceFromHome,Education,EmployeeCount,EmployeeNumber,EnvironmentSatisfaction,HourlyRate,JobInvolvement,JobLevel,...,JobRole_Research Director,JobRole_Research Scientist,JobRole_Sales Executive,JobRole_Sales Representative,MaritalStatus_Divorced,MaritalStatus_Married,MaritalStatus_Single,Over18_Y,OverTime_No,OverTime_Yes
0,0.44635,0.742527,-1.010909,-0.891688,0.0,-1.701283,-0.660531,1.383138,0.379672,-0.057788,...,False,False,True,False,False,False,True,True,False,True
1,1.322365,-1.297775,-0.14715,-1.868426,0.0,-1.699621,0.254625,-0.240677,-1.026167,-0.057788,...,False,True,False,False,False,True,False,True,True,False
2,0.008343,1.414363,-0.887515,-0.891688,0.0,-1.696298,1.169781,1.284725,-1.026167,-0.961486,...,False,False,False,False,False,False,True,True,False,True
3,-0.429664,1.461466,-0.764121,1.061787,0.0,-1.694636,1.169781,-0.486709,0.379672,-0.961486,...,False,True,False,False,False,True,False,True,False,True
4,-1.086676,-0.524295,-0.887515,-1.868426,0.0,-1.691313,-1.575686,-1.274014,0.379672,-0.961486,...,False,False,False,False,False,True,False,True,True,False


In [27]:
# assigning the feature and target variable
y= df['Attrition']
X= df_final

Train and test split

In [28]:
from sklearn.model_selection import train_test_split

In [29]:
X_train, X_test , y_train, y_test = train_test_split(X, y , train_size = .7, random_state =100)

Step 3: Model Fitting

Adaboost Classifier

The most important parameters are base_estimator, n_estimators and learning_rate.
1. base_estimator - It is the learning algorithm to use to train the weak models. The default Learning Algorithm is DecisionTreeClassifier with Max Depth of 1
2. n_estimators - It is the number of models to iteratively train.

3.learning_rate - It is the contribution of each model to the weights and default value for it is 1.

There is a trade-off between learning_rate and n_estimators. Reducing the learning rate will forcing the model train slower (but sometimes resulting in better performance scores). Decreasing the learning rate L makes the coefficients α_m smaller, which reduces the amplitude of the sample_weights at each step (As per weight formula we use at each step for updating weights).

In [30]:
from sklearn.ensemble import AdaBoostClassifier

In [31]:
# building the model
adaboost = AdaBoostClassifier(n_estimators =200, random_state =100)

In [32]:
# fitting the model
adaboost.fit(X_train, y_train)

In [33]:
# predicting the value of y

y_pred = adaboost.predict(X_test)

In [34]:
from sklearn.metrics import accuracy_score, confusion_matrix

In [35]:
accuracy_score(y_test, y_pred)

0.8639455782312925

In [36]:
confusion_matrix(y_test, y_pred)

array([[355,  16],
       [ 44,  26]])

Conclusion

The HR Employee Attrition project aimed to predict whether an employee is likely to leave the organization using historical employee data. The dataset contained both numerical features (like Age, DistanceFromHome, MonthlyIncome) and categorical features (like Department, JobRole, OverTime), which were appropriately preprocessed — numerical features were scaled, and categorical features were one-hot encoded.

The model used for prediction was AdaBoost Classifier, a robust ensemble learning technique that combines multiple weak learners to improve classification performance. After training and testing the model:

The accuracy achieved was 86.4%, indicating that the model can correctly classify employees’ attrition status in the majority of cases.

The confusion matrix revealed that the model predicts employees who stay (non-attrition) very well but has some difficulty predicting actual attrition cases, which is typical in imbalanced datasets.

Overall, the model demonstrates that machine learning can be effectively leveraged to identify employees at risk of leaving, enabling HR departments to implement proactive retention strategies. Further improvements could include addressing class imbalance (e.g., using SMOTE), feature selection, and experimenting with other advanced models like XGBoost or Random Forest for potentially higher predictive performance.