# Modelling for Predicting Employee Attrition 
*By Bhavya Bhargava*<br>

### **Why Build a Classification Model on HR Attrition Data?**  

After preparing HR attrition data, visualizing trends, and conducting statistical analysis, **building a classification model** allows for predictive insights and proactive decision-making.  

📊 **Predicts Employee Attrition** – A classification model helps identify employees who are at high risk of leaving based on historical patterns.  

🔍 **Automates Decision-Making** – Machine learning models enable HR teams to assess attrition risk automatically rather than relying solely on manual analysis.  

📈 **Identifies Key Retention Factors** – Feature importance analysis highlights the most influential factors driving attrition, such as job satisfaction, salary hikes, or work-life balance.  

⚡ **Enables Targeted Retention Strategies** – By categorizing employees into "Likely to Stay" and "Likely to Leave," HR teams can take proactive measures to improve engagement and reduce turnover.  

🚀 **Optimizes Workforce Planning** – Predictive modeling supports long-term HR planning by forecasting potential attrition rates and workforce stability.  

By implementing a classification model, organizations **transform HR attrition analysis from reactive insights to proactive workforce management**, ensuring data-driven retention strategies.
<br>
<br>
_Now let's start creating a comprehensive model for predicting employee attrition using random forests_
<br>
<br>
To begin with, let's initialize the environment for our modelling by importing required libraries and loading our dataset.

In [2]:
# Importing Libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import warnings
from sklearn.preprocessing import LabelEncoder, StandardScaler
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import confusion_matrix, classification_report, roc_auc_score, roc_curve

# Setting the style for visualizations
plt.style.use('fivethirtyeight')
sns.set_style("whitegrid")
%matplotlib inline

# Loading the dataset
employee_data = pd.read_csv('WA_Fn-UseC_-HR-Employee-Attrition_Processed.csv')

# For ignoring unncecessary warnings that may arise
warnings.filterwarnings('always')
warnings.filterwarnings('ignore')

Let's start off with first taking a look at our data.

In [3]:
# Displaying first few rows and basic information
print("Dataset Shape:", employee_data.shape)
display(employee_data.head())
employee_data.info()

Dataset Shape: (1470, 51)


Unnamed: 0,Age,Attrition,BusinessTravel,DailyRate,Department,DistanceFromHome,Education,EducationField,EnvironmentSatisfaction,Gender,...,EnvironmentSatisfaction_Encoded,Gender_Encoded,JobInvolvement_Encoded,JobRole_Encoded,JobSatisfaction_Encoded,MaritalStatus_Encoded,OverTime_Encoded,PerformanceRating_Encoded,RelationshipSatisfaction_Encoded,WorkLifeBalance_Encoded
0,41,1,Travel_Rarely,1102,Sales,1,College,Life Sciences,Medium,Female,...,2,0,0,7,3,2,1,0,1,0
1,49,0,Travel_Frequently,279,Research & Development,8,Below College,Life Sciences,High,Male,...,0,1,2,6,2,1,0,1,3,2
2,37,1,Travel_Rarely,1373,Research & Development,2,College,Other,Very High,Male,...,3,1,2,2,0,2,1,0,2,2
3,33,0,Travel_Frequently,1392,Research & Development,3,Master,Life Sciences,Very High,Female,...,3,0,0,6,0,1,1,0,0,2
4,27,0,Travel_Rarely,591,Research & Development,2,Below College,Medical,Low,Male,...,1,1,0,2,2,1,0,0,3,2


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1470 entries, 0 to 1469
Data columns (total 51 columns):
 #   Column                            Non-Null Count  Dtype  
---  ------                            --------------  -----  
 0   Age                               1470 non-null   int64  
 1   Attrition                         1470 non-null   int64  
 2   BusinessTravel                    1470 non-null   object 
 3   DailyRate                         1470 non-null   int64  
 4   Department                        1470 non-null   object 
 5   DistanceFromHome                  1470 non-null   int64  
 6   Education                         1470 non-null   object 
 7   EducationField                    1470 non-null   object 
 8   EnvironmentSatisfaction           1470 non-null   object 
 9   Gender                            1470 non-null   object 
 10  HourlyRate                        1470 non-null   int64  
 11  JobInvolvement                    1470 non-null   object 
 12  JobLev

As the shape and feature data is consistent from the preparation stage we can move forward with the various statistical analysis tests.

We can start taking care of...
