# **Employee Attrition Analysis and Prediction Using Machine Learning**

## **Project Introduction**

In today’s competitive business environment, retaining talented employees is critical for organizational success. High employee turnover not only affects productivity but also incurs significant recruitment and training costs. 

This project aims to analyze an HR dataset to uncover key factors that influence employee attrition and to build a predictive model that can help HR departments proactively identify employees at risk of leaving.

## **Objectives**

-  **Exploratory Data Analysis (EDA):** Understand the distribution and relationships between different variables.
-  **Correlation Analysis:** Identify how features relate to each other and especially to the target variable `left`.
-  **Attrition Factor Identification:** Discover the most influential features that drive employee turnover.
-  **Machine Learning Modeling:** Build and evaluate predictive models to forecast the likelihood of an employee leaving the company.

## **Goal**

Provide actionable insights and a predictive tool for HR teams to improve employee retention and reduce turnover-related costs.


## **Business scenario and problem**

The HR department at Salifort Motors wants to take some initiatives to improve employee satisfaction levels at the company. They collected data from employees, but now they don’t know what to do with it. They refered to me as a data analytics professional and wants me to provide data-driven suggestions based on myunderstanding of the data. They have the following question: what’s likely to make the employee leave the company?

My goals in this project is to analyze the data collected by the HR department and to build a model that predicts whether or not an employee will leave the company. 

If I can predict employees who are likely to quit, it might be possible to identify factors that contribute to their leaving. Because it is time-consuming and expensive to find, interview, and hire new employees, increasing employee retention will be beneficial to the company.

### HR dataset 

In this [dataset](https://www.kaggle.com/datasets/mfaisalqureshi/hr-analytics-and-job-prediction?select=HR_comma_sep.csv), there are 14,999 rows, 10 columns, and these variables: 

Variable  |Description |
-----|-----| 
satisfaction_level|Employee-reported job satisfaction level [0&ndash;1]|
last_evaluation|Score of employee's last performance review [0&ndash;1]|
number_project|Number of projects employee contributes to|
average_monthly_hours|Average number of hours employee worked per month|
time_spend_company|How long the employee has been with the company (years)
Work_accident|Whether or not the employee experienced an accident while at work
left|Whether or not the employee left the company
promotion_last_5years|Whether or not the employee was promoted in the last 5 years
Department|The employee's department
salary|The employee's salary (U.S. dollars)

## Importing packages and loading the dataset into a dataframe

### Importing packages

In [21]:
 # For data manipulation
import numpy as np
import pandas as pd

# For data visualization
import matplotlib.pyplot as plt
import seaborn as sns

# For displaying all of the columns in dataframes
pd.set_option('display.max_columns', None)

# for data preprocessing of duplicate entries
from collections import defaultdict

### Loading the dataset

In [22]:
# Loading dataset into a dataframe
df0 = pd.read_csv("hr_dataset_0.csv")

# first few rows of the dataframe
df0.head()

Unnamed: 0,satisfaction_level,last_evaluation,number_project,average_montly_hours,time_spend_company,Work_accident,left,promotion_last_5years,Department,salary
0,0.38,0.53,2,157,3,0,1,0,sales,low
1,0.8,0.86,5,262,6,0,1,0,sales,medium
2,0.11,0.88,7,272,4,0,1,0,sales,medium
3,0.72,0.87,5,223,5,0,1,0,sales,low
4,0.37,0.52,2,159,3,0,1,0,sales,low


## Data Exploration (Initial EDA and data cleaning)

- Understand the variables
- Cleaning the dataset (missing data, redundant data, outliers)
- Dealing with legit duplicates(same entry by two or more people)



### Gathering basic information about the data

In [23]:
# Some basic information about the data 
df0.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 14999 entries, 0 to 14998
Data columns (total 10 columns):
 #   Column                 Non-Null Count  Dtype  
---  ------                 --------------  -----  
 0   satisfaction_level     14999 non-null  float64
 1   last_evaluation        14999 non-null  float64
 2   number_project         14999 non-null  int64  
 3   average_montly_hours   14999 non-null  int64  
 4   time_spend_company     14999 non-null  int64  
 5   Work_accident          14999 non-null  int64  
 6   left                   14999 non-null  int64  
 7   promotion_last_5years  14999 non-null  int64  
 8   Department             14999 non-null  object 
 9   salary                 14999 non-null  object 
dtypes: float64(2), int64(6), object(2)
memory usage: 1.1+ MB


### Descriptive statistics about the data

In [24]:
df0.describe()

Unnamed: 0,satisfaction_level,last_evaluation,number_project,average_montly_hours,time_spend_company,Work_accident,left,promotion_last_5years
count,14999.0,14999.0,14999.0,14999.0,14999.0,14999.0,14999.0,14999.0
mean,0.612834,0.716102,3.803054,201.050337,3.498233,0.14461,0.238083,0.021268
std,0.248631,0.171169,1.232592,49.943099,1.460136,0.351719,0.425924,0.144281
min,0.09,0.36,2.0,96.0,2.0,0.0,0.0,0.0
25%,0.44,0.56,3.0,156.0,3.0,0.0,0.0,0.0
50%,0.64,0.72,4.0,200.0,3.0,0.0,0.0,0.0
75%,0.82,0.87,5.0,245.0,4.0,0.0,0.0,0.0
max,1.0,1.0,7.0,310.0,10.0,1.0,1.0,1.0


### Renaming columns

As a data cleaning step, renaming the columns as needed. Standardizing the column names so that they are all in `snake_case`, correcting any column names that are misspelled, and making column names more concise as needed.

In [25]:
# Column names
df0.columns

Index(['satisfaction_level', 'last_evaluation', 'number_project',
       'average_montly_hours', 'time_spend_company', 'Work_accident', 'left',
       'promotion_last_5years', 'Department', 'salary'],
      dtype='object')

In [26]:
# Renaming column names as needed
df0 = df0.rename(columns={'Work_accident': 'work_accident',
                          'average_montly_hours': 'average_monthly_hours',
                          'time_spend_company': 'tenure',
                          'Department': 'department'})

# All column names after the update

df0.columns

Index(['satisfaction_level', 'last_evaluation', 'number_project',
       'average_monthly_hours', 'tenure', 'work_accident', 'left',
       'promotion_last_5years', 'department', 'salary'],
      dtype='object')

### Checking for missing values

In [27]:
# Checking for missing values
df0.isna().sum()

satisfaction_level       0
last_evaluation          0
number_project           0
average_monthly_hours    0
tenure                   0
work_accident            0
left                     0
promotion_last_5years    0
department               0
salary                   0
dtype: int64

There are no missing values in the data.

### Checking for any duplicate entries in the data and dealing with them.

In [28]:
# Total no. of duplicates
df0.duplicated().sum()

np.int64(3008)

3,008 rows contain duplicates. That is 20% of the data.

In [29]:
# Inspecting the duplicates
df0[df0.duplicated()].head()

Unnamed: 0,satisfaction_level,last_evaluation,number_project,average_monthly_hours,tenure,work_accident,left,promotion_last_5years,department,salary
396,0.46,0.57,2,139,3,0,1,0,sales,low
866,0.41,0.46,2,128,3,0,1,0,accounting,low
1317,0.37,0.51,2,127,3,0,1,0,sales,medium
1368,0.41,0.52,2,132,3,0,1,0,RandD,low
1461,0.42,0.53,2,142,3,0,1,0,sales,low


The above output shows the first five occurences of rows that are duplicated farther down in the dataframe. How likely is it that these are legitimate entries? In other words, how plausible is it that two employees self-reported the exact same response for every column?

I will perform a likelihood analysis by essentially applying Bayes' theorem and multiplying the probabilities of finding each value in each column. And keep legitimate entries.

In [None]:
# separating duplicates and non-duplicates
duplicates = df0[df0.duplicated(keep=False)].copy()
non_duplicates = df0[~df0.duplicated(keep=False)].copy()

In [33]:
# Calculating probability of each value in each column
value_probs = defaultdict(dict)

for col in df0.columns:
    col_probs = df0[col].value_counts(normalize=True)
    for val, prob in col_probs.items():
        value_probs[col][val] = prob

In [34]:
# Defining a function to compute row probability
def row_probability(row):
    prob = 1.0
    for col in df0.columns:
        prob *= value_probs[col].get(row[col], 1e-6) 
    return prob

In [35]:
# Applying the row probability function to duplicates
duplicates['row_prob'] = duplicates.apply(row_probability, axis=1)

# Setting a threshold for low probability
threshold = duplicates['row_prob'].quantile(0.05)

# Dropping rows with low probability of being entries from a real person
# (i.e., rows that are less likely to be duplicates based on the calculated probabilities)
# occurance of these rows is likely due to errors in data entry
to_drop = (duplicates['row_prob'] < threshold)

# Dropping the identified rows from the duplicates DataFrame
cleaned_duplicates = duplicates[~to_drop].copy()

cleaned_duplicates.drop(columns=['row_prob'], inplace=True)

cleaned_duplicates.shape

(5079, 10)

In [36]:
df1 = pd.concat([cleaned_duplicates, non_duplicates], ignore_index=True).reset_index(drop=True)

In [37]:
df0.shape, df1.shape

((14999, 10), (14732, 10))

Successfully removed 267 duplicates. These duplicates had a low probablity of occuring due to random chance and likely occured due to some other fault in the data collection process.

In [38]:
# Saving the cleaned dataset
df1.to_csv("hr_dataset_1.csv", index=False)