# **IBM HR Analytics Employee Attrition & Performance**



## Introduction

Understanding why employees leave — and how performance trends shape workforce retention — is crucial for building resilient, high-performing teams. This project explores the *IBM HR Analytics Employee Attrition & Performance* dataset to uncover patterns behind employee attrition, performance, and engagement. The goal is not only to identify who is likely to leave, but to reveal *why*, and what actionable strategies might retain talent.

By translating raw data into meaningful insights, this analysis aims to support HR decision-making and shape long-term workforce strategy.

## Business Case:
Employee attrition affects productivity, morale, and costs. The goal is to uncover patterns that explain why employees leave and to recommend retention strategies based on data.


## Key Questions to Explore

- What personal and professional attributes (e.g., age, education, tenure, income) are linked to higher attrition risk?
- Do performance ratings correlate with attrition, satisfaction, or promotion likelihood?
- Can we identify early warning signs of voluntary attrition?
- How do departmental trends influence turnover, especially in high-performing teams?




##  Project Objectives

- Perform descriptive and exploratory data analysis (EDA) to profile employee groups
- Build visual dashboards to support HR storytelling and data-driven conversations
- Apply pattern recognition and machine learning techniques to model attrition risk
- Translate findings into recommendations for retention and performance optimization

---


# Importing libraries required for this project

In [11]:
#We will use Pandas for data manipulation, NumPy for numerical operations, and Matplotlib & Seaborn for visualisation
%pip install pandas
%pip install numpy
%pip install matplotlib
%pip install seaborn

Note: you may need to restart the kernel to use updated packages.
Note: you may need to restart the kernel to use updated packages.
Collecting matplotlib
  Using cached matplotlib-3.10.3-cp313-cp313-win_amd64.whl.metadata (11 kB)
Collecting contourpy>=1.0.1 (from matplotlib)
  Using cached contourpy-1.3.2-cp313-cp313-win_amd64.whl.metadata (5.5 kB)
Collecting cycler>=0.10 (from matplotlib)
  Using cached cycler-0.12.1-py3-none-any.whl.metadata (3.8 kB)
Collecting fonttools>=4.22.0 (from matplotlib)
  Downloading fonttools-4.59.0-cp313-cp313-win_amd64.whl.metadata (110 kB)
Collecting kiwisolver>=1.3.1 (from matplotlib)
  Using cached kiwisolver-1.4.8-cp313-cp313-win_amd64.whl.metadata (6.3 kB)
Collecting pillow>=8 (from matplotlib)
  Downloading pillow-11.3.0-cp313-cp313-win_amd64.whl.metadata (9.2 kB)
Collecting pyparsing>=2.3.1 (from matplotlib)
  Using cached pyparsing-3.2.3-py3-none-any.whl.metadata (5.0 kB)
Using cached matplotlib-3.10.3-cp313-cp313-win_amd64.whl (8.1 MB)
Using cac

# Loading & Inspecting the Dataset

In [None]:
import pandas as pd

# Note: This dataset is part of the IBM HR Analytics series and is being used for workforce planning and predictive modeling.
# Load the dataset

df = pd.read_csv('../data/inputs/raw/WA_Fn-UseC_-HR-Employee-Attrition.csv')
df

# Display the first few rows of the dataset
df.head()

Unnamed: 0,Age,Attrition,BusinessTravel,DailyRate,Department,DistanceFromHome,Education,EducationField,EmployeeCount,EmployeeNumber,...,RelationshipSatisfaction,StandardHours,StockOptionLevel,TotalWorkingYears,TrainingTimesLastYear,WorkLifeBalance,YearsAtCompany,YearsInCurrentRole,YearsSinceLastPromotion,YearsWithCurrManager
0,41,Yes,Travel_Rarely,1102,Sales,1,2,Life Sciences,1,1,...,1,80,0,8,0,1,6,4,0,5
1,49,No,Travel_Frequently,279,Research & Development,8,1,Life Sciences,1,2,...,4,80,1,10,3,3,10,7,1,7
2,37,Yes,Travel_Rarely,1373,Research & Development,2,2,Other,1,4,...,2,80,0,7,3,3,0,0,0,0
3,33,No,Travel_Frequently,1392,Research & Development,3,4,Life Sciences,1,5,...,3,80,0,8,3,3,8,7,3,0
4,27,No,Travel_Rarely,591,Research & Development,2,1,Medical,1,7,...,4,80,1,6,3,3,2,2,2,2


In [None]:
# Display basic statistics of the dataset
df.describe()


Unnamed: 0,Age,DailyRate,DistanceFromHome,Education,EmployeeCount,EmployeeNumber,EnvironmentSatisfaction,HourlyRate,JobInvolvement,JobLevel,...,RelationshipSatisfaction,StandardHours,StockOptionLevel,TotalWorkingYears,TrainingTimesLastYear,WorkLifeBalance,YearsAtCompany,YearsInCurrentRole,YearsSinceLastPromotion,YearsWithCurrManager
count,1470.0,1470.0,1470.0,1470.0,1470.0,1470.0,1470.0,1470.0,1470.0,1470.0,...,1470.0,1470.0,1470.0,1470.0,1470.0,1470.0,1470.0,1470.0,1470.0,1470.0
mean,36.92381,802.485714,9.192517,2.912925,1.0,1024.865306,2.721769,65.891156,2.729932,2.063946,...,2.712245,80.0,0.793878,11.279592,2.79932,2.761224,7.008163,4.229252,2.187755,4.123129
std,9.135373,403.5091,8.106864,1.024165,0.0,602.024335,1.093082,20.329428,0.711561,1.10694,...,1.081209,0.0,0.852077,7.780782,1.289271,0.706476,6.126525,3.623137,3.22243,3.568136
min,18.0,102.0,1.0,1.0,1.0,1.0,1.0,30.0,1.0,1.0,...,1.0,80.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0
25%,30.0,465.0,2.0,2.0,1.0,491.25,2.0,48.0,2.0,1.0,...,2.0,80.0,0.0,6.0,2.0,2.0,3.0,2.0,0.0,2.0
50%,36.0,802.0,7.0,3.0,1.0,1020.5,3.0,66.0,3.0,2.0,...,3.0,80.0,1.0,10.0,3.0,3.0,5.0,3.0,1.0,3.0
75%,43.0,1157.0,14.0,4.0,1.0,1555.75,4.0,83.75,3.0,3.0,...,4.0,80.0,1.0,15.0,3.0,3.0,9.0,7.0,3.0,7.0
max,60.0,1499.0,29.0,5.0,1.0,2068.0,4.0,100.0,4.0,5.0,...,4.0,80.0,3.0,40.0,6.0,4.0,40.0,18.0,15.0,17.0


#  Core Statistical Concepts in Data Analysis

Data analysis is built on fundamental statistical concepts. These principles help us summarize datasets, understand variability, test assumptions, and make predictions. As you can see in the basic statistics of the dataset above we have - Central tendency (mean), 
Variability (std), Range (min to max),
And distribution shape (via percentiles like 25%, 50%, and 75%).

Below, we explain key concepts and why they are critical.

---

## Mean, Median, and Standard Deviation

- **Mean (Average):**
  The sum of all values divided by the number of values. Represents the "center" of the data.
  
   *Strength:* Simple and widely used.  
   *Weakness:* Sensitive to outliers.

- **Median:**
  The middle value when data is sorted. More robust than the mean for skewed data.
  
   *Strength:* Resistant to extreme values.

- **Standard Deviation (SD):**
  Measures how spread out the data is around the mean. A small SD = data points are close to the mean.
  
  A large SD = high variability.

---

## Hypothesis Testing

Hypothesis testing helps assess whether observed patterns are due to chance.

- **Null Hypothesis (H₀):** Assumes no effect or difference.
- **Alternative Hypothesis (H₁):** Assumes a significant effect or difference exists.
- **p-value:** Probability of obtaining results as extreme as observed if H₀ is true.
- **Significance Level (α):** Commonly set at 0.05. If p-value < α, reject H₀.

 *Why Important?* It allows analysts to make data-driven decisions with confidence.

---

##  Basic Probability

Probability measures how likely an event is to occur (0 = impossible, 1 = certain).

- **Independent Events:** One event does not affect another.
- **Conditional Probability:** Probability of event A given that B has occurred.

 *Application in Data Analysis:* Underpins predictive models, simulations, and risk assessments.

---

##  Summary Table

| Concept              | Role in Data Analysis                                    |
|----------------------|----------------------------------------------------------|
| Mean & Median        | Summarize central trends in data.                        |
| Standard Deviation   | Quantify variability and identify outliers.              |
| Hypothesis Testing   | Validate assumptions and guide data-driven decisions.    |
| Probability          | Foundation for modeling, forecasting, and simulations.   |


## **Dataset Overview and Other Quick Observations**

Target variable for attrition prediction: Attrition

Many variables measure satisfaction or involvement (e.g., JobSatisfaction, EnvironmentSatisfaction, RelationshipSatisfaction).

Some columns like EmployeeCount and StandardHours may not vary (likely constants).

Shape: 1,470 rows × 35 columns
No missing values in any column.
Mix of numerical and categorical data.

| Column Name               | Description                                                | Type        |
| ------------------------- | ---------------------------------------------------------- | ----------- |
| `Attrition`               | Whether the employee left the company (Yes/No)             | Categorical |
| `Age`                     | Employee’s age                                             | Numeric     |
| `BusinessTravel`          | Frequency of business travel                               | Categorical |
| `Department`              | Department name (e.g., Sales, R\&D, HR)                    | Categorical |
| `DistanceFromHome`        | Distance from home to work                                 | Numeric     |
| `Education`               | Education level (1–5 scale)                                | Numeric     |
| `EducationField`          | Field of education (Life Sciences, Medical, etc.)          | Categorical |
| `Gender`                  | Male or Female                                             | Categorical |
| `JobRole`                 | Job role/title (e.g., Sales Executive, Research Scientist) | Categorical |
| `JobSatisfaction`         | Job satisfaction level (1–4 scale)                         | Numeric     |
| `MaritalStatus`           | Marital status (Single, Married, Divorced)                 | Categorical |
| `MonthlyIncome`           | Monthly salary                                             | Numeric     |
| `OverTime`                | Whether the employee works overtime                        | Categorical |
| `TotalWorkingYears`       | Total years of experience                                  | Numeric     |
| `YearsAtCompany`          | Years spent at the current company                         | Numeric     |
| `YearsSinceLastPromotion` | Years since the employee was last promoted                 | Numeric     |
| `WorkLifeBalance`         | Work-life balance rating (1–4 scale)                       | Numeric     |
