![HR Analytics](../Images/Data_Preperation.png)

---

**_The key to success in any organization lies in its ability to attract and retain top talent. As part of the Unified Mentor program, I undertook the IBM HR Analytics Employee Attrition & Performance project to analyze critical factors influencing employee turnover._**

**_This analysis is designed to support HR Analysts in identifying the underlying reasons why employees choose to stay or leave an organization. By leveraging data-driven insights, the project highlights patterns related to job satisfaction, work-life balance, compensation, and other key variables impacting attrition._**

**_Understanding these factors empowers HR teams to implement proactive strategies aimed at improving employee retention, enhancing workplace satisfaction, and ultimately reducing the costly cycle of hiring and training new staff. The predictive models and visual insights developed in this project serve as valuable tools to help organizations retain their top performers and foster a stable, productive workforce._**

---
# <span style="background-color:#393be5; color:white; padding:10px;border-radius:15px; text-align:center;">OUTLINE</span>

1. **Import Libraries**  
   Load the necessary Python tools to work with data and perform analysis.

2. **Load the Data**  
   Open the dataset and get it ready for analysis.

3. **Explore and Clean the Data (Data Wrangling)**  
   - Understanding Dataset Dimensions  
   - Exploring Dataset Structure  
   - Attribute Summary  
   - Feature Classification  
   - Decoding Encoded Features  
   - Look for any missing or empty data.  
   - Summarize the numeric data to spot trends (like average salary, age range).  
   - Remove columns that aren’t useful for analysis.  
   - Review category-based columns to understand common values (e.g., most common job role).  
   - Check how many unique values exist in each category column.

4. **Save the Cleaned Data**  
   Store the cleaned dataset in a new file for future use.

---

# <span style="background-color:#258683; color:white; padding:10px;border-radius:15px; text-align:center;">IMPORTING VARIOUS MODULES</span>


In [1]:
# Library for Data Manipulation
import numpy as np
import pandas as pd
import io
# Library for Statistical Modelling
from sklearn.preprocessing import LabelEncoder
# Library for Ignore the warnings
import warnings
warnings.filterwarnings('always')
warnings.filterwarnings('ignore')

---
# <span style="background-color:#258683; color:white; padding:10px;border-radius:15px; text-align:center;">PROVIDED DATASET</span>

In [2]:
# Load the dataset
employee_data = '../WA_Fn-UseC_-HR-Employee-Attrition.csv'
df = pd.read_csv(employee_data)

# Display the first 5 rows
df.head()

Unnamed: 0,Age,Attrition,BusinessTravel,DailyRate,Department,DistanceFromHome,Education,EducationField,EmployeeCount,EmployeeNumber,...,RelationshipSatisfaction,StandardHours,StockOptionLevel,TotalWorkingYears,TrainingTimesLastYear,WorkLifeBalance,YearsAtCompany,YearsInCurrentRole,YearsSinceLastPromotion,YearsWithCurrManager
0,41,Yes,Travel_Rarely,1102,Sales,1,2,Life Sciences,1,1,...,1,80,0,8,0,1,6,4,0,5
1,49,No,Travel_Frequently,279,Research & Development,8,1,Life Sciences,1,2,...,4,80,1,10,3,3,10,7,1,7
2,37,Yes,Travel_Rarely,1373,Research & Development,2,2,Other,1,4,...,2,80,0,7,3,3,0,0,0,0
3,33,No,Travel_Frequently,1392,Research & Development,3,4,Life Sciences,1,5,...,3,80,0,8,3,3,8,7,3,0
4,27,No,Travel_Rarely,591,Research & Development,2,1,Medical,1,7,...,4,80,1,6,3,3,2,2,2,2


In [3]:
df.tail()

Unnamed: 0,Age,Attrition,BusinessTravel,DailyRate,Department,DistanceFromHome,Education,EducationField,EmployeeCount,EmployeeNumber,...,RelationshipSatisfaction,StandardHours,StockOptionLevel,TotalWorkingYears,TrainingTimesLastYear,WorkLifeBalance,YearsAtCompany,YearsInCurrentRole,YearsSinceLastPromotion,YearsWithCurrManager
1465,36,No,Travel_Frequently,884,Research & Development,23,2,Medical,1,2061,...,3,80,1,17,3,3,5,2,0,3
1466,39,No,Travel_Rarely,613,Research & Development,6,1,Medical,1,2062,...,1,80,1,9,5,3,7,7,1,7
1467,27,No,Travel_Rarely,155,Research & Development,4,3,Life Sciences,1,2064,...,2,80,1,6,0,3,6,2,0,3
1468,49,No,Travel_Frequently,1023,Sales,2,3,Medical,1,2065,...,4,80,0,17,3,2,9,6,0,8
1469,34,No,Travel_Rarely,628,Research & Development,8,3,Medical,1,2068,...,1,80,0,6,3,4,4,3,1,2


---
# <span style="background-color:#258683; color:white; padding:10px;border-radius:15px; text-align:center;">Understanding Dataset Dimensions</span>

In [4]:
# Checking the shape of the dataset
rows, columns = df.shape

rows, columns

(1470, 35)

---
# <span style="background-color:#258683; color:white; padding:10px;border-radius:15px; text-align:center;">Exploring Dataset Structure</span>

In [5]:
# Listing all columns in the dataset
df.columns.tolist()


['Age',
 'Attrition',
 'BusinessTravel',
 'DailyRate',
 'Department',
 'DistanceFromHome',
 'Education',
 'EducationField',
 'EmployeeCount',
 'EmployeeNumber',
 'EnvironmentSatisfaction',
 'Gender',
 'HourlyRate',
 'JobInvolvement',
 'JobLevel',
 'JobRole',
 'JobSatisfaction',
 'MaritalStatus',
 'MonthlyIncome',
 'MonthlyRate',
 'NumCompaniesWorked',
 'Over18',
 'OverTime',
 'PercentSalaryHike',
 'PerformanceRating',
 'RelationshipSatisfaction',
 'StandardHours',
 'StockOptionLevel',
 'TotalWorkingYears',
 'TrainingTimesLastYear',
 'WorkLifeBalance',
 'YearsAtCompany',
 'YearsInCurrentRole',
 'YearsSinceLastPromotion',
 'YearsWithCurrManager']

---
# <span style="background-color:#258683; color:white; padding:10px;border-radius:15px; text-align:center;">Attribute Summary</span>

**We’ll generate basic info:**
- Data types
- Non-null counts
- Memory usage

In [6]:
# Generating basic information about the dataset
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1470 entries, 0 to 1469
Data columns (total 35 columns):
 #   Column                    Non-Null Count  Dtype 
---  ------                    --------------  ----- 
 0   Age                       1470 non-null   int64 
 1   Attrition                 1470 non-null   object
 2   BusinessTravel            1470 non-null   object
 3   DailyRate                 1470 non-null   int64 
 4   Department                1470 non-null   object
 5   DistanceFromHome          1470 non-null   int64 
 6   Education                 1470 non-null   int64 
 7   EducationField            1470 non-null   object
 8   EmployeeCount             1470 non-null   int64 
 9   EmployeeNumber            1470 non-null   int64 
 10  EnvironmentSatisfaction   1470 non-null   int64 
 11  Gender                    1470 non-null   object
 12  HourlyRate                1470 non-null   int64 
 13  JobInvolvement            1470 non-null   int64 
 14  JobLevel                

***Attribute Summary:***
- Total Entries: 1470
- Total Features: 35

**Data Types:**
- 26 Numerical (int64)
- 9 Categorical (object)

**No missing values detected so far (all columns show 1470 non-null).**


---
# <span style="background-color:#258683; color:white; padding:10px;border-radius:15px; text-align:center;">Feature Classification</span>
- List all Numerical Features.

- List all Categorical Features.

In [7]:
# Identifying Numerical and Categorical Features
numerical_features = df.select_dtypes(include=['int64', 'float64']).columns.tolist()
categorical_features = df.select_dtypes(include=['object']).columns.tolist()

numerical_features, categorical_features

(['Age',
  'DailyRate',
  'DistanceFromHome',
  'Education',
  'EmployeeCount',
  'EmployeeNumber',
  'EnvironmentSatisfaction',
  'HourlyRate',
  'JobInvolvement',
  'JobLevel',
  'JobSatisfaction',
  'MonthlyIncome',
  'MonthlyRate',
  'NumCompaniesWorked',
  'PercentSalaryHike',
  'PerformanceRating',
  'RelationshipSatisfaction',
  'StandardHours',
  'StockOptionLevel',
  'TotalWorkingYears',
  'TrainingTimesLastYear',
  'WorkLifeBalance',
  'YearsAtCompany',
  'YearsInCurrentRole',
  'YearsSinceLastPromotion',
  'YearsWithCurrManager'],
 ['Attrition',
  'BusinessTravel',
  'Department',
  'EducationField',
  'Gender',
  'JobRole',
  'MaritalStatus',
  'Over18',
  'OverTime'])

***Numerical Features (26):***

- **Here are the key numerical columns:** ['Age', 'DailyRate', 'DistanceFromHome', 'Education', 'EmployeeCount', 'EmployeeNumber', 'EnvironmentSatisfaction', 'HourlyRate', 'JobInvolvement', 'JobLevel', 'JobSatisfaction', 'MonthlyIncome', 'MonthlyRate', 'NumCompaniesWorked', 'PercentSalaryHike', 'PerformanceRating', 'RelationshipSatisfaction', 'StandardHours', 'StockOptionLevel', 'TotalWorkingYears', 'TrainingTimesLastYear', 'WorkLifeBalance', 'YearsAtCompany', 'YearsInCurrentRole', 'YearsSinceLastPromotion', 'YearsWithCurrManager']

***Categorical Features (9):***

- **Here are the key categorical columns:** ['Attrition', 'BusinessTravel', 'Department', 'EducationField', 'Gender', 'JobRole', 'MaritalStatus', 'Over18', 'OverTime']

---
# <span style="background-color:#258683; color:white; padding:10px;border-radius:15px; text-align:center;">Decoding Encoded Features</span>

**Mappings to Apply:**
- Education {1: 'Below College', 2: 'College', 3: 'Bachelor', 4: 'Master', 5: 'Doctor'}
- EnvironmentSatisfaction, JobInvolvement, JobSatisfaction, RelationshipSatisfaction {1: 'Low', 2: 'Medium', 3: 'High', 4: 'Very High'}
- PerformanceRating {1: 'Low', 2: 'Good', 3: 'Excellent', 4: 'Outstanding'}
- WorkLifeBalance {1: 'Bad', 2: 'Good', 3: 'Better', 4: 'Best'}

In [8]:
# Defining the mappings
education_map = {1: 'Below College', 2: 'College', 3: 'Bachelor', 4: 'Master', 5: 'Doctor'}
satisfaction_map = {1: 'Low', 2: 'Medium', 3: 'High', 4: 'Very High'}
performance_map = {1: 'Low', 2: 'Good', 3: 'Excellent', 4: 'Outstanding'}
worklife_map = {1: 'Bad', 2: 'Good', 3: 'Better', 4: 'Best'}

# Applying mappings
df['Education'] = df['Education'].map(education_map)
df['EnvironmentSatisfaction'] = df['EnvironmentSatisfaction'].map(satisfaction_map)
df['JobInvolvement'] = df['JobInvolvement'].map(satisfaction_map)
df['JobSatisfaction'] = df['JobSatisfaction'].map(satisfaction_map)
df['RelationshipSatisfaction'] = df['RelationshipSatisfaction'].map(satisfaction_map)
df['PerformanceRating'] = df['PerformanceRating'].map(performance_map)
df['WorkLifeBalance'] = df['WorkLifeBalance'].map(worklife_map)

# Checking if mapping is successful
df[['Education', 'EnvironmentSatisfaction', 'JobInvolvement', 'JobSatisfaction', 
    'RelationshipSatisfaction', 'PerformanceRating', 'WorkLifeBalance']].head()


Unnamed: 0,Education,EnvironmentSatisfaction,JobInvolvement,JobSatisfaction,RelationshipSatisfaction,PerformanceRating,WorkLifeBalance
0,College,Medium,High,Very High,Low,Excellent,Bad
1,Below College,High,Medium,Medium,Very High,Outstanding,Better
2,College,Very High,Medium,High,Medium,Excellent,Better
3,Master,Very High,High,High,High,Excellent,Better
4,Below College,Low,High,Medium,Very High,Excellent,Better


---
# <span style="background-color:#258683; color:white; padding:10px;border-radius:15px; text-align:center;">Validating Data Completeness</span>

- Next, let's check for any missing values across the dataset.

In [9]:
# Checking for missing values
missing_values = df.isnull().sum()

# Filtering columns with missing values
missing_values = missing_values[missing_values > 0]

missing_values

Series([], dtype: int64)

- All columns are complete—no null values in the dataset.

---
# <span style="background-color:#258683; color:white; padding:10px;border-radius:15px; text-align:center;">Statistical Summary of Numerical Data</span>

**We’ll run describe() to get:**

- Count
- Mean
- Std deviation
- Min & Max
- Percentiles

In [10]:
# Descriptive statistics for numerical features
numerical_df = df.select_dtypes(include=['int64', 'float64'])
numerical_df.describe()


Unnamed: 0,Age,DailyRate,DistanceFromHome,EmployeeCount,EmployeeNumber,HourlyRate,JobLevel,MonthlyIncome,MonthlyRate,NumCompaniesWorked,PercentSalaryHike,StandardHours,StockOptionLevel,TotalWorkingYears,TrainingTimesLastYear,YearsAtCompany,YearsInCurrentRole,YearsSinceLastPromotion,YearsWithCurrManager
count,1470.0,1470.0,1470.0,1470.0,1470.0,1470.0,1470.0,1470.0,1470.0,1470.0,1470.0,1470.0,1470.0,1470.0,1470.0,1470.0,1470.0,1470.0,1470.0
mean,36.92381,802.485714,9.192517,1.0,1024.865306,65.891156,2.063946,6502.931293,14313.103401,2.693197,15.209524,80.0,0.793878,11.279592,2.79932,7.008163,4.229252,2.187755,4.123129
std,9.135373,403.5091,8.106864,0.0,602.024335,20.329428,1.10694,4707.956783,7117.786044,2.498009,3.659938,0.0,0.852077,7.780782,1.289271,6.126525,3.623137,3.22243,3.568136
min,18.0,102.0,1.0,1.0,1.0,30.0,1.0,1009.0,2094.0,0.0,11.0,80.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
25%,30.0,465.0,2.0,1.0,491.25,48.0,1.0,2911.0,8047.0,1.0,12.0,80.0,0.0,6.0,2.0,3.0,2.0,0.0,2.0
50%,36.0,802.0,7.0,1.0,1020.5,66.0,2.0,4919.0,14235.5,2.0,14.0,80.0,1.0,10.0,3.0,5.0,3.0,1.0,3.0
75%,43.0,1157.0,14.0,1.0,1555.75,83.75,3.0,8379.0,20461.5,4.0,18.0,80.0,1.0,15.0,3.0,9.0,7.0,3.0,7.0
max,60.0,1499.0,29.0,1.0,2068.0,100.0,5.0,19999.0,26999.0,9.0,25.0,80.0,3.0,40.0,6.0,40.0,18.0,15.0,17.0


***Here’s the Statistical Summary of Numerical Features directly displayed:***

**You can observe key stats like mean, std, min, max, and quartiles for each numerical column.**

 **Notice columns like:**
 - StandardHours and EmployeeCount are constants.
 - EmployeeNumber is just an ID.

---
# <span style="background-color:#258683; color:white; padding:10px;border-radius:15px; text-align:center;">Streamlining the Dataset</span>

**Remove columns that don’t add value for analysis or modeling.**

**Columns to Drop:**
- Column---------------------------------------------------Reason
- EmployeeCount------------------------Constant value (always 1)
- Over18-------------------------Constant value ('Y' for all entries)
- StandardHours------------------------Constant value (always 80)
- EmployeeNumber----Just a unique ID, no predictive importance

**Dropping these will clean up the dataset and avoid noise in further steps like EDA or Modeling.**

In [11]:
# Dropping unnecessary columns
df.drop(['EmployeeCount', 'Over18', 'StandardHours', 'EmployeeNumber'], axis=1, inplace=True)

# Verifying the shape after dropping
df.shape

(1470, 31)

**We have successfully dropped unnecessary columns.**

- The dataset now contains 31 columns (down from 35).
- All remaining features are relevant for analysis.

---
# <span style="background-color:#258683; color:white; padding:10px;border-radius:15px; text-align:center;">Summary of Categorical Data</span>

**Run describe() specifically on the categorical features to understand:**
- How many unique categories each has.
- The most frequent category.
- How often that category appears.

In [12]:
# Ensuring proper selection of categorical features after mapping
categorical_columns = df.columns[df.dtypes == 'object']

# Descriptive summary for actual categorical columns
categorical_summary = df[categorical_columns].describe()

categorical_summary


Unnamed: 0,Attrition,BusinessTravel,Department,Education,EducationField,EnvironmentSatisfaction,Gender,JobInvolvement,JobRole,JobSatisfaction,MaritalStatus,OverTime,PerformanceRating,RelationshipSatisfaction,WorkLifeBalance
count,1470,1470,1470,1470,1470,1470,1470,1470,1470,1470,1470,1470,1470,1470,1470
unique,2,3,3,5,6,4,2,4,9,4,3,2,2,4,4
top,No,Travel_Rarely,Research & Development,Bachelor,Life Sciences,High,Male,High,Sales Executive,Very High,Married,No,Excellent,High,Better
freq,1233,1043,961,572,606,453,882,868,326,459,673,1054,1244,459,893


---
# <span style="background-color:#258683; color:white; padding:10px;border-radius:15px; text-align:center;">Exploring Categorical Diversity</span>

**Check unique values in each categorical feature to fully understand category distributions.**

In [13]:
# Checking unique values for each categorical attribute
unique_values = {col: df[col].unique() for col in categorical_columns}

unique_values

{'Attrition': array(['Yes', 'No'], dtype=object),
 'BusinessTravel': array(['Travel_Rarely', 'Travel_Frequently', 'Non-Travel'], dtype=object),
 'Department': array(['Sales', 'Research & Development', 'Human Resources'], dtype=object),
 'Education': array(['College', 'Below College', 'Master', 'Bachelor', 'Doctor'],
       dtype=object),
 'EducationField': array(['Life Sciences', 'Other', 'Medical', 'Marketing',
        'Technical Degree', 'Human Resources'], dtype=object),
 'EnvironmentSatisfaction': array(['Medium', 'High', 'Very High', 'Low'], dtype=object),
 'Gender': array(['Female', 'Male'], dtype=object),
 'JobInvolvement': array(['High', 'Medium', 'Very High', 'Low'], dtype=object),
 'JobRole': array(['Sales Executive', 'Research Scientist', 'Laboratory Technician',
        'Manufacturing Director', 'Healthcare Representative', 'Manager',
        'Sales Representative', 'Research Director', 'Human Resources'],
       dtype=object),
 'JobSatisfaction': array(['Very High', 'Mediu

**Unique Values in Categorical Features:**
- Attrition: ['Yes', 'No']
- BusinessTravel: ['Travel_Rarely', 'Travel_Frequently', 'Non-Travel']
- Department: ['Sales', 'Research & Development', 'Human Resources']
- Education: ['College', 'Below College', 'Master', 'Bachelor', 'Doctor']
- EducationField: ['Life Sciences', 'Other', 'Medical', 'Marketing', 'Technical Degree', 'Human Resources']
- EnvironmentSatisfaction: ['Medium', 'High', 'Very High', 'Low']
- Gender: ['Female', 'Male']
- JobInvolvement: ['High', 'Medium', 'Very High', 'Low']
- JobRole: ['Sales Executive', 'Research Scientist', 'Laboratory Technician', 'Manufacturing Director', 'Healthcare Representative', 'Manager', 'Sales Representative', 'Research Director', 'Human Resources']
- JobSatisfaction: ['Very High', 'Medium', 'High', 'Low']
- MaritalStatus: ['Single', 'Married', 'Divorced']
- OverTime: ['Yes', 'No']
- PerformanceRating: ['Excellent', 'Outstanding']
- RelationshipSatisfaction: ['Low', 'Very High', 'Medium', 'High']
- WorkLifeBalance: ['Bad', 'Better', 'Good', 'Best']

---
# <span style="background-color:#258683; color:white; padding:10px;border-radius:15px; text-align:center;">Saving the Refined Dataset</span>

**Save this cleaned and prepared dataset into a CSV file for future steps like EDA, Statistical Analysis, and Modeling.**

In [14]:
cleaned_data_path = '../CSV/employee_attrition_cleaned.csv'
df.to_csv(cleaned_data_path, index=False)

cleaned_data_path

'../CSV/employee_attrition_cleaned.csv'

---