# ðŸ“Š HR Analytics Dataset: Data Cleaning & Preparation

This section details the initial steps taken to clean and prepare the raw HR dataset for subsequent exploratory data analysis (EDA) and predictive modeling. The focus was on handling duplicates, correcting data types, standardizing categorical features, and managing missing values.

### Key Cleaning Actions Performed:

* **Data Structure Confirmation:** The initial dataset contained 1,480 rows and 38 columns.
* **Data Type Standardization:**
    * The `YearsWithCurrManager` column was converted from a floating-point type to an **Integer with null support (`Int64`)** after rounding, making it suitable for representing whole years of tenure.
* **Duplicate Removal:**
    * Identified and removed **10 duplicate rows** based on the unique identifier, `EmpID`, ensuring each employee record is unique.
* **Irrelevant Feature Identification:**
    * Identified columns that exhibit **zero variance** (`EmployeeCount`, `StandardHours`, `Over18`) and are candidates for removal as they provide no discriminative information for modeling.
* **Categorical Consistency Check:**
    * Noticed an inconsistency in the `BusinessTravel` column (`'Travel_Rarely'` vs. `'TravelRarely'`) which will require a simple string standardization step in a later stage.
* **Missing Value Strategy:**
    * The only column with missing data was `YearsWithCurrManager`, which had **57 missing entries**.
    * These rows were **removed** (Listwise Deletion) to maintain data integrity, reducing the dataset size slightly (from 1470 unique rows to 1416).

The resulting dataset, with duplicates and missing values removed, is now ready for in-depth analysis and feature engineering.


In [6]:
import pandas as pd
import numpy as np

In [7]:
df = pd.read_csv('HR_Analytics.csv')
display(df.head())

Unnamed: 0,EmpID,Age,AgeGroup,Attrition,BusinessTravel,DailyRate,Department,DistanceFromHome,Education,EducationField,...,RelationshipSatisfaction,StandardHours,StockOptionLevel,TotalWorkingYears,TrainingTimesLastYear,WorkLifeBalance,YearsAtCompany,YearsInCurrentRole,YearsSinceLastPromotion,YearsWithCurrManager
0,RM297,18,18-25,Yes,Travel_Rarely,230,Research & Development,3,3,Life Sciences,...,3,80,0,0,2,3,0,0,0,0.0
1,RM302,18,18-25,No,Travel_Rarely,812,Sales,10,3,Medical,...,1,80,0,0,2,3,0,0,0,0.0
2,RM458,18,18-25,Yes,Travel_Frequently,1306,Sales,5,3,Marketing,...,4,80,0,0,3,3,0,0,0,0.0
3,RM728,18,18-25,No,Non-Travel,287,Research & Development,5,2,Life Sciences,...,4,80,0,0,2,3,0,0,0,0.0
4,RM829,18,18-25,Yes,Non-Travel,247,Research & Development,8,1,Medical,...,4,80,0,0,0,3,0,0,0,0.0


In [8]:
df.shape

(1480, 38)

In [9]:
df.dtypes

Unnamed: 0,0
EmpID,object
Age,int64
AgeGroup,object
Attrition,object
BusinessTravel,object
DailyRate,int64
Department,object
DistanceFromHome,int64
Education,int64
EducationField,object


In [11]:
df['YearsWithCurrManager'] = df['YearsWithCurrManager'].round().astype('Int64')


In [12]:
df.AgeGroup.unique()

array(['18-25', '26-35', '36-45', '46-55', '55+'], dtype=object)

In [13]:
df.Attrition.unique()

array(['Yes', 'No'], dtype=object)

In [14]:
df.BusinessTravel.unique()

array(['Travel_Rarely', 'Travel_Frequently', 'Non-Travel', 'TravelRarely'],
      dtype=object)

In [15]:
df.Department.unique()

array(['Research & Development', 'Sales', 'Human Resources'], dtype=object)

In [16]:
df.EducationField.unique()

array(['Life Sciences', 'Medical', 'Marketing', 'Technical Degree',
       'Other', 'Human Resources'], dtype=object)

In [17]:
df.Gender.unique()

array(['Male', 'Female'], dtype=object)

In [18]:
df.MaritalStatus.unique()

array(['Single', 'Divorced', 'Married'], dtype=object)

In [19]:
df.JobRole.unique()

array(['Laboratory Technician', 'Sales Representative',
       'Research Scientist', 'Human Resources', 'Manufacturing Director',
       'Sales Executive', 'Healthcare Representative',
       'Research Director', 'Manager'], dtype=object)

In [20]:
df.SalarySlab.unique()

array(['Upto 5k', '5k-10k', '10k-15k', '15k+'], dtype=object)

In [21]:
df.Over18.unique()

array(['Y'], dtype=object)

In [22]:
df.OverTime.unique()

array(['No', 'Yes'], dtype=object)

In [23]:
print(df.describe())

               Age    DailyRate  DistanceFromHome    Education  EmployeeCount  \
count  1480.000000  1480.000000       1480.000000  1480.000000         1480.0   
mean     36.917568   801.384459          9.220270     2.910811            1.0   
std       9.128559   403.126988          8.131201     1.023796            0.0   
min      18.000000   102.000000          1.000000     1.000000            1.0   
25%      30.000000   465.000000          2.000000     2.000000            1.0   
50%      36.000000   800.000000          7.000000     3.000000            1.0   
75%      43.000000  1157.000000         14.000000     4.000000            1.0   
max      60.000000  1499.000000         29.000000     5.000000            1.0   

       EmployeeNumber  EnvironmentSatisfaction   HourlyRate  JobInvolvement  \
count     1480.000000              1480.000000  1480.000000     1480.000000   
mean      1031.860811                 2.724324    65.845270        2.729730   
std        605.955046            

In [26]:
dup_empid = df[df.duplicated(subset=['EmpID'], keep=False)]
print(dup_empid)


       EmpID  Age AgeGroup Attrition     BusinessTravel  DailyRate  \
161   RM1465   26    26-35        No      Travel_Rarely       1167   
162   RM1465   26    26-35        No      Travel_Rarely       1167   
210   RM1468   27    26-35        No      Travel_Rarely        155   
211   RM1468   27    26-35        No      Travel_Rarely        155   
327   RM1461   29    26-35        No      Travel_Rarely        468   
328   RM1461   29    26-35        No      Travel_Rarely        468   
457   RM1464   31    26-35        No         Non-Travel        325   
458   RM1464   31    26-35        No         Non-Travel        325   
654   RM1470   34    26-35        No       TravelRarely        628   
655   RM1470   34    26-35        No       TravelRarely        628   
802   RM1466   36    36-45        No  Travel_Frequently        884   
803   RM1466   36    36-45        No  Travel_Frequently        884   
952   RM1463   39    36-45        No      Travel_Rarely        722   
953   RM1467   39   

In [27]:
df = df.drop_duplicates()


In [28]:
df.isnull().sum()

Unnamed: 0,0
EmpID,0
Age,0
AgeGroup,0
Attrition,0
BusinessTravel,0
DailyRate,0
Department,0
DistanceFromHome,0
Education,0
EducationField,0


In [None]:
missing_manager_years = df[df['YearsWithCurrManager'].isnull()]

print(missing_manager_years)




      EmpID  Age AgeGroup Attrition     BusinessTravel  DailyRate  \
28    RM024   21    18-25        No      Travel_Rarely        391   
31    RM363   21    18-25        No         Non-Travel        895   
45    RM207   22    18-25        No      Travel_Rarely       1136   
99    RM139   25    18-25        No      Travel_Rarely        959   
100   RM256   25    18-25        No      Travel_Rarely        685   
103   RM406   25    18-25       Yes      Travel_Rarely        688   
222   RM405   28    26-35        No      Travel_Rarely       1300   
262   RM072   29    26-35        No      Travel_Rarely       1328   
264   RM206   29    26-35       Yes      Travel_Rarely        121   
268   RM253   29    26-35        No      Travel_Rarely        665   
269   RM255   29    26-35        No      Travel_Rarely       1247   
329   RM008   30    26-35        No      Travel_Rarely       1358   
336   RM140   30    26-35        No      Travel_Rarely       1240   
337   RM144   30    26-35        N

In [30]:
df = df[~df['YearsWithCurrManager'].isnull()].copy()

print("Number of rows after removing missing YearsWithCurrManager:", len(df))

Number of rows after removing missing YearsWithCurrManager: 1416


In [31]:
df.to_csv("cleaned_dataset.csv", index=False)
print("Cleaned dataset saved as 'cleaned_dataset.csv'")

Cleaned dataset saved as 'cleaned_dataset.csv'
