# üßº Data Processing

This notebook performs the essential preprocessing steps required to prepare the employee dataset for machine learning modeling. It includes data loading, cleaning, encoding, and saving the processed output.

In [1]:
# üì• Load Raw Employee Dataset

import pandas as pd

file_path = '../../data/raw/employee_data.xlsx'
df = pd.read_excel(file_path)

df.head()

Unnamed: 0,EmpNumber,Age,Gender,EducationBackground,MaritalStatus,EmpDepartment,EmpJobRole,BusinessTravelFrequency,DistanceFromHome,EmpEducationLevel,...,EmpRelationshipSatisfaction,TotalWorkExperienceInYears,TrainingTimesLastYear,EmpWorkLifeBalance,ExperienceYearsAtThisCompany,ExperienceYearsInCurrentRole,YearsSinceLastPromotion,YearsWithCurrManager,Attrition,PerformanceRating
0,E1001007,40,Male,Life Sciences,Married,Sales,Sales Executive,Travel_Frequently,5,4,...,3,20,2,3,18,13,1,12,No,4
1,E1001025,30,Male,Marketing,Divorced,Sales,Sales Executive,Travel_Rarely,27,5,...,4,10,2,2,8,7,7,7,No,4
2,E1001054,52,Male,Marketing,Married,Sales,Manager,Travel_Rarely,3,4,...,1,34,3,4,34,6,1,16,No,4
3,E1001059,25,Female,Medical,Single,Sales,Sales Executive,Travel_Rarely,26,1,...,2,6,5,2,6,5,1,4,No,4
4,E1001064,34,Male,Other,Single,Sales,Sales Executive,Travel_Rarely,2,3,...,3,6,5,3,6,5,1,4,No,4


### üîç 1. Inspect Dataset Structure

We inspect the dataset to understand its shape, column types, and sample records.

In [2]:
# üîç Inspect Dataset Structure

print("Shape:", df.shape)
print("\nColumns:\n", df.columns.tolist())
print("\nData Types:\n", df.dtypes)

df.describe(include='all')

Shape: (86, 28)

Columns:
 ['EmpNumber', 'Age', 'Gender', 'EducationBackground', 'MaritalStatus', 'EmpDepartment', 'EmpJobRole', 'BusinessTravelFrequency', 'DistanceFromHome', 'EmpEducationLevel', 'EmpEnvironmentSatisfaction', 'EmpHourlyRate', 'EmpJobInvolvement', 'EmpJobLevel', 'EmpJobSatisfaction', 'NumCompaniesWorked', 'OverTime', 'EmpLastSalaryHikePercent', 'EmpRelationshipSatisfaction', 'TotalWorkExperienceInYears', 'TrainingTimesLastYear', 'EmpWorkLifeBalance', 'ExperienceYearsAtThisCompany', 'ExperienceYearsInCurrentRole', 'YearsSinceLastPromotion', 'YearsWithCurrManager', 'Attrition', 'PerformanceRating']

Data Types:
 EmpNumber                       object
Age                              int64
Gender                          object
EducationBackground             object
MaritalStatus                   object
EmpDepartment                   object
EmpJobRole                      object
BusinessTravelFrequency         object
DistanceFromHome                 int64
EmpEducationLe

Unnamed: 0,EmpNumber,Age,Gender,EducationBackground,MaritalStatus,EmpDepartment,EmpJobRole,BusinessTravelFrequency,DistanceFromHome,EmpEducationLevel,...,EmpRelationshipSatisfaction,TotalWorkExperienceInYears,TrainingTimesLastYear,EmpWorkLifeBalance,ExperienceYearsAtThisCompany,ExperienceYearsInCurrentRole,YearsSinceLastPromotion,YearsWithCurrManager,Attrition,PerformanceRating
count,86,86.0,86,86,86,86,86,86,86.0,86.0,...,86.0,86.0,86.0,86.0,86.0,86.0,86.0,86.0,86,86.0
unique,86,,2,6,3,4,12,3,,,...,,,,,,,,,2,
top,E1001007,,Male,Life Sciences,Married,Research & Development,Sales Executive,Travel_Rarely,,,...,,,,,,,,,No,
freq,1,,53,35,36,41,25,53,,,...,,,,,,,,,75,
mean,,37.209302,,,,,,,8.918605,2.906977,...,2.534884,11.790698,2.755814,3.069767,7.55814,4.546512,2.151163,4.197674,,4.0
std,,9.577076,,,,,,,8.569735,1.047438,...,1.048222,8.330124,1.094758,0.664932,7.310918,3.92775,3.414562,3.946166,,0.0
min,,18.0,,,,,,,1.0,1.0,...,1.0,0.0,0.0,2.0,0.0,0.0,0.0,0.0,,4.0
25%,,31.0,,,,,,,2.0,2.0,...,2.0,7.0,2.0,3.0,3.0,2.0,0.0,1.0,,4.0
50%,,37.0,,,,,,,6.5,3.0,...,3.0,10.0,3.0,3.0,6.0,4.0,1.0,3.5,,4.0
75%,,43.0,,,,,,,12.0,4.0,...,3.0,14.75,3.0,3.75,8.75,7.0,2.0,7.0,,4.0


### ‚ùì 2. Handle Missing Values

We identify missing values and apply appropriate imputation strategies. For simplicity, we fill missing numerical values with 0 and categorical values with `'Unknown'`. This ensures the dataset is complete and ready for encoding.

In [4]:
# ‚ùì Handle Missing Values (Safe Version)

# Check missing values
missing = df.isnull().sum()
print("Missing values per column:\n", missing[missing > 0])

# Fill missing values safely
for col in df.columns:
    if df[col].dtype == 'object':
        df[col] = df[col].fillna('Unknown')
    else:
        df[col] = df[col].fillna(0)

# Confirm no missing values remain
df.isnull().sum().sum()

Missing values per column:
 Series([], dtype: int64)


np.int64(0)

### üîÑ 3. Encode Categorical Variables

We apply Label Encoding to convert categorical features into numerical values. The column `EmpNumber` is excluded as it is a unique identifier and not a predictive feature.

In [6]:
# üîÑ Encode Categorical Variables (excluding EmpNumber)

from sklearn.preprocessing import LabelEncoder

le = LabelEncoder()

# Categorical columns to encode (excluding EmpNumber)
cat_cols = [
    'Gender', 'EducationBackground', 'MaritalStatus',
    'EmpDepartment', 'EmpJobRole', 'BusinessTravelFrequency',
    'OverTime', 'Attrition'
]

# Apply Label Encoding
for col in cat_cols:
    df[col] = le.fit_transform(df[col])

### üíæ 4. Save Processed Dataset

We save the cleaned and encoded dataset to the `data/processed/` directory for use in modeling and analysis.

In [7]:
# üíæ Save Processed Dataset

output_path = '../../data/processed/employee_data_cleaned.csv'
df.to_csv(output_path, index=False)

print("‚úÖ Dataset saved to:", output_path)

‚úÖ Dataset saved to: ../../data/processed/employee_data_cleaned.csv
