In [1]:
# Importing required libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import classification_report, accuracy_score, confusion_matrix, roc_auc_score

import xgboost as xgb

In [2]:
# Default display options
pd.set_option("display.max_columns", None)
sns.set_style("whitegrid")

In [3]:
# Loading dataset
df = pd.read_csv("../data/HR_capstone_dataset.csv")

In [4]:
# Displaying first 5 rows
df.head()

Unnamed: 0,satisfaction_level,last_evaluation,number_project,average_montly_hours,time_spend_company,Work_accident,left,promotion_last_5years,Department,salary
0,0.38,0.53,2,157,3,0,1,0,sales,low
1,0.8,0.86,5,262,6,0,1,0,sales,medium
2,0.11,0.88,7,272,4,0,1,0,sales,medium
3,0.72,0.87,5,223,5,0,1,0,sales,low
4,0.37,0.52,2,159,3,0,1,0,sales,low


In [5]:
# Dataset information
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 14999 entries, 0 to 14998
Data columns (total 10 columns):
 #   Column                 Non-Null Count  Dtype  
---  ------                 --------------  -----  
 0   satisfaction_level     14999 non-null  float64
 1   last_evaluation        14999 non-null  float64
 2   number_project         14999 non-null  int64  
 3   average_montly_hours   14999 non-null  int64  
 4   time_spend_company     14999 non-null  int64  
 5   Work_accident          14999 non-null  int64  
 6   left                   14999 non-null  int64  
 7   promotion_last_5years  14999 non-null  int64  
 8   Department             14999 non-null  object 
 9   salary                 14999 non-null  object 
dtypes: float64(2), int64(6), object(2)
memory usage: 1.1+ MB


## Initial Dataset Overview

The dataset contains **14,999 employee records** with **10 variables**.

There are:
- 2 continuous variables (satisfaction_level, last_evaluation)
- 6 numerical discrete variables (projects, hours, tenure, accident, promotion, turnover)
- 2 categorical variables (department, salary)

No missing values are present across any columns.

This indicates strong data completeness and reduces the need for imputation strategies.

In [10]:
# Descriptive statistics
df.describe()

Unnamed: 0,satisfaction_level,last_evaluation,number_project,average_montly_hours,time_spend_company,Work_accident,left,promotion_last_5years
count,14999.0,14999.0,14999.0,14999.0,14999.0,14999.0,14999.0,14999.0
mean,0.612834,0.716102,3.803054,201.050337,3.498233,0.14461,0.238083,0.021268
std,0.248631,0.171169,1.232592,49.943099,1.460136,0.351719,0.425924,0.144281
min,0.09,0.36,2.0,96.0,2.0,0.0,0.0,0.0
25%,0.44,0.56,3.0,156.0,3.0,0.0,0.0,0.0
50%,0.64,0.72,4.0,200.0,3.0,0.0,0.0,0.0
75%,0.82,0.87,5.0,245.0,4.0,0.0,0.0,0.0
max,1.0,1.0,7.0,310.0,10.0,1.0,1.0,1.0


## Descriptive Statistics Summary

Key observations:

- Average satisfaction level: **0.61**
- Average evaluation score: **0.71**
- Employees work ~**201 hours per month** on average
- Average tenure: **3.5 years**
- Most employees work on **3â€“5 projects**
- Promotions in last 5 years are rare (~2%)

Turnover rate (target variable):
- **23.8% of employees left**
- **76.2% stayed**

The dataset shows moderate class imbalance, but not severe enough to prevent modeling. However, evaluation metrics beyond accuracy will be important.

In [7]:
# Checking class distribution
df["left"].value_counts()

left
0    11428
1     3571
Name: count, dtype: int64

In [8]:
df["left"].value_counts(normalize=True)

left
0    0.761917
1    0.238083
Name: proportion, dtype: float64

## Target Variable Distribution

Approximately 24% of employees in the dataset have left the company.

This imbalance suggests that:

- Accuracy alone may be misleading.
- Metrics such as recall, precision, and ROC-AUC should be prioritized.
- Stratified sampling will be used during train-test split.

## Analytical Observations
1. Typo in *average_montly_hours*
2. Promotion rate is very low as mean = 0.021, i.e., only ~2% promoted in the last 5 years
3. Work accident = 14%, could correlate with retention
4. Satisfaction Level range min = 0.09 whereas max = 1. Wide distribution, hence could be a strong predictor

In [12]:
# Checking for duplicates
df.duplicated().sum()

np.int64(3008)

## Duplicate Records Check
The dataset contains 3008 duplicate rows which is approximately 20% of the total records.

Duplicate employee entries can distort model performance and bias evaluation results

To ensure the integrity of the model, duplicate rows will be removed

In [13]:
# Removing duplicate rows
df = df.drop_duplicates()

# Confirming removal
df.duplicated().sum()

np.int64(0)

In [14]:
df.shape

(11991, 10)

In [15]:
# Standardizing column names
df.columns = df.columns.str.lower().str.strip()

# Fixing misspelled column name
df = df.rename(columns={"average_montly_hours": "average_monthly_hours",
                        "work_accident": "work_accident"})

df.columns

Index(['satisfaction_level', 'last_evaluation', 'number_project',
       'average_monthly_hours', 'time_spend_company', 'work_accident', 'left',
       'promotion_last_5years', 'department', 'salary'],
      dtype='object')

# Exploratory Data Analysis (EDA)

The goal of this stage is to understand patterns in employee turnover and identify potential predictors for modeling.

Beginning by examining:

- Class balance of the target variable
- Relationships between turnover and key features
- Distribution patterns of numerical variables

In [17]:
df["left"].value_counts()

left
0    10000
1     1991
Name: count, dtype: int64

In [18]:
df["left"].value_counts(normalize=True)

left
0    0.833959
1    0.166041
Name: proportion, dtype: float64

## Updated Target Distribution (After Removing Duplicates)

- Employees who stayed: 10,000 (83.4%)
- Employees who left: 1,991 (16.6%)

The dataset shows moderate class imbalance.

Implication for modeling:
- Accuracy alone may be misleading.
- Recall for employees who left will be particularly important.
- Stratified sampling will be used during train-test split.

In [19]:
df.groupby("left")["satisfaction_level"].mean()

left
0    0.667365
1    0.440271
Name: satisfaction_level, dtype: float64