
# CRISP-DM Case Study of HR Attrition Modeling.

This notebook demonstrates an end‑to‑end CRISP‑DM workflow for employee attrition modeling.

The goal is NOT to force a predictive model, but to show:
- Proper problem framing
- Feature engineering
- Handling class imbalance
- Honest model evaluation
- Responsible risk scoring
- Business interpretation

Dataset: HumanResources_India.csv (synthetic HR data)



## CRISP‑DM Framework

1. Business Understanding  
2. Data Understanding  
3. Data Preparation  
4. Modeling (Baseline + Advanced)  
5. Evaluation  
6. Deployment Concept (Risk Scoring)  
7. Final Report  

Attrition is treated as a binary classification problem:   
1 = Terminated  
0 = Active


In [1]:

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

from scipy.stats import pointbiserialr
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline
from sklearn.metrics import classification_report, roc_auc_score
from sklearn.calibration import CalibratedClassifierCV

sns.set(style="whitegrid")
plt.rcParams["figure.figsize"]=(10,6)


## 1. Data Understanding

In [2]:

df = pd.read_csv("HumanResources_India.csv")

df["hiredate"]=pd.to_datetime(df["hiredate"])
df["termdate"]=pd.to_datetime(df["termdate"],errors="coerce")
df["birthdate"]=pd.to_datetime(df["birthdate"])

df["age"]=((pd.Timestamp.today()-df["birthdate"]).dt.days/365).astype(int)
df["tenure_years"]=((pd.Timestamp.today()-df["hiredate"]).dt.days/365).round(2)
df["is_terminated"]=df["termdate"].notna().astype(int)

df.head()


Unnamed: 0,employee_id,first_name,last_name,gender,state,city,hiredate,department,job_title,education_level,salary,performance_rating,overtime,birthdate,termdate,age,tenure_years,is_terminated
0,00-95822412,Isaac,Bakshi,Female,Maharashtra,Pune,2016-10-14,Customer Service,Help Desk Technician,High School,334245,Good,No,1981-06-16,2021-07-05,44,9.33,1
1,00-42868828,Anvi,Konda,Male,Kerala,Kochi,2017-03-28,IT,System Administrator,Bachelor,579262,Good,No,1972-02-25,2019-06-14,53,8.88,1
2,00-83197857,Liam,Chaudry,Male,Karnataka,Mangalore,2016-09-19,Operations,Logistics Coordinator,Bachelor,372853,Good,No,1996-03-20,2021-03-06,29,9.4,1
3,00-13999315,Gagan,Sami,Male,Karnataka,Mangalore,2016-01-13,Operations,Inventory Specialist,Bachelor,402906,Good,No,1986-04-05,2018-11-06,39,10.08,1
4,00-90801586,Ayushman,Chander,Male,Uttar Pradesh,Varanasi,2015-03-26,IT,Software Developer,Bachelor,557955,Good,No,1990-12-13,2017-11-29,35,10.88,1


## 2. Feature Engineering

In [3]:

df["overtime_flag"]=df["overtime"].str.lower().map({"yes":1,"no":0}).fillna(0)

perf_map={"Needs Improvement":0,"Satisfactory":1,"Good":2,"Excellent":3}
df["perf_score"]=df["performance_rating"].map(perf_map).fillna(1)

edu_map={"High School":0,"Bachelor":1,"Master":2,"PhD":3}
df["edu_ord"]=df["education_level"].map(edu_map).fillna(1)

dept_med=df.groupby("department")["salary"].median()
df["salary_to_dept_median"]=df["salary"]/df["department"].map(dept_med)

features=["age","tenure_years","salary","salary_to_dept_median","overtime_flag","perf_score","edu_ord"]
X=df[features].fillna(0)
y=df["is_terminated"]


## 3. Feature Association with Attrition

In [4]:

results=[]
for f in features:
    r,p=pointbiserialr(y,X[f])
    results.append((f,r,p))

pd.DataFrame(results,columns=["feature","pb_corr","p_value"]).sort_values("pb_corr",key=abs,ascending=False)


Unnamed: 0,feature,pb_corr,p_value
5,perf_score,0.018311,0.083237
0,age,-0.014026,0.184575
6,edu_ord,-0.011403,0.28073
3,salary_to_dept_median,0.010391,0.325644
2,salary,0.004719,0.655305
1,tenure_years,0.000953,0.928137
4,overtime_flag,0.000369,0.972121



All features exhibit extremely weak correlation with attrition.  
This indicates that static HR attributes alone contain little predictive signal.


## 4. Baseline Model — Logistic Regression

In [5]:

X_train,X_test,y_train,y_test=train_test_split(X,y,test_size=0.25,stratify=y,random_state=42)

log_pipe=Pipeline([
    ("scaler",StandardScaler()),
    ("clf",LogisticRegression(class_weight="balanced",max_iter=1000))
])

log_pipe.fit(X_train,y_train)
preds=log_pipe.predict(X_test)

print(classification_report(y_test,preds))


              precision    recall  f1-score   support

           0       0.88      0.50      0.63      1988
           1       0.10      0.45      0.16       250

    accuracy                           0.49      2238
   macro avg       0.49      0.47      0.40      2238
weighted avg       0.79      0.49      0.58      2238




Logistic regression achieves moderate recall but extremely low precision, meaning many false positives.
This makes it unsuitable for operational HR use.


## 5. Random Forest + Probability Calibration

In [6]:

rf=RandomForestClassifier(class_weight="balanced",n_estimators=300,random_state=42)
rf.fit(X_train,y_train)

cal=CalibratedClassifierCV(rf,method="sigmoid",cv=5)
cal.fit(X_train,y_train)

proba=cal.predict_proba(X_test)[:,1]
pred=(proba>=0.5).astype(int)

print(classification_report(y_test,pred))
print("ROC AUC:",roc_auc_score(y_test,proba))


              precision    recall  f1-score   support

           0       0.89      1.00      0.94      1988
           1       0.00      0.00      0.00       250

    accuracy                           0.89      2238
   macro avg       0.44      0.50      0.47      2238
weighted avg       0.79      0.89      0.84      2238

ROC AUC: 0.442169014084507


  _warn_prf(average, modifier, f"{metric.capitalize()} is", result.shape[0])
  _warn_prf(average, modifier, f"{metric.capitalize()} is", result.shape[0])
  _warn_prf(average, modifier, f"{metric.capitalize()} is", result.shape[0])



The Random Forest collapses to majority‑class prediction despite balancing and calibration.

ROC AUC below 0.5 confirms the absence of usable predictive structure.


## 6. Demonstration Risk Scoring

In [7]:

df["attrition_probability"]=cal.predict_proba(X)[:,1]
df["risk_percentile"]=(df["attrition_probability"].rank(pct=True)*100).round(2)

df[["employee_id","department","tenure_years","salary","risk_percentile"]].sort_values("risk_percentile",ascending=False).head(10)


Unnamed: 0,employee_id,department,tenure_years,salary,risk_percentile
952,00-18653290,Customer Service,8.32,345114,100.0
584,00-40540334,Customer Service,3.84,337706,99.99
174,00-12735359,IT,5.06,480585,99.98
430,00-54915875,Marketing,9.36,425353,99.97
69,00-64547971,Customer Service,6.57,362463,99.96
840,00-43226234,IT,3.12,619618,99.94
243,00-88561612,Operations,6.27,346062,99.93
460,00-70549373,Sales,2.51,634663,99.92
951,00-88667527,Customer Service,8.84,344853,99.91
929,00-47557512,HR,9.25,421874,99.9



These risk scores are illustrative only and should NOT be used for decision making due to poor model performance.



# Report - An Executive Summary

## Objective
Demonstrate a professional CRISP‑DM workflow for HR attrition modeling and identify whether static HR attributes can support churn prediction.

## Key Findings

• Individual features show near‑zero correlation with attrition  
• Logistic regression trades recall for massive false positives  
• Random Forest fails entirely (ROC AUC < 0.5)  
• Class imbalance handling cannot compensate for missing signal  

## Core Insight

Attrition cannot be reliably predicted using demographic and snapshot HR variables alone.

This is not a modeling failure but a data limitation.

## Why the Model Fails

The dataset lacks behavioral drivers such as:

- Performance trends over time  
- Promotion velocity  
- Manager changes  
- Engagement survey results  
- Absence patterns  

Without these, churn prediction becomes statistically impossible.

## Business Implications

Organizations attempting predictive retention must first invest in richer workforce telemetry.

Static HR attributes are suitable for descriptive analytics, not predictive intervention.

## Recommendations

1. Collect longitudinal behavioral data  
2. Track promotion and manager transitions  
3. Introduce engagement measurements  
4. Use recall‑based metrics (not accuracy)  
5. Treat early risk scores as exploratory signals only  

## Conclusion

This case study demonstrates responsible analytics:

- Proper CRISP‑DM methodology  
- Transparent modeling limitations  
- Ethical interpretation of results  
- Actionable guidance for future data strategy  

The most valuable outcome is identifying what data is missing — not forcing a weak model into production.
