# HR Analytics: Predicting Employee Churn

Understanding why employees leave and building a model to predict turnover.

## Problem Statement

Employee turnover is costly. This analysis aims to:
1. Understand factors driving employee churn
2. Build a predictive model to identify at-risk employees
3. Provide actionable insights for HR teams

In [None]:
# imports
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.preprocessing import LabelEncoder, StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import classification_report, confusion_matrix, roc_curve, auc
import warnings
warnings.filterwarnings('ignore')

In [None]:
# load data
df = pd.read_csv('../data/turnover.csv')
print(f"Dataset shape: {df.shape}")
df.head()

## Data Overview

Let's check the structure and quality of our data.

In [None]:
# basic info
print("Data types:")
print(df.dtypes)
print("\nMissing values:")
print(df.isnull().sum())
print("\nBasic statistics:")
df.describe()

In [None]:
# churn distribution
print(f"Churn rate: {df['churn'].mean()*100:.1f}%")
df['churn'].value_counts()

## Exploratory Data Analysis

### Satisfaction vs Churn

In [None]:
# satisfaction distribution by churn status
fig, axes = plt.subplots(1, 2, figsize=(12, 4))

# satisfaction histogram
axes[0].hist(df[df['churn']==0]['satisfaction'], bins=20, alpha=0.7, label='Stayed')
axes[0].hist(df[df['churn']==1]['satisfaction'], bins=20, alpha=0.7, label='Left')
axes[0].set_xlabel('Satisfaction Level')
axes[0].set_ylabel('Count')
axes[0].set_title('Satisfaction Distribution')
axes[0].legend()

# evaluation histogram
axes[1].hist(df[df['churn']==0]['evaluation'], bins=20, alpha=0.7, label='Stayed')
axes[1].hist(df[df['churn']==1]['evaluation'], bins=20, alpha=0.7, label='Left')
axes[1].set_xlabel('Last Evaluation')
axes[1].set_ylabel('Count')
axes[1].set_title('Evaluation Distribution')
axes[1].legend()

plt.tight_layout()
plt.show()

In [None]:
# churn by department
dept_churn = df.groupby('department')['churn'].mean().sort_values(ascending=False)
plt.figure(figsize=(10, 5))
dept_churn.plot(kind='bar', color='steelblue')
plt.title('Churn Rate by Department')
plt.xlabel('Department')
plt.ylabel('Churn Rate')
plt.xticks(rotation=45)
plt.tight_layout()
plt.show()

In [None]:
# churn by salary level
salary_churn = df.groupby('salary')['churn'].mean()
print("Churn rate by salary:")
print(salary_churn)

# churn by number of projects
project_churn = df.groupby('number_of_projects')['churn'].mean()
plt.figure(figsize=(8, 4))
project_churn.plot(kind='bar', color='coral')
plt.title('Churn Rate by Number of Projects')
plt.xlabel('Number of Projects')
plt.ylabel('Churn Rate')
plt.tight_layout()
plt.show()

In [None]:
# correlation heatmap for numeric features
numeric_cols = ['satisfaction', 'evaluation', 'number_of_projects', 
                'average_montly_hours', 'time_spend_company', 'work_accident', 
                'churn', 'promotion']
plt.figure(figsize=(10, 8))
sns.heatmap(df[numeric_cols].corr(), annot=True, cmap='RdBu_r', center=0)
plt.title('Correlation Heatmap')
plt.tight_layout()
plt.show()

## Feature Engineering & Preprocessing

In [None]:
# encode categorical variables
df_encoded = df.copy()

# salary encoding (ordinal)
salary_map = {'low': 0, 'medium': 1, 'high': 2}
df_encoded['salary'] = df_encoded['salary'].map(salary_map)

# department encoding (one-hot)
df_encoded = pd.get_dummies(df_encoded, columns=['department'], drop_first=True)

print(f"Features after encoding: {df_encoded.shape[1]}")
df_encoded.head()

In [None]:
# prepare features and target
X = df_encoded.drop('churn', axis=1)
y = df_encoded['churn']

# train test split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42, stratify=y)

print(f"Training set: {X_train.shape[0]} samples")
print(f"Test set: {X_test.shape[0]} samples")

## Model Training

### Logistic Regression

In [None]:
# logistic regression
lr = LogisticRegression(random_state=42, max_iter=1000)
lr.fit(X_train, y_train)
lr_pred = lr.predict(X_test)

print("Logistic Regression Results:")
print(classification_report(y_test, lr_pred))

### Random Forest

In [None]:
# random forest
rf = RandomForestClassifier(n_estimators=100, random_state=42, n_jobs=-1)
rf.fit(X_train, y_train)
rf_pred = rf.predict(X_test)

print("Random Forest Results:")
print(classification_report(y_test, rf_pred))

## Model Evaluation

In [None]:
# confusion matrices
fig, axes = plt.subplots(1, 2, figsize=(12, 4))

# logistic regression cm
cm_lr = confusion_matrix(y_test, lr_pred)
sns.heatmap(cm_lr, annot=True, fmt='d', cmap='Blues', ax=axes[0])
axes[0].set_title('Logistic Regression')
axes[0].set_xlabel('Predicted')
axes[0].set_ylabel('Actual')

# random forest cm
cm_rf = confusion_matrix(y_test, rf_pred)
sns.heatmap(cm_rf, annot=True, fmt='d', cmap='Greens', ax=axes[1])
axes[1].set_title('Random Forest')
axes[1].set_xlabel('Predicted')
axes[1].set_ylabel('Actual')

plt.tight_layout()
plt.show()

In [None]:
# ROC curves
fig, ax = plt.subplots(figsize=(8, 6))

# logistic regression ROC
lr_proba = lr.predict_proba(X_test)[:, 1]
fpr_lr, tpr_lr, _ = roc_curve(y_test, lr_proba)
roc_auc_lr = auc(fpr_lr, tpr_lr)
ax.plot(fpr_lr, tpr_lr, label=f'Logistic Regression (AUC = {roc_auc_lr:.3f})')

# random forest ROC
rf_proba = rf.predict_proba(X_test)[:, 1]
fpr_rf, tpr_rf, _ = roc_curve(y_test, rf_proba)
roc_auc_rf = auc(fpr_rf, tpr_rf)
ax.plot(fpr_rf, tpr_rf, label=f'Random Forest (AUC = {roc_auc_rf:.3f})')

# diagonal line
ax.plot([0, 1], [0, 1], 'k--', label='Random Classifier')

ax.set_xlabel('False Positive Rate')
ax.set_ylabel('True Positive Rate')
ax.set_title('ROC Curves')
ax.legend()
plt.tight_layout()
plt.show()

## Feature Importance

In [None]:
# feature importance from random forest
feature_imp = pd.DataFrame({
    'feature': X.columns,
    'importance': rf.feature_importances_
}).sort_values('importance', ascending=False)

plt.figure(figsize=(10, 6))
plt.barh(feature_imp['feature'][:10], feature_imp['importance'][:10])
plt.xlabel('Importance')
plt.title('Top 10 Feature Importances (Random Forest)')
plt.gca().invert_yaxis()
plt.tight_layout()
plt.show()

print("\nTop 5 features:")
print(feature_imp.head())

## Cross-Validation

In [None]:
# cross validation scores
cv_lr = cross_val_score(lr, X, y, cv=5, scoring='accuracy')
cv_rf = cross_val_score(rf, X, y, cv=5, scoring='accuracy')

print("5-Fold Cross-Validation Accuracy:")
print(f"Logistic Regression: {cv_lr.mean():.4f} (+/- {cv_lr.std()*2:.4f})")
print(f"Random Forest:       {cv_rf.mean():.4f} (+/- {cv_rf.std()*2:.4f})")

## Conclusions

### Key Findings

1. **Satisfaction is the strongest predictor** - Employees with low satisfaction scores are much more likely to leave

2. **Workload matters** - Both too few projects (2) and too many (6-7) increase churn risk

3. **Time at company** - Employees with 3-5 years tenure show higher churn, possibly due to career advancement expectations

4. **Low salary = higher risk** - Clear correlation between salary level and retention

### Model Performance

- Random Forest outperforms Logistic Regression with higher AUC
- The model can identify at-risk employees with good precision

### Recommendations for HR

1. Monitor satisfaction scores regularly
2. Balance project assignments (3-5 projects seems optimal)
3. Review compensation for long-tenured employees
4. Consider career development programs for 3-5 year employees