# HR Analytics - Employee Attrition Prediction

## ðŸ“Œ Project Context
Employee attrition is a major concern for organizations as it leads to high costs of recruitment, training, and loss of institutional knowledge. This project aims to analyze employee data and build multiple machine learning models to predict who might leave and understand the key drivers behind these decisions.

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

# Model Selection & Evaluation
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import classification_report, confusion_matrix, accuracy_score, roc_auc_score, roc_curve

# Advanced Models
try:
    from xgboost import XGBClassifier
except ImportError:
    print("XGBoost not installed. Please install it using 'pip install xgboost'")

import warnings
warnings.filterwarnings('ignore')

sns.set_style('whitegrid')

## 1. Data Loading

In [None]:
df = pd.read_excel('../data/hr_analytics.xlsx')
print(f"Dataset Shape: {df.shape}")
df.head()

## 2. Exploratory Data Analysis (EDA)
Understanding the distribution of features and their relationship with the target variable `left`.

In [None]:
# Target Variable Distribution
plt.figure(figsize=(6, 4))
sns.countplot(x='left', data=df, palette='viridis')
plt.title('Distribution of Employee Attrition (0 = Stayed, 1 = Left)')
plt.show()

In [None]:
# Correlation Heatmap
plt.figure(figsize=(10, 8))
sns.heatmap(df.corr(), annot=True, cmap='coolwarm', fmt='.2f')
plt.title('Feature Correlation Heatmap')
plt.show()

In [None]:
# Attrition by Salary Level
plt.figure(figsize=(8, 5))
sns.countplot(x='salary', hue='left', data=df, palette='magma')
plt.title('Attrition vs Salary Level')
plt.show()

In [None]:
# Attrition by Department
plt.figure(figsize=(12, 6))
sns.countplot(y='Department', hue='left', data=df, palette='Set2')
plt.title('Attrition vs Department')
plt.show()

In [None]:
# Satisfaction Level Distribution
plt.figure(figsize=(8, 5))
sns.histplot(x='satisfaction_level', hue='left', data=df, kde=True, palette='Set1')
plt.title('Satisfaction Level vs Attrition')
plt.show()

## 3. Data Preprocessing
Converting categorical columns to dummy variables and splitting the data into training and testing sets.

In [None]:
# Convert categorical variables into dummy/indicator variables
df_final = pd.get_dummies(df, columns=['Department', 'salary'], drop_first=True)

# Define Features (X) and Target (y)
X = df_final.drop('left', axis=1)
y = df_final['left']

# Train-Test Split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42, stratify=y)

print(f"Training set size: {X_train.shape}")
print(f"Test set size: {X_test.shape}")

## 4. Model Building & Evaluation

### 4.1 Logistic Regression (Baseline)

In [None]:
lr_model = LogisticRegression(max_iter=1000)
lr_model.fit(X_train, y_train)
lr_pred = lr_model.predict(X_test)

print("Logistic Regression Evaluation:")
print(classification_report(y_test, lr_pred))

### 4.2 Random Forest Classifier

In [None]:
rf_model = RandomForestClassifier(n_estimators=100, random_state=42)
rf_model.fit(X_train, y_train)
rf_pred = rf_model.predict(X_test)

print("Random Forest Evaluation:")
print(classification_report(y_test, rf_pred))

### 4.3 XGBoost Classifier

In [None]:
xgb_model = XGBClassifier(use_label_encoder=False, eval_metric='logloss', random_state=42)
xgb_model.fit(X_train, y_train)
xgb_pred = xgb_model.predict(X_test)

print("XGBoost Evaluation:")
print(classification_report(y_test, xgb_pred))

## 5. Model Comparison & Feature Importance

In [None]:
# Plotting Feature Importance for the Random Forest Model
feat_importances = pd.Series(rf_model.feature_importances_, index=X.columns)
plt.figure(figsize=(10, 6))
feat_importances.nlargest(10).plot(kind='barh', color='teal')
plt.title('Top 10 Important Features Driving Attrition')
plt.xlabel('Importance Score')
plt.show()

In [None]:
# ROC-AUC Comparison
models = [lr_model, rf_model, xgb_model]
model_names = ['Logistic Regression', 'Random Forest', 'XGBoost']

plt.figure(figsize=(8, 6))
for model, name in zip(models, model_names):
    y_prob = model.predict_proba(X_test)[:, 1]
    fpr, tpr, _ = roc_curve(y_test, y_prob)
    auc = roc_auc_score(y_test, y_prob)
    plt.plot(fpr, tpr, label=f'{name} (AUC = {auc:.2f})')

plt.plot([0, 1], [0, 1], 'k--')
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('ROC Curve Comparison')
plt.legend()
plt.show()

## 6. Conclusion
- **Best Model**: The Random Forest and XGBoost models significantly outperformed the baseline Logistic Regression model.
- **Key Drivers**: `satisfaction_level`, `time_spend_company`, and `number_project` are the most influential factors in predicting employee attrition.
- **Business Action**: To reduce attrition, HR should focus on improving satisfaction levels and monitoring workloads (monthly hours and project count) for long-tenured employees.