# Employee Attrition Prediction using IBM HR Analytics Dataset
This notebook walks through the full pipeline to build a predictive model for employee attrition and derive actionable HR insights using EDA, machine learning, and SHAP explainability.

## Step 1: Load and Preprocess Data
We start by importing the dataset and applying basic preprocessing steps including dropping constant/irrelevant columns, binary encoding, and one-hot encoding.

In [None]:
import pandas as pd
df = pd.read_csv("/content/WA_Fn-UseC_-HR-Employee-Attrition.csv")

# Drop irrelevant columns
df.drop(columns=["EmployeeNumber", "EmployeeCount", "Over18", "StandardHours"], inplace=True)

# Binary encoding
df["Attrition"] = df["Attrition"].map({"Yes": 1, "No": 0})
df["Gender"] = df["Gender"].map({"Male": 1, "Female": 0})
df["OverTime"] = df["OverTime"].map({"Yes": 1, "No": 0})

# One-hot encoding for categorical variables
categorical_cols = ["BusinessTravel", "Department", "EducationField", "JobRole", "MaritalStatus"]
df = pd.get_dummies(df, columns=categorical_cols, drop_first=True)


## Step 2: Exploratory Data Analysis (EDA)
We analyze class distribution and correlations to understand key drivers of attrition.

In [None]:
import matplotlib.pyplot as plt
import seaborn as sns

# Attrition class distribution
sns.countplot(x="Attrition", data=df)
plt.title("Attrition Distribution")
plt.show()

# Correlation with target
corr = df.corr()["Attrition"].sort_values(ascending=False)
print("Top correlated features with Attrition:\n", corr.head(10))

## Step 3: Train-Test Split and Modeling
We train Random Forest and Logistic Regression models after stratified train-test splitting.

In [None]:
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report

X = df.drop("Attrition", axis=1)
y = df["Attrition"]
X_train, X_test, y_train, y_test = train_test_split(X, y, stratify=y, test_size=0.2, random_state=42)

# Random Forest
rf_model = RandomForestClassifier(random_state=42)
rf_model.fit(X_train, y_train)
print("Random Forest Classification Report:")
print(classification_report(y_test, rf_model.predict(X_test)))

# Logistic Regression
lr_model = LogisticRegression(max_iter=1000)
lr_model.fit(X_train, y_train)
print("Logistic Regression Classification Report:")
print(classification_report(y_test, lr_model.predict(X_test)))

## Step 4: Model Explainability with SHAP
We use SHAP to understand which features contribute most to predictions.

In [None]:
import shap
explainer = shap.Explainer(rf_model, X_test)
shap_values = explainer(X_test)
shap.summary_plot(shap_values, X_test)

## Step 5: Actionable Insights
Based on feature importance, here are actionable insights:
- High overtime increases attrition risk → reduce workload or offer flexible shifts.
- Low income correlates with attrition → consider salary adjustment.
- Long commutes may lead to resignations → offer hybrid/remote options.
- Young employees with low experience may leave → improve onboarding and career paths.