# 🧠 Employee Attrition & Financial Impact Analysis

This project aims to combine **classification and regression** techniques to predict:
- Whether an employee will leave the company
- Estimate their future salary if they stay
- Calculate the expected financial loss if they leave

---


In [None]:
# 📦 Step 1: Import Libraries
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.preprocessing import LabelEncoder, StandardScaler
from sklearn.linear_model import LogisticRegression, Ridge, Lasso
from sklearn.svm import SVR
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import (
    f1_score, roc_auc_score, precision_score, recall_score,
    r2_score, mean_squared_error
)
import matplotlib.pyplot as plt
import seaborn as sns


## 📂 Step 2: Load Dataset
Download the IBM HR Analytics Dataset from [Kaggle](https://www.kaggle.com/datasets/pavansubhasht/ibm-hr-analytics-attrition-dataset) and upload it below.

In [None]:
from google.colab import files
uploaded = files.upload()  # Upload WA_Fn-UseC_-HR-Employee-Attrition.csv

df = pd.read_csv("WA_Fn-UseC_-HR-Employee-Attrition.csv")
df.head()


## 🔧 Step 3: Preprocessing

In [None]:
label_encoders = {}
for col in df.select_dtypes(include=['object']).columns:
    le = LabelEncoder()
    df[col] = le.fit_transform(df[col])
    label_encoders[col] = le

X = df.drop(['Attrition'], axis=1)
y = df['Attrition']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)
X_scaled = scaler.transform(X)


## 🎯 Step 4: Classification - Predicting Employee Attrition

In [None]:
clf = LogisticRegression()
clf.fit(X_train, y_train)

y_pred = clf.predict(X_test)
y_proba = clf.predict_proba(X_test)[:, 1]

print("F1 Score:", f1_score(y_test, y_pred))
print("AUC-ROC:", roc_auc_score(y_test, y_proba))
print("Precision:", precision_score(y_test, y_pred))
print("Recall:", recall_score(y_test, y_pred))

# Cross Validation
cv_scores = cross_val_score(clf, X_scaled, y, cv=5, scoring='f1')
print("Cross-Validation F1 Score:", np.mean(cv_scores))


## 💸 Step 5: Simulating Future Salary

In [None]:
df['Increment'] = df['PerformanceRating'].apply(lambda x: 1.10 if x == 4 else 1.05)
df['FutureSalary'] = df['MonthlyIncome'] * df['Increment']
df[['PerformanceRating', 'MonthlyIncome', 'FutureSalary']].head()


## 🧮 Step 6: Identify Likely-to-Stay Employees

In [None]:
P_leave = clf.predict_proba(X_scaled)[:, 1]
P_stay = 1 - P_leave
likely_to_stay_idx = np.where(P_stay > 0.6)[0]

X_stay = X.iloc[likely_to_stay_idx]
y_salary = df.loc[likely_to_stay_idx, 'FutureSalary']

X_train_reg, X_test_reg, y_train_reg, y_test_reg = train_test_split(
    X_stay, y_salary, test_size=0.2, random_state=42
)


## 📈 Step 7: Regression - Predicting Future Salaries

In [None]:
models = {
    "Random Forest": RandomForestRegressor(),
    "Ridge": Ridge(),
    "Lasso": Lasso(),
    "SVR": SVR()
}

for name, model in models.items():
    model.fit(X_train_reg, y_train_reg)
    preds = model.predict(X_test_reg)
    r2 = r2_score(y_test_reg, preds)
    rmse = np.sqrt(mean_squared_error(y_test_reg, preds))
    print(f"{name} → R2 Score: {r2:.6f} | RMSE: {rmse:.2f}")


## 📉 Step 8: Estimate Financial Loss Due to Attrition

In [None]:
expected_loss = P_leave * df['FutureSalary']
df['ExpectedLoss'] = expected_loss
total_expected_loss = df['ExpectedLoss'].sum()
print("💰 Total Expected Financial Loss: $", round(total_expected_loss, 2))


In [None]:
plt.figure(figsize=(10, 5))
sns.histplot(df['ExpectedLoss'], bins=30, kde=True, color='red')
plt.title("Distribution of Expected Financial Loss")
plt.xlabel("Expected Loss ($)")
plt.ylabel("Number of Employees")
plt.show()

### 📉 Expected Financial Loss Distribution
Financial risk per employee based on attrition probability.

In [None]:
plt.figure(figsize=(10, 5))
sns.histplot(df['MonthlyIncome'], label='Original Salary', kde=True, color='blue')
sns.histplot(df['FutureSalary'], label='Simulated Future Salary', kde=True, color='green')
plt.title("Salary Before vs Future Simulation")
plt.legend()
plt.show()

### 💸 Salary Distribution: Before vs After Simulation
Visual comparison of current and future salaries.

In [None]:
from sklearn.metrics import roc_curve, auc

fpr, tpr, _ = roc_curve(y_test, y_proba)
roc_auc = auc(fpr, tpr)

plt.figure()
plt.plot(fpr, tpr, label='Logistic Regression (AUC = %0.2f)' % roc_auc)
plt.plot([0, 1], [0, 1], 'k--')
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('Receiver Operating Characteristic Curve')
plt.legend()
plt.grid(True)
plt.show()

### 📌 Feature Importance from Random Forest
Which features most impact salary predictions?

In [None]:
sns.countplot(x='Attrition', data=df)
plt.title("Employee Attrition Distribution")
plt.show()

### 📊 Attrition Distribution
Understanding how many employees are leaving vs staying.

## ✅ Conclusion

This project combined classification and regression to analyze employee attrition and salary impact.
- Classification helped predict who might leave.
- Regression estimated the financial consequence.
- Combined, this allows better HR planning.