# üß† Employee Attrition Prediction Using Random Forest
This notebook walks through a complete machine learning pipeline to predict employee attrition using IBM HR data. The goal is to identify which employees are likely to leave and understand why, using Random Forest‚Äîa powerful, interpretable ensemble model.


## üìå Step 1: Import Libraries
We'll use standard Python libraries for data analysis, modeling, and visualization.

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import classification_report, confusion_matrix
from sklearn.preprocessing import LabelEncoder

## üìÅ Step 2: Load and Explore the Dataset
The dataset contains information on current and former employees and whether they left the company (`Attrition`).

In [None]:
df = pd.read_csv("WA_Fn-UseC_-HR-Employee-Attrition 2.csv")
df.head()

## üîç Step 3: Data Cleaning and Preprocessing
- Dropping constant or ID columns: `EmployeeNumber`, `EmployeeCount`, `Over18`, `StandardHours`
- Encoding the `Attrition` column to binary (Yes=1, No=0)

In [None]:
df.drop(columns=['EmployeeNumber', 'EmployeeCount', 'Over18', 'StandardHours'], inplace=True)
df['Attrition'] = df['Attrition'].map({'Yes': 1, 'No': 0})

## üß† Step 4: One-Hot Encode Categorical Variables
We use `pd.get_dummies()` to convert categorical variables into numerical format required by the model.

In [None]:
categorical_cols = df.select_dtypes(include='object').columns
df_encoded = pd.get_dummies(df, columns=categorical_cols, drop_first=True)

## üéØ Step 5: Split Features and Target
- `X` contains all features
- `y` is the target variable (`Attrition`)
- Split into training and testing sets (80/20)

In [None]:
X = df_encoded.drop('Attrition', axis=1)
y = df_encoded['Attrition']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

## üå≥ Step 6: Train Random Forest Model
**Why Random Forest?**
- Handles both numerical and categorical data well
- Reduces overfitting with ensemble of decision trees
- Gives feature importance for interpretation

In [None]:
rf_model = RandomForestClassifier(n_estimators=100, random_state=42)
rf_model.fit(X_train, y_train)

## üß™ Step 7: Evaluate Model Performance
- **Accuracy**: Overall correctness
- **Precision**: Correct 'Yes' predictions out of all predicted 'Yes'
- **Recall**: Correct 'Yes' predictions out of all actual 'Yes'
- **F1-Score**: Harmonic mean of Precision & Recall

In [None]:
y_pred = rf_model.predict(X_test)
print("Confusion Matrix:")
print(confusion_matrix(y_test, y_pred))
print("\nClassification Report:")
print(classification_report(y_test, y_pred))

## üìä Step 8: Feature Importance
**Why?**
- Understand which features impact attrition most
- Provide actionable insights to HR
- Helps in feature selection for future models

In [None]:
importances = rf_model.feature_importances_
feature_names = X.columns
feat_importance = pd.Series(importances, index=feature_names).sort_values(ascending=False)

plt.figure(figsize=(10, 6))
feat_importance.head(10).plot(kind='barh', color='teal')
plt.gca().invert_yaxis()
plt.title("Top 10 Important Features for Predicting Attrition")
plt.xlabel("Feature Importance")
plt.tight_layout()
plt.show()

## ‚úÖ Conclusion & Next Steps
**Key Findings:**
- `MonthlyIncome`, `OverTime`, `Age`, and `TotalWorkingYears` are strong predictors.
- Random Forest gave ~84% accuracy with balanced precision and recall.

**Next Steps:**
- Use SHAP for deeper model explainability
- Try other models like XGBoost, LightGBM
- Deploy as a dashboard using Flask or Streamlit