# Employee Attrition Prediction: Final Notebook

## Objective
This notebook demonstrates the end-to-end process of using a Random Forest model to predict employee attrition. It incorporates the preprocessed data, evaluates the saved best-performing model, and extracts actionable insights.

## Workflow:
1. Data Preparation: Load and preprocess the dataset.
2. Model Reloading: Load the saved best Random Forest model.
3. Evaluation: Assess the model's performance on the test set.
4. Interpretation: Analyze feature importance and derive insights.
5. Conclusion: Summarize findings and discuss business implications.

In [5]:
# Import necessary libraries
import pandas as pd
import joblib
import matplotlib.pyplot as plt
from sklearn.metrics import classification_report, confusion_matrix, roc_auc_score, RocCurveDisplay

# Display settings
plt.style.use('seaborn-v0_8-darkgrid')
pd.set_option('display.max_columns', None)
pd.set_option('display.width', 1000)

print("Environment set up successfully.")

Environment set up successfully.


## Data Preparation

In this section, we prepare the dataset for modeling by performing the following steps:
1. Load the pre-engineered dataset (`engineered_dataset.csv`).
2. Inspect the dataset for structure, missing values, and class imbalance.
3. Encode categorical features using one-hot encoding.
4. Split the dataset into training and testing sets.
5. Address the class imbalance in the target variable using SMOTE.

In [7]:
# Load the dataset
data = pd.read_csv('../data/engineered_dataset.csv')

# Display basic information
print("Dataset Shape:", data.shape)
print("\nFirst Five Rows:")
print(data.head())

Dataset Shape: (1470, 36)

First Five Rows:
   Age  Attrition     BusinessTravel  DailyRate              Department  DistanceFromHome  Education EducationField  EnvironmentSatisfaction  Gender  HourlyRate  JobInvolvement  JobLevel                JobRole  JobSatisfaction MaritalStatus  MonthlyIncome  MonthlyRate  NumCompaniesWorked OverTime  PercentSalaryHike  PerformanceRating  RelationshipSatisfaction  StockOptionLevel  TotalWorkingYears  TrainingTimesLastYear  WorkLifeBalance  YearsAtCompany  YearsInCurrentRole  YearsSinceLastPromotion  YearsWithCurrManager  RoleStability  OverTime_Binary  OT_WorkLifeImpact  SeniorityImpact  SatisfactionBalance
0   41          1      Travel_Rarely       1102                   Sales                 1          2  Life Sciences                        2  Female          94               3         2        Sales Executive                4        Single           5993        19479                   8      Yes                 11                  3          

In [8]:
# Check for missing values
print("\nMissing Values:")
print(data.isnull().sum())


Missing Values:
Age                         0
Attrition                   0
BusinessTravel              0
DailyRate                   0
Department                  0
DistanceFromHome            0
Education                   0
EducationField              0
EnvironmentSatisfaction     0
Gender                      0
HourlyRate                  0
JobInvolvement              0
JobLevel                    0
JobRole                     0
JobSatisfaction             0
MaritalStatus               0
MonthlyIncome               0
MonthlyRate                 0
NumCompaniesWorked          0
OverTime                    0
PercentSalaryHike           0
PerformanceRating           0
RelationshipSatisfaction    0
StockOptionLevel            0
TotalWorkingYears           0
TrainingTimesLastYear       0
WorkLifeBalance             0
YearsAtCompany              0
YearsInCurrentRole          0
YearsSinceLastPromotion     0
YearsWithCurrManager        0
RoleStability               0
OverTime_Binary        

In [9]:
# Separate features and target variable
target_column = 'Attrition'  # Replace with your target column name
X = data.drop(columns=[target_column])
y = data[target_column]

print("\nFeatures Shape:", X.shape)
print("Target Shape:", y.shape)


Features Shape: (1470, 35)
Target Shape: (1470,)


In [10]:
# Perform one-hot encoding for categorical features
X_encoded = pd.get_dummies(X, drop_first=True)
print("\nEncoded Features Shape:", X_encoded.shape)


Encoded Features Shape: (1470, 49)


In [11]:
from sklearn.model_selection import train_test_split

# Split data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X_encoded, y, test_size=0.2, random_state=42)

print("\nTraining Set Shape:", X_train.shape)
print("Testing Set Shape:", X_test.shape)


Training Set Shape: (1176, 49)
Testing Set Shape: (294, 49)


In [12]:
from imblearn.over_sampling import SMOTE
from sklearn.model_selection import train_test_split

# Define features and target
target_column = 'Attrition'
X = data.drop(columns=[target_column])
y = data[target_column]

# Encode categorical features
X_encoded = pd.get_dummies(X, drop_first=True)

# Split dataset into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X_encoded, y, test_size=0.2, random_state=42, stratify=y)

# Apply SMOTE to the training set
smote = SMOTE(random_state=42)
X_train_resampled, y_train_resampled = smote.fit_resample(X_train, y_train)

# Display resampled class distribution
print("Original class distribution:\n", y_train.value_counts())
print("\nResampled class distribution:\n", pd.Series(y_train_resampled).value_counts())

Original class distribution:
 Attrition
0    986
1    190
Name: count, dtype: int64

Resampled class distribution:
 Attrition
0    986
1    986
Name: count, dtype: int64




## Model Reloading

In this step, we load the pre-trained best-performing Random Forest model from the previous notebook. This ensures consistency in evaluation and avoids retraining.

The model was trained with the following best parameters:
- `n_estimators`: 300
- `max_depth`: 20
- `min_samples_split`: 2
- `min_samples_leaf`: 1
- `class_weight`: 'balanced_subsample'

In [13]:
import joblib

# Load the saved best model
model_filename = 'best_rf_model.joblib'
best_rf_model = joblib.load(model_filename)

print(f"Model loaded successfully from {model_filename}.")

Model loaded successfully from best_rf_model.joblib.


In [14]:
# Print model parameters to confirm correct loading
print("\nModel Parameters:")
print(best_rf_model)


Model Parameters:
RandomForestClassifier(class_weight='balanced_subsample', max_depth=20,
                       n_estimators=300, random_state=42)
