# Employee Attrition Prediction Using Ensemble Methods (XGBoost + LightGBM)

This notebook demonstrates how to predict employee attrition using an ensemble of XGBoost and LightGBM models. The dataset is highly imbalanced, so we will use techniques like SMOTE-Tomek and ADASYN to handle class imbalance. We will also perform hyperparameter tuning using Optuna to optimize the models.

## 1. Import Libraries
First, we import all the necessary libraries for data processing, modeling, and evaluation.

In [24]:
import kagglehub
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split, StratifiedKFold
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.compose import ColumnTransformer
from sklearn.metrics import classification_report, roc_auc_score, precision_recall_curve, auc
from imblearn.pipeline import Pipeline
from imblearn.combine import SMOTETomek
from imblearn.under_sampling import TomekLinks
from imblearn.over_sampling import ADASYN
from lightgbm import LGBMClassifier
from xgboost import XGBClassifier
from sklearn.ensemble import VotingClassifier
import optuna
import warnings
warnings.filterwarnings('ignore')

## 2. Load the Dataset
We load the dataset using **kagglehub**. Replace the dataset path with your actual dataset path.

In [5]:
# Load data using kagglehub
path = kagglehub.dataset_download("ziya07/employee-attrition-prediction-dataset")
df = pd.read_csv(f"{path}/employee_attrition_dataset.csv")

# Display the first few rows of the dataset
df.head()

Unnamed: 0,Employee_ID,Age,Gender,Marital_Status,Department,Job_Role,Job_Level,Monthly_Income,Hourly_Rate,Years_at_Company,...,Overtime,Project_Count,Average_Hours_Worked_Per_Week,Absenteeism,Work_Environment_Satisfaction,Relationship_with_Manager,Job_Involvement,Distance_From_Home,Number_of_Companies_Worked,Attrition
0,1,58,Female,Married,IT,Manager,1,15488,28,15,...,No,6,54,17,4,4,4,20,3,No
1,2,48,Female,Married,Sales,Assistant,5,13079,28,6,...,Yes,2,45,1,4,1,2,25,2,No
2,3,34,Male,Married,Marketing,Assistant,1,13744,24,24,...,Yes,6,34,2,3,4,4,45,3,No
3,4,27,Female,Divorced,Marketing,Manager,1,6809,26,10,...,No,9,48,18,2,3,1,35,3,No
4,5,40,Male,Divorced,Marketing,Executive,1,10206,52,29,...,No,3,33,0,4,1,3,44,3,No


## 3. Prepare the Data
We preprocess the dataset by dropping irrelevant columns, performing feature engineering, and splitting the data into features (X) and target (y).

In [7]:
def prepare_data(df):
    """Prepare the dataset with advanced feature engineering."""
    # Drop irrelevant columns
    df = df.drop(columns=['Employee_ID'])
    
    # Feature engineering
    df['Salary_to_Experience'] = df['Monthly_Income'] / (df['Years_at_Company'] + 1)
    df['Overtime_Impact'] = df['Overtime'].map({'Yes': 1, 'No': 0}) * df['Monthly_Income']
    df['Years_At_Company_Ratio'] = df['Years_at_Company'] / (df['Number_of_Companies_Worked'] + 1)
    df['Promotion_Rate'] = df['Years_at_Company'] / (df['Years_Since_Last_Promotion'] + 1)
    
    # Define features and target
    X = df.drop(columns=['Attrition'])
    y = df['Attrition'].map({'Yes': 1, 'No': 0})
    
    return X, y

# Prepare the data
X, y = prepare_data(df)

# Display the shape of the dataset
print(f"Features shape: {X.shape}")
print(f"Target shape: {y.shape}")

Features shape: (1000, 28)
Target shape: (1000,)


## 4. Create a Preprocessor
We create a preprocessor to handle numerical and categorical features separately. Numerical features are scaled, and categorical features are one-hot encoded.

In [9]:
def create_preprocessor(X):
    """Create a preprocessor with separate handling for numerical and categorical features."""
    categorical_cols = X.select_dtypes(include=['object']).columns
    numerical_cols = X.select_dtypes(exclude=['object']).columns
    
    return ColumnTransformer(
        transformers=[
            ('num', StandardScaler(), numerical_cols),
            ('cat', OneHotEncoder(handle_unknown='ignore'), categorical_cols)
        ])

# Create the preprocessor
preprocessor = create_preprocessor(X)

## 5. Define the Optuna Objective Function
We define an Optuna objective function to optimize the hyperparameters of XGBoost and LightGBM. The objective function uses cross-validation to evaluate the model's performance.

In [11]:
def objective(trial, X, y, preprocessor):
    """Optuna objective function for hyperparameter optimization."""
    # LightGBM parameters
    lgb_params = {
        'n_estimators': trial.suggest_int('lgb_n_estimators', 100, 1000),
        'learning_rate': trial.suggest_loguniform('lgb_learning_rate', 1e-3, 0.1),
        'num_leaves': trial.suggest_int('lgb_num_leaves', 20, 100),
        'max_depth': trial.suggest_int('lgb_max_depth', 3, 12),
        'min_child_samples': trial.suggest_int('lgb_min_child_samples', 5, 100),
        'subsample': trial.suggest_uniform('lgb_subsample', 0.6, 1.0),
        'colsample_bytree': trial.suggest_uniform('lgb_colsample_bytree', 0.6, 1.0),
        'class_weight': 'balanced',
        'random_state': 42
    }
    
    # XGBoost parameters
    xgb_params = {
        'n_estimators': trial.suggest_int('xgb_n_estimators', 100, 1000),
        'learning_rate': trial.suggest_loguniform('xgb_learning_rate', 1e-3, 0.1),
        'max_depth': trial.suggest_int('xgb_max_depth', 3, 12),
        'min_child_weight': trial.suggest_int('xgb_min_child_weight', 1, 7),
        'subsample': trial.suggest_uniform('xgb_subsample', 0.6, 1.0),
        'colsample_bytree': trial.suggest_uniform('xgb_colsample_bytree', 0.6, 1.0),
        'scale_pos_weight': trial.suggest_uniform('scale_pos_weight', 1, 10),
        'random_state': 42
    }
    
    # Create models
    lgb = LGBMClassifier(**lgb_params)
    xgb = XGBClassifier(**xgb_params)
    
    # Create ensemble
    ensemble = VotingClassifier(
        estimators=[
            ('lgb', lgb),
            ('xgb', xgb)
        ],
        voting='soft'
    )
    
    # Create pipeline with SMOTE-Tomek and ADASYN
    pipeline = Pipeline([
        ('preprocessor', preprocessor),
        ('tomek', TomekLinks()),
        ('adasyn', ADASYN(random_state=42)),
        ('classifier', ensemble)
    ])
    
    # Cross-validation
    cv = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)
    scores = []
    
    for train_idx, val_idx in cv.split(X, y):
        X_train, X_val = X.iloc[train_idx], X.iloc[val_idx]
        y_train, y_val = y.iloc[train_idx], y.iloc[val_idx]
        
        pipeline.fit(X_train, y_train)
        y_pred_proba = pipeline.predict_proba(X_val)[:, 1]
        
        # Use PR-AUC as the metric for imbalanced classification
        precision, recall, _ = precision_recall_curve(y_val, y_pred_proba)
        pr_auc = auc(recall, precision)
        scores.append(pr_auc)
    
    return np.mean(scores)

## 6. Train and Evaluate the Model
We split the data into training and testing sets, optimize the hyperparameters using Optuna, and train the final model.

In [26]:
def train_and_evaluate_model(X, y):
    """Train the final model and evaluate its performance."""
    # Split the data
    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42, stratify=y)
    
    # Create preprocessor
    preprocessor = create_preprocessor(X)
    
    # Optimize hyperparameters
    study = optuna.create_study(direction='maximize')
    study.optimize(lambda trial: objective(trial, X_train, y_train, preprocessor), n_trials=20)
    
    # Get best parameters
    best_params = study.best_params
    
    # Create final model with best parameters
    lgb = LGBMClassifier(
        verbosity=-1,
        n_estimators=best_params['lgb_n_estimators'],
        learning_rate=best_params['lgb_learning_rate'],
        num_leaves=best_params['lgb_num_leaves'],
        max_depth=best_params['lgb_max_depth'],
        min_child_samples=best_params['lgb_min_child_samples'],
        subsample=best_params['lgb_subsample'],
        colsample_bytree=best_params['lgb_colsample_bytree'],
        class_weight='balanced',
        random_state=42
    )
    
    xgb = XGBClassifier(
        n_estimators=best_params['xgb_n_estimators'],
        learning_rate=best_params['xgb_learning_rate'],
        max_depth=best_params['xgb_max_depth'],
        min_child_weight=best_params['xgb_min_child_weight'],
        subsample=best_params['xgb_subsample'],
        colsample_bytree=best_params['xgb_colsample_bytree'],
        scale_pos_weight=best_params['scale_pos_weight'],
        random_state=42
    )
    
    ensemble = VotingClassifier(
        estimators=[
            ('lgb', lgb),
            ('xgb', xgb)
        ],
        voting='soft'
    )
    
    # Create final pipeline
    final_pipeline = Pipeline([
        ('preprocessor', preprocessor),
        ('tomek', TomekLinks()),
        ('adasyn', ADASYN(random_state=42)),
        ('classifier', ensemble)
    ])
    
    # Train final model
    final_pipeline.fit(X_train, y_train)
    
    # Predictions
    y_pred = final_pipeline.predict(X_test)
    y_pred_proba = final_pipeline.predict_proba(X_test)[:, 1]
    
    # Calculate metrics
    precision, recall, _ = precision_recall_curve(y_test, y_pred_proba)
    pr_auc = auc(recall, precision)
    roc_auc = roc_auc_score(y_test, y_pred_proba)
    
    # Print results
    print("\nModel Performance:")
    print(f"ROC-AUC: {roc_auc:.3f}")
    print(f"PR-AUC: {pr_auc:.3f}")
    print("\nClassification Report:")
    print(classification_report(y_test, y_pred))
    
    return final_pipeline

# Train and evaluate the model
model = train_and_evaluate_model(X, y)

[I 2025-01-29 18:27:16,109] A new study created in memory with name: no-name-0fae114d-63f0-46a4-9403-7b5250ac4bd0
[I 2025-01-29 18:27:21,835] Trial 0 finished with value: 0.19499344415236824 and parameters: {'lgb_n_estimators': 994, 'lgb_learning_rate': 0.0010134261690233537, 'lgb_num_leaves': 75, 'lgb_max_depth': 5, 'lgb_min_child_samples': 82, 'lgb_subsample': 0.6102034500232334, 'lgb_colsample_bytree': 0.9068875517204755, 'xgb_n_estimators': 536, 'xgb_learning_rate': 0.019028931063476143, 'xgb_max_depth': 6, 'xgb_min_child_weight': 3, 'xgb_subsample': 0.8647885125787671, 'xgb_colsample_bytree': 0.6283259480693634, 'scale_pos_weight': 4.525858736464565}. Best is trial 0 with value: 0.19499344415236824.
[I 2025-01-29 18:27:25,529] Trial 1 finished with value: 0.1922710128131687 and parameters: {'lgb_n_estimators': 246, 'lgb_learning_rate': 0.005539061102584841, 'lgb_num_leaves': 97, 'lgb_max_depth': 5, 'lgb_min_child_samples': 43, 'lgb_subsample': 0.9593342152953701, 'lgb_colsample_by


Model Performance:
ROC-AUC: 0.462
PR-AUC: 0.180

Classification Report:
              precision    recall  f1-score   support

           0       0.82      0.88      0.85       162
           1       0.24      0.16      0.19        38

    accuracy                           0.74       200
   macro avg       0.53      0.52      0.52       200
weighted avg       0.71      0.74      0.72       200



## 7. Save the Model
Finally, we save the trained model for future use.

In [15]:
import joblib

# Save the model
joblib.dump(model, 'attrition_ensemble_model.pkl')

['attrition_ensemble_model.pkl']

# Dataset Summary
### 1. Dataset Overview
**Source:** The dataset is likely sourced from HR records or employee surveys, containing information about employees and whether they left the company (attrition).

**Target Variable:** Attrition (binary: "Yes" or "No").

**Features:** The dataset includes features such as:

Demographic information (e.g., Age, Gender, Marital_Status).

Job-related information (e.g., Department, Job_Role, Job_Level).

Financial information (e.g., Monthly_Income, Hourly_Rate).

Work history (e.g., Years_at_Company, Years_in_Current_Role).

Satisfaction metrics (e.g., Job_Satisfaction, Work_Life_Balance).

Behavioral metrics (e.g., Overtime, Absenteeism).

### 2. Class Imbalance
Imbalance Ratio: The dataset is highly imbalanced, with the majority of employees belonging to the "No Attrition" class and a small percentage belonging to the "Attrition" class.

For example, if 95% of the data is "No Attrition" and 5% is "Attrition," the model will naturally favor the majority class.

Impact on Model Performance:

Models trained on imbalanced datasets tend to have poor predictive power for the minority class (attrition).

Metrics like accuracy can be misleading, as the model may achieve high accuracy by simply predicting the majority class.

### 3. Challenges in Building a Strong Predictive Model
#### a. Limited Information for the Minority Class
The small number of attrition cases makes it difficult for the model to learn meaningful patterns for predicting attrition.

Techniques like oversampling (e.g., SMOTE, ADASYN) can help, but they may introduce noise or overfitting.

#### b. Lack of Predictive Features
The dataset may not contain features that strongly correlate with attrition. For example:

Factors like employee morale, workplace culture, or personal circumstances are often not captured in HR datasets.

Without these features, the model may struggle to identify the root causes of attrition.

#### c. Noisy or Irrelevant Features
Some features in the dataset may be irrelevant or noisy, reducing the model's predictive power.

For example, features like Employee_ID or Hourly_Rate may not provide useful information for predicting attrition.

#### d. Data Quality Issues
Missing values, outliers, or inconsistencies in the dataset can negatively impact model performance.

For example, if Monthly_Income has missing values or outliers, it may skew the model's predictions.

#### e. Domain-Specific Complexity
Attrition is influenced by complex, often subjective factors (e.g., employee satisfaction, work-life balance) that are difficult to quantify.

Without domain-specific features or external data (e.g., employee surveys), the model may lack the necessary information to make accurate predictions.

### 4. Recommendations for Improving the Dataset
**a.** Collect More Data
Gather additional data, especially for the minority class (attrition cases).

Use external data sources (e.g., employee surveys, industry turnover rates) to enrich the dataset.

**b.** Feature Engineering
Create new features that better capture the reasons for attrition. For example:

Workload_Score: A metric combining Project_Count and Average_Hours_Worked_Per_Week.

Satisfaction_Index: A composite score based on Job_Satisfaction, Work_Life_Balance, and Work_Environment_Satisfaction.

**c.** Address Class Imbalance
Use advanced resampling techniques like SMOTE, ADASYN, or class-weight adjustments to give more importance to the minority class.

Consider anomaly detection or one-class classification techniques if the imbalance is extreme.

**d.** Improve Data Quality
Handle missing values, remove outliers, and correct inconsistencies in the dataset.

Use robust preprocessing techniques to ensure the data is in a usable format.

**e.** Incorporate Domain Knowledge
Consult HR or domain experts to identify key drivers of attrition and ensure they are adequately represented in the dataset.

Use domain-specific insights to guide feature engineering and model selection.

### 5. Conclusion
The dataset's class imbalance and potential lack of predictive features make it challenging to build a strong predictive model for employee attrition. While techniques like resampling and feature engineering can help, the dataset itself may need to be improved by collecting more data, enhancing feature quality, and incorporating domain-specific insights. Addressing these challenges is critical for building a model with strong predictive power.