 # HR EMPLOYEE ATTIRITION LLM ANALYSIS

# HR Attrition Analysis: ML + LLM Integration

## Project Overview

Employee attrition is defined as the natural process by which employees leave the workforce (Yang & Islam, 2020). Employee turnover is regarded as a key issue for organizations due to its adverse effects on workplace productivity and accomplishing organizational objectives on time.

### Technical Innovation
This project demonstrates the integration of traditional Machine Learning models with Large Language Models (LLMs) to create human-readable, actionable insights from predictive analytics.

### Objectives
- Predict employee attrition using Random Forest and Logistic Regression
- Generate personalized risk assessments using GPT-4
- Create scalable framework for automated insight generation
- Compare ML predictions with LLM explanations for strategic decision-making

### Dataset
- IBM HR Employee Attrition dataset
- 1,470 employees, 36 attributes
- Stratified sample of 249 employees for cost-effective LLM analysis

---

## 1. Data Loading & Preprocessing

In [34]:
# Required Libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import mysql.connector
import json
from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder
from sklearn.metrics import accuracy_score
from openai import OpenAI
import warnings
warnings.filterwarnings('ignore')

# Database Configuration
# Note: Replace with your actual database credentials
db_config = {
    'host': 'localhost',
    'user': 'your_username',
    'password': 'your_password', 
    'database': 'hrdb',
    'use_pure': True
}

In [None]:
# Database Connection
db = mysql.connector.connect(
    host="localhost",
    user="root",
    password="your_password",  # Replace with actual credentials
    database="hrdb",
    use_pure=True
)

print("Database connection established successfully")

# Data Extraction: Join employee demographics with employment details
query = """
SELECT DISTINCT * FROM table1 
JOIN table2 ON table1.EmployeeNumber = table2.EmployeeNumber
"""

# Execute query and create DataFrame
mycursor = db.cursor()
mycursor.execute(query)
result = mycursor.fetchall()
column_names = [i[0] for i in mycursor.description]
df = pd.DataFrame(result, columns=column_names)

print(f"Dataset loaded: {len(df)} employees, {len(df.columns)} features")

Database connection established successfully
Dataset loaded: 1470 employees, 36 features


In [36]:
# Data Extraction: Employee Demographics + Employment Details
# Dataset is normalized across two tables, requiring JOIN operation
query = """
SELECT DISTINCT * FROM table1 
JOIN table2 ON table1.EmployeeNumber = table2.EmployeeNumber
"""

# Execute query and create DataFrame
mycursor.execute(query)
result = mycursor.fetchall()
column_names = [i[0] for i in mycursor.description]
df = pd.DataFrame(result, columns=column_names)

print(f"Full dataset loaded: {len(df)} employees, {len(df.columns)} features")
print(f"Attrition rate: {(df['Attrition'] == 'Yes').mean():.1%}")

Full dataset loaded: 1470 employees, 36 features
Attrition rate: 16.1%


## 1. Data Loading & Strategic Sampling

### Business Context
Employee attrition analysis combined with modern AI explanations to create actionable HR insights.

### Technical Approach
- Traditional ML models for accurate predictions
- LLM integration for human-readable explanations
- Cost-effective sampling strategy for API budget management



In [37]:
# Database Connection
db = mysql.connector.connect(
    host="localhost",
    user="root",
    password="KikiLili4ever$",
    database="hrdb",
    use_pure=True
)

print("Database connection established")

# Data Extraction: Join normalized tables
query = """
SELECT DISTINCT * FROM table1 
JOIN table2 ON table1.EmployeeNumber = table2.EmployeeNumber
"""

mycursor = db.cursor()
mycursor.execute(query)
result = mycursor.fetchall()
column_names = [i[0] for i in mycursor.description]
df = pd.DataFrame(result, columns=column_names)

# Dataset Overview
print(f"Full dataset: {len(df)} employees, {df.shape[1]} features")
print(f"Overall attrition rate: {(df['Attrition'] == 'Yes').mean():.1%}")

# Strategic Sampling with Proper Department Representation
# Each LLM analysis costs ~$0.02, so 249 employees ≈ $5 total

# Create combined stratification variable (Department + Attrition)
df['stratify_var'] = df['Department'] + '_' + df['Attrition']

# Multi-variable stratified sampling
sample_df, _ = train_test_split(
    df,
    test_size=0.83,
    stratify=df['stratify_var'],  # Maintains both department AND attrition ratios
    random_state=42
)

# Clean up temporary column
sample_df = sample_df.drop('stratify_var', axis=1)
df = df.drop('stratify_var', axis=1)

print(f"Sample size: {len(sample_df)} employees")
print(f"Sample attrition rate: {(sample_df['Attrition'] == 'Yes').mean():.1%}")
print("✓ Representative sample created for LLM analysis")

Database connection established
Full dataset: 1470 employees, 36 features
Overall attrition rate: 16.1%
Sample size: 249 employees
Sample attrition rate: 16.1%
✓ Representative sample created for LLM analysis


In [38]:
# Sample Representativeness Validation
print("\nSample Quality Check:")
print("Department Distribution Verification:")
print("Original → Sample:")
for dept in df['Department'].unique():
    orig_pct = (df['Department'] == dept).mean()
    sample_pct = (sample_df['Department'] == dept).mean()
    print(f"  {dept}: {orig_pct:.1%} → {sample_pct:.1%}")

print("\nDepartment Attrition Rate Verification:")
print("Original → Sample:")
for dept in df['Department'].unique():
    orig_rate = (df[df['Department'] == dept]['Attrition'] == 'Yes').mean()
    sample_rate = (sample_df[sample_df['Department'] == dept]['Attrition'] == 'Yes').mean()
    print(f"  {dept}: {orig_rate:.1%} → {sample_rate:.1%}")

print("✓ Sample maintains original data distributions")


Sample Quality Check:
Department Distribution Verification:
Original → Sample:
  Sales: 30.3% → 30.5%
  Research & Development: 65.4% → 65.1%
  Human Resources: 4.3% → 4.4%

Department Attrition Rate Verification:
Original → Sample:
  Sales: 20.6% → 21.1%
  Research & Development: 13.8% → 13.6%
  Human Resources: 19.0% → 18.2%
✓ Sample maintains original data distributions


## 2. Machine Learning Model Development

Traditional ML models provide accurate attrition predictions that will be enhanced with LLM explanations.

In [39]:
# ML Pipeline Setup
from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.preprocessing import LabelEncoder
from sklearn.metrics import accuracy_score

# Data Preprocessing
ml_df = sample_df.copy()
categorical_cols = ml_df.select_dtypes(include=['object']).columns.tolist()
print(f"Categorical columns identified: {len(categorical_cols)} features")

# Encode categorical variables
encoded_df = ml_df.copy()
label_encoders = {}

# Target variable encoding
le_target = LabelEncoder()
encoded_df['Attrition'] = le_target.fit_transform(encoded_df['Attrition'])
print(f"Target encoding: No=0, Yes=1")

# Feature encoding
for col in categorical_cols:
    if col != 'Attrition':
        le = LabelEncoder()
        encoded_df[col] = le.fit_transform(encoded_df[col])
        label_encoders[col] = le

print(f"Encoded {len(categorical_cols)-1} categorical features")

Categorical columns identified: 9 features
Target encoding: No=0, Yes=1
Encoded 8 categorical features


In [40]:
# Feature Engineering & Data Preparation
encoded_df = ml_df.copy()

# Target variable encoding
le_target = LabelEncoder()
encoded_df['Attrition'] = le_target.fit_transform(encoded_df['Attrition'])

# Feature encoding
label_encoders = {}
for col in categorical_cols:
    if col != 'Attrition':
        le = LabelEncoder()
        encoded_df[col] = le.fit_transform(encoded_df[col])
        label_encoders[col] = le

# Remove irrelevant features (following Yang & Islam methodology)
irrelevant_features = ['EmployeeNumber', 'EmployeeCount', 'Over18', 'StandardHours', 
                      'DailyRate', 'HourlyRate', 'MonthlyRate']
features_to_remove = [col for col in irrelevant_features if col in encoded_df.columns]

X = encoded_df.drop(['Attrition'] + features_to_remove, axis=1)
y = encoded_df['Attrition']

print(f"Final feature set: {X.shape[1]} features")
print(f"Removed irrelevant features: {features_to_remove}")

Final feature set: 27 features
Removed irrelevant features: ['EmployeeNumber', 'EmployeeCount', 'Over18', 'StandardHours', 'DailyRate', 'HourlyRate', 'MonthlyRate']


In [41]:
# Model Training & Evaluation
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42, stratify=y
)

# Train ML Models
rf_model = RandomForestClassifier(n_estimators=100, random_state=42)
lr_model = LogisticRegression(random_state=42, max_iter=1000)

rf_model.fit(X_train, y_train)
lr_model.fit(X_train, y_train)

# Model Performance
rf_accuracy = accuracy_score(y_test, rf_model.predict(X_test))
lr_accuracy = accuracy_score(y_test, lr_model.predict(X_test))

print(f"Training: {X_train.shape[0]} employees | Testing: {X_test.shape[0]} employees")
print(f"Random Forest Accuracy: {rf_accuracy:.1%}")
print(f"Logistic Regression Accuracy: {lr_accuracy:.1%}")

# Feature Importance Analysis
importance_df = pd.DataFrame({
    'feature': X.columns,
    'importance': rf_model.feature_importances_
}).sort_values('importance', ascending=False)

print(f"\nTop 5 Predictive Features:")
for i, row in importance_df.head(5).iterrows():
    print(f"{i+1}. {row['feature']}: {row['importance']:.3f}")

Training: 199 employees | Testing: 50 employees
Random Forest Accuracy: 86.0%
Logistic Regression Accuracy: 88.0%

Top 5 Predictive Features:
14. MonthlyIncome: 0.108
21. TotalWorkingYears: 0.100
4. DistanceFromHome: 0.076
1. Age: 0.065
16. OverTime: 0.059


In [42]:
# Sample Representativeness Validation
print("Sample Quality Check:")
print("Department Distribution:")
dept_dist = sample_df['Department'].value_counts(normalize=True).sort_index()
for dept, pct in dept_dist.items():
    print(f"  {dept}: {pct:.1%}")

print("\nJob Level Distribution:")
job_dist = sample_df['JobLevel'].value_counts(normalize=True).sort_index() 
for level, pct in job_dist.items():
    print(f"  Level {level}: {pct:.1%}")

print("✓ Sample maintains original data distributions")

Sample Quality Check:
Department Distribution:
  Human Resources: 4.4%
  Research & Development: 65.1%
  Sales: 30.5%

Job Level Distribution:
  Level 1: 36.9%
  Level 2: 32.5%
  Level 3: 16.1%
  Level 4: 6.8%
  Level 5: 7.6%
✓ Sample maintains original data distributions


## 3. LLM Integration for Actionable Insights

Traditional ML provides accurate predictions but lacks human-readable explanations. We integrate GPT-4 to translate numerical predictions into strategic recommendations managers can understand and act upon.

In [None]:
# Section 3: LLM Integration for Actionable Insights
from openai import OpenAI
import json

# Initialize OpenAI client 
# Note: Replace with your actual API key
api_key = "your-openai-api-key-here"  # Replace with actual key
client = OpenAI(api_key=api_key)

# Test connection
print("Testing GPT-4 connection...")
try:
    response = client.chat.completions.create(
        model="gpt-4",
        messages=[{"role": "user", "content": "Say 'GPT-4 connected successfully!'"}],
        max_tokens=15
    )
    print(response.choices[0].message.content)
    print("GPT-4 API connected successfully!")
except Exception as e:
    print(f"GPT-4 Error: {e}")

Testing GPT-4 connection...
'GPT-4 connected successfully!'
GPT-4 API connected successfully!


In [44]:
# Employee Risk Assessment Function (Enhanced)
def explain_attrition_prediction(employee_data, ml_prediction, ml_probability):
    """
    Generate human-readable explanations focusing on key predictive factors
    """
    # Convert employee data (handle numpy types for JSON compatibility)
    employee_info = {}
    for i, feature in enumerate(X_paper.columns):
        value = employee_data[i]
        employee_info[feature] = value.item() if hasattr(value, 'item') else value
    
    # Focus on most predictive features from our analysis
    key_features = {
        'DistanceFromHome': employee_info.get('DistanceFromHome'),
        'Age': employee_info.get('Age'), 
        'MonthlyIncome': employee_info.get('MonthlyIncome'),
        'TotalWorkingYears': employee_info.get('TotalWorkingYears'),
        'WorkLifeBalance': employee_info.get('WorkLifeBalance')
    }
    
    prompt = f"""
    You are an expert HR analyst. Analyze this employee's attrition risk:
    
    Key Predictive Factors: {key_features}
    ML Prediction: {'Will likely leave' if ml_prediction == 1 else 'Will likely stay'}
    Risk Probability: {ml_probability:.1%}
    
    Provide: 1) Analysis of key risk factors 2) Business rationale 3) Specific HR recommendations
    Keep under 120 words, actionable for managers.
    """
    
    try:
        response = client.chat.completions.create(
            model="gpt-4",
            messages=[{"role": "user", "content": prompt}],
            max_tokens=180,
            temperature=0.3
        )
        return response.choices[0].message.content
    except Exception as e:
        return f"Error: {e}"

print("Employee risk assessment function ready")

Employee risk assessment function ready


## 4. Individual Employee Risk Assessments

Demonstrating personalized AI-powered insights that translate ML predictions into actionable HR recommendations.

In [45]:
# Generate Individual Risk Assessment
print("EMPLOYEE RISK ASSESSMENT EXAMPLE")
print("="*50)

# Select sample employee from test set
sample_employee = X_test_paper.iloc[0].values
actual_outcome = y_test.iloc[0]

# Generate ML prediction
rf_prediction = rf_model.predict([sample_employee])[0]
rf_probability = rf_model.predict_proba([sample_employee])[0][1]

# Display prediction summary
print(f"Actual Outcome: {'Left' if actual_outcome == 1 else 'Stayed'}")
print(f"ML Prediction: {'Will leave' if rf_prediction == 1 else 'Will stay'}")
print(f"Risk Probability: {rf_probability:.1%}")

# Generate AI explanation
print(f"\nAI-Generated Strategic Assessment:")
print("-" * 40)
explanation = explain_attrition_prediction(sample_employee, rf_prediction, rf_probability)
print(explanation)

EMPLOYEE RISK ASSESSMENT EXAMPLE
Actual Outcome: Stayed
ML Prediction: Will stay
Risk Probability: 5.0%

AI-Generated Strategic Assessment:
----------------------------------------
1) The employee is at a low risk of attrition (5%), likely due to their age, reasonable commute, and balanced work-life situation. Their income and total working years also suggest stability.
2) Retaining experienced employees is crucial for business continuity and knowledge retention.
3) Recommendations: Maintain the employee's work-life balance and consider recognizing their experience and loyalty with a salary review or career development opportunities. Regular check-ins can also help to address any potential issues early.


## 5. Portfolio-Wide Risk Analysis

Scaling individual predictions to organizational insights across all 249 employees.

In [46]:
# Generate Predictions for All Employees
all_predictions = rf_model.predict(X_paper)
all_probabilities = rf_model.predict_proba(X_paper)[:, 1]

# Create comprehensive results dataset
results_df = pd.DataFrame({
    'employee_index': range(len(X_paper)),
    'actual_outcome': y.values,
    'predicted_outcome': all_predictions,
    'leave_probability': all_probabilities
})

# Add key demographic features for analysis
for feature in ['Age', 'DistanceFromHome', 'MonthlyIncome', 'TotalWorkingYears']:
    if feature in X_paper.columns:
        results_df[feature] = X_paper[feature].values

# Risk Segmentation Analysis
high_risk = results_df[results_df['leave_probability'] > 0.5]
medium_risk = results_df[(results_df['leave_probability'] >= 0.2) & (results_df['leave_probability'] <= 0.5)]
low_risk = results_df[results_df['leave_probability'] < 0.2]

print("ORGANIZATIONAL RISK PORTFOLIO")
print("="*40)
print(f"Total employees analyzed: {len(results_df)}")
print(f"High risk (>50%): {len(high_risk)} employees")
print(f"Medium risk (20-50%): {len(medium_risk)} employees") 
print(f"Low risk (<20%): {len(low_risk)} employees")

print(f"\nKey Pattern: High-risk employees average {high_risk['Age'].mean():.1f} years old")
print(f"vs Low-risk employees average {low_risk['Age'].mean():.1f} years old")

ORGANIZATIONAL RISK PORTFOLIO
Total employees analyzed: 249
High risk (>50%): 11 employees
Medium risk (20-50%): 67 employees
Low risk (<20%): 171 employees

Key Pattern: High-risk employees average 30.5 years old
vs Low-risk employees average 37.7 years old


## 6. Executive Strategic Analysis

Synthesizing individual predictions into actionable organizational insights using AI-powered analysis.

In [47]:
# Executive Analysis Function
def generate_strategic_analysis():
    """
    Generate comprehensive strategic insights from employee data patterns
    """
    # Analyze key organizational patterns
    feature_insights = {}
    
    # Department-level analysis
    dept_analysis = sample_df.groupby('Department')['Attrition'].apply(
        lambda x: (x == 'Yes').mean()
    )
    
    # Risk factor comparison
    leavers = sample_df[sample_df['Attrition'] == 'Yes']
    stayers = sample_df[sample_df['Attrition'] == 'No']
    
    analysis_summary = {
        'total_employees': len(sample_df),
        'attrition_rate': f"{(sample_df['Attrition'] == 'Yes').mean():.1%}",
        'high_risk_count': len(results_df[results_df['leave_probability'] > 0.5]),
        'dept_rates': {dept: f"{rate:.1%}" for dept, rate in dept_analysis.items()},
        'age_gap': round(stayers['Age'].mean() - leavers['Age'].mean(), 1)
    }
    
    prompt = f"""
    Executive HR Analytics Brief
    
    Analysis Summary: {json.dumps(analysis_summary, indent=2)}
    
    Provide strategic analysis:
    1. Top 3 attrition drivers
    2. Highest-risk employee segments  
    3. Department-specific patterns
    4. Priority interventions with expected impact
    
    Format as executive summary (250 words max).
    """
    
    try:
        response = client.chat.completions.create(
            model="gpt-4",
            messages=[{"role": "user", "content": prompt}],
            max_tokens=350,
            temperature=0.3
        )
        return response.choices[0].message.content
    except Exception as e:
        return f"Error: {e}"

# Generate Executive Analysis
print("EXECUTIVE STRATEGIC ANALYSIS")
print("="*50)
strategic_analysis = generate_strategic_analysis()
print(strategic_analysis)

EXECUTIVE STRATEGIC ANALYSIS
Executive Summary:

Our HR analytics reveals that we currently have a total of 249 employees, with an overall attrition rate of 16.1%. The highest attrition rate is observed in the Sales department (21.1%), followed by Human Resources (18.2%), and Research & Development (13.6%). 

The top three attrition drivers appear to be departmental alignment, age, and a high-risk group of 11 employees. The age gap of 1.9 suggests a potential generational conflict or differing work styles that may be contributing to attrition. 

The highest-risk employee segments are in the Sales department, which also has the highest attrition rate. This suggests that interventions should be prioritized in this area. 

Department-specific patterns indicate that attrition rates vary significantly across departments. This suggests that department-specific interventions may be necessary, rather than a one-size-fits-all approach. 

To mitigate these issues, priority interventions could in

In [48]:
# Verify department attrition rates
print("Department Attrition Verification:")
for dept in sample_df['Department'].unique():
    dept_data = sample_df[sample_df['Department'] == dept]
    total = len(dept_data)
    left = (dept_data['Attrition'] == 'Yes').sum()
    rate = left / total
    print(f"{dept}: {left}/{total} = {rate:.1%}")

Department Attrition Verification:
Research & Development: 22/162 = 13.6%
Sales: 16/76 = 21.1%
Human Resources: 2/11 = 18.2%


In [49]:
# Fixed comprehensive analysis function
def generate_strategic_attrition_analysis():
    """
    Generate comprehensive strategic analysis using LLM
    """
    
    # Work with the original sample data that has all features
    leavers_mask = y == 1
    stayers_mask = y == 0
    
    feature_analysis = {}
    categorical_features = ['BusinessTravel', 'Department', 'JobRole', 'MaritalStatus', 'OverTime']
    
    # Analyze key features only (avoid overwhelming the LLM)
    key_features = ['Age', 'DistanceFromHome', 'MonthlyIncome', 'TotalWorkingYears', 
                   'BusinessTravel', 'Department', 'JobRole', 'OverTime']
    
    for feature in key_features:
        if feature in X_paper.columns:
            if feature in categorical_features:
                # For categorical: attrition rate by category
                feature_data = sample_df[feature]
                attrition_by_category = {}
                for value in feature_data.unique():
                    mask = feature_data == value
                    if mask.sum() > 0:
                        attrition_rate = y[mask].mean()
                        count = mask.sum()
                        attrition_by_category[str(value)] = f"{attrition_rate:.1%} ({count} employees)"
                feature_analysis[feature] = attrition_by_category
            else:
                # For numerical: compare leavers vs stayers
                leavers_values = sample_df[feature][leavers_mask]
                stayers_values = sample_df[feature][stayers_mask]
                feature_analysis[feature] = {
                    "leavers_avg": round(leavers_values.mean(), 1),
                    "stayers_avg": round(stayers_values.mean(), 1),
                    "difference": round(leavers_values.mean() - stayers_values.mean(), 1)
                }
    
    # Create strategic prompt
    prompt = f"""
    You are a senior HR analytics consultant. Analyze this employee attrition data for strategic insights.
    
    OVERVIEW:
    - 249 employees analyzed
    - 16.1% overall attrition rate
    - 34 high-risk employees identified
    
    KEY FEATURE ANALYSIS:
    {json.dumps(feature_analysis, indent=2)}
    
    Provide strategic analysis including:
    1. Primary attrition drivers and root causes
    2. High-risk employee segments
    3. Department/role patterns
    4. Prioritized recommendations
    5. Specific interventions
    
    Write as executive summary (300 words max).
    """
    
    try:
        response = client.chat.completions.create(
            model="gpt-4",
            messages=[{"role": "user", "content": prompt}],
            max_tokens=400,
            temperature=0.3
        )
        return response.choices[0].message.content
    except Exception as e:
        return f"Error: {e}"

print("Generating strategic analysis...")
strategic_analysis = generate_strategic_attrition_analysis()
print("\n" + "="*60)
print("STRATEGIC ATTRITION ANALYSIS")
print("="*60)
print(strategic_analysis)

Generating strategic analysis...

STRATEGIC ATTRITION ANALYSIS
EXECUTIVE SUMMARY:

The primary drivers of attrition appear to be age, distance from home, monthly income, total working years, business travel, and overtime. Younger employees, those living further from work, those with lower incomes, and those with fewer years of experience are more likely to leave. Additionally, frequent business travel and overtime work significantly increase attrition rates.

High-risk employee segments include those who travel frequently for business (23.5% attrition), those working overtime (31.3% attrition), and Sales Representatives (35.7% attrition). 

In terms of department/role patterns, the Sales department has the highest attrition rate (21.1%), followed by Human Resources (18.2%). Within job roles, Sales Representatives and Laboratory Technicians have the highest attrition rates. 

Recommendations to reduce attrition include: 

1. Implement flexible work policies to accommodate employees livi

## 7. Key Learnings & Portfolio Value

### Technical Demonstration
This project successfully integrates traditional ML with modern LLM capabilities, showing proficiency in:
- API integration and prompt engineering
- Hybrid analytics approaches (predictive + explanatory)
- Sampling methodology and bias detection

### Critical Insight
The initial sampling approach revealed the importance of multi-variable stratification when working with heterogeneous populations. Correcting this bias aligned results with comprehensive analysis findings.

### Business Value
While LLMs add interpretability to ML predictions, human analytical judgment remains essential for methodology validation and strategic insight generation.