<a href="https://colab.research.google.com/github/Lake-Commander/Neuro_well/blob/main/a3.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## Prelimiminary Research to understand the concept

🔹 1. What is Employee Burnout?
Employee burnout is a psychological syndrome emerging as a prolonged response to chronic interpersonal stressors on the job. It has three main dimensions:

Emotional exhaustion (feeling drained and fatigued),

Depersonalization (cynicism and detachment from work),

Reduced personal accomplishment (feeling ineffective at work).

Burnout has been classified by the World Health Organization (WHO) as an “occupational phenomenon,” not a medical condition.

🔹 2. Organizational Impact of Burnout
Burnout is not just a personal health issue—it’s a major organizational risk. Its consequences include:

Impact Area	Description
Productivity	Burned-out employees are less efficient and more prone to errors.
Turnover	Increases employee attrition rates and recruitment costs.
Workplace Morale	Creates a toxic or disengaged environment, affecting team dynamics.
Healthcare Costs	Leads to higher rates of absenteeism and stress-related illnesses.
Reputation	Companies that don't prioritize employee well-being may struggle to attract top talent.

🔹 3. Why Use Data to Predict Burnout?
Data-driven approaches help organizations move from reactive to proactive well-being strategies:

Benefit	Description
Early Warning	Identify at-risk employees before burnout manifests severely.
Resource Optimization	Allocate tasks, support, and interventions more effectively.
Tailored Interventions	Develop programs based on actual needs, not assumptions.
Strategic Planning	Inform HR policies and leadership decisions with facts.
Continuous Improvement	Monitor the effectiveness of wellness initiatives over time.

🧩 In Summary:
Employee burnout is a measurable, preventable, and solvable workplace challenge. Predictive analytics allows companies like NeuroWell Analytics to lead the charge in promoting workplace mental health with precision and impact.

## Imports

In [4]:
# Data manipulation and analysis
import pandas as pd
import numpy as np

# Visualization
import seaborn as sns
import matplotlib.pyplot as plt

# Date and time handling
from datetime import datetime

# Preprocessing
from sklearn.preprocessing import LabelEncoder, MinMaxScaler

# Model development
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression, Ridge, Lasso
from sklearn.ensemble import RandomForestRegressor

# Evaluation
from sklearn.metrics import mean_squared_error, r2_score

# Saving models and utilities
import joblib
import os


## EDA

In [5]:
# Load dataset
df = pd.read_csv("train.csv")

# Basic info
print("Shape:", df.shape)
print("\nMissing values:\n", df.isnull().sum())

# Convert 'Date of Joining' to datetime and derive 'Tenure'
df['Date of Joining'] = pd.to_datetime(df['Date of Joining'], errors='coerce')
df['Tenure'] = pd.to_datetime('today').year - df['Date of Joining'].dt.year

# Drop 'Employee ID' (not useful for modeling)
df.drop(columns=['Employee ID'], inplace=True)

# Basic stats
print("\nDescriptive statistics:\n", df.describe())

# Create output directory for plots
os.makedirs("eda_outputs", exist_ok=True)

# Distribution of Burn Rate
plt.figure(figsize=(8, 4))
sns.histplot(df['Burn Rate'].dropna(), kde=True, color='orange')
plt.title("Distribution of Burn Rate")
plt.savefig("eda_outputs/burn_rate_distribution.png")
plt.close()

# Burn Rate vs Mental Fatigue Score
plt.figure(figsize=(8, 4))
sns.scatterplot(data=df, x="Mental Fatigue Score", y="Burn Rate", hue="Gender")
plt.title("Burn Rate vs Mental Fatigue Score")
plt.savefig("eda_outputs/burnrate_vs_fatigue.png")
plt.close()

# Box plot of Burn Rate by Designation
plt.figure(figsize=(8, 4))
sns.boxplot(x='Designation', y='Burn Rate', data=df)
plt.title("Burn Rate by Designation")
plt.savefig("eda_outputs/burnrate_by_designation.png")
plt.close()

# Burn Rate by Company Type
plt.figure(figsize=(8, 4))
sns.boxplot(x='Company Type', y='Burn Rate', data=df)
plt.title("Burn Rate by Company Type")
plt.savefig("eda_outputs/burnrate_by_company_type.png")
plt.close()

# Heatmap for correlations
plt.figure(figsize=(10, 6))
sns.heatmap(df.corr(numeric_only=True), annot=True, cmap='coolwarm')
plt.title("Correlation Heatmap")
plt.savefig("eda_outputs/correlation_heatmap.png")
plt.close()

print("EDA plots saved in 'eda_outputs/' folder.")

Shape: (22750, 9)

Missing values:
 Employee ID                0
Date of Joining            0
Gender                     0
Company Type               0
WFH Setup Available        0
Designation                0
Resource Allocation     1381
Mental Fatigue Score    2117
Burn Rate               1124
dtype: int64

Descriptive statistics:
                      Date of Joining   Designation  Resource Allocation  \
count                          22750  22750.000000         21369.000000   
mean   2008-07-01 09:28:05.274725120      2.178725             4.481398   
min              2008-01-01 00:00:00      0.000000             1.000000   
25%              2008-04-01 00:00:00      1.000000             3.000000   
50%              2008-07-02 00:00:00      2.000000             4.000000   
75%              2008-09-30 00:00:00      3.000000             6.000000   
max              2008-12-31 00:00:00      5.000000            10.000000   
std                              NaN      1.135145             2

## Feature Engineering

In [6]:
def load_data(train_path='train.csv', test_path='test.csv'):
    return pd.read_csv(train_path), pd.read_csv(test_path)

def encode_categoricals(train_df, test_df, columns, save_dir='models'):
    encoders = {}
    for col in columns:
        le = LabelEncoder()
        combined = pd.concat([train_df[col], test_df[col]], axis=0).astype(str)
        le.fit(combined)
        train_df[col] = le.transform(train_df[col].astype(str))
        test_df[col] = le.transform(test_df[col].astype(str))
        encoders[col] = le
        joblib.dump(le, f"{save_dir}/{col}_encoder.pkl")
    return train_df, test_df

def fill_missing_values(df):
    # Avoid chained assignment: assign result directly back to the column
    df['Mental Fatigue Score'] = df['Mental Fatigue Score'].fillna(df['Mental Fatigue Score'].mean())
    df['Resource Allocation'] = df['Resource Allocation'].fillna(df['Resource Allocation'].median())
    return df

def calculate_tenure(join_date):
    if pd.isnull(join_date):
        return np.nan
    return (datetime.now() - pd.to_datetime(join_date)).days // 365

def add_tenure_feature(df):
    df['Tenure'] = df['Date of Joining'].apply(calculate_tenure)
    return df.drop(columns=['Date of Joining'])

def normalize_columns(train_df, test_df, columns, save_dir='models'):
    scaler = MinMaxScaler()
    train_df[columns] = scaler.fit_transform(train_df[columns])
    test_df[columns] = scaler.transform(test_df[columns])
    joblib.dump(scaler, f"{save_dir}/scaler.pkl")
    return train_df, test_df

def save_processed(train_df, test_df, output_dir='processed'):
    os.makedirs(output_dir, exist_ok=True)
    train_df.to_csv(f"{output_dir}/train_processed.csv", index=False)
    test_df.to_csv(f"{output_dir}/test_processed.csv", index=False)
    print(f"✅ Preprocessing complete. Files saved in '{output_dir}'.")

def preprocess():
    os.makedirs("models", exist_ok=True)

    train_df, test_df = load_data()

    # Encode categoricals
    cat_cols = ['Gender', 'Company Type', 'WFH Setup Available']
    train_df, test_df = encode_categoricals(train_df, test_df, cat_cols)

    # Handle missing
    train_df = fill_missing_values(train_df)
    test_df = fill_missing_values(test_df)

    # Add tenure
    train_df = add_tenure_feature(train_df)
    test_df = add_tenure_feature(test_df)

    # Normalize
    num_cols = ['Resource Allocation', 'Designation', 'Tenure']
    train_df, test_df = normalize_columns(train_df, test_df, num_cols)

    # Save results
    save_processed(train_df, test_df)

if __name__ == "__main__":
    preprocess()


✅ Preprocessing complete. Files saved in 'processed'.


## Model Development

In [7]:
# Create models directory if not exists
os.makedirs('models', exist_ok=True)

# Load processed training data
train_df = pd.read_csv('processed/train_processed.csv')

# Drop rows where target is missing
train_df = train_df.dropna(subset=['Burn Rate'])

# Separate features and target
X = train_df.drop(columns=['Burn Rate', 'Employee ID'])
y = train_df['Burn Rate']

# Train-test split
X_train, X_val, y_train, y_val = train_test_split(X, y, test_size=0.2, random_state=42)

# Define models
models = {
    'LinearRegression': LinearRegression(),
    'Ridge': Ridge(alpha=1.0),
    'Lasso': Lasso(alpha=0.1),
    'RandomForest': RandomForestRegressor(n_estimators=100, random_state=42)
}

best_model = None
best_model_name = ""
best_mse = float('inf')

# Train, evaluate, and save models
for name, model in models.items():
    print(f"\nTraining {name}...")
    model.fit(X_train, y_train)

    # Predict on validation set
    y_pred = model.predict(X_val)

    # Evaluate
    mse = mean_squared_error(y_val, y_pred)
    r2 = r2_score(y_val, y_pred)
    print(f"{name} MSE: {mse:.4f}")
    print(f"{name} R²: {r2:.4f}")

    # Save model
    model_path = f'models/{name.lower()}.pkl'
    joblib.dump(model, model_path)
    print(f"{name} saved to {model_path}")

    # Track best model
    if mse < best_mse:
        best_mse = mse
        best_model = model
        best_model_name = name

# Save the best model separately
if best_model:
    best_model_path = 'models/best_model.pkl'
    joblib.dump(best_model, best_model_path)
    print(f"\n✅ Best model is {best_model_name} with MSE: {best_mse:.4f}")
    print(f"📦 Saved as {best_model_path}")



Training LinearRegression...
LinearRegression MSE: 0.0050
LinearRegression R²: 0.8682
LinearRegression saved to models/linearregression.pkl

Training Ridge...
Ridge MSE: 0.0050
Ridge R²: 0.8682
Ridge saved to models/ridge.pkl

Training Lasso...
Lasso MSE: 0.0103
Lasso R²: 0.7285
Lasso saved to models/lasso.pkl

Training RandomForest...
RandomForest MSE: 0.0042
RandomForest R²: 0.8889
RandomForest saved to models/randomforest.pkl

✅ Best model is RandomForest with MSE: 0.0042
📦 Saved as models/best_model.pkl


## Predictions based on the best model

In [8]:
# === Load processed test data ===
test_df = pd.read_csv('processed/test_processed.csv')

# Keep track of Employee IDs for submission
employee_ids = test_df['Employee ID'].copy()

# Drop ID column before prediction
X_test = test_df.drop(columns=['Employee ID'])

# === Load the best model ===
model = joblib.load('models/best_model.pkl')

# === Make predictions ===
preds = model.predict(X_test)

# Clip predictions between 0 and 1
preds = preds.clip(0, 1)

# === Create submission DataFrame ===
submission = pd.DataFrame({
    'Employee ID': employee_ids,
    'Burn Rate': preds
})

# === Save to CSV ===
os.makedirs('submissions', exist_ok=True)
submission_path = 'submissions/predicted_burn_rate.csv'
submission.to_csv(submission_path, index=False)

print(f"✅ Submission file saved to '{submission_path}'")


✅ Submission file saved to 'submissions/predicted_burn_rate.csv'


## Insights

In [9]:
# insights_phase5.py

import pandas as pd
import joblib
import matplotlib.pyplot as plt
import seaborn as sns
import os

# === Ensure folders exist ===
os.makedirs('models', exist_ok=True)
os.makedirs('output_graphs/insights', exist_ok=True)

# === Load trained model and processed training data ===
model = joblib.load('models/randomforest.pkl')
df = pd.read_csv('processed/train_processed.csv')

# === Prepare features (excluding target and ID) ===
X = df.drop(columns=['Employee ID', 'Burn Rate'])

# === Feature importance extraction from RandomForest ===
importances = model.feature_importances_
features = X.columns

feat_df = pd.DataFrame({'Feature': features, 'Importance': importances})
feat_df = feat_df.sort_values(by='Importance', ascending=False)

# === Save feature importance to CSV ===
feat_df.to_csv('models/feature_importances.csv', index=False)

# === Plot and save feature importance chart ===
plt.figure(figsize=(10, 6))
sns.barplot(x='Importance', y='Feature', data=feat_df)
plt.title('Feature Importance from Random Forest')
plt.xlabel('Importance Score')
plt.ylabel('Feature')
plt.tight_layout()

# Save plot to insights subfolder
plot_path = 'output_graphs/insights/feature_importance_plot.png'
plt.savefig(plot_path)
plt.close()




# === Recommendations based on top features ===
"""
# 🔍 RECOMMENDATIONS BASED ON INSIGHTS

1. **Mental Fatigue Mitigation**
   - Mental Fatigue Score is the top predictor of Burnout.
   - ✅ Action: Implement wellness programs, mindfulness sessions, and encourage time-off.

2. **Review Resource Allocation**
   - High allocation correlates with burnout.
   - ✅ Action: Track and rebalance workloads, especially among lower designations.

3. **Enable Remote Work Options**
   - 'WFH Setup Available' influences burnout likelihood.
   - ✅ Action: Provide or improve remote setups and policies.

4. **Designation Sensitivity**
   - Junior staff (Designation 0–2) are more burnout-prone.
   - ✅ Action: Offer mentorship, manageable workloads, and growth pathways.

5. **Gender and Company-Type Patterns**
   - Some demographic segments show consistent trends.
   - ✅ Action: Investigate HR policy adjustments to support underrepresented or at-risk groups.

# 📦 OUTPUTS SAVED:
- models/feature_importances.csv
- output_graphs/insights/feature_importance_plot.png
"""