**Case File #004 – Tree-Based Models + k-NN + Feature Selection**-(Day 4)


---

Dataset:Mental Health and Burnout in the Workplace

Loaded the Dataset

In [2]:
from google.colab import drive
drive.mount('/content/drive')
df = pd.read_csv('/content/drive/MyDrive/ACM-TASKS/mental_health_workplace_survey.csv')

Mounted at /content/drive


Imported essential libraries for data manipulation, visualization, model training, evaluation, and feature selection.

In [14]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.feature_selection import mutual_info_classif
from sklearn.metrics import accuracy_score, roc_auc_score
import matplotlib.pyplot as plt
import seaborn as sns

Data Preprocessing     
Performed data preprocessing by encoding categorical features using one-hot encoding and ordinal mapping for 'SalaryRange'. Removed unnecessary columns, split the data into training and testing sets, and scaled features for models like k-NN.

In [9]:
X = df.drop(["EmployeeID", "BurnoutRisk"], axis=1)
y = df["BurnoutRisk"]
ordinal_map = {
    '<40K': 0,
    '40K-60K': 1,
    '60K-80K': 2,
    '80K-100K': 3,
    '100K+': 4
}
df['SalaryRangeEncoded'] = df['SalaryRange'].map(ordinal_map)
one_hot_cols = [
    'Gender', 'Country', 'JobRole', 'Department',
    'RemoteWork', 'HasMentalHealthSupport', 'HasTherapyAccess'
]
df_encoded = pd.get_dummies(df, columns=one_hot_cols, drop_first=True)
X = df_encoded.drop(['EmployeeID', 'SalaryRange', 'BurnoutRisk'], axis=1)
X['SalaryRangeEncoded'] = df['SalaryRangeEncoded']
y = df_encoded['BurnoutRisk']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)


Model Training and Evaluation    
Trained Decision Tree, Random Forest, and k-NN classifiers on the full feature set. Evaluated model performance using accuracy and ROC-AUC score to measure classification effectiveness.

In [15]:
def evaluate(model, X_t, y_t):
    y_pred = model.predict(X_t)
    acc = accuracy_score(y_t, y_pred)
    roc = roc_auc_score(y_t, model.predict_proba(X_t)[:, 1])
    return acc, roc

dt = DecisionTreeClassifier(random_state=42)
dt.fit(X_train, y_train)
dt_acc, dt_roc = evaluate(dt, X_test, y_test)

rf = RandomForestClassifier(random_state=42)
rf.fit(X_train, y_train)
rf_acc, rf_roc = evaluate(rf, X_test, y_test)

knn = KNeighborsClassifier()
knn.fit(X_train_scaled, y_train)
knn_acc, knn_roc = evaluate(knn, X_test_scaled, y_test)

Feature Selection using Random Forest Importances      
Identified the top 3 most important features based on feature importances from the trained Random Forest model. These features will be used to build a simplified model for comparison

In [16]:
importances = rf.feature_importances_
top_indices = np.argsort(importances)[-3:][::-1]
top_features = X.columns[top_indices].tolist()
print("Top 3 Features:", top_features)

Top 3 Features: ['BurnoutLevel', 'WorkLifeBalanceScore', 'StressLevel']


Training Models     
Trained Decision Tree, Random Forest, and k-NN models using only the top 3 important features identified earlier. This helps compare the performance of a simplified model with the original full-feature model.

In [17]:
X_top = X[top_features]
X_train2, X_test2, y_train2, y_test2 = train_test_split(X_top, y, test_size=0.2, random_state=42)

scaler2 = StandardScaler()
X_train2_scaled = scaler2.fit_transform(X_train2)
X_test2_scaled = scaler2.transform(X_test2)

dt2 = DecisionTreeClassifier(random_state=42)
dt2.fit(X_train2, y_train2)
dt2_acc, dt2_roc = evaluate(dt2, X_test2, y_test2)

rf2 = RandomForestClassifier(random_state=42)
rf2.fit(X_train2, y_train2)
rf2_acc, rf2_roc = evaluate(rf2, X_test2, y_test2)

knn2 = KNeighborsClassifier()
knn2.fit(X_train2_scaled, y_train2)
knn2_acc, knn2_roc = evaluate(knn2, X_test2_scaled, y_test2)


Before vs After Comparison     
Compared model performance using all features vs using only the top 3 important features. Metrics like Accuracy and ROC-AUC were evaluated for Decision Tree, Random Forest, and k-NN classifiers.

In [18]:
results = pd.DataFrame({
    "Model": ["Decision Tree", "Decision Tree (Top 3)",
              "Random Forest", "Random Forest (Top 3)",
              "k-NN", "k-NN (Top 3)"],
    "Accuracy": [dt_acc, dt2_acc, rf_acc, rf2_acc, knn_acc, knn2_acc],
    "ROC-AUC": [dt_roc, dt2_roc, rf_roc, rf2_roc, knn_roc, knn2_roc]
})

print(results)


                   Model  Accuracy   ROC-AUC
0          Decision Tree     1.000  1.000000
1  Decision Tree (Top 3)     1.000  1.000000
2          Random Forest     1.000  1.000000
3  Random Forest (Top 3)     1.000  1.000000
4                   k-NN     0.760  0.799187
5           k-NN (Top 3)     0.975  0.998242


Summary:     
Encoded categorical features using ordinal and one-hot encoding

Split the dataset and applied feature scaling

Trained Decision Tree, Random Forest, and k-NN models on the full dataset

Identified top 3 important features using Random Forest feature importances

Trained the same models using only the top 3 features

Compared performance (Accuracy and ROC-AUC) of full model vs top-3-feature model