# 02 – Model Development
### Student Depression Prediction: Building and Evaluating the XGBoost Classifier

---

## 📌 Objective  
Develop an end-to-end XGBoost classification pipeline to predict student depression using the preprocessed dataset.

**Key Tasks:**

 **Model Pipeline**  
- Train-test split  
- Categorical encoding
- Feature scaling/selection  

 **XGBoost Implementation**  
- Baseline model  
- Hyperparameter tuning  
- Cross-validation  

 **Model Evaluation**  
- Performance metrics (accuracy, precision, recall, F1, ROC-AUC)  
- Feature importance analysis  

---

### 📂 Input  
 - `clean_data.csv` saved in `Data/processed/FC110552_mithula-cbw/` (preprocessed dataset)


### 📦 Output  
- `xgb_model.pkl` saved in `models/FC110552_mithula-cbw/`

---

### 📈 Expected Outcomes
- Trained and validated XGBoost classification model for student depression prediction
- Performance evaluation metrics (accuracy, precision, recall, F1-score, ROC-AUC)
- Insights from feature importance analysis
- Model artifacts saved for future inference and deployment


## Step 1: Import Libraries & Load Data

In [18]:
# Data manipulation and analysis
import pandas as pd
import numpy as np

# Data visualization
import matplotlib.pyplot as plt
import seaborn as sns
from prettytable import PrettyTable

# Scikit-learn modules for model selection and evaluation
from sklearn.model_selection import train_test_split, cross_val_score, GridSearchCV
from sklearn.preprocessing import LabelEncoder, StandardScaler
from sklearn.metrics import (
    accuracy_score,
    precision_score,
    recall_score,
    f1_score,
    roc_auc_score,
    confusion_matrix
)

# XGBoost classifier
import xgboost as xgb


### Load preprocessed data

In [19]:
# Load preprocessed data
df = pd.read_csv('Data/processed/FC110552_mithula-cbw/clean_data.csv')

print("\n🔹 DataFrame Dimensions")
print("------------------------")
print(f"   Rows   : {df.shape[0]}")
print(f"   Columns: {df.shape[1]}")


🔹 DataFrame Dimensions
------------------------
   Rows   : 27868
   Columns: 13


In [25]:
# Print the first few rows of the data set
print(f"\n🔹 First 5 rows:")
df.head()


🔹 First 5 rows:


Unnamed: 0,gender,age,academic_pressure,cgpa,study_satisfaction,dietary_habits,have_you_ever_had_suicidal_thoughts,work_study_hours,financial_stress,family_history_of_mental_illness,depression,degree_encoded,sleep_duration_encoded
0,Male,33.0,5.0,8.97,2.0,Healthy,Yes,3.0,1.0,No,1,2,1
1,Female,24.0,2.0,5.9,5.0,Moderate,No,3.0,2.0,Yes,0,2,1
2,Male,31.0,3.0,7.03,5.0,Healthy,No,9.0,1.0,Yes,0,2,0
3,Female,28.0,3.0,5.59,2.0,Moderate,Yes,4.0,5.0,Yes,1,2,2
4,Female,25.0,4.0,8.13,3.0,Moderate,Yes,1.0,1.0,No,0,3,1


In [20]:
# Print data structure and datatypes of each column
table = PrettyTable()
table.field_names = ["Column", "Non-Null Count", "Dtype"]

for col in df.columns:
    non_null_count = df[col].count()
    dtype = df[col].dtype
    table.add_row([col, non_null_count, dtype])

print("\n🔹 Dataset Summary:")
print(table)


🔹 Dataset Summary:
+-------------------------------------+----------------+---------+
|                Column               | Non-Null Count |  Dtype  |
+-------------------------------------+----------------+---------+
|                gender               |     27868      |  object |
|                 age                 |     27868      | float64 |
|          academic_pressure          |     27868      | float64 |
|                 cgpa                |     27868      | float64 |
|          study_satisfaction         |     27868      | float64 |
|            dietary_habits           |     27868      |  object |
| have_you_ever_had_suicidal_thoughts |     27868      |  object |
|           work_study_hours          |     27868      | float64 |
|           financial_stress          |     27868      | float64 |
|   family_history_of_mental_illness  |     27868      |  object |
|              depression             |     27868      |  int64  |
|            degree_encoded           |   

💡 **Observations:**  
- No `id` column – No need to drop any identifier column.
- There are four object columns – These need encoding after the train-test split.

## Step 2: Train-Test Split

In [21]:
# Separate features and target
X = df.drop(['depression'], axis=1)  # Remove target column
y = df['depression']

> 🧠 To split the dataframe, we use stratified sampling to maintain the same  
class proportions in both the training and test sets.

> 🧠 Let's use an 80-20 split, where 80% of data is for training and 20% for testing.


In [None]:
# Stratified split to maintain class balance (80-20 split)
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42, stratify=y
)

# Print sizes of train and test sets
print(f"Training set size: {X_train.shape}")
print(f"Test set size:     {X_test.shape}")

# print depression rates in each split to verify stratification
print(f"Training set depression rate: {y_train.mean():.3f}")
print(f"Test set depression rate:     {y_test.mean():.3f}")

Training set size: (22294, 12)
Test set size:     (5574, 12)
Training set depression rate: 0.586
Test set depression rate:     0.586
