Model Development (Logistic Regression Only)

This notebook implements model development using the final features from 'final_features_student_depression.csv', focusing solely on Logistic Regression.

Steps:
- Split data into train/validation/test sets (70/15/15).
- Evaluate Logistic Regression as the baseline model.
- Use K-fold cross-validation for robust performance estimates.
- Perform hyperparameter tuning using GridSearchCV.

The goal is to optimize and evaluate a Logistic Regression model for predicting 'Depression' and save it for deployment.

In [10]:
# Import necessary libraries
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split, cross_val_score, GridSearchCV, KFold
from sklearn.metrics import classification_report, confusion_matrix, accuracy_score, f1_score, roc_auc_score
from sklearn.linear_model import LogisticRegression

import joblib
import warnings
warnings.filterwarnings('ignore')

In [11]:
# Load the final features dataset and split into train/validation/test

# Load the preprocessed dataset from Feature_Engineering.ipynb
csv_path = 'final_features_student_depression.csv'
df_final = pd.read_csv(csv_path)

# Prepare features (X) and target (y)
X = df_final.drop(columns=['Depression', 'id'])  # Drop target and irrelevant 'id'
y = df_final['Depression']

# Split into train (70%), validation (15%), test (15%)
# First, split into train+val (85%) and test (15%)
X_train_full, X_test, y_train_full, y_test = train_test_split(X, y, test_size=0.15, random_state=42)
# Then, split train+val into train (70%) and validation (15%)
# 0.1765 ≈ 15/85 to get 15% of original data as validation
X_train, X_val, y_train, y_val = train_test_split(X_train_full, y_train_full, test_size=0.1765, random_state=42)

# Verify shapes
print(f'Train shape: {X_train.shape}, Val shape: {X_val.shape}, Test shape: {X_test.shape}')
# Expected: Train ~19,530 rows, Val ~4,185 rows, Test ~4,185 rows (10 features each)

Train shape: (19529, 10), Val shape: (4186, 10), Test shape: (4186, 10)


In [12]:
# Baseline Logistic Regression with Cross-Validation
# Evaluate Logistic Regression using 5-fold CV to get robust performance estimates

# Define K-Fold CV (5 folds, shuffled)
kf = KFold(n_splits=5, shuffle=True, random_state=42)

# Function to evaluate model with CV and return F1-macro score
def evaluate_model(model, X, y, model_name):
    cv_scores = cross_val_score(model, X, y, cv=kf, scoring='f1_macro')
    print(f'{model_name} CV F1-macro: {cv_scores.mean():.4f} ± {cv_scores.std():.4f}')
    return cv_scores.mean(), cv_scores.std()

# Baseline: Logistic Regression
logreg = LogisticRegression(max_iter=1000, random_state=42)
logreg_mean, logreg_std = evaluate_model(logreg, X_train, y_train, 'Logistic Regression')

Logistic Regression CV F1-macro: 0.8393 ± 0.0040


In [13]:
# Hyperparameter Tuning for Logistic Regression
# Tune Logistic Regression using GridSearchCV

# Define parameter grid for Logistic Regression
logreg_param_grid = {
    'C': [0.01, 0.1, 1, 10, 100],   # Inverse of regularization strength
    'penalty': ['l1', 'l2'],        # Regularization type
    'solver': ['liblinear']         # Solver compatible with both 'l1' and 'l2' (binary classification)
}


# Perform GridSearchCV
logreg_grid = GridSearchCV(
    LogisticRegression(max_iter=1000, random_state=42),
    param_grid=logreg_param_grid,
    cv=kf,                          # StratifiedKFold or KFold defined earlier
    scoring='f1_macro',             # Macro F1 is good for imbalanced classes
    n_jobs=-1                       # Use all cores
)
logreg_grid.fit(X_train, y_train)

# Print best parameters and score
print(f'Best Logistic Regression Params: {logreg_grid.best_params_}')
print(f'Best Logistic Regression CV F1: {logreg_grid.best_score_:.4f}')

# Select best model
best_model = logreg_grid.best_estimator_

# Evaluate on validation set
y_val_pred = best_model.predict(X_val)
print('Validation Classification Report:\n', classification_report(y_val, y_val_pred))

# Optional: ROC-AUC for binary classification
if len(np.unique(y_val)) == 2:
    y_val_proba = best_model.predict_proba(X_val)[:, 1]
    print("Validation ROC-AUC:", roc_auc_score(y_val, y_val_proba))

Best Logistic Regression Params: {'C': 1, 'penalty': 'l1', 'solver': 'liblinear'}
Best Logistic Regression CV F1: 0.8394
Validation Classification Report:
               precision    recall  f1-score   support

           0       0.84      0.79      0.82      1708
           1       0.86      0.90      0.88      2478

    accuracy                           0.85      4186
   macro avg       0.85      0.84      0.85      4186
weighted avg       0.85      0.85      0.85      4186

Validation ROC-AUC: 0.9250425997017313


In [14]:
# Final Evaluation on Test Set
# Use the best Logistic Regression model on the held-out test set
y_test_pred = best_model.predict(X_test)
print('Test Classification Report:\n', classification_report(y_test, y_test_pred))
print('Test Confusion Matrix:\n', confusion_matrix(y_test, y_test_pred))
print('Test Accuracy:', accuracy_score(y_test, y_test_pred))

# Save the best model for deployment
joblib.dump(best_model, 'best_logreg_depression_model.pkl')
print('Best model saved as best_logreg_depression_model.pkl')

Test Classification Report:
               precision    recall  f1-score   support

           0       0.82      0.79      0.80      1749
           1       0.85      0.87      0.86      2437

    accuracy                           0.84      4186
   macro avg       0.83      0.83      0.83      4186
weighted avg       0.84      0.84      0.84      4186

Test Confusion Matrix:
 [[1374  375]
 [ 307 2130]]
Test Accuracy: 0.8370759675107501
Best model saved as best_logreg_depression_model.pkl
