Model Training

We will:
- Load the preprocessed data
- Train a baseline model (Logistic Regression)
- Train a more powerful model (Random Forest)
- Evaluate and compare their performance

In [1]:
import numpy as np

from sklearn.linear_model import LogisticRegression   # simple, interpretable baseline classifier
from sklearn.ensemble import RandomForestClassifier   # stronger, nonlinear model
# classification_report → precision, recall, F1,
# confusion_matrix → error breakdown &
# roc_auc_score → overall ranking performance
from sklearn.metrics import classification_report, confusion_matrix, roc_auc_score

In [6]:
# Load the preprocessed data
X_train = np.load("../data/preprocessed/X_train.npy")
X_test = np.load("../data/preprocessed/X_test.npy")
y_train = np.load("../data/preprocessed/y_train.npy")
y_test = np.load("../data/preprocessed/y_test.npy")

print("Training set shape: ", X_train.shape)
print("Test set shape: ", X_test.shape)

Training set shape:  (316, 56)
Test set shape:  (79, 56)


Baseline Model: Logistic Regression
- It is simple and fast to train
- It provides a strong linear baseline for classification
- It is easy to interpret
- It is commonly used as a reference model in applied machine learning!

In [7]:
# Create and train the logistic regression model

# max_iter=1000 → ensures convergence
# .fit() → trains the model
# .predict() → predicted class labels (0/1)
# .predict_proba() → predicted probabilities (used for ROC-AUC)

log_reg = LogisticRegression(max_iter=10000)
log_reg.fit(X_train, y_train)

# Predict on Test Set
y_pred_lr = log_reg.predict(X_test)
y_proba_lr = log_reg.predict_proba(X_test)[:,1]

In [9]:
# Evaluation of Logistic Regression

# Precision → How many predicted at-risk are actually at-risk
# Recall → How many real at-risk students we caught (VERY IMPORTANT)
# F1-score → Balance of both
# Confusion matrix → Exact error counts
# ROC-AUC → Overall ranking quality

print("Logistic Regression - Classification Report")
print(classification_report(y_test, y_pred_lr))

print("Confusion Matrix")
print(confusion_matrix(y_test, y_pred_lr))

print("ROC-AUC Score", roc_auc_score(y_test, y_pred_lr))

Logistic Regression - Classification Report
              precision    recall  f1-score   support

           0       0.72      0.89      0.80        53
           1       0.57      0.31      0.40        26

    accuracy                           0.70        79
   macro avg       0.65      0.60      0.60        79
weighted avg       0.67      0.70      0.67        79

Confusion Matrix
[[47  6]
 [18  8]]
ROC-AUC Score 0.5972423802612482


A better and powerful model: Random Forest
Logistic Regression is a linear model. However, student performance may depend on complex, non-linear interactions between variables.

Therefore, we also train a Random Forest classifier, which:
- Can model non-linear relationships
- Is robust to noise
- Often performs very well on tabular data.


In [10]:
# Create and train the Random Forest model

# n_estimators=200 → number of trees
# class_weight="balanced" → helps with class imbalance
# Random Forest outputs both:
# Class predictions
# Probabilities

rf = RandomForestClassifier(n_estimators=200, random_state=42, class_weight="balanced")
rf.fit(X_train, y_train)

# Predict on test set
y_pred_rf = rf.predict(X_test)
y_proba_rf = rf.predict_proba(X_test)[:,1]

In [11]:
# Evaluation of Random Forest

print("Random Forest - Classification Report")
print(classification_report(y_test, y_pred_rf))

print("Confusion Matrix")
print(confusion_matrix(y_test, y_pred_rf))

print("ROC-AUC Score", roc_auc_score(y_test, y_proba_rf))

Random Forest - Classification Report
              precision    recall  f1-score   support

           0       0.70      0.92      0.80        53
           1       0.56      0.19      0.29        26

    accuracy                           0.68        79
   macro avg       0.63      0.56      0.54        79
weighted avg       0.65      0.68      0.63        79

Confusion Matrix
[[49  4]
 [21  5]]
ROC-AUC Score 0.6868650217706821


Therefore, Interpretation:

1. Logistic Regression:

- Is very good at identifying safe students
- Is bad at catching at-risk students (Out of 26 it only caught 8 students at risk - Not so GREAT!)
- ROC-AUC ≈ 0.60 → only slightly better than random

2. Random Forest:

- Has better ROC-AUC (0.687 vs 0.597) → better ranking
But:
- With default threshold, it is even worse at catching at-risk students (catched only 5 students at risk out of 21)
- It is extremely conservative.

Hence,

**Logistic Regression** is currently better because it catches more at-risk students
Even though **Random Forest** is better at ranking, it is: Too conservative at the default threshold.