# Baseline Models: Logistic Regression & Random Forest

## 🎯 Concept Primer
Baselines sanity-check your preprocessing and provide a performance floor. If a neural net doesn't beat Logistic Regression, investigate why.

**Model 1:** Logistic Regression (linear, interpretable)  
**Model 2:** Random Forest (non-linear, feature importance)

Expected: Train both models, evaluate on validation set, compare metrics.

## 📋 Objectives
1. Train Logistic Regression with class weights
2. Train Random Forest with hyperparameter tuning
3. Evaluate both on validation set
4. Compare metrics (Accuracy, Precision, Recall, F1, ROC-AUC)
5. Visualize confusion matrices

## ✅ Acceptance Criteria
- [ ] Logistic Regression trained and evaluated
- [ ] Random Forest trained and evaluated
- [ ] Metrics table comparing both models
- [ ] Confusion matrices plotted
- [ ] Best baseline identified

## 🔧 Setup

In [26]:
# TODO 1: Import libraries
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import classification_report, confusion_matrix, roc_auc_score
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.utils.class_weight import compute_class_weight
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import OrdinalEncoder
import numpy as np
import pickle



import pandas as pd

df = pd.read_csv("../../../datasets/diabetes_BRFSS2015.csv")
df.columns = df.columns.str.lower().str.replace(' ', '_')
numeric_cols = ['bmi', 'genhlth', 'menthlth', 'physhlth']

In [27]:

import pickle
with open('../preprocessed_data/preprocessed_train_test_val.pkl', 'rb') as f:  # notice 'rb' for read
    data_dict = pickle.load(f)

X_train = data_dict['X_train']
X_val = data_dict['X_val']
X_test = data_dict['X_test']
y_train = data_dict['y_train']
y_val = data_dict['y_val']
y_test = data_dict['y_test']
class_weights = data_dict['class_weights']

## 📊 Logistic Regression Baseline

### TODO 2: Train Logistic Regression

**Parameters:** Use class_weight='balanced' to handle imbalance  
**Expected:** Fit on X_train, y_train; predict on X_val

In [28]:
# TODO 2: Train Logistic Regression
from sklearn.metrics import accuracy_score

lr = LogisticRegression(class_weight='balanced', random_state=42, max_iter=1000)
lr.fit(X_train, y_train)
y_pred_lr = lr.predict(X_val)
y_proba_lr = lr.predict_proba(X_val)
score_lr = accuracy_score(y_true=y_test, y_pred=y_pred_lr)
print(lr.score(X_val, y_val))
print(score_lr)

0.6439346157889204
0.52827709450226


## 🌲 Random Forest Baseline

### TODO 3: Train Random Forest

**Parameters:** n_estimators=100, max_depth=10, class_weight='balanced'  
**Expected:** Fit on X_train, y_train; predict on X_val

In [29]:
# TODO 3: Train Random Forest
rf = RandomForestClassifier(n_estimators=100, max_depth=10, class_weight='balanced', random_state=42)
rf.fit(X_train, y_train)
y_pred_rf = rf.predict(X_val)
y_proba_rf = rf.predict_proba(X_val)
score_rf = accuracy_score(y_true=y_test, y_pred=y_pred_rf)

print(rf.score(X_val, y_val))
print(score_rf)

0.6790444654683065
0.5594712498686009


## 📈 Evaluate Baselines

### TODO 4: Compute metrics for both models

**Metrics:** Accuracy, Precision, Recall, F1, ROC-AUC  
**Use:** classification_report and roc_auc_score

In [36]:
# TODO 4: Evaluate models
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, roc_auc_score
# 
metrics_lr = {
    'accuracy': accuracy_score(y_val, y_pred_lr),
    'precision_weighted': precision_score(y_val, y_pred_lr, average='weighted'),
    'recall_weighted': recall_score(y_val, y_pred_lr, average='weighted'),
    'f1_weighted': f1_score(y_val, y_pred_lr, average='weighted'),
    'f1_macro': f1_score(y_val, y_pred_lr, average='macro'),  # Add this too!
    'roc_auc_ovr_weighted': roc_auc_score(y_val, y_proba_lr, multi_class='ovr', average='weighted')
}

metrics_rf = {
    'accuracy': accuracy_score(y_val, y_pred_rf),
    'precision_weighted': precision_score(y_val, y_pred_rf, average='weighted'),  # Add average!
    'recall_weighted': recall_score(y_val, y_pred_rf, average='weighted'),        # Add average!
    'f1_weighted': f1_score(y_val, y_pred_rf, average='weighted'),                # Add average!
    'f1_macro': f1_score(y_val, y_pred_rf, average='macro'),                      # Add macro F1!
    'roc_auc_ovr_weighted': roc_auc_score(y_val, y_proba_rf, multi_class='ovr', average='weighted')  # Fix ROC-AUC!
}
# 
print("Logistic Regression:", metrics_lr)
print("Random Forest:", metrics_rf)

print("Confusion Matrix LR")
print(confusion_matrix(y_val, y_pred_lr))
print()
print("Confusion Matrix RF")
print(confusion_matrix(y_val, y_pred_rf))

Logistic Regression: {'accuracy': 0.6439346157889204, 'precision_weighted': 0.8521270719498015, 'recall_weighted': 0.6439346157889204, 'f1_weighted': 0.719372702594614, 'f1_macro': 0.4270835418115904, 'roc_auc_ovr_weighted': 0.8153886287972445}
Random Forest: {'accuracy': 0.6790444654683065, 'precision_weighted': 0.8431263669167273, 'recall_weighted': 0.6790444654683065, 'f1_weighted': 0.7336212036910527, 'f1_macro': 0.4288505468778819, 'roc_auc_ovr_weighted': 0.8155788248246245}
Confusion Matrix LR
[[21133  5532  5391]
 [  175   218   301]
 [  893  1257  3152]]

Confusion Matrix RF
[[22031  2732  7293]
 [  215    92   387]
 [ 1035   551  3716]]


In [37]:
import json
import pickle
import os
from sklearn.metrics import classification_report
import numpy as np

# Create results directory
os.makedirs('../baseline_results', exist_ok=True)

# 1. Prepare baseline results dictionary
baseline_results = {
    'logistic_regression': {
        'metrics': metrics_lr,
        'confusion_matrix': confusion_matrix(y_val, y_pred_lr).tolist(),
        'classification_report': classification_report(y_val, y_pred_lr, 
                                                       target_names=['No Diabetes', 'Prediabetes', 'Diabetes'],
                                                       output_dict=True)
    },
    'random_forest': {
        'metrics': metrics_rf,
        'confusion_matrix': confusion_matrix(y_val, y_pred_rf).tolist(),
        'classification_report': classification_report(y_val, y_pred_rf,
                                                       target_names=['No Diabetes', 'Prediabetes', 'Diabetes'],
                                                       output_dict=True)
    },
    'metadata': {
        'dataset': 'BRFSS 2015 Diabetes',
        'validation_set_size': len(y_val),
        'class_distribution': {
            'class_0': int((y_val == 0).sum()),
            'class_1': int((y_val == 1).sum()),
            'class_2': int((y_val == 2).sum())
        }
    }
}

# 2. Save as JSON (human-readable)
with open('../baseline_results/baseline_metrics.json', 'w') as f:
    json.dump(baseline_results, f, indent=2)

# 3. Save trained models (for potential reuse)
import joblib
joblib.dump(lr, '../baseline_results/logistic_regression_model.pkl')
joblib.dump(rf, '../baseline_results/random_forest_model.pkl')

print("✅ Baseline results saved successfully!")
print(f"   - Metrics: ../baseline_results/baseline_metrics.json")
print(f"   - LR Model: ../baseline_results/logistic_regression_model.pkl")
print(f"   - RF Model: ../baseline_results/random_forest_model.pkl")
print(f"\n📊 Summary:")
print(f"   Logistic Regression - Accuracy: {metrics_lr['accuracy']:.4f}, F1 Macro: {metrics_lr['f1_macro']:.4f}")
print(f"   Random Forest       - Accuracy: {metrics_rf['accuracy']:.4f}, F1 Macro: {metrics_rf['f1_macro']:.4f}")

✅ Baseline results saved successfully!
   - Metrics: ../baseline_results/baseline_metrics.json
   - LR Model: ../baseline_results/logistic_regression_model.pkl
   - RF Model: ../baseline_results/random_forest_model.pkl

📊 Summary:
   Logistic Regression - Accuracy: 0.6439, F1 Macro: 0.4271
   Random Forest       - Accuracy: 0.6790, F1 Macro: 0.4289


## 🤔 Reflection
1. Which baseline performs better? Why?
2. What patterns do you see in confusion matrices?
3. Are baselines good enough for your use case?
4. What should the PyTorch model beat?

**Your reflection:**

### Model Performance Comparison

**Random Forest performs better overall:**
- Accuracy: 67.9% vs 64.4% (LR)
- F1 weighted: 0.734 vs 0.719 (LR)
- Better at majority classes (No Diabetes, Diabetes)

**Key Challenge: Prediabetes class (class 1)**
- RF: Only 13% recall (92/694 correct)
- LR: Only 31% recall (218/694 correct)
- Both models struggle due to:
  - Extreme rarity (2% of data)
  - Ambiguous features (intermediate health state)
  - Class weight of 18.26 causes instability

**F1 Macro vs Weighted:**
- F1 Weighted (0.734): Accounts for class imbalance, dominated by class 0 performance
- F1 Macro (0.429): Treats classes equally, reveals poor prediabetes performance
- Large gap indicates severe minority class issues

**ROC-AUC:**
- Both models ~0.815 (similar discrimination ability)
- Suggests similar feature importance across models

**Confusion Matrix Patterns:**
- Both models good at "No Diabetes" (65-69% recall)
- Both reasonable at "Diabetes" (59-70% recall)
- Both fail at "Prediabetes" (13-31% recall)
- RF more confident in predictions (fewer cross-class errors)

**Realistic Goal for PyTorch:**
- Target accuracy: 70-75% (not 85% - unrealistic given data quality and severe imbalance)
- Focus on improving F1 Macro to 0.50-0.60
- Try to improve prediabetes recall to 25-30%
- Use focal loss or custom class weights to handle extreme imbalance
- Neural networks may find non-linear patterns missed by linear/tree models

**Overall Assessment:**
Random Forest is the better baseline model, achieving 3.5% higher accuracy and better performance on majority classes. However, both models demonstrate the fundamental challenge of this dataset: predicting prediabetes with only 2% representation is extremely difficult. PyTorch should focus on incremental improvements rather than dramatic gains.

## 📌 Summary
✅ Baselines trained and evaluated  
✅ Metrics compared  
✅ Ready for PyTorch model

**Next:** `07_pytorch_ffn_build_train.ipynb`