# Assignment 5 — Obesity Risk: Model Comparison

This notebook loads cross‑validation results, recreates summary plots, and documents Kaggle scores for **four classifiers**:
- Model 1 — Multinomial Logistic Regression
- Model 2 — Linear Discriminant Analysis (shrinkage)
- Model 3 — Naïve Bayes (Gaussian)
- Model 4 — Linear SVM (One‑vs‑Rest)

**Files expected in the same folder**:
- `CV_Summary_Models_1_to_4.csv`
- `cv_accuracy_bar.png`, `cv_macroF1_bar.png` (or the `_ggplot.png` versions)
- Four submission CSVs (optional)


In [None]:
import pandas as pd
import numpy as np
from pathlib import Path
import matplotlib.pyplot as plt

BASE = Path('.')
cv_path = BASE / 'CV_Summary_Models_1_to_4.csv'
assert cv_path.exists(), f"CV file not found: {cv_path}"
cv = pd.read_csv(cv_path)
cv

## Recreate plots from the CV table
The plots below are generated from the CSV (no external packages required).

In [None]:
# Accuracy bar
plt.figure(figsize=(8,4))
plt.barh(cv['model'], cv['accuracy_mean'])
plt.xlabel('Accuracy (mean)')
plt.title('CV Accuracy (3-fold stratified)')
plt.xlim(0,1)
plt.tight_layout()
plt.show()

# Macro-F1 bar
plt.figure(figsize=(8,4))
plt.barh(cv['model'], cv['macro_f1_mean'])
plt.xlabel('Macro-F1 (mean)')
plt.title('CV Macro-F1 (3-fold stratified)')
plt.xlim(0,1)
plt.tight_layout()
plt.show()

## Display saved PNGs (if present)
If you already created `cv_accuracy_bar_ggplot.png` or similar, they will be shown below.

In [None]:
from IPython.display import display, Image
for fname in ['cv_accuracy_bar.png','cv_accuracy_bar_ggplot.png','cv_macroF1_bar.png','cv_macroF1_bar_ggplot.png']:
    p = BASE / fname
    if p.exists():
        display(Image(filename=str(p)))
    else:
        print(f"(missing) {fname}")

## Kaggle results (late submission)
- Model 1 — Multinomial Logistic Regression: **Private 0.86054**, Public 0.86596  
- Model 2 — Linear Discriminant Analysis (shrinkage): Private 0.81286, Public 0.81466  
- Model 4 — Linear SVM (One‑vs‑Rest): Private 0.75108, Public 0.75000  
- Model 3 — Naïve Bayes (Gaussian): Private 0.58535, Public 0.58887  

Public and private scores are close across models, indicating minimal leaderboard overfit. Model 1 is the selected final submission.

## Notes on assumptions
- **Multinomial LR**: linear decision boundary in transformed (one‑hot) space; check multicollinearity and regularization `C`.
- **LDA (shrinkage)**: approximately Gaussian per class with shared covariance; shrinkage stabilizes estimates.
- **Gaussian NB**: conditional independence approximation; fast baseline for mixed tabular features.
- **Linear SVM (OvR)**: margin maximization in high‑dimensional sparse space; scaled numerics recommended.