## BME i9400
## Fall 2025

### Homework 2: Logistic Regression with L1 and L2 Regularization

**Assigned:** 2025-11-10  
**Due:** 2025-11-24 11:59:59.999 PM EST

**Place completed notebook into your "my-work" folder on JupyterHub**

**Honor & AI use policy (read carefully)**
- You may use docs/StackOverflow for syntax.  
- You **may** ask an LLM for debugging/snippets, but you must include an **AI Log** at the end (prompts + what you used).
- You **may not** ask for a full solution. Your code and plots must reflect your own understanding.

**Deliverables**
1. Executed notebook (.ipynb) with all cells run. 
2. `ai_log.md` (if you used an LLM).

### Student Starter Notebook 

This notebook walks you step-by-step through the assignment.

**Dataset assumptions:**
- File: `diabetes.csv` (provided on GitHub)
- Target column: `Outcome` (1 = diabetes, 0 = no diabetes)
- All other columns are numeric predictors.

**Instructions:**
- Read the markdown **before** each code cell.
- Fill in any `TODO` sections in the code.
- Answer the short written questions in the markdown blocks.
- Run cells in order so variables are defined correctly.

Do **not** remove cells; add new ones if you want extra exploration.


---
## 1. Setup & Imports

Import all necessary libraries and set up basic plotting defaults.
Run this cell first. No edits needed unless you add extra packages.

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

from sklearn.model_selection import train_test_split, GridSearchCV, StratifiedKFold
from sklearn.linear_model import LogisticRegression
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline
from sklearn.metrics import (
    accuracy_score,
    confusion_matrix,
    ConfusionMatrixDisplay,
    roc_curve,
    roc_auc_score,
    classification_report,
)

plt.rcParams['figure.figsize'] = (6, 4)
plt.rcParams['axes.spines.top'] = False
plt.rcParams['axes.spines.right'] = False
plt.rcParams['axes.grid'] = False
plt.rcParams['font.size'] = 11

---
## 2. Task 1 – Data Loading & Basic Exploration (10 pts)

**Goal:** Load the dataset and understand what you are modeling.

### Instructions
- Load `diabetes.csv` into a DataFrame called `df`.
- Separate features `X` and labels `y` using `Outcome` as the target column.
- Print:
  - The first 5 rows
  - Number of samples and features
  - Class counts and class proportions


In [None]:
# TODO: Load the dataset


# TODO: Print the first few rows


# TODO: Define features X and target y


# TODO: Print the number of samples and features


# TODO: Print the class counts and proportions

### Short Answer

**(a)** Is the dataset balanced? Justify with the proportions.

**(b)** Give 1–2 reasons why logistic regression is a reasonable model choice for this prediction task.

> TODO: Write your answer here.


---
## 3. Task 2 – Baseline Logistic Regression with Pipeline (20 pts)

**Goal:** Train a clean baseline model that correctly handles scaling and evaluation.

### Instructions
- Create a train/test split:
  - `test_size = 0.2`
  - `stratify = y`
  - `random_state = 9400`
- Build a `Pipeline` with:
  - `StandardScaler()`
  - `LogisticRegression(penalty='l2', solver='lbfgs', max_iter=1000, random_state=9400)`
- Fit on the training data.
- Report training and test accuracy.
- Plot the confusion matrix on the test set.


In [None]:
# TODO: Train/test split


# TODO: Define the baseline pipeline


# TODO: Fit and evaluate


# TODO: Report training and test accuracy


# TODO: Plot Confusion matrix on test set


### Short Answer

- Compare training vs test accuracy.
- Does the model appear to overfit, underfit, or behave reasonably? Explain briefly.

> TODO: Write your answer here.


---
## 4. Task 3 – Effect of Regularization Strength C (25 pts)

**Goal:** Use cross-validation to choose the L2 regularization strength.

### Instructions
- Use a `Pipeline` with `StandardScaler` + `LogisticRegression(penalty='l2')`.
- Use `StratifiedKFold(n_splits=5, shuffle=True, random_state=9400)`.
- Search `C ∈ {1e-3, 1e-2, 1e-1, 1, 10, 100}` with `GridSearchCV`.
- For each C, record mean CV accuracy.
- Plot mean CV accuracy vs `log10(C)`.
- Refit the best model on the full training set and evaluate on the test set.


In [None]:
# TODO: Define grid of C values

# TODO: Create pipeline for L2-regularized logistic regression

# TODO: Search over grid of C parameters 

## TODO: For each C, record mean CV accuracy

## TODO: Plot mean CV accuracy vs log10(C)

## TODO: Refit best model on full training set & evaluate on test set

### Short Answer

- How does C influence cross-validated performance?
- What happens for very small vs very large C?
- Which C do you choose, and why?

> TODO: Write your answer here.


---
## 5. Task 4 – L1 regularized logistic regression (25 points)

### Instructions
- Fit an L1-penalized model (`penalty='l1'`, `solver='liblinear'`) with GridSearchCV over the same C grid.
- For the best model, report:
  - Best hyperparameters
  - Test accuracy
  - Number of non-zero coefficients


In [None]:
# Helper: count non-zero coefficients in a Pipeline(LogisticRegression)
def count_nonzero_coefs(pipe_model):
    logreg = pipe_model.named_steps['logreg']
    return np.sum(logreg.coef_ != 0)

# L2 summary (reuse best_l2_model)
l2_nonzero = count_nonzero_coefs(best_l2_model)
l2_test_acc = accuracy_score(y_test, best_l2_model.predict(X_test))
print(f'[L2] best C={best_l2_model.named_steps["logreg"].C}, '
      f'non-zero coeffs={l2_nonzero}, test acc={l2_test_acc:.3f}')

In [None]:
# TODO:  redefine grid of C values

# TODO: Create pipeline for L1-penalized logistic regression

# TODO: fit pipeline and find best C value, measure test accuracy and number of non-zero coefficients


### Short Answer

- Which penalty produced the sparsest model (fewest non-zero coefficients)?
- How do test accuracies compare between L2 and L1?
- If two models have similar accuracy but different sparsity, which would you prefer here, and why?

> TODO: Write your answer here.


---
## 6. Task 5 – Probabilities, ROC Curve & Thresholds (20 pts)

**Goal:** Use logistic regression calibration to study trade-offs between sensitivity and specificity.

### Instructions
- Using your **best L2 model**, compute predicted probabilities `P(Outcome=1)` on `X_test`.
- Compute **ROC AUC** and plot the ROC curve.
- Choose a threshold different from 0.5 (e.g., 0.4) and:
  - Compute accuracy.
  - Show the confusion matrix.


In [None]:
# TODO: Compute predicted probabilities for the positive class using the best L2 model


# TODO: plot ROC curve and compute the area under the curve (AUC)
# Hint: use sklearn.metrics.roc_curve and sklearn.metrics.roc_auc_score


In [None]:
# TODO: Using a threshold of 0.4, compute accuracy and show confusion matrix


### Short Answer

- How does lowering the threshold (e.g., from 0.5 to 0.4) affect false negatives vs false positives?
- For a diabetes screening scenario, would you choose a higher or lower threshold than 0.5? Explain.

> TODO: Write your answer here.
