# 02 - Model Baselines (Logistic vs. Dummy) + ROC/PR + Coefficients + What if
 
This notebook trains **baseline models** on the cleaned labels from `01_import_clean_eda.ipynb`
and adds **evaluation curves**, **coefficient inspection**, and an **interactive what if** tool.

**Goals**
- Load cleaned labels (`data/cleaned/labels_clean.parquet`).
- Define features/target (PHQ 8 item-levels `phq8_binary`).
- Stratified train/test split.
- Baselines: **DummyClassifier** vs **LogisticRegression (balanced)**.
- Evaluate: classification report, confusion matrix, **ROC**/**PR** curves.
- Explainability: plot logistic **coefficients**.
- UX: **what if sliders** to see how item scores change predicted risk.

---



### Environment reminders
Run this once in your terminal (inside the repo root) if needed:
```bash
source .venv/bin/activate
pip install -U pip
pip install scikit-learn ipywidgets matplotlib numpy pandas
```
*Note:* JupyterLab 3+ supports `ipywidgets` without extra enabling. If the slider cell errors,
install `ipywidgets` in your **.venv** and restart the kernel.

---

In [None]:
# --- bootstrap PYTHONPATH so repo utilities are importable ------------------------------
import sys, pathlib
CWD = pathlib.Path.cwd()
ROOT = CWD if (CWD / "utils").exists() else CWD.parent
if str(ROOT) not in sys.path: sys.path.append(str(ROOT))
if str(ROOT / "utils") not in sys.path: sys.path.append(str(ROOT / "utils"))
print("ROOT:", ROOT)
print("In sys.path:", str(ROOT) in sys.path, str(ROOT/'utils') in sys.path)

In [None]:
# --- environment sanity & project paths -------------------------------------------------
from utils.sanity import sanity_env, setup_paths, set_seeds
sanity_env(pkgs=("pandas","numpy","matplotlib","sklearn"))
ROOT, DATA, RAW, CLEAN, OUT = setup_paths()
set_seeds(42)
ROOT, DATA, RAW, CLEAN, OUT

## Step 1 - Load cleaned labels

In [None]:
# Why: Use the cleaned artifact from notebook 01 as the single source of truth for labels.
import pandas as pd
labels_path = CLEAN / "labels_clean.parquet"
assert labels_path.exists(), f"Missing {labels_path}. Run 01_import_clean_eda.ipynb first."
df = pd.read_parquet(labels_path)
print("Shape:", df.shape); df.head(3)

In [None]:
# --- Confirm required columns -----------------------------------------------------------
ITEMS = ['phq8_nointerest','phq8_depressed','phq8_sleep','phq8_tired',
 'phq8_appetite','phq8_failure','phq8_concentrating','phq8_moving']
REQUIRED = ITEMS + ['phq8_binary']
missing = [c for c in REQUIRED if c not in df.columns]
assert not missing, f"Missing columns: {missing}"
df['phq8_binary'] = df['phq8_binary'].astype(int)
df[REQUIRED].head(2)

---
### Step 1 Interpretation - Loading Cleaned Labels
We begin with the cleaned labels file produced in Notebook 01. 
- Ensures we are working from a **reproducible single source of truth**. 
- Confirms that the file exists and loads correctly. 
- Preview of the first few rows verifies that participant IDs, PHQ-8 item responses, and binary labels are present. 

*Key point:* By centralizing label cleaning in Notebook 01, we guarantee consistency across all modeling experiments.

---



## Step 2 - Feature/label selection & balance check

In [None]:
# Why: Make class imbalance explicit; it guides evaluation choices (e.g., PR curves, macro F1).
import pandas as pd
X = df[ITEMS].copy()
y = df['phq8_binary'].astype(int)
balance = y.value_counts().to_frame('count')
balance['proportion'] = (balance['count'] / len(y)).round(3)
display(balance)

---
### Step 2 Interpretation - Feature & Label Balance
Here we select PHQ-8 items as input features and the binary depression indicator (`phq8_binary`) as the prediction target. 

- The **class balance check** shows ~72% not depressed vs. ~28% depressed. 
- This imbalance is important to acknowledge because it impacts how metrics like accuracy can be misleading. 

*Key point:* Making class imbalance explicit prepares us to use metrics like **Precision-Recall** and strategies such as `class_weight="balanced"`.

---



## Step 3 - Train/test split (stratified)

In [None]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(
 X, y, test_size=0.20, stratify=y, random_state=42
)
print(len(X_train), len(X_test), y_train.mean().round(3), y_test.mean().round(3))

---
### Step 3 Interpretation - Train/Test Split
We split the dataset into 80% training and 20% testing using **stratified sampling**. 

- Stratification ensures the class distribution (72/28) is preserved in both train and test sets. 
- This prevents accidental bias where the test set might contain too few positives or negatives. 
- The printed output confirms balanced class proportions across both splits. 

*Key point:* A stratified split is critical for fair evaluation under class imbalance.

---


## Step 4 - Baselines: Dummy vs Logistic (balanced)

---
### Why include a Dummy Classifier?

A **DummyClassifier** does *not* learn from the data. 
Instead, it makes predictions using simple rules, such as:
- always guessing the majority class (e.g., always "not depressed"), or 
- predicting labels randomly in proportion to the class distribution. 

**Purpose:** 
- Provides a **performance floor** (chance-level baseline). 
- Serves as a **sanity check**: if our real model cannot outperform Dummy, it means the features contain little to no predictive signal. 
- Gives context: we can show how much better a real model performs compared to naive guessing.

*Key idea:* If Logistic Regression (or any model) performs better than Dummy, we know it's actually learning patterns from the PHQ-8 features rather than just guessing.

---


In [None]:
# --- Baselines with proper preprocessing (impute -> scale -> logistic) -------
from sklearn.dummy import DummyClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import classification_report, confusion_matrix
import pandas as pd

# Quick sanity: where are NaNs?
print("NaNs per feature (train):")
display(X_train.isna().sum())

# 1) Dummy baseline (doesn't need preprocessing)
dum = DummyClassifier(strategy="stratified", random_state=42)
dum.fit(X_train, y_train)
dum_pred = dum.predict(X_test)

print("== DummyClassifier ==")
print(classification_report(y_test, dum_pred, digits=3))
display(pd.DataFrame(confusion_matrix(y_test, dum_pred),
 index=['true_0','true_1'], columns=['pred_0','pred_1']))

# 2) Logistic with pipeline: impute median -> scale -> logistic(balanced)
logit_pipe = Pipeline(steps=[
 ("imputer", SimpleImputer(strategy="median")),
 ("scaler", StandardScaler(with_mean=False)), # robust for sparse/small features
 ("clf", LogisticRegression(max_iter=1000, class_weight="balanced", solver="lbfgs"))
])

logit_pipe.fit(X_train, y_train)
log_pred = logit_pipe.predict(X_test)

print("\n== LogisticRegression (balanced, impute+scale) ==")
print(classification_report(y_test, log_pred, digits=3))
display(pd.DataFrame(confusion_matrix(y_test, log_pred),
 index=['true_0','true_1'], columns=['pred_0','pred_1']))


---
### Interpretation of Baseline Classification Reports

- **Dummy Baseline** 
 - Precision/recall are low, especially for the positive (depressed) class. 
 - The confusion matrix shows that most cases are predicted as non-depressed. 
 - This confirms Dummy is essentially a *chance-level floor*.

- **Logistic Regression (with imputation + scaling)** 
 - Precision and recall are both much higher compared to Dummy. 
 - The confusion matrix shows that the model successfully identifies most depressed cases,
 while keeping false positives low. 
 - Indicates that even a simple linear model can capture meaningful patterns in the PHQ-8 items.

**Key takeaway:** Logistic regression substantially outperforms the Dummy baseline, establishing a strong reference point for future models.

---



## Step 5 - ROC & Precision Recall curves (probability-based evaluation)

In [None]:
# --- ROC & PR curves using predicted probabilities ---------------------------
# Why: Curves summarize threshold behavior; PR is more informative with imbalance.
import matplotlib.pyplot as plt
from sklearn.metrics import RocCurveDisplay, PrecisionRecallDisplay, average_precision_score
import numpy as np

# Probabilities for positive class (1)
# Dummy: use predict_proba if available; otherwise a constant score (class prior)
if hasattr(dum, "predict_proba"):
 dum_scores = dum.predict_proba(X_test)[:, 1]
else:
 dum_scores = np.full(len(y_test), y_train.mean())

# Logistic pipeline
log_scores = logit_pipe.predict_proba(X_test)[:, 1]

# ROC
fig, ax = plt.subplots()
RocCurveDisplay.from_predictions(y_test, dum_scores, name="Dummy", ax=ax)
RocCurveDisplay.from_predictions(y_test, log_scores, name="Logistic (impute+scale)", ax=ax)
ax.set_title("ROC curve"); plt.show()

# PR
fig, ax = plt.subplots()
PrecisionRecallDisplay.from_predictions(y_test, dum_scores, name="Dummy", ax=ax)
PrecisionRecallDisplay.from_predictions(y_test, log_scores, name="Logistic (impute+scale)", ax=ax)
ax.set_title("Precision-Recall curve"); plt.show()

print("AP (Dummy): ", round(average_precision_score(y_test, dum_scores), 3))
print("AP (Logistic):", round(average_precision_score(y_test, log_scores), 3))



---
### Interpretation of ROC & Precision-Recall Curves

- **Dummy Baseline** 
 - ROC AUC 0.46 and Average Precision (AP) 0.26. 
 - This is close to chance-level performance, as expected for a model that only guesses labels
 in proportion to the class distribution. 
 - Serves as a *floor* for evaluation, indicating any real model should outperform this.
 

- **Logistic Regression (impute + scale + balanced weights)** 
 - ROC AUC 1.00 and AP 1.00 on this dataset. 
 - The model is nearly perfectly separating depressed vs. non-depressed cases. 
 - Indicates strong predictive signal in the PHQ-8 item responses, even with a simple linear model.
 

- **Why both curves?** 
 - ROC AUC summarizes *overall separability* (true positive rate vs. false positive rate). 
 - Precision-Recall is more informative under **class imbalance** because it directly reflects
 the tradeoff between catching positives and avoiding false alarms.
 

**Key takeaway:** Logistic regression vastly outperforms the Dummy baseline, confirming that the labels are highly learnable. 
This gives us a strong reference point for evaluating more complex models later (e.g., SVM, Random Forest, or multimodal architectures).

---


## Step 6 - Logistic coefficients (global feature influence)

In [None]:
# --- Logistic coefficients from the pipeline ---------------------------------
# Why: Coefficients (after impute+scale) show global directional influence on the log-odds.
import pandas as pd
import matplotlib.pyplot as plt

coef = pd.Series(logit_pipe.named_steps["clf"].coef_[0], index=ITEMS).sort_values()
ax = coef.plot(kind='barh')
ax.set_title("Logistic coefficients (after impute+scale)\npositive higher depression risk")
ax.set_xlabel("Coefficient")
plt.tight_layout(); plt.show()


---
### Interpretation of Logistic Coefficients

- **Directionality:** 
 - Positive coefficients (bars to the right) increase the log-odds of being classified as *depressed*. 
 - Negative coefficients (bars to the left, if present) decrease that risk. 

- **Magnitude:** 
 - Because features were standardized (impute + scale), magnitudes are comparable across PHQ-8 items. 
 - Larger absolute values indicate stronger influence on the model's predictions. 

- **Findings in this run:** 
 - Items such as *appetite*, *sleep disturbance*, and *tiredness* are strong positive predictors. 
 - This aligns with clinical expectations: somatic symptoms often weigh heavily in depression screening. 

**Key takeaway:** The model is not only predictive, but also interpretable. 
Coefficients provide a transparent, global view of which PHQ-8 items drive classification decisions.

---

## Step 7 - Save artifacts (predictions excerpt)

In [None]:
# --- Save artifacts (predictions excerpt) ------------------------------------
OUT.mkdir(parents=True, exist_ok=True)

# ensure aligned indices
y_true = y_test.reset_index(drop=True)
x_view = X_test.reset_index(drop=True)

pred_df = x_view.copy()
pred_df['y_true'] = y_true
pred_df['y_pred_dummy'] = dum_pred
pred_df['y_pred_log'] = logit_pipe.predict(X_test)
pred_df['p_log'] = logit_pipe.predict_proba(X_test)[:, 1]

pred_path = OUT / "baseline_predictions.csv"
pred_df.to_csv(pred_path, index=False)
print("Saved:", pred_path)


In [None]:
# --- Save baseline predictions (optional artifact) ---------------------------

import pandas as pd

# Ensure aligned indices between features and true labels
X_view = X_test.reset_index(drop=True)
y_true = y_test.reset_index(drop=True)

# Build predictions DataFrame
pred_df = X_view.copy()
pred_df["y_true"] = y_true
pred_df["y_pred_dummy"] = dum_pred
pred_df["y_pred_logistic"] = logit_pipe.predict(X_test)
pred_df["p_logistic"] = logit_pipe.predict_proba(X_test)[:, 1]

# Save locally (ignored by git via .gitignore)
OUT.mkdir(parents=True, exist_ok=True)
pred_path = OUT / "baseline_predictions.csv"
pred_df.to_csv(pred_path, index=False)

print(f"Baseline predictions saved {pred_path}")


## Step 8 - What if sliders (interactive)

In [None]:
# --- What-if sliders (interactive) -------------------------------------------
# Why: Build intuition by adjusting PHQ-8 item scores (0..3) and seeing predicted probability.
# Requires: ipywidgets installed in your .venv. If import fails, install then restart kernel.
try:
 import ipywidgets as W
 from IPython.display import display
except Exception:
 print("ipywidgets missing. Run: pip install ipywidgets (then restart kernel)")
 raise

# Build sliders for each PHQ-8 item
sliders = {
 f: W.IntSlider(
 value=int(X[f].median()),
 min=0, max=3, step=1,
 description=f, continuous_update=False
 )
 for f in ITEMS
}

# Decision threshold (default 0.5)
th = W.FloatSlider(
 value=0.5, min=0.0, max=1.0, step=0.01,
 description="threshold", readout_format=".2f", continuous_update=False
)

btn = W.Button(description="Predict", button_style="primary")
out = W.Output()

def on_click(_):
 with out:
 out.clear_output()
 import pandas as pd
 # 1-row dataframe from current slider values
 x = pd.DataFrame({k: [int(v.value)] for k, v in sliders.items()})
 # predict with trained pipeline
 p = float(logit_pipe.predict_proba(x)[0, 1])
 yhat = int(p >= th.value) # use chosen threshold
 # pretty print
 print({k: int(v.value) for k, v in sliders.items()})
 print(f"Predicted probability: {p:.3f} "
 f"{'DEPRESSED (1)' if yhat else 'NOT DEPRESSED (0)'} "
 f"@ threshold={th.value:.2f}")

btn.on_click(on_click)

# Layout: sliders + threshold + button + output
display(W.VBox(list(sliders.values()) + [th, btn, out]))


---

### Why interactive sliders + threshold matter

- **Sliders for PHQ-8 items** 
 Let us simulate "what-if" scenarios by changing individual symptom scores (e.g., bump *sleep disturbance* from 1 3). 
 This builds intuition about how the model responds to different symptom patterns.

- **Decision threshold (default = 0.5)** 
 Classification models predict *probabilities* (e.g., 0.73 depressed). The **threshold** is where we draw the line: 
 - At 0.5, 50% probability classify as *depressed*. 
 - Lowering the threshold (e.g., 0.3) increases sensitivity (catches more true positives) but risks more false alarms. 
 - Raising the threshold (e.g., 0.7) increases specificity (fewer false positives) but risks missing true cases. 

- **What this shows** 
 - Helps visualize the **trade-off between sensitivity and specificity** in real time. 
 - Demonstrates how symptom combinations push the probability up or down. 
 - Encourages critical thinking: the model is not a fixed "yes/no" machine - decisions depend on context and chosen threshold.

**Key takeaway:** 
Interactive sliders + threshold let stakeholders explore "what would the model say if...?" and see how model outputs align (or misalign) with clinical judgment. This makes the notebook not only technical but also *explainable and interpretable*.


---
## Appendix - Notes, assumptions, & next steps

**Notes** 
- Features: PHQ-8 item-level responses (8 features). 
- Target: `phq8_binary` (0 non-depressed, 1 depressed). 
- Dataset: Labels and responses derived from the DAIC-WOZ corpus (PHQ-8 questionnaire). 

**Assumptions** 
- Class imbalance addressed with `class_weight='balanced'` and stratified train/test split. 
- ROC/PR curves summarize classifier thresholds; PR is especially informative under imbalance. 
- Logistic coefficients interpreted after standardization (magnitude ~ influence on log-odds).
 
**Artifacts**
- outputs/baseline_predictions.csv - predictions + probabilities on test set (local only, ignored by Git).
- No new parquet artifacts generated; modeling uses `data/cleaned/labels_clean.parquet` from Notebook 01.
 
**Limitations** 
- Current analysis restricted to depression severity (PHQ-8). 
- Trauma-informed markers beyond depression (e.g., dissociation, blunted affect) not yet included. 
- Small dataset size results are illustrative, not fully generalizable. 

**Reproducibility** 
- Python 3.13 environment (`.venv`). 
- Core libraries: `scikit-learn`, `pandas`, `matplotlib`, `ipywidgets`. 
- Pre-commit hooks strip notebook noise (`nbstripout`) for clean version control. 

**Next** 
- Feature engineering: demographics, text embeddings, audio/video features. 
- Additional models (SVM, RF, calibrated models) + ROC-AUC/PR-AUC tables. 
- SHAP for tree models; coefficient confidence intervals for logistic regression. 
- Fairness analyses: performance slices by demographics if available. 



---

## Closing Summary

In this notebook we established **baseline models** for detecting depressive states using PHQ-8 item responses.

### Models Compared
- **Dummy Classifier** 
 - Purpose: provides a *non-learning baseline* (majority class or random guesses). 
 - Results: ROC AUC 0.46 and Average Precision (AP) 0.26 chance-level performance. 
 - Significance: serves as a **performance floor** for meaningful models. 

- **Logistic Regression (impute + scale)** 
 - Purpose: simple but interpretable linear model. 
 - Results: ROC AUC 1.0 and AP 1.0 on this dataset. 
 - Coefficients: strongest predictors included somatic symptoms such as **appetite, sleep disturbance, and tiredness**. 
 - Significance: demonstrates **excellent discriminative ability** while remaining transparent. 

### Evaluation Methods
- **ROC Curve (Receiver Operating Characteristic)** 
 Shows the trade-off between sensitivity (true positives) and specificity (false positives). 
- **PR Curve (Precision-Recall)** 
 Especially informative under class imbalance; highlights precision at different recall levels. 
- **AP (Average Precision)** 
 Summarizes PR performance into a single score. 
- **Coefficient Plot** 
 Visualizes which features most strongly increase/decrease depression risk (positive vs. negative log-odds). 
- **Interactive Sliders** 
 Allow dynamic adjustment of PHQ-8 items to explore "what-if" scenarios. 
- **Threshold Control** 
 Highlights the trade-off between **catching more true cases** vs. **avoiding false alarms**. 

### Key Findings
- Dummy classifier confirmed baseline chance-level performance. 
- Logistic regression far outperformed Dummy, achieving near-perfect separation. 
- Somatic PHQ-8 items (appetite, sleep, tiredness) emerged as key predictors. 
- Evaluation curves and sliders reinforced model reliability and interpretability. 

### Why It Matters
- Establishes a **transparent, interpretable baseline** for depressive state detection. 
- Provides stakeholders with clear context: *how much better a real model is than guessing*. 
- Sets a foundation for expanding into **multimodal, trauma-informed modeling** in future work. 

**Takeaway:** 
Even a simple logistic model can achieve state-of-the-art performance on PHQ-8 while maintaining interpretability. 
This gives us a strong, trustworthy baseline before layering in richer trauma-informed signals (audio, video, text, demographics) in subsequent notebooks.

---
