# v1 ML Baseline Notebook

This notebook focuses on the v1 ML baseline: features, model metrics, diagnostics, and interpretation.

Inputs:
- `v1/data_clean/ml_baseline_features.csv`
- `v1/report/ml_baseline_results.csv`
- `v1/report/ml_baseline_cv.csv`
- `v1/data_clean/ml_baseline_predictions.csv`
- `v1/report/ml_feature_importance.csv`


## 1) Setup


In [None]:
from pathlib import Path
import pandas as pd
import numpy as np
import plotly.express as px
import plotly.graph_objects as go

pd.set_option("display.max_columns", 200)

CWD = Path.cwd().resolve()
if (CWD / "v1").exists():
    REPO_ROOT = CWD
elif CWD.name == "notebooks" and (CWD.parent / "data_clean").exists():
    REPO_ROOT = CWD.parents[1]
else:
    REPO_ROOT = CWD

V1_DIR = REPO_ROOT / "v1"
DATA_CLEAN = V1_DIR / "data_clean"
REPORT_DIR = V1_DIR / "report"

DATA_CLEAN, REPORT_DIR


## 2) Load baseline features


### Narrative commentary
This table is one row per country. It aggregates across age groups to avoid leakage when predicting the 2021 suicide rate.


In [None]:
features_path = DATA_CLEAN / "ml_baseline_features.csv"
features = pd.read_csv(features_path)
features.head()


In [None]:
summary = {
    "rows": len(features),
    "countries": features["iso3"].nunique(),
    "missing_rows": int(features.isna().any(axis=1).sum()),
}
summary


## 3) Feature distributions


### Narrative commentary
Use these distributions to spot skew and extreme values. Large right tails can affect model fit and error spread.


In [None]:
feature_cols = [
    "gbd_depression_dalys_rate_both",
    "gbd_addiction_death_rate_both",
    "gbd_selfharm_death_rate_both",
]

fig = px.histogram(
    features,
    x=feature_cols[0],
    nbins=30,
    title="Distribution: Depression DALYs rate (Both)",
)
fig


In [None]:
fig = px.histogram(
    features,
    x=feature_cols[1],
    nbins=30,
    title="Distribution: Addiction deaths rate (Both)",
)
fig


In [None]:
fig = px.histogram(
    features,
    x=feature_cols[2],
    nbins=30,
    title="Distribution: Self-harm deaths rate (Both)",
)
fig


## 4) Relationship check


In [None]:
fig = px.scatter(
    features,
    x="gbd_depression_dalys_rate_both",
    y="age_standardized_suicide_rate_2021",
    color="region_name",
    hover_name="location_name",
    title="Suicide rate vs Depression DALYs",
)
fig


In [None]:
fig = px.scatter(
    features,
    x="gbd_addiction_death_rate_both",
    y="age_standardized_suicide_rate_2021",
    color="region_name",
    hover_name="location_name",
    title="Suicide rate vs Addiction deaths",
)
fig


In [None]:
fig = px.scatter(
    features,
    x="gbd_selfharm_death_rate_both",
    y="age_standardized_suicide_rate_2021",
    color="region_name",
    hover_name="location_name",
    title="Suicide rate vs Self-harm deaths",
)
fig


## 5) Baseline results + CV


### Narrative commentary
Holdout metrics show point-in-time performance, while cross-validation measures stability. Prefer models with low MAE and consistent CV.


In [None]:
results = pd.read_csv(REPORT_DIR / "ml_baseline_results.csv")
cv = pd.read_csv(REPORT_DIR / "ml_baseline_cv.csv")

results, cv


In [None]:
fig = px.bar(
    results,
    x="model",
    y="mae",
    title="Holdout MAE by model",
)
fig


In [None]:
fig = px.bar(
    cv,
    x="model",
    y="mae_mean",
    error_y="mae_std",
    title="Cross-validation MAE (mean +/- std)",
)
fig


## 6) Predictions diagnostics


### Narrative commentary
The predicted vs actual plot should cluster around the diagonal. Residuals centered near zero indicate unbiased errors.


In [None]:
preds = pd.read_csv(DATA_CLEAN / "ml_baseline_predictions.csv")

pred_cols = [c for c in preds.columns if c.endswith("_pred")]
pred_cols


In [None]:
model_col = pred_cols[0]
fig = px.scatter(
    preds,
    x="actual",
    y=model_col,
    hover_name="location_name",
    title=f"Predicted vs Actual ({model_col})",
)
fig.add_trace(
    go.Scatter(
        x=preds["actual"],
        y=preds["actual"],
        mode="lines",
        name="Ideal",
    )
)
fig


In [None]:
preds["residual"] = preds[model_col] - preds["actual"]
fig = px.histogram(
    preds,
    x="residual",
    nbins=30,
    title="Residual distribution",
)
fig


## 7) Feature importance


### Narrative commentary
Feature importance highlights which indicators drive the model most. Use this to explain patterns, but avoid causal claims.


In [None]:
imp = pd.read_csv(REPORT_DIR / "ml_feature_importance.csv")

fig = px.bar(
    imp.sort_values("importance", ascending=True).tail(15),
    x="importance",
    y="feature",
    orientation="h",
    title="Top 15 feature importances",
)
fig


## 8) Notes

- v1 baseline uses one row per country to avoid leakage across age groups.
- Results are correlational because inputs mix WHO 2021 outcomes and GBD 2023 features.
