# v1 Analysis Notebook

This notebook documents the v1 cleaned datasets and produces reproducible analysis and visuals for the WHO + GBD workflow.

- Version: v1 (cleaned data + merged ML)
- Inputs: `v1/data_clean/*.csv`, `v1/report/*.csv`
- Outputs: tables and charts in this notebook


## 1) Setup

This notebook assumes dependencies from `requirements.txt` are installed in the project environment.


### Narrative commentary
This notebook is the reference walkthrough for the v1 pipeline. It reads the clean tables generated in v1 so every chart is directly traceable to a CSV output.


In [None]:
from pathlib import Path
import pandas as pd
import numpy as np
import plotly.express as px
import plotly.graph_objects as go

pd.set_option("display.max_columns", 200)

CWD = Path.cwd().resolve()
if (CWD / "v1").exists():
    REPO_ROOT = CWD
elif CWD.name == "notebooks" and (CWD.parent / "data_clean").exists():
    REPO_ROOT = CWD.parents[1]
else:
    REPO_ROOT = CWD

V1_DIR = REPO_ROOT / "v1"
DATA_CLEAN = V1_DIR / "data_clean"
REPORT_DIR = V1_DIR / "report"

DATA_CLEAN, REPORT_DIR


## 2) Helpers


In [None]:
from typing import Iterable


def load_csv(path: Path) -> pd.DataFrame:
    if not path.exists():
        raise FileNotFoundError(f"Missing: {path}")
    return pd.read_csv(path)


def numeric(df: pd.DataFrame, cols: Iterable[str]) -> pd.DataFrame:
    for col in cols:
        if col in df.columns:
            df[col] = pd.to_numeric(df[col], errors="coerce")
    return df


def show_files(root: Path, pattern: str = "*.csv") -> pd.DataFrame:
    rows = []
    for path in sorted(root.glob(pattern)):
        rows.append({"file": path.name, "rows": sum(1 for _ in open(path, "rb")) - 1})
    return pd.DataFrame(rows)


## 3) v1 file inventory


### Narrative commentary
Use the inventory to confirm row counts and ensure no output is missing. If a table is unexpectedly small, re-run the upstream script that generates it.


In [None]:
show_files(DATA_CLEAN)


In [None]:
show_files(REPORT_DIR)


## 4) WHO 2021 Suicide Overview

Focus: age-standardized and crude rates for 2021.


### Narrative commentary
WHO 2021 is the baseline outcome. Rates are per 100k. Age-standardized rates allow cross-country comparison without bias from age structure.


In [None]:
who = load_csv(DATA_CLEAN / "who_2021_clean.csv")
who = numeric(
    who,
    [
        "number_suicides_2021",
        "crude_suicide_rate_2021",
        "age_standardized_suicide_rate_2021",
    ],
)

who.head()


In [None]:
who_both = who[who["sex_name"] == "Both sexes"].copy()
who_both = who_both[who_both["iso3"].notna() & (who_both["iso3"].astype(str) != "")]

summary = {
    "countries": who_both["iso3"].nunique(),
    "mean_age_std": who_both["age_standardized_suicide_rate_2021"].mean(),
    "median_age_std": who_both["age_standardized_suicide_rate_2021"].median(),
}
summary


In [None]:
fig = px.choropleth(
    who_both,
    locations="iso3",
    color="age_standardized_suicide_rate_2021",
    hover_name="location_name",
    color_continuous_scale="Cividis",
    title="WHO 2021 age-standardized suicide rate (Both sexes)",
)
fig


In [None]:
rank = who_both.nlargest(10, "age_standardized_suicide_rate_2021").sort_values(
    "age_standardized_suicide_rate_2021"
)
fig = px.bar(
    rank,
    x="age_standardized_suicide_rate_2021",
    y="location_name",
    orientation="h",
    title="Top 10 suicide rates (Both sexes)",
)
fig


In [None]:
scatter = who_both.dropna(subset=["crude_suicide_rate_2021", "age_standardized_suicide_rate_2021"])
fig = px.scatter(
    scatter,
    x="crude_suicide_rate_2021",
    y="age_standardized_suicide_rate_2021",
    color="region_name",
    hover_name="location_name",
    title="Crude vs age-standardized suicide rates",
)
fig


## 5) GBD Depression (DALYs Rate)


### Narrative commentary
DALYs combine premature mortality and disability. Here we focus on DALYs rate so countries are comparable on a per-100k basis.


In [None]:
dep = load_csv(DATA_CLEAN / "gbd_depression_dalys_clean.csv")
dep = numeric(dep, ["val"])

dep = dep[
    (dep["cause_name"] == "Depressive disorders")
    & (dep["measure_name"] == "DALYs (Disability-Adjusted Life Years)")
    & (dep["metric_name"] == "Rate")
    & (dep["sex_name"] == "Both")
]

dep["age_name"].value_counts()


In [None]:
age_choice = dep["age_name"].dropna().unique().tolist()[0]
subset = dep[dep["age_name"] == age_choice].copy()
subset = subset[subset["iso3"].notna() & (subset["iso3"].astype(str) != "")]

fig = px.choropleth(
    subset,
    locations="iso3",
    color="val",
    hover_name="location_name",
    color_continuous_scale="Cividis",
    title=f"Depression DALYs rate (Both, {age_choice})",
)
fig


In [None]:
rank = subset.nlargest(20, "val").sort_values("val")
fig = px.bar(
    rank,
    x="val",
    y="location_name",
    orientation="h",
    title=f"Top 20 depression DALYs (Both, {age_choice})",
)
fig


In [None]:
age_summary = dep.groupby("age_name", as_index=False)["val"].mean()
age_order = ["<20 years", "20-24 years", "25+ years"]
fig = px.bar(
    age_summary,
    x="age_name",
    y="val",
    title="Average DALYs rate by age group (Both)",
    category_orders={"age_name": age_order},
)
fig


## 6) GBD Addiction (Deaths Rate)


### Narrative commentary
Addiction deaths are separated by cause and sex. These rates are from the 2023 GBD release and are used as predictors in the ML dataset.


In [None]:
add = load_csv(DATA_CLEAN / "gbd_addiction_clean.csv")
add = numeric(add, ["val"])
add = add[
    (add["measure_name"] == "Deaths")
    & (add["metric_name"] == "Rate")
    & (add["year"].astype(str) == "2023")
]

add["cause_name"].value_counts().head()


In [None]:
cause = add["cause_name"].dropna().unique().tolist()[0]
sex = add["sex_name"].dropna().unique().tolist()[0]
subset = add[(add["cause_name"] == cause) & (add["sex_name"] == sex)].copy()
subset = subset[subset["iso3"].notna() & (subset["iso3"].astype(str) != "")]

fig = px.choropleth(
    subset,
    locations="iso3",
    color="val",
    hover_name="location_name",
    color_continuous_scale="Cividis",
    title=f"{cause} deaths rate ({sex})",
)
fig


In [None]:
rank = subset.nlargest(20, "val").sort_values("val")
fig = px.bar(
    rank,
    x="val",
    y="location_name",
    orientation="h",
    title=f"Top 20 {cause} deaths rate ({sex})",
)
fig


## 7) GBD Self-harm (Deaths Rate)


### Narrative commentary
Self-harm deaths provide a distinct signal from broader substance-use categories. We keep age and sex filters to show heterogeneity across groups.


In [None]:
sh = load_csv(DATA_CLEAN / "gbd_selfharm_clean.csv")
sh = numeric(sh, ["val"])
sh = sh[
    (sh["cause_name"] == "Self-harm")
    & (sh["measure_name"] == "Deaths")
    & (sh["metric_name"] == "Rate")
]

age = sh["age_name"].dropna().unique().tolist()[0]
sex = sh["sex_name"].dropna().unique().tolist()[0]
subset = sh[(sh["age_name"] == age) & (sh["sex_name"] == sex)].copy()
subset = subset[subset["iso3"].notna() & (subset["iso3"].astype(str) != "")]

fig = px.choropleth(
    subset,
    locations="iso3",
    color="val",
    hover_name="location_name",
    color_continuous_scale="Cividis",
    title=f"Self-harm deaths rate ({sex}, {age})",
)
fig


In [None]:
rank = subset.nlargest(20, "val").sort_values("val")
fig = px.bar(
    rank,
    x="val",
    y="location_name",
    orientation="h",
    title=f"Top 20 self-harm deaths rate ({sex}, {age})",
)
fig


## 8) Probability of Death

This metric is a probability (0-1), not a per-100k rate.


### Narrative commentary
Probability of death is a 0-1 value (not per-100k). It should be interpreted as a risk probability, so compare it only within this metric.


In [None]:
prob = load_csv(DATA_CLEAN / "gbd_prob_death_clean.csv")
prob = numeric(prob, ["val"])
prob = prob[prob["metric_name"] == "Probability of death"]

cause = prob["cause_name"].dropna().unique().tolist()[0]
sex = prob["sex_name"].dropna().unique().tolist()[0]
age = prob["age_name"].dropna().unique().tolist()[0]
subset = prob[(prob["cause_name"] == cause) & (prob["sex_name"] == sex) & (prob["age_name"] == age)]
subset = subset[subset["iso3"].notna() & (subset["iso3"].astype(str) != "")]

fig = px.choropleth(
    subset,
    locations="iso3",
    color="val",
    hover_name="location_name",
    color_continuous_scale="Blues",
    title=f"Probability of death ({cause}, {sex}, {age})",
)
fig


In [None]:
rank = subset.nlargest(20, "val").sort_values("val")
fig = px.bar(
    rank,
    x="val",
    y="location_name",
    orientation="h",
    title="Top 20 probability of death",
)
fig


## 9) Context Tables (All-cause Trends and Big Categories)


### Narrative commentary
Context tables give macro-level background. All-cause trends provide time context, while big categories show composition in a single year.


In [None]:
allcause = load_csv(DATA_CLEAN / "context_tables" / "context_allcauses_trend.csv")
allcause = numeric(allcause, ["val", "year"])

loc_type = allcause["location_type"].dropna().unique().tolist()[0]
subset = allcause[allcause["location_type"] == loc_type]
location = subset["location_name"].dropna().unique().tolist()[0]
sex = subset["sex_name"].dropna().unique().tolist()[0]
age = subset["age_name"].dropna().unique().tolist()[0]
metric = subset["metric_name"].dropna().unique().tolist()[0]

trend = subset[
    (subset["location_name"] == location)
    & (subset["sex_name"] == sex)
    & (subset["age_name"] == age)
    & (subset["metric_name"] == metric)
].sort_values("year")

fig = px.line(trend, x="year", y="val", markers=True, title=f"{location} | {metric}")
fig


In [None]:
big = load_csv(DATA_CLEAN / "context_tables" / "context_big_categories_2023.csv")
big = numeric(big, ["val"])

location = big["location_name"].dropna().unique().tolist()[0]
sex = big["sex_name"].dropna().unique().tolist()[0]
age = big["age_name"].dropna().unique().tolist()[0]
metric = big["metric_name"].dropna().unique().tolist()[0]

subset = big[
    (big["location_name"] == location)
    & (big["sex_name"] == sex)
    & (big["age_name"] == age)
    & (big["metric_name"] == metric)
]

fig = px.treemap(
    subset,
    path=["cause_name"],
    values="val",
    title=f"Big categories ({location}, {sex}, {age})",
)
fig


## 10) Merged ML Dataset + Baseline Results


### Narrative commentary
The merged ML table links WHO 2021 outcomes with GBD 2023 indicators. This supports correlation analysis but does not imply causality.


In [None]:
ml = load_csv(DATA_CLEAN / "merged_ml_country.csv")
ml = numeric(
    ml,
    [
        "age_standardized_suicide_rate_2021",
        "gbd_depression_dalys_rate_both",
        "gbd_addiction_death_rate_both",
        "gbd_selfharm_death_rate_female",
        "gbd_selfharm_death_rate_male",
    ],
)
ml.head()


In [None]:
cols = [
    "age_standardized_suicide_rate_2021",
    "gbd_depression_dalys_rate_both",
    "gbd_addiction_death_rate_both",
    "gbd_selfharm_death_rate_female",
    "gbd_selfharm_death_rate_male",
]

corr = ml[cols].corr()
fig = px.imshow(corr, text_auto=".2f", color_continuous_scale="RdBu", title="Correlation matrix")
fig


In [None]:
results = load_csv(REPORT_DIR / "ml_baseline_results.csv")
cv = load_csv(REPORT_DIR / "ml_baseline_cv.csv")

results, cv.head()


## 11) Data Quality and Documentation


### Narrative commentary
Check missingness, duplicates, and ISO3 coverage to justify data reliability. The data dictionary and model schema support report transparency.


In [None]:
scorecard = load_csv(REPORT_DIR / "data_quality_scorecard.csv")
missingness = load_csv(REPORT_DIR / "data_quality_missingness.csv")
who_quality = load_csv(REPORT_DIR / "data_quality_who_data_quality.csv")

scorecard.head()


In [None]:
fig = px.pie(
    who_quality,
    names="data_quality",
    values="count",
    title="WHO data_quality distribution",
)
fig


In [None]:
from IPython.display import Markdown, display

model_path = REPORT_DIR / "data_model.md"
dict_path = REPORT_DIR / "data_dictionary.md"

if model_path.exists():
    display(Markdown(model_path.read_text()))
else:
    print("Missing data_model.md")

if dict_path.exists():
    display(Markdown(dict_path.read_text()))
else:
    print("Missing data_dictionary.md")


## 12) Notes

- v1 joins WHO 2021 outcomes with GBD 2023 feature rates.
- Use the ML baseline report in `v1/report/ml_baseline.md` for model detail.
- All v1 visuals in the Streamlit dashboard are reproducible from these tables.
