# v3 Feature Analysis Notebook

This notebook documents the v3 feature tables and basic exploratory checks.

Inputs:
- `v3/data_clean/v3_features_v1.csv`
- `v3/data_clean/v3_features_v2.csv`
- `v3/report/v3_feature_summary.md`

Notes:
- v3 features are derived from v1 (real) and v2 (synthetic) sources.


## 1) Setup

### Narrative commentary
This notebook is a lightweight audit for the v3 feature tables. It checks coverage, distributions, and simple relationships.


In [None]:
from pathlib import Path
import pandas as pd
import numpy as np
import plotly.express as px

pd.set_option("display.max_columns", 200)

CWD = Path.cwd().resolve()
if (CWD / "v3").exists():
    REPO_ROOT = CWD
elif CWD.name == "notebooks" and (CWD.parent / "data_clean").exists():
    REPO_ROOT = CWD.parents[1]
else:
    REPO_ROOT = CWD

V3_DIR = REPO_ROOT / "v3"
DATA_CLEAN = V3_DIR / "data_clean"
REPORT_DIR = V3_DIR / "report"

DATA_CLEAN, REPORT_DIR


## 2) Load feature tables

### Narrative commentary
We compare the v1-derived feature table with the v2 synthetic table to see coverage and distribution differences.


In [None]:
features_v1 = pd.read_csv(DATA_CLEAN / "v3_features_v1.csv")
features_v2 = pd.read_csv(DATA_CLEAN / "v3_features_v2.csv")

features_v1.head(), features_v2.head()


In [None]:
summary = pd.DataFrame([
    {
        "source": "v1",
        "rows": len(features_v1),
        "countries": features_v1["iso3"].nunique(),
        "sexes": ", ".join(sorted(features_v1["sex_name"].unique())),
        "years": f"{features_v1['year'].min()}-{features_v1['year'].max()}",
    },
    {
        "source": "v2",
        "rows": len(features_v2),
        "countries": features_v2["iso3"].nunique(),
        "sexes": ", ".join(sorted(features_v2["sex_name"].unique())),
        "years": f"{features_v2['year'].min()}-{features_v2['year'].max()}",
    },
])
summary


## 3) Distributions

### Narrative commentary
Distributions help identify skew and extreme values. Compare v1 (real) and v2 (synthetic) shapes.


In [None]:
feature_cols = ["suicide_rate", "depression_dalys_rate", "addiction_death_rate", "selfharm_death_rate"]

fig = px.histogram(features_v1, x="suicide_rate", nbins=30, title="v1: suicide_rate distribution")
fig


In [None]:
fig = px.histogram(features_v2, x="suicide_rate", nbins=30, title="v2: suicide_rate distribution")
fig


## 4) High-risk labeling (example)

### Narrative commentary
The v3 risk estimator labels high-risk by percentile. Here we use p80 as a simple example.


In [None]:
cutoff = 0.80
thr_v1 = features_v1["suicide_rate"].quantile(cutoff)
thr_v2 = features_v2["suicide_rate"].quantile(cutoff)

labels = pd.DataFrame([
    {"source": "v1", "cutoff": cutoff, "threshold": thr_v1, "positive_rate": (features_v1["suicide_rate"] >= thr_v1).mean()},
    {"source": "v2", "cutoff": cutoff, "threshold": thr_v2, "positive_rate": (features_v2["suicide_rate"] >= thr_v2).mean()},
])
labels


## 5) Correlation check

### Narrative commentary
This shows the basic correlation structure among the four numeric indicators.


In [None]:
import plotly.express as px

corr = features_v1[feature_cols].corr()
fig = px.imshow(corr, text_auto=".2f", color_continuous_scale="RdBu", title="v1 correlations")
fig


## 6) Feature summary report

### Narrative commentary
The summary file is generated by `src/v3_prepare_features.py` and records row counts and missingness.


In [None]:
from IPython.display import Markdown, display

summary_path = REPORT_DIR / "v3_feature_summary.md"
if summary_path.exists():
    display(Markdown(summary_path.read_text()))
else:
    print("Missing v3_feature_summary.md")


## 7) Notes

- v3 features are cross-sectional (country-level) inputs for the risk estimator.
- Use this notebook to validate distributions before running the v3 model.
