# 01 - Import & Explore DAIC WOZ Labels (then Clean)

This notebook focuses on **importing** and **exploring** the DAIC WOZ labels first; a small,
clearly documented **cleaning step** comes *after* we understand the data (drop NAs/dupes,
keep valid PHQ 8 range). Comments walk you through the *why* behind each step.

**Goals**
- Load the Depression AVEC 2017 training labels (participant_id + PHQ 8).
- Inspect structure, missing values, and value ranges.
- Visualize the label distribution.
- Do a minimal, safe cleaning pass (documented).
- Save a tiny checkpoint sample (shareable) and a full cleaned artifact (local, ignored by git).


In [None]:
# --- bootstrap PYTHONPATH so repo utilities are importable ------------------------------
# Why: Jupyter often runs from the `notebooks/` folder, so `utils/` (at repo root) isn't on sys.path.
# This cell adds both ROOT and ROOT/utils to sys.path so `from utils.sanity ...` works reliably.
import sys, pathlib
CWD = pathlib.Path.cwd()
ROOT = CWD if (CWD / "utils").exists() else CWD.parent # handles when notebook lives in notebooks/
if str(ROOT) not in sys.path:
 sys.path.append(str(ROOT))
if str(ROOT / "utils") not in sys.path:
 sys.path.append(str(ROOT / "utils"))
print("ROOT:", ROOT)
print("In sys.path:", str(ROOT) in sys.path, str(ROOT/'utils') in sys.path)

In [None]:
# --- environment sanity & project paths -------------------------------------------------
# Why: print python path/versions and set up canonical DATA folders used throughout the project.
from utils.sanity import sanity_env, setup_paths, set_seeds
sanity_env(pkgs=("pandas","numpy","matplotlib","seaborn","sklearn"))
ROOT, DATA, RAW, CLEAN, OUT = setup_paths()
set_seeds(42)
ROOT, DATA, RAW, CLEAN, OUT

## Step 1 - Discover raw files
List data like files under `data/raw/` to confirm locations and names.

In [None]:
from pathlib import Path

def list_files(base: Path, patterns=(".csv", ".tsv", ".xlsx", ".json")):
 """Return (relative_path, sizeMB) for readable data files, sorted by size desc."""
 rows = []
 for p in base.rglob("*"):
 if p.is_file() and p.suffix.lower() in patterns:
 rows.append((p.relative_to(base), round(p.stat().st_size/1_000_000, 2)))
 return sorted(rows, key=lambda x: (-x[1], str(x[0])))

raw_list = list_files(RAW)
print(f"Found {len(raw_list)} candidate files under {RAW}")
for rel, mb in raw_list[:30]:
 print(f"{str(rel):70s} {mb:6.2f} MB")

## Step 2 - Load the Depression AVEC 2017 training labels
We expect a file named like `train_split_Depression_AVEC2017.csv` containing **participant IDs**
and **PHQ 8** totals. Adjust `labels_rel` below if your path differs.


In [None]:
# --- Auto-find a likely labels file under RAW --------------------------------
from pathlib import Path
import re

# reuse raw_list if it's already defined; otherwise rebuild it quickly
try:
 raw_list
except NameError:
 from pathlib import Path
 def list_files(base: Path, patterns=(".csv", ".tsv", ".xlsx", ".json")):
 rows = []
 for p in base.rglob("*"):
 if p.is_file() and p.suffix.lower() in patterns:
 rows.append((p.relative_to(base), round(p.stat().st_size/1_000_000, 2)))
 return sorted(rows, key=lambda x: (-x[1], str(x[0])))
 raw_list = list_files(RAW)

def score_name(name: str) -> int:
 n = name.lower()
 score = 0
 # prioritize depression labels files
 if "depression" in n: score += 3
 if "train_split" in n or "train" in n: score += 2
 if "avec2017" in n or "avec" in n: score += 2
 if "label" in n or "phq" in n or "phq8" in n: score += 2
 if n.endswith(".csv"): score += 1
 return score

candidates = sorted([(score_name(str(rel)), rel) for rel,_ in raw_list], reverse=True)
top = [rel for sc, rel in candidates if sc > 0][:10]

print("Top label-like candidates:")
for rel in top:
 print(" ", rel)

# pick the best guess
labels_rel = top[0] if top else None
labels_rel


In [None]:
# --- load the chosen labels file --------------------------------------------
import pandas as pd

labels_path = RAW / "train_split_Depression_AVEC2017.csv" # use the top candidate
assert labels_path.exists(), f"Labels file not found: {labels_path}"

df = pd.read_csv(labels_path)

print("Original columns:", df.columns.tolist())
df.head(3)


## Step 3 - Normalize column names; choose ID & label columns

In [None]:
# Normalize: strip, lowercase, spaces->underscores for consistent referencing
df.columns = [c.strip().lower().replace(" ", "_") for c in df.columns]
print("Normalized columns:", df.columns.tolist())

# For AVEC 2017 labels we expect these names; override if different:
ID_COL = "participant_id"
LABEL_COL = "phq8_score"

assert ID_COL in df.columns, f"{ID_COL} not in columns: {df.columns.tolist()}"
assert LABEL_COL in df.columns, f"{LABEL_COL} not in columns: {df.columns.tolist()}"

# Standardize types: IDs as strings; coerce label to numeric
df[ID_COL] = df[ID_COL].astype(str).str.strip()
df[LABEL_COL] = pd.to_numeric(df[LABEL_COL], errors="coerce")

print(f"Using ID_COL={ID_COL}, LABEL_COL={LABEL_COL}")
df[[ID_COL, LABEL_COL]].head(5)

## Step 4 - Overview & nulls audit

In [None]:
# --- Quick overview and null audit -------------------------------------------
# Why: understand the shape of the dataset and check for missing values
# across critical columns (IDs + PHQ8 scores).

from utils.sanity import data_overview

data_overview(df[[ID_COL, LABEL_COL]])


## Step 5 - Label distribution & integrity checks

In [None]:
# --- Label distribution and integrity checks ---------------------------------
# Why: 
# - visualize the PHQ8 score distribution (class imbalance, expected range 0-24)
# - check for duplicate participant IDs
# - confirm all scores fall in the valid range

from utils.sanity import label_balance, check_integrity

label_balance(df, label_col=LABEL_COL, binary_col="phq8_binary")

check_integrity(
 df,
 id_col=ID_COL,
 label_col=LABEL_COL,
 label_range=(0, 24) # valid PHQ8 score range
)


## Step 6 - Minimal cleaning (documented, reversible)

In [None]:
# --- Minimal cleaning --------------------------------------------------------
# Why:
# - Drop rows missing ID or PHQ8 score (can't be used downstream).
# - Drop any duplicate IDs (should be 0, but defensive coding is good).
# - Keep only scores in the valid PHQ8 range (0-24).

clean = (
 df.dropna(subset=[ID_COL, LABEL_COL]) # drop rows with null ID or score
 .drop_duplicates(subset=[ID_COL]) # keep first occurrence per participant
 .query(f"{LABEL_COL} >= 0 & {LABEL_COL} <= 24") # enforce valid score range
 .copy()
)

# standardize ID again to string + strip whitespace
clean[ID_COL] = clean[ID_COL].astype(str).str.strip()

print("Before -> After:", len(df), "->", len(clean))
clean.head()


## Step 7 - Save artifacts (tiny sample + full cleaned)

In [None]:
from utils.sanity import save_checkpoint

# Tiny sample (<=200 rows) -> safe for versioning if you want
_ = save_checkpoint(clean, OUT / "eda_sample.parquet", n=min(200, len(clean)))

# Full cleaned labels -> goes to data/cleaned (ignored by git via .gitignore)
CLEAN.mkdir(parents=True, exist_ok=True)
full_path = CLEAN / "labels_clean.parquet"
clean.to_parquet(full_path, index=False)
print("Saved:", full_path)

## Appendix - Working data dictionary (update as you go)
- `participant_id` *(string)* - normalized unique subject identifier.
- `phq8_score` *(int 0..24)* - PHQ 8 depression severity total.

**Notes:** Add any additional columns you plan to join later (e.g., demographics, transcripts).
Document units and valid ranges here as they become relevant.


---
### ✅ Commit suggestions (use from terminal)
- `EDA: env sanity + paths; discovered raw files`
- `EDA: loaded AVEC 2017 labels; normalized columns; set ID+label`
- `EDA: overview/nulls, distribution, integrity checks`
- `EDA: minimal cleaning; saved sample + full cleaned parquet`

### ️ Next
- Join with other tables (e.g., demographics) using `participant_id`.
- Begin `feature/model-baselines` branch: train/test split + baseline classifier.
