
# Notebook 03 - Feature Engineering & Multimodal Inputs

**Project:** Recognizing the Unseen - A Multimodal, Trauma-Informed AI Framework 
**Goal of this notebook:** engineer features beyond PHQ-8 and prepare multimodal inputs (text, audio, video) for downstream modeling.

**Builds on:** 
- Notebook 01: Import, clean, EDA (labels + minimal cleaning) 
- Notebook 02: Baselines (Dummy vs. Logistic), ROC/PR, coefficient plots, interactive sliders + thresholds 



## Contents
1. Data sources & setup 
2. SMT guardrails (Z3) for data integrity and split hygiene 
3. Feature engineering 
 - Tabular (PHQ-8) 
 - Text (transcripts embeddings) 
 - Audio (prosody) 
 - Video (facial action units) 
4. Multimodal dataset assembly 
5. Artifacts (saved processed data) 
6. Limitations & reproducibility 
7. Closing summary & next steps



## 1) Data sources & setup

Load the cleaned PHQ-8 labels and set up placeholders for additional modalities. 
This cell focuses on reading already-prepared artifacts from prior notebooks and defining
conventions for participant/session keys.


In [None]:
# =============================================================================
# Imports & Paths  +  Data sources & setup (canonical one-stop block)
# -----------------------------------------------------------------------------
# - Imports (numpy/pandas, etc.) + quick diagnostics (Python/pandas version)
# - Resolve repo root and standard data directories
# - Define canonical join/target names used across Notebook 03
# - Load cleaned labels (from Notebook 01/02 output) with graceful fallback
# - Normalize common column-name variants -> {participant_id, label, split}
# - Print a concise summary for reviewers
# =============================================================================

# --- Imports -----------------------------------------------------------------
from pathlib import Path
import pandas as pd
import numpy as np

# (Optional) quick diagnostics to make runtime context explicit
import platform
print("Python:", platform.python_version())
print("numpy:", np.__version__)
print("pandas:", pd.__version__)
# Optional: show whether any BLAS info is available (won't crash if missing)
blas = getattr(getattr(np, "__config__", object()), "blas_opt_info", {})
print("BLAS info found:", bool(blas))

# --- Paths (single source of truth) ------------------------------------------
# Reuse ROOT_DIR if already set (e.g., by earlier cells); otherwise derive it.
try:
    ROOT_DIR
except NameError:
    ROOT_DIR = Path.cwd().resolve().parent  # notebooks/ -> repo root

DATA_DIR      = ROOT_DIR / "data"
CLEAN_DIR     = DATA_DIR / "cleaned"       # outputs from NB01/NB02 (adjust if your layout differs)
PROCESSED_DIR = DATA_DIR / "processed"     # downstream features go here
PROCESSED_DIR.mkdir(parents=True, exist_ok=True)

# --- Canonical names & expected labels artifact ------------------------------
JOIN_KEY = "participant_id"                # canonical join key used throughout NB03
TARGET   = "label"                         # canonical binary target (0/1)
LABELS_PATH = CLEAN_DIR / "labels_clean.parquet"   # <- adjust here if NB01/02 writes elsewhere

# --- Load labels with guardrails (degrade gracefully if file not ready) ------
if LABELS_PATH.exists():
    labels_df = pd.read_parquet(LABELS_PATH)
    print(f"Loaded labels: {LABELS_PATH} | shape={labels_df.shape}")
else:
    print(f'NOTE: labels_clean.parquet not found -> {LABELS_PATH}')
    print("      (Run Notebook 01/02 to generate, or adjust LABELS_PATH if it moved.)")
    # Create empty skeleton so downstream checks in Step 2 can SKIP cleanly
    labels_df = pd.DataFrame(columns=[JOIN_KEY, TARGET, "split"])

# --- Normalize column names to our canonical schema --------------------------
rename_map = {}
# join key variants
if "subject_id" in labels_df.columns and JOIN_KEY not in labels_df.columns:
    rename_map["subject_id"] = JOIN_KEY
if "id" in labels_df.columns and JOIN_KEY not in labels_df.columns:
    rename_map["id"] = JOIN_KEY
# target variants
if "target" in labels_df.columns and TARGET not in labels_df.columns:
    rename_map["target"] = TARGET
if "phq8_binary" in labels_df.columns and TARGET not in labels_df.columns:
    rename_map["phq8_binary"] = TARGET

if rename_map:
    labels_df = labels_df.rename(columns=rename_map)

# Ensure placeholders exist (keeps downstream guardrails readable & safe)
if JOIN_KEY not in labels_df.columns:
    labels_df[JOIN_KEY] = pd.Series(dtype="object")
    print(f'NOTE: added empty "{JOIN_KEY}" column')
if TARGET not in labels_df.columns:
    labels_df[TARGET] = pd.Series(dtype="Int64")  # nullable int; matches {0,1}
    print(f'NOTE: added empty "{TARGET}" column')
if "split" not in labels_df.columns:
    labels_df["split"] = pd.Series(dtype="string")
    print('NOTE: added empty "split" column')

# --- Reviewer-friendly summary ----------------------------------------------
n_rows = len(labels_df)
sample_cols = [c for c in [JOIN_KEY, TARGET, "split"] if c in labels_df.columns]
print(f"labels_df: {n_rows} rows, {labels_df.shape[1]} cols | has {sample_cols} | head:")
print(labels_df[sample_cols].head(5))



## 2) SMT guardrails (Z3) for data integrity and split hygiene

We add lightweight **formal checks** to catch structural mistakes early:

- Temporal event sanity: `onset < apex < offset n_frames - 1` 
- Window safety: each feature window stays within clip bounds 
- Sampling consistency: `fps > 0` and `duration frames / fps` 
- Split hygiene: subject-disjoint train/val/test; minimum class presence per split 
- Label domain checks: labels belong to the expected set


In [None]:
# =============================================================================
# STEP 2 â€” SMT GUARDRAILS (Z3) + SPLIT HYGIENE
# Goal:
#   - Enforce label domain and (optionally) split integrity with small, readable
#     checks that fail-fast when assumptions break.
#   - Keep notebook executable even when artifacts are not ready (print & skip).
# Why:
#   - Early structural checks catch silent drift (e.g., wrong label domain, ID
#     overlap across splits) before modeling.
# =============================================================================

# Make repo-root imports work from inside notebooks/
# Why: the kernel's CWD is often `notebooks/`, while `verification.py` lives at repo root.
import sys
from pathlib import Path

def _find_repo_root(filename: str = "verification.py") -> Path | None:
    """Walk upward from CWD until `filename` is found; return its parent (repo root)."""
    here = Path.cwd().resolve()
    for p in [here, *here.parents]:
        if (p / filename).exists():
            return p
    return None

# Reuse ROOT_DIR from Step 1 if present; otherwise resolve it robustly here.
try:
    ROOT_DIR
except NameError:
    ROOT_DIR = _find_repo_root() or Path.cwd().resolve().parent  # fallback: notebooks/ -> repo root

# Ensure the root is importable
if ROOT_DIR and str(ROOT_DIR) not in sys.path:
    sys.path.append(str(ROOT_DIR))

# Now safe(ish) to import guardrail utilities; if unavailable, degrade gracefully.
_guardrails_loaded = False
try:
    from verification import (
        check_event_triplet,             # example timing check for onset/apex/offset
        check_window_bounds,             # window [start, start+len) within [0, n)
        check_sampling_consistency,      # duration â‰ˆ frames/fps
        assert_disjoint_splits,          # no subject overlap across splits
        min_class_presence,              # per-split class counts â‰¥ threshold
        assert_label_domain,             # labels in allowed set
        verify_env,                      # tiny runtime report for smoke tests/CI
    )
    _guardrails_loaded = True
except Exception as e:
    print(f"SKIP: guardrail utilities not importable ({type(e).__name__}: {e}). "
          "Proceeding without hard checks so the notebook stays runnable.")

JOIN_KEY = "participant_id"
TARGET   = "label"

# ---- 2.1 Label hygiene ------------------------------------------------------
if not _guardrails_loaded:
    print("SKIP: label checks (verification.py not loaded).")
elif "labels_df" not in globals() or labels_df is None or labels_df.empty or TARGET not in labels_df.columns:
    print("SKIP: label checks (labels_df empty or target column missing).")
else:
    # Domain guarantee â†’ reviewers see intent: binary classification (0/1).
    assert_label_domain(labels_df[TARGET], allowed=(0, 1))
    print("OK: label domain is restricted to {0, 1}.")

# ---- 2.2 Split hygiene (optional) -------------------------------------------
# If you already created a split in Notebook 02, this validates it.
if not _guardrails_loaded:
    print("SKIP: split checks (verification.py not loaded). "
          "Create deterministic splits in Notebook 02/03 before modeling.")
elif ("labels_df" in globals() and labels_df is not None and not labels_df.empty
      and ("split" in labels_df.columns) and (JOIN_KEY in labels_df.columns)):
    # Extract subject IDs per split (keeps checks explainable & auditable).
    train_ids = labels_df.loc[labels_df["split"] == "train", JOIN_KEY]
    val_ids   = labels_df.loc[labels_df["split"] == "val",   JOIN_KEY]
    test_ids  = labels_df.loc[labels_df["split"] == "test",  JOIN_KEY]

    # (a) No subject overlap across splits
    assert_disjoint_splits(train_ids, val_ids, test_ids)
    print("OK: no subject overlap across splits (train/val/test).")

    # (b) Minimum per-class support in each split â†’ guards against degenerate folds
    min_class_presence(
        {
            "train": labels_df.loc[labels_df["split"] == "train", TARGET],
            "val":   labels_df.loc[labels_df["split"] == "val",   TARGET],
            "test":  labels_df.loc[labels_df["split"] == "test",  TARGET],
        },
        min_count=5  # Adjust with dataset size; aim to preserve evaluation stability.
    )
    print("OK: each split meets minimum class presence thresholds.")
else:
    print('SKIP: split checks (no "split" column yet). '
          "Create deterministic splits in Notebook 02/03 before modeling.")

# ---- 2.3 Timing/window sanity (optional, runs only if variables provided) ---
# These are examples; they will quietly skip if you haven't defined the inputs yet.
# Rationale: keeps nbconvert/CI green while still documenting expectations.

if _guardrails_loaded:
    # Example A: sampling consistency for a video segment: frames / fps â‰ˆ duration
    try:
        ok, msg = check_sampling_consistency(
            frames=int(video_frames),        # define upstream when available
            fps=float(video_fps),
            duration_sec=float(video_duration_sec)
        )
        print("Video sampling check:", msg)
    except Exception:
        # Not available yet; that is expected in early drafts.
        pass

    # Example B: generic window bounds (e.g., feature extraction slices)
    try:
        ok, msg = check_window_bounds(
            start=int(win_start),            # define upstream when available
            length=int(win_len),
            n_frames=int(total_frames)
        )
        print("Window bounds check:", msg)
    except Exception:
        pass
else:
    print("SKIP: timing/window checks (verification.py not loaded).")

print("Guardrail checks completed.")

# =============================================================================
# Smoke test â€” confirm guardrail utilities are importable and show env facts
# =============================================================================
_loaded = globals().get("_guardrails_loaded", False)

if _loaded:
    try:
        import verification
        print(f"Verification module loaded from: {verification.__file__}")
        want = [
            "check_event_triplet",
            "check_window_bounds",
            "check_sampling_consistency",
            "assert_disjoint_splits",
            "min_class_presence",
            "assert_label_domain",
            "verify_env",
        ]
        available = [name for name in want if getattr(verification, name, None)]
        missing   = [name for name in want if name not in available]
        print("Available guardrail functions:", available)
        if missing:
            print("Note: missing in verification.py ->", missing)
        # One-line environment report (nice for CI and Dr. S)
        try:
            print("Env:", verification.verify_env())
        except Exception:
            print("Env: verify_env() raised; skipping.")
    except Exception as e:
        print(f"Smoke test warning: import succeeded but inspection failed ({type(e).__name__}: {e})")
else:
    print("Smoke test: verification.py not loaded (see SKIP messages above).")

# Show where ROOT_DIR resolved to (useful for CI/review logs)
print("Resolved ROOT_DIR:", ROOT_DIR if "ROOT_DIR" in globals() else "<not set>")

# Peek at the first few sys.path entries to confirm import order
print("sys.path[0:3]:", sys.path[:3])






> ðŸ’¡ **Workflow tip:** Run the checks immediately after loading each modality. Fail fast with clear errors so issues don't propagate into modeling.



## 3) Feature engineering
We create modality-specific features. Start simple and keep everything **reproducible**.

### 3.1 Tabular (PHQ-8)
- Standardize numeric PHQ-8 items.
- (Optional) Create low-order interaction terms for hypothesis-driven pairs.


In [None]:
# Quick probe: see all columns that look PHQ-related
[c for c in labels_df.columns if "phq" in str(c).lower()]


In [None]:
# =============================================================================
# 3.1 Tabular (PHQ-8) â€” Clinical-style imputation + optional rounding
# Goal:
#   - Build interpretable PHQ-8 features (sum/mean/missingness, z-scores).
#   - Clinical scoring:
#       * If â‰¤1 item missing â†’ impute that item with the row mean, then sum.
#       * If â‰¥2 items missing â†’ leave score NaN (no aggressive imputation).
#   - Optional rounding of the final score to match reporting conventions.
#   - After scoring, zero-fill item columns for downstream models (documented).
# =============================================================================

import pandas as pd
from sklearn.preprocessing import StandardScaler
import numpy as np

TAB_OUT = PROCESSED_DIR / "tabular_phq8.parquet"

# ---- Explicit PHQ-8 schema pin (order matters: items 1..8) ------------------
PHQ8_COLS = [
    "phq8_nointerest",      # 1  little interest/pleasure
    "phq8_depressed",       # 2  feeling down/depressed/hopeless
    "phq8_sleep",           # 3  sleep problems
    "phq8_tired",           # 4  low energy/tired
    "phq8_appetite",        # 5  appetite/eating
    "phq8_failure",         # 6  feeling bad/failure/worthless/guilty
    "phq8_concentrating",   # 7  trouble concentrating
    "phq8_moving",          # 8  psychomotor (restless/slow)
]
REQUIRE_ALL_ITEMS = True  # set False to proceed if any items missing

# Choose rounding for the total score: "nearest" | "bankers" | "floor" | "ceil" | None
SCORE_ROUNDING = "nearest"

# ---- Guard schema presence ---------------------------------------------------
missing_items = [c for c in PHQ8_COLS if c not in labels_df.columns]
if missing_items:
    msg = f"PHQ-8 schema mismatch: missing {len(missing_items)} column(s): {missing_items}"
    if REQUIRE_ALL_ITEMS:
        raise AssertionError(msg)
    else:
        print("WARNING:", msg, "â†’ proceeding with available items only.")
        PHQ8_COLS = [c for c in PHQ8_COLS if c in labels_df.columns]

if not PHQ8_COLS:
    print("SKIP: No PHQ-8 item columns available; tabular features will be empty.")
    tab_df = pd.DataFrame(columns=[JOIN_KEY, TARGET])
else:
    # ---- 3.1.1 Assemble base frame ------------------------------------------
    base_cols = [c for c in [JOIN_KEY, TARGET] if c in labels_df.columns]
    tab_df = labels_df[base_cols + PHQ8_COLS].copy()

    # Coerce items to numeric safely (handles stray strings gracefully)
    items = tab_df[PHQ8_COLS].apply(pd.to_numeric, errors="coerce")

    # ---- 3.1.2 Clinical-style imputation & scoring ---------------------------
    missing_ct = items.isna().sum(axis=1)        # items missing per row
    row_mean   = items.mean(axis=1, skipna=True) # mean of answered items

    # Impute only when exactly 1 (or â‰¤1) item missing
    items_imputed = items.copy()
    mask_impute = missing_ct.le(1) & missing_ct.gt(0)  # (0 < missing â‰¤ 1)
    items_imputed.loc[mask_impute] = (
        items_imputed.loc[mask_impute].T
        .fillna(row_mean[mask_impute])  # broadcast row-wise means into NaNs
        .T
    )

    # Score:
    #  - If â‰¥2 items missing â†’ keep NaN (min_count enforces that)
    #  - Else â†’ sum imputed row
    tab_df["phq8_missing_count"] = missing_ct
    tab_df["phq8_sum"]  = items_imputed.sum(axis=1, min_count=len(PHQ8_COLS) - 1)
    tab_df["phq8_mean"] = items_imputed.mean(axis=1, skipna=True)

    # ---- 3.1.3 Optional rounding to match reporting conventions -------------
    if SCORE_ROUNDING == "nearest":
        s = tab_df["phq8_sum"]
        tab_df["phq8_sum"] = np.sign(s) * np.floor(np.abs(s) + 0.5)  # half-away-from-zero
    elif SCORE_ROUNDING == "bankers":
        tab_df["phq8_sum"] = tab_df["phq8_sum"].round(0)
    elif SCORE_ROUNDING == "floor":
        tab_df["phq8_sum"] = np.floor(tab_df["phq8_sum"])
    elif SCORE_ROUNDING == "ceil":
        tab_df["phq8_sum"] = np.ceil(tab_df["phq8_sum"])
    # else: leave fractional totals as-is

    # ---- 3.1.4 Post-scoring zero-fill for model inputs (documented choice) ---
    # Keeps rows dense for models while preserving clinically faithful 'phq8_sum'.
    tab_df[PHQ8_COLS] = items.fillna(0)

    # ---- 3.1.5 Standardize numeric features (excluding target & ID) ----------
    num_cols = [c for c in tab_df.columns
                if c not in [JOIN_KEY, TARGET] and pd.api.types.is_numeric_dtype(tab_df[c])]
    if num_cols:
        scaler = StandardScaler()
        tab_df[[f"{c}_z" for c in num_cols]] = scaler.fit_transform(tab_df[num_cols])
        print(f"Scaled {len(num_cols)} numeric columns -> *_z")
    else:
        print("NOTE: No numeric columns to scale.")

# ---- 3.1.6 Save & reviewer preview -----------------------------------------
try:
    tab_df.to_parquet(TAB_OUT, index=False)
    print("Saved tabular PHQ-8 ->", TAB_OUT, "| shape=", tab_df.shape)
except Exception as e:
    print("SKIP save:", type(e).__name__, "-", e)

show_cols = [JOIN_KEY, TARGET] + PHQ8_COLS + ["phq8_missing_count", "phq8_sum", "phq8_mean"]
show_cols = [c for c in show_cols if c in tab_df.columns]
print("tab_df preview:")
print(tab_df[show_cols].head(5))

# ---- Optional QA against any provided 'phq8_score' column -------------------
if "phq8_score" in labels_df.columns:
    try:
        orig = pd.to_numeric(labels_df["phq8_score"], errors="coerce")
        agree = (orig.fillna(-1).astype(float) == tab_df["phq8_sum"].fillna(-2).astype(float)).sum()
        print(f"QA: phq8_sum (clinical + rounding) vs phq8_score agreement: {agree}/{len(tab_df)} rows")
    except Exception:
        print("QA: could not compare to 'phq8_score' (non-fatal).")




   

In [None]:
# =============================================================================
# PHQ-8 QA (optional): compare our phq8_sum to provided phq8_score
# =============================================================================
if "phq8_score" in labels_df.columns and "phq8_sum" in tab_df.columns:
    orig = pd.to_numeric(labels_df["phq8_score"], errors="coerce")
    ours = pd.to_numeric(tab_df["phq8_sum"], errors="coerce")

    mismask = orig.fillna(-1).astype(float) != ours.fillna(-2).astype(float)
    mism_idx = mismask[mismask].index
    n_mis = int(mismask.sum())

    print(f"QA: mismatches (ours vs provided): {n_mis}/{len(tab_df)} rows")
    if n_mis:
        cols = [JOIN_KEY, "phq8_score", "phq8_sum", "phq8_mean", "phq8_missing_count"] + PHQ8_COLS
        # Show up to 5 examples
        preview = tab_df.loc[mism_idx, [c for c in cols if c in tab_df.columns]].head(5).copy()
        # Add the provided score for clarity (from labels_df)
        preview["phq8_score_src"] = labels_df.loc[preview.index, "phq8_score"]
        display(preview)
else:
    print("QA: skipped (no 'phq8_score' column or 'phq8_sum' not computed).")




---
### PHQ-8 tabular features: interpretation & key takeaways

**What we did**
- Pinned PHQ-8 item schema: ["phq8_nointerest","phq8_depressed","phq8_sleep","phq8_tired","phq8_appetite","phq8_failure","phq8_concentrating","phq8_moving"].
- Clinical-style scoring:
  - If â‰¤1 item missing: imputed the missing item with the row mean of answered items, then summed.
  - If â‰¥2 items missing: left the score as NaN (no aggressive imputation).
- Optional rounding: set to "nearest" so totals match typical reporting.
- After scoring, zero-filled item columns for modeling, and z-scored numeric features for comparability.

**Guardrails & QA**
- Label domain and split checks run in Step 2 (fail-fast or SKIP cleanly).
- PHQ-8 QA: our computed "phq8_sum" vs provided "phq8_score" â†’ **107/107** agreement with rounding ("nearest").

**Results snapshot**
- Saved to `data/processed/tabular_phq8.parquet`.
- Shape: **(107, 24)** (ID, label, 8 items, missing_count, sum, mean, and z-scored variants).
- Missingness: `phq8_missing_count` shows per-row item gaps; rows with â‰¥2 missing keep `phq8_sum` as NaN.

**How to read the features**
- `phq8_sum`: total symptom burden (higher = more severe).
- `phq8_mean`: average per-item severity (robust when one item is imputed).
- `phq8_missing_count`: data quality indicator; consider as a covariate or filter in sensitivity analyses.
- `*_z`: standardized versions for models that benefit from scaled inputs.

**Decisions (documented)**
- Rounding: used "nearest" to mirror the provided clinical scores (prevents off-by-one drift when one item is imputed).
- Post-scoring zero-fill: keeps downstream models dense without altering the clinically faithful `phq8_sum`.

**Limitations**
- Row-mean imputation for a single missing item is simple and standard, but still an assumption.
- Rows with â‰¥2 missing items are not scored; downstream models should either ignore `phq8_sum` for those rows or handle NaNs explicitly.

**Recommended next steps**
- Sensitivity check: run models with and without rounding; confirm conclusions are stable.
- Optionally add `phq8_flag_gt1_missing = 1{missing_count â‰¥ 2}` as an exclusion flag or covariate.
- Proceed to 3.2 (Text) to add linguistic signals; the tabular block provides a solid baseline.
---



### 3.2 Text (transcripts embeddings)
- Option A (quick baseline): TF IDF on transcript text. 
- Option B (semantic): sentence embeddings (e.g., SentenceTransformers).

> Note: If running offline or with limited resources, prefer TF IDF first; swap in embeddings later.


In [None]:

# Placeholder stub for text features (replace with real transcript loading)
# Example API (choose one approach):
USE_TFIDF = True

if not df_text.empty:
 if USE_TFIDF:
 from sklearn.feature_extraction.text import TfidfVectorizer
 vec = TfidfVectorizer(max_features=1024, ngram_range=(1,2))
 tfidf = vec.fit_transform(df_text.get('transcript', pd.Series([], dtype=str)).fillna(''))
 df_text_feats = pd.DataFrame(tfidf.toarray(), columns=[f'tfidf_{i}' for i in range(tfidf.shape[1])])
 else:
 # Sentence embeddings (requires sentence-transformers)
 # from sentence_transformers import SentenceTransformer
 # model = SentenceTransformer('all-MiniLM-L6-v2')
 # emb = model.encode(df_text['transcript'].fillna('').tolist(), batch_size=64, show_progress_bar=True)
 # df_text_feats = pd.DataFrame(emb, columns=[f'emb_{i}' for i in range(emb.shape[1])])
 df_text_feats = pd.DataFrame() # placeholder
 df_text_model = pd.concat([df_text[['subject_id','session_id']].reset_index(drop=True),
 df_text_feats.reset_index(drop=True)], axis=1)
else:
 df_text_model = df_text.copy()
 print('Text table empty; populate df_text with transcripts + IDs.')



### 3.3 Audio (prosody)
- Fundamental frequency (f0), jitter/shimmer, loudness/energy, spectral features. 
- Extract with OpenSMILE or COVAREP, then aggregate per session (mean, std, percentiles).


In [None]:

# Placeholder stub for audio features
# Expect df_audio to already contain aggregated features keyed by subject/session.
if df_audio.empty:
 print('Audio table empty; populate df_audio with prosodic aggregates (e.g., f0_mean, jitter, etc.).')



### 3.4 Video (facial action units)
- Use OpenFace to extract per frame AUs and gaze; aggregate per session. 
- Typical aggregates: mean, std, max, fraction above threshold.


In [None]:

# Placeholder stub for video features
# Expect df_video to already contain aggregated AU/gaze features keyed by subject/session.
if df_video.empty:
 print('Video table empty; populate df_video with AU/gaze aggregates.')



## 4) Multimodal dataset assembly

Merge per-modality feature tables on `['subject_id','session_id']`, align with labels, handle missing data, and validate splits.


In [None]:

# Merge order: tabular -> text -> audio -> video
dfs = []
for d in [df_tab, df_text_model, df_audio, df_video]:
 if not d.empty:
 dfs.append(d)

if dfs:
 from functools import reduce
 df_features = reduce(lambda left, right: pd.merge(left, right, on=['subject_id','session_id'], how='outer'), dfs)
else:
 df_features = pd.DataFrame(columns=['subject_id','session_id'])

# Attach target if available
if 'depressed' in df_labels.columns:
 df_features = pd.merge(df_labels[['subject_id','session_id','depressed']], df_features, on=['subject_id','session_id'], how='left')

print('Multimodal merged shape:', df_features.shape)

# Simple imputation (zero-fill for missing engineered features; leave IDs/target intact)
id_cols = ['subject_id','session_id','depressed']
feat_cols = [c for c in df_features.columns if c not in id_cols]
df_features[feat_cols] = df_features[feat_cols].fillna(0.0)

# Split by subject (subject-disjoint)
subjects = df_features['subject_id'].unique()
rng = np.random.default_rng(42)
rng.shuffle(subjects)
n = len(subjects)
train_ids = subjects[: int(0.7*n)]
val_ids = subjects[int(0.7*n): int(0.85*n)]
test_ids = subjects[int(0.85*n):]

train = df_features[df_features['subject_id'].isin(train_ids)]
val = df_features[df_features['subject_id'].isin(val_ids)]
test = df_features[df_features['subject_id'].isin(test_ids)]

print('Split sizes (rows):', len(train), len(val), len(test))

# Split hygiene checks
assert_disjoint_splits(train['subject_id'], val['subject_id'], test['subject_id'])
assert_label_domain(df_features['depressed'], allowed=(0,1))
min_class_presence({'train': train['depressed'], 'val': val['depressed'], 'test': test['depressed']}, min_count=3)



## 5) Artifacts (saved processed data)
Save per modality tables and the merged multimodal dataset for downstream modeling.


In [None]:

ART_TEXT = PROC_DIR / 'text_embeddings.parquet'
ART_AUDIO = PROC_DIR / 'audio_features.parquet'
ART_VIDEO = PROC_DIR / 'video_features.parquet'
ART_TAB = PROC_DIR / 'phq8_engineered.parquet'
ART_MERGE = PROC_DIR / 'multimodal_merged.parquet'

# Save only if non-empty (avoid writing empty placeholder frames)
if not df_tab.empty: df_tab.to_parquet(ART_TAB, index=False)
if not df_text_model.empty: df_text_model.to_parquet(ART_TEXT, index=False)
if not df_audio.empty: df_audio.to_parquet(ART_AUDIO, index=False)
if not df_video.empty: df_video.to_parquet(ART_VIDEO, index=False)
if not df_features.empty: df_features.to_parquet(ART_MERGE, index=False)

print('Saved (if available):')
for p in [ART_TAB, ART_TEXT, ART_AUDIO, ART_VIDEO, ART_MERGE]:
 print('-', p, p.exists())



## 6) Limitations & reproducibility

**Limitations**
- Placeholder tables for text/audio/video until extraction pipelines are connected. 
- Class imbalance persists; monitor PR curves and calibration in later notebooks. 
- Alignment between modalities may vary by session; verify IDs and time boundaries upstream.

**Reproducibility**
- Python 3.11 (`venv`) 
- Core libs: `pandas`, `numpy`, `scikit-learn`, `matplotlib`, `z3-solver` 
- Optional libs: `sentence-transformers`, `librosa`/`opensmile`, `openface` (CLI) 



## 7) Closing summary & next steps

- Engineered tabular features (PHQ 8 with standardized interactions). 
- Added SMT guardrails to catch data/timing/split issues early. 
- Assembled a merged, subject-disjoint multimodal dataset ready for modeling.

**Next:** Notebook 04 - Multimodal Modeling & Fusion (LR/RF/baseline NN; late fusion vs. early fusion; calibration & interpretability).
