## 1. Data source: Sleep-Accel (PhysioNet; Apple Watch + PSG)

**Dataset:** *Motion and heart rate from a wrist-worn wearable and labeled sleep from polysomnography* (PhysioNet, v1.0.0)

- **PhysioNet dataset page (download + description):**   
    - https://physionet.org/content/sleep-accel/1.0.0/   
- **DOI (v1.0.0):**   
    - https://doi.org/10.13026/hmhs-py35   
- **Local path (expected):** download + unzip into `./data/sleep_accel/` (the `data/` directory is not committed to git)   
    - expected: `heart_rate/`, `motion/`, `labels/`, `steps/`, plus `LICENSE.txt`   
- **License (for files):** Open Data Commons Attribution License v1.0 (**ODC-By 1.0**)   

**Citations (as requested by PhysioNet):**   
- Walch, O. (2019). *Motion and heart rate from a wrist-worn wearable and labeled sleep from polysomnography* (version 1.0.0). PhysioNet. https://doi.org/10.13026/hmhs-py35   
- Walch, O., Huang, Y., Forger, D., Goldstein, C. (2019). *Sleep stage prediction with raw acceleration and photoplethysmography heart rate data derived from a consumer wearable device*. SLEEP. https://doi.org/10.1093/sleep/zsz180   
- Goldberger, A. L., et al. (2000). *PhysioBank, PhysioToolkit, and PhysioNet: Components of a new research resource for complex physiologic signals*. Circulation.   


**Files used(and expected columns)**   

We will use the following folders/files from the PhysioNet release:   

- **Motion (ACC):** `motion/[subject]_acceleration.txt`  
  Columns per line: `t_sec, ax_g, ay_g, az_g`  
  where `t_sec` is seconds since PSG start, and accelerations are in **g**.   

- **Heart rate (HR):** `heart_rate/[subject]_heartrate.txt`  
  Columns per line: `t_sec, hr_bpm`  
  where `hr_bpm` is heart rate in **beats per minute**.   

- **PSG sleep labels:** `labels/[subject]_labeled_sleep.txt`  
  Columns per line: `t_sec, stage` with stage codes:  
  `Wake=0, N1=1, N2=2, N3=3, REM=5` (we drop unscored/invalid epochs if present).

> Note: The dataset also includes `steps/` files, but we won’t use them in the first version.


**Notebook intention**

Goal: build a **reproducible sleep-staging pipeline** from **wrist ACC + HR** aligned to **PSG-scored 30-second epochs**, with **leakage-aware, subject-wise evaluation**.

What we do:   

1. **Define the modeling unit as the PSG epoch (30s)** and build one feature row per epoch.  
2. **Align** wrist **ACC** and **HR** to each labeled 30s epoch (aggregate samples falling in `[t, t+30s)`), and attach the PSG stage label at `t`.  
3. **Extract simple, readable features** per epoch:   
   - ACC: magnitude and axis statistics + activity intensity proxies  
   - HR: summary statistics + missingness indicators  
4. Add **causal context (history) features** using past-only rolling summaries over recent epochs (e.g., last few minutes) to capture local sleep continuity without using future information.  
5. Train and compare a small set of classical models using **subject-wise cross-validation** (GroupKFold) and report robust staging metrics (macro-F1, balanced accuracy, confusion matrices, per-subject performance).  
6. Apply a lightweight **temporal stabilization** step (e.g., hysteresis / causal smoothing of probabilities) to reduce one-epoch “blips” and reflect product-realistic output stability.

Note on extra “pre-PSG” wearable data:

This dataset includes wearable streams that may start **before PSG time zero** (e.g., steps for days prior, HR for hours prior, motion shortly before). For the main staging pipeline we **restrict to the PSG-labeled interval** and only aggregate sensor data within labeled 30s epochs. Pre-PSG data can be used in extensions as **subject-level context** (e.g., prior-days activity summaries computed strictly from `t < 0`), but coverage varies across subjects and adds preprocessing complexity.


## 2. Data loading and check

We load per-subject **wrist accelerometer (ACC)**, **heart rate (HR)**, and **PSG sleep labels** from the PhysioNet Sleep-Accel folder structure.

Key conventions:
- All timestamps are de-identified and expressed as **seconds since PSG start** (`t=0` is the PSG start time).
- For the main staging pipeline, we **restrict to the PSG-labeled interval** and only aggregate sensor samples that fall inside each labeled 30s epoch.


In [3]:
from pathlib import Path
import re

import numpy as np
import pandas as pd

STAGE_MAP = {
    0: "Wake",
    1: "N1",
    2: "N2",
    3: "N3",
    5: "REM",
}

# --- Paths ---
DATA_DIR   = Path("../data/sleep_accel")
MOTION_DIR = DATA_DIR / "motion"
HR_DIR     = DATA_DIR / "heart_rate"
LABEL_DIR  = DATA_DIR / "labels"

for d in [DATA_DIR, MOTION_DIR, HR_DIR, LABEL_DIR]:
    assert d.exists(), f"Missing: {d}"

# --- Subject discovery (use labels as the "source of truth") ---
label_files = sorted(LABEL_DIR.glob("*_labeled_sleep.txt"))
assert len(label_files) > 0, f"No label files found in: {LABEL_DIR}"

SUBJECT_IDS = [re.match(r"(\d+)_labeled_sleep\.txt", f.name).group(1) for f in label_files]
print(f"Found {len(SUBJECT_IDS)} subjects (from labels/). Example IDs: {SUBJECT_IDS[:5]}")

# --- Loaders ---
def load_labels(subject_id: str) -> pd.DataFrame:
    """Load PSG sleep labels: columns [t_sec, stage]."""
    path = LABEL_DIR / f"{subject_id}_labeled_sleep.txt"
    df = pd.read_csv(path, sep=r"\s+", header=None, names=["t_sec", "stage"])
    df["t_sec"] = df["t_sec"].astype(float)
    df["stage"] = df["stage"].astype(int)
    df = df.sort_values("t_sec", kind="mergesort").reset_index(drop=True)
    return df

def load_hr(subject_id: str) -> pd.DataFrame:
    """Load Apple Watch HR: columns [t_sec, hr_bpm]."""
    path = HR_DIR / f"{subject_id}_heartrate.txt"
    df = pd.read_csv(path, sep=",", header=None, names=["t_sec", "hr_bpm"])
    df["t_sec"] = df["t_sec"].astype(float)
    df["hr_bpm"] = df["hr_bpm"].astype(float)
    df = df.sort_values("t_sec", kind="mergesort").reset_index(drop=True)
    return df

def load_acc(subject_id: str) -> pd.DataFrame:
    """Load Apple Watch accelerometer: columns [t_sec, ax_g, ay_g, az_g]."""
    path = MOTION_DIR / f"{subject_id}_acceleration.txt"
    df = pd.read_csv(path, sep=r"\s+", header=None, names=["t_sec", "ax_g", "ay_g", "az_g"])
    df["t_sec"] = df["t_sec"].astype(float)
    for c in ["ax_g", "ay_g", "az_g"]:
        df[c] = df[c].astype(float)
    df = df.sort_values("t_sec", kind="mergesort").reset_index(drop=True)
    return df

# --- Quick sanity check on one subject ---
sid = SUBJECT_IDS[0]

labels = load_labels(sid)
hr     = load_hr(sid)
acc    = load_acc(sid)

print(f"\nSubject {sid} loaded:")
print(f"  labels: {labels.shape} | t range: [{labels.t_sec.min():.1f}, {labels.t_sec.max():.1f}]")
print(f"  hr    : {hr.shape}     | t range: [{hr.t_sec.min():.1f}, {hr.t_sec.max():.1f}]")
print(f"  acc   : {acc.shape}    | t range: [{acc.t_sec.min():.1f}, {acc.t_sec.max():.1f}]")

print("\nLabel codes present (counts):")
print(labels["stage"].value_counts().sort_index())

# Show a small peek
label_counts = labels["stage"].value_counts().sort_index()
display(pd.DataFrame({
    "stage_code": label_counts.index,
    "stage_name": [STAGE_MAP.get(int(s), "OTHER") for s in label_counts.index],
    "count": label_counts.values,
}))

Found 31 subjects (from labels/). Example IDs: ['1066528', '1360686', '1449548', '1455390', '1818471']

Subject 1066528 loaded:
  labels: (952, 2) | t range: [0.0, 28530.0]
  hr    : (16617, 2)     | t range: [-355241.7, 34491.2]
  acc   : (1281000, 4)    | t range: [-21684.8, 28626.5]

Label codes present (counts):
stage
0    185
1     97
2    299
3     62
5    309
Name: count, dtype: int64


Unnamed: 0,stage_code,stage_name,count
0,0,Wake,185
1,1,N1,97
2,2,N2,299
3,3,N3,62
4,5,REM,309
