## 1. Data source: Sleep-Accel (PhysioNet; Apple Watch + PSG)

**Dataset:** *Motion and heart rate from a wrist-worn wearable and labeled sleep from polysomnography* (PhysioNet, v1.0.0)

- **PhysioNet dataset page (download + description):**   
    - https://physionet.org/content/sleep-accel/1.0.0/   
- **DOI (v1.0.0):**   
    - https://doi.org/10.13026/hmhs-py35   
- **Local path (expected):** download + unzip into `./data/sleep_accel/` (the `data/` directory is not committed to git)   
    - expected: `heart_rate/`, `motion/`, `labels/`, `steps/`, plus `LICENSE.txt`   
- **License (for files):** Open Data Commons Attribution License v1.0 (**ODC-By 1.0**)   

**Citations (as requested by PhysioNet):**   
- Walch, O. (2019). *Motion and heart rate from a wrist-worn wearable and labeled sleep from polysomnography* (version 1.0.0). PhysioNet. https://doi.org/10.13026/hmhs-py35   
- Walch, O., Huang, Y., Forger, D., Goldstein, C. (2019). *Sleep stage prediction with raw acceleration and photoplethysmography heart rate data derived from a consumer wearable device*. SLEEP. https://doi.org/10.1093/sleep/zsz180   
- Goldberger, A. L., et al. (2000). *PhysioBank, PhysioToolkit, and PhysioNet: Components of a new research resource for complex physiologic signals*. Circulation.   


**Files used(and expected columns)**   

We will use the following folders/files from the PhysioNet release:   

- **Motion (ACC):** `motion/[subject]_acceleration.txt`  
  Columns per line: `t_sec, ax_g, ay_g, az_g`  
  where `t_sec` is seconds since PSG start, and accelerations are in **g**.   

- **Heart rate (HR):** `heart_rate/[subject]_heartrate.txt`  
  Columns per line: `t_sec, hr_bpm`  
  where `hr_bpm` is heart rate in **beats per minute**.   

- **PSG sleep labels:** `labels/[subject]_labeled_sleep.txt`  
  Columns per line: `t_sec, stage` with stage codes:  
  `Wake=0, N1=1, N2=2, N3=3, REM=5` (we drop unscored/invalid epochs if present).

> Note: The dataset also includes `steps/` files, but we won’t use them in the first version.


**Notebook intention**

Goal: build a **clean, reproducible sleep-staging pipeline** from **wrist ACC + HR** aligned to **PSG-scored 30-second epochs**, with **leakage-aware, subject-wise evaluation**.

What we do:   

1. **Define the modeling unit as the PSG epoch (30s)** and build one feature row per epoch.  
2. **Align** wrist **ACC** and **HR** to each labeled 30s epoch (aggregate samples falling in `[t, t+30s)`), and attach the PSG stage label at `t`.  
3. **Extract simple, readable features** per epoch:   
   - ACC: magnitude and axis statistics + activity intensity proxies  
   - HR: summary statistics + missingness indicators  
4. Add **causal context (history) features** using past-only rolling summaries over recent epochs (e.g., last few minutes) to capture local sleep continuity without using future information.  
5. Train and compare a small set of classical models using **subject-wise cross-validation** (GroupKFold) and report robust staging metrics (macro-F1, balanced accuracy, confusion matrices, per-subject performance).  
6. Apply a lightweight **temporal stabilization** step (e.g., hysteresis / causal smoothing of probabilities) to reduce one-epoch “blips” and reflect product-realistic output stability.

Note on extra “pre-PSG” wearable data:

This dataset includes wearable streams that may start **before PSG time zero** (e.g., steps for days prior, HR for hours prior, motion shortly before). For the main staging pipeline we **restrict to the PSG-labeled interval** and only aggregate sensor data within labeled 30s epochs. Pre-PSG data can be used in extensions as **subject-level context** (e.g., prior-days activity summaries computed strictly from `t < 0`), but coverage varies across subjects and adds preprocessing complexity.
