
# 0) Notebook title & overview

**Markdown**

> # Garmin Wearable — Simple Accel+Gyro Processing (Easy Mode)
>
> **Goal:** turn two raw Excel files (accelerometer & gyroscope) with batched samples into a single, clean **50 Hz** time series with columns:
> `timestamp, timestamp_ms, Ax, Ay, Az, Gx, Gy, Gz`.
>
> **Key ideas:**
>
> 1. Parse list-like columns → 2) explode to one row per sample → 3) build true timestamps → 4) align accel↔gyro by nearest time → 5) resample to exact 50 Hz.
>
> We keep everything **UTC**, avoid clever abstractions, and comment every step.

# 1) Imports & file paths

In [19]:
# 1) Imports 
import pandas as pd
import numpy as np

import json, ast
from pathlib import Path

CURRENT_DIR = Path.cwd()
BASE_DIR = CURRENT_DIR.parent

# paths to your files inside "data"
ACCEL_PATH = BASE_DIR / "data" / "2025-03-23-15-23-10-accelerometer_data.xlsx"
GYRO_PATH  = BASE_DIR / "data" / "2025-03-23-15-23-10-gyroscope_data.xlsx"

TARGET_HZ = 50                 # resample target (Hz)
MERGE_TOLERANCE_MS = 1       # max accel↔gyro time gap allowed

# 2) Load raw Excel & peek

> We read both Excel files as-is to see the original column names.
> Many Garmin exports have:
>
> * `timestamp` (date string)
> * `timestamp_ms` (base ms offset)
> * `sample_time_offset` (list of per-sample ms within the row)
> * per-axis arrays like `calibrated_accel_x|y|z` or already `x|y|z`.

In [20]:
try:
    accel_raw = pd.read_excel(ACCEL_PATH) # type: ignore
    print("accel data read successfully")
    print(accel_raw.head())
except Exception as e:
    print(f"Error reading accel data: {e}") 

try:
    gyro_raw = pd.read_excel(GYRO_PATH)   # type: ignore
    print("gyro data read successfully")
    print(gyro_raw.head())
except Exception as e:
    print(f"Error reading gyro data: {e}")



accel data read successfully
                 timestamp  timestamp_ms  \
0  03/24/2025, 07:52:19 AM           766   
1  03/24/2025, 07:52:20 AM            23   
2  03/24/2025, 07:52:20 AM           271   
3  03/24/2025, 07:52:20 AM           518   
4  03/24/2025, 07:52:20 AM           767   

                                  sample_time_offset  \
0  ["0","10","29","38","48","57","66","77","85","...   
1  ["0","9","19","28","38","48","57","67","77","8...   
2  ["0","10","19","29","38","47","58","66","76","...   
3  ["0","11","20","29","39","48","58","68","77","...   
4  ["0","9","19","29","38","57","66","76","85","9...   

                                  calibrated_accel_x  \
0  ["-62.48186","-27.60828","-95.41802","-155.478...   
1  ["-178.7271","-161.2903","-133.1977","-124.479...   
2  ["-81.85607","-92.51189","-101.2303","-112.854...   
3  ["-79.91866","-134.1664","-75.07510","-51.8260...   
4  ["-90.57446","-102.1990","-91.54318","-101.230...   

                                

In [21]:
accel_map = {
    "calibrated_accel_x": "x",
    "calibrated_accel_y": "y",
    "calibrated_accel_z": "z"
}
for k,v in accel_map.items():
    if k in accel_raw.columns:
        accel_raw = accel_raw.rename(columns={k:v})

# For gyro
gyro_map = {
    "calibrated_gyro_x": "x",
    "calibrated_gyro_y": "y",
    "calibrated_gyro_z": "z"
}
for k,v in gyro_map.items():
    if k in gyro_raw.columns:
        gyro_raw = gyro_raw.rename(columns={k:v})

# Check we have the basics we need
needed = ["timestamp","timestamp_ms","sample_time_offset","x","y","z"]
missing_acc = [c for c in needed if c not in accel_raw.columns]
missing_gyr = [c for c in needed if c not in gyro_raw.columns]
print("Missing in accel:", missing_acc)
print("Missing in gyro :", missing_gyr)

Missing in accel: []
Missing in gyro : []


# 3) Normalize column names to a simple schema



> To make later steps easy, we normalize headers:
>
> * Accelerometer → `x, y, z`
> * Gyroscope     → `x, y, z`
>   (We’ll add the `A`/`G` prefixes **after** exploding.)

In [22]:
accel_map = {
    "calibrated_accel_x": "x",
    "calibrated_accel_y": "y",
    "calibrated_accel_z": "z"
}
for k,v in accel_map.items():
    if k in accel_raw.columns:
        accel_raw = accel_raw.rename(columns={k:v})

# For gyro
gyro_map = {
    "calibrated_gyro_x": "x",
    "calibrated_gyro_y": "y",
    "calibrated_gyro_z": "z"
}
for k,v in gyro_map.items():
    if k in gyro_raw.columns:
        gyro_raw = gyro_raw.rename(columns={k:v})

# Check we have the basics we need
needed = ["timestamp","timestamp_ms","sample_time_offset","x","y","z"]
missing_acc = [c for c in needed if c not in accel_raw.columns]
missing_gyr = [c for c in needed if c not in gyro_raw.columns]
print("Missing in accel:", missing_acc)
print("Missing in gyro :", missing_gyr)

Missing in accel: []
Missing in gyro : []


In [23]:
display(accel_raw.head(2))
display(gyro_raw.head(2))

Unnamed: 0,timestamp,timestamp_ms,sample_time_offset,x,y,z
0,"03/24/2025, 07:52:19 AM",766,"[""0"",""10"",""29"",""38"",""48"",""57"",""66"",""77"",""85"",""...","[""-62.48186"",""-27.60828"",""-95.41802"",""-155.478...","[""-758.6971"",""-984.1958"",""-1240.532"",""-1338.82...","[""-1097.585"",""-1221.784"",""-1086.742"",""-957.614..."
1,"03/24/2025, 07:52:20 AM",23,"[""0"",""9"",""19"",""28"",""38"",""48"",""57"",""67"",""77"",""8...","[""-178.7271"",""-161.2903"",""-133.1977"",""-124.479...","[""-1360.991"",""-1342.681"",""-1269.442"",""-1171.14...","[""-777.2302"",""-619.5170"",""-460.8181"",""-345.490..."


Unnamed: 0,timestamp,timestamp_ms,sample_time_offset,x,y,z
0,"03/24/2025, 07:52:19 AM",795,"[""0"",""9"",""19"",""38"",""48"",""57"",""67"",""76"",""85"",""9...","[""39.15614"",""32.96125"",""22.04125"",""1.041086"",""...","[""-7.520049"",""-10.77502"",""-9.620019"",""-4.79006...","[""7.881836"",""7.742001"",""12.78200"",""2.981754"",""..."
1,"03/24/2025, 07:52:20 AM",52,"[""0"",""9"",""19"",""28"",""38"",""48"",""57"",""67"",""76"",""8...","[""-24.26378"",""-35.91878"",""-44.66883"",""-50.3738...","[""-12.10503"",""-11.82503"",""-11.30004"",""-9.90004...","[""-0.3430405"",""-1.078040"",""-1.533123"",""-2.1631..."


# 4) Convert list-like cells into real Python lists

> In the raw files, `sample_time_offset` and `x/y/z` may be stored as JSON strings (like `"["0","20","40"]"`) or Python-like lists (like `"[0, 20, 40]"`).
> We parse them into real lists so we can **explode** them.


2) “Parsing list cells” — why do we parse first?
What’s in the raw files

Garmin often stores multiple samples inside one row:

timestamp (a base time like “2025-03-24 08:00:00”)

timestamp_ms (extra milliseconds at the row level)

sample_time_offset = a list of per-sample ms offsets within that row (e.g., [0, 20, 40, ...])

x, y, z = lists of values (same length as offsets)

Sometimes those lists are strings like "[0, 20, 40]" or ["0","20","40"].

Why parse

Pandas can’t work with math on stringified lists. We convert them into real Python lists so we can explode them (next step) and compute accurate times.

In [24]:
def parse_list_cell(v):
    """Return a Python list from JSON/Python-like representations."""
    if isinstance(v, (list, tuple, np.ndarray)):
        return list(v)
    if pd.isna(v):
        return []
    s = str(v).strip()
    # Try JSON first
    try:
        return json.loads(s)
    except Exception:
        # Then Python literal (safe) e.g. "[0, 20, 40]"
        try:
            return ast.literal_eval(s)
        except Exception:
            # Fallback: try comma-split
            s = s.strip("[]")
            if not s:
                return []
            return [pd.to_numeric(x, errors="coerce") for x in s.split(",")]

# Apply to both dataframes
for df in (accel_raw, gyro_raw):
    for c in ["sample_time_offset", "x", "y", "z"]:
        df[c] = df[c].apply(parse_list_cell)

print("Parsed list lengths (first row accel):",
      [len(accel_raw.iloc[0][c]) for c in ["sample_time_offset","x","y","z"]])
print("Parsed list lengths (first row gyro):",
      [len(gyro_raw.iloc[0][c]) for c in ["sample_time_offset","x","y","z"]])

Parsed list lengths (first row accel): [25, 25, 25, 25]
Parsed list lengths (first row gyro): [25, 25, 25, 25]


In [25]:
# display(accel_raw.head(2))
# display(gyro_raw.head(2))


# 5) Drop any rows where list lengths are mismatched

> Each row’s `offsets`, `x`, `y`, and `z` must have the **same length**.

> If not, we drop that row (rare, but safer than guessing).


In [26]:
def ok_lengths(row):
    a = len(row["sample_time_offset"])
    b = len(row["x"]); c = len(row["y"]); d = len(row["z"])
    return (a == b == c == d)

acc_ok = accel_raw.apply(ok_lengths, axis=1)
gyr_ok = gyro_raw.apply(ok_lengths, axis=1)

print("Accel bad rows:", (~acc_ok).sum(), " | Gyro bad rows:", (~gyr_ok).sum())

accel_raw = accel_raw.loc[acc_ok].reset_index(drop=True).copy()
gyro_raw  = gyro_raw.loc[gyr_ok].reset_index(drop=True).copy()


Accel bad rows: 0  | Gyro bad rows: 0


# 6) Explode to **one row per sample** (accel & gyro)

> We convert batched rows into **per-sample** rows.

> Then we create **true timestamps** as:

> `timestamp_true = to_datetime(timestamp, UTC) + timestamp_ms (ms) + sample_time_offset (ms)`.

3) “Explode to one row per sample” — why?
What “explode” does

After parsing, each row still contains arrays. explode turns that into one row per actual sample:

Row 1 with offset 0 → sample 1

Row 1 with offset 20 → sample 2

…

Why it’s important

All downstream steps (alignment, resampling, modeling) expect one sample per row with a single timestamp and scalar Ax, Ay, Az (and later Gx, Gy, Gz).

If we skip explode, we’d be averaging lists or merging lists → you saw those “TypeError: list” errors earlier.

In [27]:
# Explode in-place (pandas can explode multiple columns at once)
# The explode() method converts each element of the specified column(s) into a row.
accel_long = accel_raw.explode(["sample_time_offset","x","y","z"], ignore_index=True).copy()
gyro_long  = gyro_raw.explode (["sample_time_offset","x","y","z"], ignore_index=True).copy()

# Build true timestamps (UTC)
def make_true_time(df):
    ''' Why: The base timestamp is only row‑level. 
    The offset gives the exact time inside the row. 
    Using true time preserves rhythm/frequency and makes alignment correct. '''
    # base = timestamp (UTC) + timestamp_ms
    base = pd.to_datetime(df["timestamp"], utc=True, errors="coerce") \
           + pd.to_timedelta(pd.to_numeric(df["timestamp_ms"], errors="coerce").fillna(0).astype("int64"), unit="ms")
    # add per-sample offset
    off  = pd.to_timedelta(pd.to_numeric(df["sample_time_offset"], errors="coerce").fillna(0).astype("int64"), unit="ms")
    return base + off

accel_long["timestamp"] = make_true_time(accel_long)
gyro_long["timestamp"]  = make_true_time(gyro_long)

# Keep only what we need, and rename axes with prefixes
accel_long = accel_long[["timestamp","x","y","z"]].rename(columns={"x":"Ax","y":"Ay","z":"Az"})
gyro_long  = gyro_long [["timestamp","x","y","z"]].rename(columns={"x":"Gx","y":"Gy","z":"Gz"})

# Ensure numeric types
for c in ["Ax","Ay","Az"]:
    accel_long[c] = pd.to_numeric(accel_long[c], errors="coerce")
for c in ["Gx","Gy","Gz"]:
    gyro_long[c]  = pd.to_numeric(gyro_long[c],  errors="coerce")

# Sort
accel_long = accel_long.sort_values("timestamp").reset_index(drop=True)
gyro_long  = gyro_long.sort_values("timestamp").reset_index(drop=True)

print("accel_long:", accel_long.shape, "gyro_long:", gyro_long.shape)
display(accel_long.head(3))
display(gyro_long.head(3))

accel_long: (363400, 4) gyro_long: (363400, 4)


Unnamed: 0,timestamp,Ax,Ay,Az
0,2025-03-24 07:52:19.766000+00:00,-62.48186,-758.6971,-1097.585
1,2025-03-24 07:52:19.776000+00:00,-27.60828,-984.1958,-1221.784
2,2025-03-24 07:52:19.795000+00:00,-95.41802,-1240.532,-1086.742


Unnamed: 0,timestamp,Gx,Gy,Gz
0,2025-03-24 07:52:19.795000+00:00,39.15614,-7.520049,7.881836
1,2025-03-24 07:52:19.804000+00:00,32.96125,-10.77502,7.742001
2,2025-03-24 07:52:19.814000+00:00,22.04125,-9.620019,12.782


#### “True timestamp” — why we build it (and how)
The formula

For each exploded sample we compute:

true_time = to_utc_datetime(timestamp) + timestamp_ms (row-level) + sample_time_offset (per-sample)

Why do this

The base timestamp is coarse (row-level).

The actual sample times are timestamp + timestamp_ms + offset.
Using true times:

aligns accel↔gyro at the exact instant;

preserves fine timing (vital for frequency content and motion phases);

avoids “smearing” or jitter caused by treating all batched samples as the same time.

In [28]:

def guess_accel_unit(df, cols=("Ax","Ay","Az")):
    norm = np.sqrt((df[list(cols)]**2).sum(axis=1))
    m = float(np.nanmedian(norm))
    if 700 <= m <= 1300: return "mG", m      # ~1000 when still
    if 0.7 <= m <= 1.3:  return "g", m       # ~1 when still
    if 7.0 <= m <= 12.5: return "m/s^2", m   # ~9.81 when still
    return "unknown", m

unit, med = guess_accel_unit(accel_long, ("Ax","Ay","Az"))
print(f"[Unit check] accel looks like: {unit} | median norm ~ {med:.3f}")

# Convert only if needed
if unit == "mG":
    factor = 9.80665 / 1000.0   # mG -> m/s^2
    for c in ["Ax","Ay","Az"]:
        accel_long[c] = accel_long[c].astype(float) * factor
    print("Converted accel: mG → m/s²")
elif unit == "g":
    factor = 9.80665            # g -> m/s^2
    for c in ["Ax","Ay","Az"]:
        accel_long[c] = accel_long[c].astype(float) * factor
    print("Converted accel: g → m/s²")
else:
    print("No conversion applied (already m/s² or unknown).")

# sanity after conversion
unit2, med2 = guess_accel_unit(accel_long, ("Ax","Ay","Az"))
print(f"[After] accel looks like: {unit2} | median norm ~ {med2:.3f}")


[Unit check] accel looks like: mG | median norm ~ 1002.321
Converted accel: mG → m/s²
[After] accel looks like: m/s^2 | median norm ~ 9.829


# 7) Align accel & gyro by **nearest timestamp** (with tolerance)

> For each accel sample, find the nearest gyro sample within a small time window.
> If there’s no gyro within the window, we drop that accel row.
> We also print **match ratio** and basic **lag** statistics.

In [29]:
tolerance = pd.Timedelta(milliseconds=MERGE_TOLERANCE_MS)

gyro_for_merge = gyro_long.rename(columns={"timestamp":"gyro_ts"})
merged = pd.merge_asof(
    accel_long.sort_values("timestamp"),
    gyro_for_merge.sort_values("gyro_ts"),
    left_on="timestamp", right_on="gyro_ts",
    direction="nearest", tolerance=tolerance
)

have_gyro = merged[["Gx","Gy","Gz"]].notna().all(axis=1)
match_ratio = have_gyro.mean()
lag_ms = ((merged.loc[have_gyro, "timestamp"].astype("int64")
          - merged.loc[have_gyro, "gyro_ts"].astype("int64")) / 1e6)

print(f"Matched within {MERGE_TOLERANCE_MS} ms: {match_ratio:.1%}")
if len(lag_ms):
    print(f"Lag ms — mean: {lag_ms.mean():.3f}, std: {lag_ms.std():.3f}, max_abs: {np.abs(lag_ms).max():.3f}")
else:
    print("No matched samples. Increase MERGE_TOLERANCE_MS?")

# Keep only matched rows and drop helper column
merged = merged.loc[have_gyro, ["timestamp","Ax","Ay","Az","Gx","Gy","Gz"]].reset_index(drop=True)
display(merged.head(5))

Matched within 1 ms: 95.1%
Lag ms — mean: -0.466, std: 0.499, max_abs: 1.000


Unnamed: 0,timestamp,Ax,Ay,Az,Gx,Gy,Gz
0,2025-03-24 07:52:19.795000+00:00,-0.935731,-12.165463,-10.657298,39.15614,-7.520049,7.881836
1,2025-03-24 07:52:19.804000+00:00,-1.524719,-13.129398,-9.390991,32.96125,-10.77502,7.742001
2,2025-03-24 07:52:19.814000+00:00,-1.610218,-13.639726,-7.941018,22.04125,-9.620019,12.782
3,2025-03-24 07:52:19.832000+00:00,-1.581718,-12.864795,-5.195736,1.041086,-4.790063,2.981754
4,2025-03-24 07:52:19.843000+00:00,-1.220725,-11.929201,-3.484769,-14.56887,-7.275052,1.441816


In [30]:
merged.shape

(345418, 7)

# 8) Resample to **exact 50 Hz** (gap-free)


> We convert the aligned series into a perfect **50 Hz** timeline (every 20 ms).
> We aggregate with `mean` within bins and use time interpolation for tiny gaps.
> Finally we add `timestamp_ms` for convenience.


In [31]:
step_ms = int(round(1000 / TARGET_HZ))
freq = f"{step_ms}ms"

# Ensure numeric (protect against stray objects)
for c in ["Ax","Ay","Az","Gx","Gy","Gz"]:
    merged[c] = pd.to_numeric(merged[c], errors="coerce")

fixed = (merged
         .set_index("timestamp")[["Ax","Ay","Az","Gx","Gy","Gz"]]
         .resample(freq)
         .mean()
         .interpolate(method="time", limit_direction="both")
         .reset_index())

# Add timestamp_ms (int)
fixed["timestamp_ms"] = (fixed["timestamp"].astype("int64") // 10**6).astype("int64")

print("Fixed @", TARGET_HZ, "Hz:", fixed.shape)
display(fixed.head(5))

Fixed @ 50 Hz: (181699, 8)


Unnamed: 0,timestamp,Ax,Ay,Az,Gx,Gy,Gz,timestamp_ms
0,2025-03-24 07:52:19.780000+00:00,-0.935731,-12.165463,-10.657298,39.15614,-7.520049,7.881836,1742802739780
1,2025-03-24 07:52:19.800000+00:00,-1.567468,-13.384562,-8.666005,27.50125,-10.197519,10.262,1742802739800
2,2025-03-24 07:52:19.820000+00:00,-1.581718,-12.864795,-5.195736,1.041086,-4.790063,2.981754,1742802739820
3,2025-03-24 07:52:19.840000+00:00,-1.206475,-11.348005,-2.967611,-21.76137,-7.012552,0.549316,1742802739840
4,2025-03-24 07:52:19.860000+00:00,-1.46772,-7.912796,-1.338808,-44.00368,-6.960001,-3.230397,1742802739860


“Exact 50 Hz from the merged data” — how & why the row counts change
Why resample

After alignment, your timestamps are irregular (tiny jitters).

Models and feature windows are simpler if the timeline is a perfect grid: every 20 ms at 50 Hz.

How we do it (one line conceptually)
fixed = (merged.set_index('timestamp')
               .resample('20ms')
               .mean()
               .interpolate('time')
               .reset_index())


resample('20ms') bins samples into exact 20 ms slots

mean() combines any duplicates that land in the same bin

interpolate('time') fills tiny holes smoothly (useful if a bin is empty)

Why your row counts look like that

From your debug earlier:

Start ms: 1 742 802 739 780

End ms: 1 742 806 373 740

Span = 3 633 960 ms = 3 633.96 s

At 50 Hz, expected rows ≈ 3 633.96 × 50 ≈ 181 698.
Adding one because we include both ends → ≈ 181 699 rows.

That matches what you observed after resampling: about 181,699 rows and 8 columns (timestamp, timestamp_ms, Ax..Gz).

Your earlier merged frame had ~345,851 rows because before resampling it’s irregular (often >50Hz effective rate due to how batched offsets land). Resampling regularizes it to the exact 50 Hz grid.


# 9) Sanity checks (order, duplicates, finites, rough ranges)

**Markdown**

> Quick guards to catch obvious problems early.

In [32]:
# monotonic timestamps
assert fixed["timestamp"].is_monotonic_increasing, "Timestamps not sorted!"

# duplicates
dupes = fixed["timestamp"].duplicated().sum()
print("Duplicate timestamps:", dupes)

# finite values
assert np.isfinite(fixed[["Ax","Ay","Az","Gx","Gy","Gz"]]).all().all(), "Non-finite values found!"

# rough magnitude checks (tweak if your units differ)
accel_ok = (fixed[["Ax","Ay","Az"]].abs() < 50).all().all()      # ~5g if m/s² (~49)
gyro_ok  = (fixed[["Gx","Gy","Gz"]].abs() < 2000).all().all()    # <2000 deg/s typical
print("Accel magnitude OK:", accel_ok, "| Gyro magnitude OK:", gyro_ok)

duration_s = (fixed["timestamp"].iloc[-1] - fixed["timestamp"].iloc[0]).total_seconds()
print(f"Duration: {duration_s:.2f}s | Rows: {len(fixed)} | Target Hz: {TARGET_HZ}")

Duplicate timestamps: 0
Accel magnitude OK: True | Gyro magnitude OK: True
Duration: 3633.96s | Rows: 181699 | Target Hz: 50


# 10) Save outputs (Parquet + CSV)

**Markdown**

> Parquet is fast & typed; CSV is handy to eyeball in Excel.

In [None]:


# PARQUET_PATH = "processed/session_50hz.parquet"
# CSV_PATH     = "processed/session_50hz.csv"

# import os
# os.makedirs(os.path.dirname(PARQUET_PATH), exist_ok=True)

# fixed.to_parquet(PARQUET_PATH, index=False)
# fixed.to_csv(CSV_PATH, index=False)

# print("Saved:", PARQUET_PATH, "|", CSV_PATH)

## EDA


In [None]:
# #Session overview
# import numpy as np, pandas as pd, matplotlib.pyplot as plt
# print("Rows:", len(fixed))
# print("Time range:", fixed["timestamp"].min(), "→", fixed["timestamp"].max())

# # sampling jitter
# dt = fixed["timestamp"].diff().dt.total_seconds().dropna()
# print("Δt stats (s): min", dt.min(), "max", dt.max(), "mean", dt.mean())

# fixed.describe()[["Ax","Ay","Az","Gx","Gy","Gz"]]

In [None]:
# seg = fixed.set_index("timestamp").iloc[0:50*60]
# ax = seg[["Ax","Ay","Az"]].plot(figsize=(12,4), title="Accel (m/s²) — first 60 s"); ax.legend(loc="upper right")
# ax = seg[["Gx","Gy","Gz"]].plot(figsize=(12,4), title="Gyro (deg/s) — first 60 s"); ax.legend(loc="upper right")


In [None]:
# fixed[["Ax","Ay","Az","Gx","Gy","Gz"]].hist(bins=60, figsize=(12,8))

# fixed["A_norm"] = np.sqrt((fixed[["Ax","Ay","Az"]]**2).sum(axis=1))
# fixed["G_norm"] = np.sqrt((fixed[["Gx","Gy","Gz"]]**2).sum(axis=1))
# fixed[["A_norm","G_norm"]].plot(figsize=(12,4), title="Vector norms")


In [None]:
# from scipy.signal import welch
# fs = 50
# f, P = welch(fixed["A_norm"].values, fs=fs, nperseg=1024, noverlap=512)
# plt.figure(figsize=(8,4)); plt.semilogy(f, P); plt.xlim(0,10); plt.xlabel("Hz"); plt.ylabel("Power"); plt.title("PSD of A_norm")


In [None]:
# from matplotlib.pyplot import specgram
# plt.figure(figsize=(10,4))
# plt.specgram(fixed["A_norm"].values, NFFT=256, Fs=fs, noverlap=128)
# plt.ylim(0,10); plt.title("Spectrogram of A_norm"); plt.xlabel("time (s)"); plt.ylabel("Hz")


In [None]:
# corr = fixed[["Ax","Ay","Az","Gx","Gy","Gz"]].corr().round(2)
# print(corr)


In [None]:
# roll = fixed[["Ax","Ay","Az"]].rolling(50*5).agg(["mean","std"])  # 5 s window
# roll.columns = ["_".join(c) for c in roll.columns]
# roll[["Ax_mean","Ax_std"]].plot(figsize=(12,4), title="Accel X — rolling mean/std (5 s)")


### Why these steps matter (in one-liners)

#### Gravity split: separates posture (slow) from motion (fast) → cleaner features & PSD.

#### Winsorize: protects features from rare spikes → more stable EDA & training.

#### QC facts: tiny dashboard numbers → easy to spot bad recordings and enforce data quality.

#### Tidy artifact: one canonical dataset → consistent across notebooks and .py pipeline.

In [None]:
# '''
# Gravity split (WHAT / WHY / HOW)

# What: split each accel axis into gravity (very slow, posture/orientation) and dynamic (motion you care about).

# Why: models/EDA are more stable on dynamic signals; gravity can mask steps/cadence and inflate low-freq energy.

# How: low-pass filter (≈0.3 Hz) gives gravity; dynamic = raw − gravity.

# '''

# from scipy.signal import butter, filtfilt
# import numpy as np

# fs = 50.0          # Hz
# cut = 0.3          # Hz (very slow = gravity/orientation)
# b, a = butter(2, cut/(fs/2), btype="low")

# for c in ["Ax","Ay","Az"]:
#     grav = filtfilt(b, a, fixed[c].astype(float).values)
#     fixed[c+"_grav"] = grav
#     fixed[c+"_dyn"]  = fixed[c] - grav

# # quick sanity
# fixed["A_dyn_norm"]  = np.sqrt((fixed[[f"{c}_dyn"  for c in ["Ax","Ay","Az"]]]**2).sum(axis=1))
# fixed["A_grav_norm"] = np.sqrt((fixed[[f"{c}_grav" for c in ["Ax","Ay","Az"]]]**2).sum(axis=1))
# print("gravity norm median ~", float(fixed["A_grav_norm"].median()))


In [None]:
# '''
# De-spike / Winsorize (optional but useful)

# What: clip extreme outliers (sensor glitches).

# Why: spikes wreck RMS/energy/PSD and can destabilize training.

# How: clip each channel at very wide percentiles (0.1% / 99.9%) so you keep almost everything.
# '''

# for c in ["Ax","Ay","Az","Gx","Gy","Gz","Ax_dyn","Ay_dyn","Az_dyn"]:
#     if c in fixed.columns:
#         lo, hi = fixed[c].quantile([0.001, 0.999])
#         fixed[c] = fixed[c].clip(lo, hi)
# print("Winsorize: done")
    

In [None]:
# '''
# Quality facts (QC metrics you can log)

# What: tiny set of numbers proving the file is healthy.

# Why: perfect for your thesis pipeline & automation; quick red/green checks.

# How: compute once, print/save.

# '''

# qc = {}
# # cadence grid
# dt = fixed["timestamp"].diff().dt.total_seconds().dropna()
# qc["hz_mean"] = float(1.0 / dt.mean()) if len(dt) else float("nan")
# qc["dupe_ts"] = int(fixed["timestamp"].duplicated().sum())

# # accel gravity check (~9.8 when still)
# fixed["A_norm"] = np.sqrt((fixed[["Ax","Ay","Az"]]**2).sum(axis=1))
# qc["A_norm_median"] = float(fixed["A_norm"].median())

# # gyro sanity (95th percentile)
# fixed["G_norm"] = np.sqrt((fixed[["Gx","Gy","Gz"]]**2).sum(axis=1))
# qc["G_norm_p95"] = float(fixed["G_norm"].quantile(0.95))

# # dynamic energy, if you computed it
# if all(col in fixed.columns for col in ["Ax_dyn","Ay_dyn","Az_dyn"]):
#     fixed["A_dyn_norm"] = np.sqrt((fixed[["Ax_dyn","Ay_dyn","Az_dyn"]]**2).sum(axis=1))
#     qc["A_dyn_norm_median"] = float(fixed["A_dyn_norm"].median())

# print("QC summary:", qc)



In [None]:
# '''
# Save a tidy artifact (so later steps can load it)

# What: persist a clean table you and your future .py pipeline can reuse.

# Why: reproducibility & easy hand-off to modeling.

# How: save Parquet (fast/typed) and CSV (human-readable).
# '''

# import os, datetime as dt
# os.makedirs("processed", exist_ok=True)
# stamp = dt.datetime.utcnow().strftime("%Y%m%d-%H%M%S")

# cols = [c for c in fixed.columns]  # keep all; or filter if you like
# parquet_path = f"processed/session_50hz_tidy_{stamp}.parquet"
# csv_path     = f"processed/session_50hz_tidy_{stamp}.csv"

# fixed[cols].to_parquet(parquet_path, index=False)
# fixed[cols].to_csv(csv_path, index=False)

# print("Saved:", parquet_path, "|", csv_path)
