# SisFall Preprocessing Pipeline

This notebook preprocesses the SisFall dataset into windowed, normalized arrays ready for training a fall detection model on the Samsung Galaxy Watch 7.

The pipeline covers loading raw sensor files, downsampling to match the watch's sampling rate, extracting the relevant axes, slicing into overlapping windows, and saving stratified train/val/test splits.

## Dataset Structure

SisFall includes recordings from 23 young adults (SA01–SA23) and 15 elderly subjects (SE01–SE15). Each subject folder contains `.txt` files named by activity type:

- `D01.txt` – `D19.txt`: daily living activities (ADL, label = 0)
- `F01.txt` – `F15.txt`: simulated fall events (label = 1)

Each file has 9 columns: Accelerometer 1 (x, y, z), Accelerometer 2 (x, y, z), Gyroscope (x, y, z).

## Configuration

In [62]:
import os
import numpy as np
import pandas as pd
from pathlib import Path
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
import pickle

### Paths

Set the raw data directory and the output directory where processed `.npy` files will be saved.

In [63]:
RAW_DIR = Path("datasets/sisfall/raw")
OUT_DIR = Path("datasets/sisfall/processed")
OUT_DIR.mkdir(parents=True, exist_ok=True)

print(f"RAW_DIR: {RAW_DIR.resolve()}")
print(f"exists: {RAW_DIR.exists()}")
if RAW_DIR.exists():
    _sdirs = sorted([d for d in RAW_DIR.iterdir() if d.is_dir()])
    print(f"subject folders: {len(_sdirs)}")
    for d in _sdirs[:5]: print(f"{d.name}")
else:
    print("RAW_DIR does not exist")
print(f"OUT_DIR: {OUT_DIR.resolve()}")

RAW_DIR: C:\Users\lenovo\OneDrive\Desktop\MemoriaHome\preprocessing\datasets\sisfall\raw
exists: True
subject folders: 38
SA01
SA02
SA03
SA04
SA05
OUT_DIR: C:\Users\lenovo\OneDrive\Desktop\MemoriaHome\preprocessing\datasets\sisfall\processed


### Sampling Rate

SisFall was recorded at 200 Hz. The Galaxy Watch 7 accelerometer runs at 50 Hz, so we downsample by a factor of 4 to match the deployment environment.

In [64]:
ORIG_HZ = 200
TARGET_HZ = 50

### Windowing

Each recording is split into fixed-length windows before being fed to the model.

- **Window length**: 3 seconds, long enough to capture the pre-fall motion, impact, and post-fall stillness
- **Overlap**: 50%, a 1.5-second step between windows so falls near a boundary aren't missed

In [65]:
WINDOW_SEC = 3
OVERLAP = 0.5
WINDOW_SIZE = TARGET_HZ * WINDOW_SEC
STEP_SIZE = int(WINDOW_SIZE * (1 - OVERLAP))

In [66]:
FALL_PREFIX = "F"
ADL_PREFIX = "D"

---

## Processing Functions

### Load raw file

Reads a SisFall `.txt` file, strips the trailing semicolons on the last column, coerces everything to numeric, and drops any rows with parse errors.

In [None]:
def load_file(filepath: Path) -> np.ndarray | None:
    try:
        data = pd.read_csv(filepath, header=None, sep=r"\s*,\s*", engine="python")
        data.iloc[:, -1] = data.iloc[:, -1].astype(str).str.replace(";", "", regex=False)
        data = data.apply(pd.to_numeric, errors="coerce").dropna()
        return data.values
    except Exception as e:
        print(f"  [WARN] Could not read {filepath.name}: {e}")
        return None

# debugging (loading the first file)
_test_files = sorted(RAW_DIR.glob("*/*.txt")) if RAW_DIR.exists() else []
if _test_files:
    _sample_raw = load_file(_test_files[0])
    if _sample_raw is not None:
        print(f"File: {'/'.join(_test_files[0].parts[-2:])}")
        print(f"Shape: {_sample_raw.shape}  (expected (N, 9))")
        print(f"Columns: {_sample_raw.shape[1]}")
    else:
        print("None")
else:
    print("No files found")

File: SA01/D01_SA01_R01.txt
Shape: (19999, 9)  (expected (N, 9))
Columns: 9


### Downsample

Takes every 4th sample (integer decimation) to reduce 200 Hz data to 50 Hz.

In [68]:
def downsample(data: np.ndarray, orig_hz: int, target_hz: int) -> np.ndarray:
    factor = orig_hz // target_hz
    return data[::factor]

# debugging
if "_sample_raw" in dir() and _sample_raw is not None:
    _sample_ds = downsample(_sample_raw, ORIG_HZ, TARGET_HZ)
    print(f"Input: {_sample_raw.shape}")
    print(f"Output: {_sample_ds.shape}  (expected ~{len(_sample_raw)//4} rows)")
else:
    print("No sample to test with")

Input: (19999, 9)
Output: (5000, 9)  (expected ~4999 rows)


### Extract accelerometer and gyroscope data

SisFall records two accelerometers and one gyroscope. We use the first accelerometer (columns 0–2) and the gyroscope (columns 6–8).

In [69]:
def extract_data(data: np.ndarray) -> np.ndarray:
    accel = data[:, 0:3]
    gyro  = data[:, 6:9]

    result = np.hstack([accel, gyro])
    return result

### Sliding window

Slices a continuous signal into overlapping windows of shape `(num_windows, window_size, channels)`.

In [70]:
def sliding_windows(signal: np.ndarray, window_size: int, step: int) -> np.ndarray:
    windows = []
    for start in range(0, len(signal) - window_size + 1, step):
        windows.append(signal[start : start + window_size])
    return np.array(windows) if windows else np.empty((0, window_size, signal.shape[1]))

### Process a single subject

Iterates over all `.txt` files in a subject folder, runs the full pipeline (load, downsample, extract, window), and assigns labels based on the file prefix.

In [71]:
def process_subject(subject_dir: Path):
    all_windows = []
    all_labels = []

    txt_files = sorted(subject_dir.glob("*.txt"))
    print(f"{subject_dir.name}: {len(txt_files)} files")

    for txt_file in txt_files:
        is_fall = txt_file.stem.upper().startswith(FALL_PREFIX)
        label = 1 if is_fall else 0

        raw = load_file(txt_file)
        if raw is None:
            print(f"{txt_file.name}: skip (load failed)")
            continue

        if len(raw) < ORIG_HZ:
            print(f"{txt_file.name}: skip (too short, {len(raw)} samples)")
            continue

        downsampled = downsample(raw, ORIG_HZ, TARGET_HZ)
        sensor_data = extract_data(downsampled)

        windows = sliding_windows(sensor_data, WINDOW_SIZE, STEP_SIZE)
        if len(windows) == 0:
            print(f"{txt_file.name}: skip (no windows)")
            continue

        print(f"  {txt_file.name}: {len(windows)} windows ({'fall' if is_fall else 'no-fall'})")
        all_windows.append(windows)
        all_labels.extend([label] * len(windows))

    if not all_windows:
        print(f"{subject_dir.name}: no usable trials found")
        return np.empty((0, WINDOW_SIZE, 6)), np.array([])

    X = np.vstack(all_windows)
    y = np.array(all_labels)
    print(f"{subject_dir.name}: {len(X)} windows ({y.sum()} fall, {(y == 0).sum()} no-fall)")
    return X, y

### Build the full dataset

Iterates over every subject folder in `RAW_DIR`, calls `process_subject`, and stacks all windows and labels into a single array.

In [72]:
def build_dataset():
    print("SisFall Data Preprocessing")
    print("---")

    all_X, all_y = [], []
    subject_dirs = sorted([d for d in RAW_DIR.iterdir() if d.is_dir()])

    if not subject_dirs:
        raise FileNotFoundError(
            f"No subject folders found in {RAW_DIR}.\n"
            f"Make sure SisFall folders (SA01, SE01, ...) are inside:\n"
            f"  {RAW_DIR.resolve()}"
        )

    for subj in subject_dirs:
        print(f"Processing {subj.name}...", end=" ")
        X, y = process_subject(subj)
        if len(X) == 0:
            print("skipped (no data)")
            continue
        print(f"{len(X)} windows | falls: {y.sum()} | ADLs: {(y==0).sum()}")
        all_X.append(X)
        all_y.append(y)

    X_all = np.vstack(all_X)
    y_all = np.concatenate(all_y)

    print(f"\nTotal windows: {len(X_all)}")
    print(f"Fall windows: {y_all.sum()} ({100*y_all.mean():.1f}%)")
    print(f"ADL windows: {(y_all==0).sum()} ({100*(1-y_all.mean()):.1f}%)")

    return X_all, y_all

### Normalize

Fits a `StandardScaler` on the training split only, then applies it to val and test. The scaler is saved to disk so the same transformation can be applied to live Galaxy Watch data at inference time.

In [73]:
def normalize(X_train, X_val, X_test):
    n_train, n_val, n_test = len(X_train), len(X_val), len(X_test)
    W, C = WINDOW_SIZE, 6

    scaler = StandardScaler()
    X_train = scaler.fit_transform(X_train.reshape(-1, C)).reshape(n_train, W, C)
    X_val = scaler.transform(X_val.reshape(-1, C)).reshape(n_val, W, C)
    X_test = scaler.transform(X_test.reshape(-1, C)).reshape(n_test, W, C)

    with open(OUT_DIR / "scaler.pkl", "wb") as f:
        pickle.dump(scaler, f)
    print("Scaler saved in datasets/sisfall/processed/scaler.pkl")

    return X_train, X_val, X_test

### Save splits

Writes each array to a `.npy` file in `OUT_DIR`.

In [74]:
def save_splits(X_train, X_val, X_test, y_train, y_val, y_test):
    splits = {
        "X_train": X_train, "y_train": y_train,
        "X_val": X_val, "y_val": y_val,
        "X_test": X_test, "y_test": y_test,
    }
    for name, arr in splits.items():
        path = OUT_DIR / f"{name}.npy"
        np.save(path, arr)
        print(f"  Saved {name}.npy  shape={arr.shape}  dtype={arr.dtype}")

---

## Run the Pipeline

### Build dataset

In [75]:
X, y = build_dataset()

SisFall Data Preprocessing
---
Processing SA01... SA01: 154 files
  D01_SA01_R01.txt: 65 windows (no-fall)
  D02_SA01_R01.txt: 65 windows (no-fall)
  D03_SA01_R01.txt: 65 windows (no-fall)
  D04_SA01_R01.txt: 65 windows (no-fall)
  D05_SA01_R01.txt: 15 windows (no-fall)
  D05_SA01_R02.txt: 15 windows (no-fall)
  D05_SA01_R03.txt: 15 windows (no-fall)
  D05_SA01_R04.txt: 15 windows (no-fall)
  D05_SA01_R05.txt: 15 windows (no-fall)
  D06_SA01_R01.txt: 15 windows (no-fall)
  D06_SA01_R02.txt: 15 windows (no-fall)
  D06_SA01_R03.txt: 15 windows (no-fall)
  D06_SA01_R04.txt: 15 windows (no-fall)
  D06_SA01_R05.txt: 15 windows (no-fall)
  D07_SA01_R01.txt: 7 windows (no-fall)
  D07_SA01_R02.txt: 7 windows (no-fall)
  D07_SA01_R03.txt: 7 windows (no-fall)
  D07_SA01_R04.txt: 7 windows (no-fall)
  D07_SA01_R05.txt: 7 windows (no-fall)
  D08_SA01_R01.txt: 7 windows (no-fall)
  D08_SA01_R02.txt: 7 windows (no-fall)
  D08_SA01_R03.txt: 7 windows (no-fall)
  D08_SA01_R04.txt: 7 windows (no-fall)


In [76]:
X_train, X_temp, y_train, y_temp = train_test_split(
    X, y, test_size=0.30, random_state=42, stratify=y
)
X_val, X_test, y_val, y_test = train_test_split(
    X_temp, y_temp, test_size=0.50, random_state=42, stratify=y_temp
)

print(f"\nTrain: {len(X_train)}, Val: {len(X_val)}, Test: {len(X_test)}")
print(f"Train falls: {y_train.mean():.1%}, Val falls: {y_val.mean():.1%}, Test falls: {y_test.mean():.1%}")


Train: 33542, Val: 7188, Test: 7188
Train falls: 33.8%, Val falls: 33.8%, Test falls: 33.8%


In [77]:
print("\nNormalizing...")
X_train, X_val, X_test = normalize(X_train, X_val, X_test)


Normalizing...
Scaler saved in datasets/sisfall/processed/scaler.pkl


In [78]:
print("\nSaving splits...")
save_splits(X_train, X_val, X_test, y_train, y_val, y_test)


Saving splits...
  Saved X_train.npy  shape=(33542, 150, 6)  dtype=float64
  Saved y_train.npy  shape=(33542,)  dtype=int32
  Saved X_val.npy  shape=(7188, 150, 6)  dtype=float64
  Saved y_val.npy  shape=(7188,)  dtype=int32
  Saved X_test.npy  shape=(7188, 150, 6)  dtype=float64
  Saved y_test.npy  shape=(7188,)  dtype=int32
