In [1]:
import pandas as pd
import numpy as np
import os

import torch
from torch.utils.data import Dataset, DataLoader

# 1. Dataset Specification: SisFall
**Project:** Hybrid Privacy-Preserving Federated Learning (HPPFL) for Rural-Urban IoMT  
**Dataset Source:** SisFall (Fall and Movement Dataset)  
**Sampling Rate:** 200 Hz  

The dataset consists of **9 raw data columns** collected simultaneously from three separate sensors located on a single device (worn at the waist). The distinct sensors provide "multi-view" data, capturing both high-sensitivity movements (like walking) and high-impact events (like falling).

### Sensor Schema

| Column Index | Variable Name | Sensor Hardware | Sensor Type | Axis | Description |
| :--- | :--- | :--- | :--- | :--- | :--- |
| **0** | `ADXL_x` | **ADXL345** | Accelerometer | X | **High-Range Sensor.** Digital accelerometer with up to $\pm 16g$ range. Ideal for capturing high-impact peaks during falls that might saturate other sensors. |
| **1** | `ADXL_y` | **ADXL345** | Accelerometer | Y | |
| **2** | `ADXL_z` | **ADXL345** | Accelerometer | Z | |
| **3** | `ITG_x` | **ITG3200** | Gyroscope | X | **Rotation Sensor.** Measures angular velocity (spin) up to $\pm 2000^\circ/s$. Critical for distinguishing falls (rotational changes) from lying down. |
| **4** | `ITG_y` | **ITG3200** | Gyroscope | Y | |
| **5** | `ITG_z` | **ITG3200** | Gyroscope | Z | |
| **6** | `MMA_x` | **MMA8451Q** | Accelerometer | X | **High-Sensitivity Sensor.** Digital accelerometer with $\pm 8g$ range and 14-bit resolution. Captures finer details of Activities of Daily Living (ADL) like walking or sitting. |
| **7** | `MMA_y` | **MMA8451Q** | Accelerometer | Y | |
| **8** | `MMA_z` | **MMA8451Q** | Accelerometer | Z | |

In [2]:
# Column names based on the sensor hardware
col_names = [
    "ADXL_x", "ADXL_y", "ADXL_z",  # Sensor 1: ADXL345 Accelerometer
    "ITG_x",  "ITG_y",  "ITG_z",   # Sensor 2: ITG3200 Gyroscope
    "MMA_x",  "MMA_y",  "MMA_z"    # Sensor 3: MMA8451Q Accelerometer
]

# 2. Activity Labels (ADLs)
**Category:** Activities of Daily Living (ADL)  
**Code Prefix:** D (Daily)

The dataset distinguishes between 19 different types of daily activities. These serve as the "Negative" class (Non-Fall) in the classification task.

| Code | Activity Description | Trials | Duration |
| :--- | :--- | :--- | :--- |
| **D01** | Walking slowly | 1 | 100s |
| **D02** | Walking quickly | 1 | 100s |
| **D03** | Jogging slowly | 1 | 100s |
| **D04** | Jogging quickly | 1 | 100s |
| **D05** | Walking upstairs and downstairs slowly | 5 | 25s |
| **D06** | Walking upstairs and downstairs quickly | 5 | 25s |
| **D07** | Slowly sit in a half height chair, wait, and up slowly | 5 | 12s |
| **D08** | Quickly sit in a half height chair, wait, and up quickly | 5 | 12s |
| **D09** | Slowly sit in a low height chair, wait, and up slowly | 5 | 12s |
| **D10** | Quickly sit in a low height chair, wait, and up quickly | 5 | 12s |
| **D11** | Sitting a moment, trying to get up, and collapse into chair | 5 | 12s |
| **D12** | Sitting a moment, lying slowly, wait, and sit again | 5 | 12s |
| **D13** | Sitting a moment, lying quickly, wait, and sit again | 5 | 12s |
| **D14** | Changing position (back to lateral), wait, and back again | 5 | 12s |
| **D15** | Standing, slowly bending at knees, and getting up | 5 | 12s |
| **D16** | Standing, slowly bending without bending knees, and up | 5 | 12s |
| **D17** | Standing, get into a car, remain seated, and get out | 5 | 25s |
| **D18** | Stumble while walking | 5 | 12s |
| **D19** | Gently jump without falling (reach high object) | 5 | 12s |

# 3. Fall Labels (Falls)
**Category:** Fall Events  
**Code Prefix:** F (Fall)

These 15 activities represent various fall scenarios (slips, trips, fainting). In a Binary Classification task, these constitute the "Positive" class (Fall Detected).

| Code | Activity Description | Trials | Duration |
| :--- | :--- | :--- | :--- |
| **F01** | Fall forward while walking (slip) | 5 | 15s |
| **F02** | Fall backward while walking (slip) | 5 | 15s |
| **F03** | Lateral fall while walking (slip) | 5 | 15s |
| **F04** | Fall forward while walking (trip) | 5 | 15s |
| **F05** | Fall forward while jogging (trip) | 5 | 15s |
| **F06** | Vertical fall while walking (fainting) | 5 | 15s |
| **F07** | Fall while walking, using hands to dampen impact (fainting) | 5 | 15s |
| **F08** | Fall forward when trying to get up | 5 | 15s |
| **F09** | Lateral fall when trying to get up | 5 | 15s |
| **F10** | Fall forward when trying to sit down | 5 | 15s |
| **F11** | Fall backward when trying to sit down | 5 | 15s |
| **F12** | Lateral fall when trying to sit down | 5 | 15s |
| **F13** | Fall forward while sitting (fainting/asleep) | 5 | 15s |
| **F14** | Fall backward while sitting (fainting/asleep) | 5 | 15s |
| **F15** | Lateral fall while sitting (fainting/asleep) | 5 | 15s |

# 4. Subject Demographics
**Total Subjects:** 38  
**Groups:** 
* **SA:** Young/Adult Subjects (SA01–SA23)
* **SE:** Elderly Subjects (SE01–SE15)

| Subject ID | Age | Height (cm) | Weight (kg) | Gender |
| :--- | :--- | :--- | :--- | :--- |
| **SA01** | 26|165 |53 |F |
| **SA02** |23 |176 |58.5 |M |
| **SA03** |19 |156 |48 |F |
| **SA04** |23 |170 |72 |M |
| **SA05** |22 |172 |69.5 |M |
| **SA06** |21 |169 |58 |M |
| **SA07** |21 |156 |63 |F |
| **SA08** |21 |149 |41.5 |F |
| **SA09** |24 |165 |64 |M |
| **SA10** |21 |177 |67 |M |
| **SA11** |19 |170 |80.5 |M |
| **SA12** |25 |153 |47 |F |
| **SA13** |22 |157 |55 |F |
| **SA14** |27 |160 |46 |F |
| **SA15** |25 |160 |52 |F |
| **SA16** |20 |169 |61 |F |
| **SA17** |23 |182 |75 |M |
| **SA18** |23 |181 |73 |M |
| **SA19** |30 |170 |76 |M |
| **SA20** |30 |150 |42 |F |
| **SA21** |30 |183 |68 |M |
| **SA22** |19 |158 |50.5 |F |
| **SA23** |24 |156 |48 |F |
| **SE01** |71 |171 |102 |M |
| **SE02** |75 |150 |57 |F |
| **SE03** |62 |150 |51 |F |
| **SE04** |63 |160 |59 |F |
| **SE05** |63 |165 |72 |M |
| **SE06** |60 |163 |79 |M |
| **SE07** |65 |168 |76 |M |
| **SE08** |68 |163 |72 |F |
| **SE09** |66 |167 |65 |M |
| **SE10** |64 |156 |66 |F |
| **SE11** |66 |169 |63 |F |
| **SE12** |69 |164 |56.5 |M |
| **SE13** |65 |171 |72.5 |M |
| **SE14** |67 |163 |58 |M |
| **SE15** |64 |150 |50 |F |

In [3]:
def load_sisfall_for_fl(root_path, select_sensors='acc_only'):
    """
    Loads SisFall data into a client-partitioned dictionary.
    
    Args:
        root_path: Path to the unzipped SisFall folder.
        select_sensors: 'all' or 'acc_only' to save memory.
        
    Returns:
        clients_data: Dict { 'client_id': df }
        df: all the sensor data for that subject and trial
    """
    
    if select_sensors == 'acc_only':
        use_cols = col_names[0:3]
    else:
        use_cols = col_names

    clients_data = {}

    # Walk through the directory
    # Expected structure: SisFall_dataset/SA01/D01_SA01_R01.txt
    for root, dirs, files in os.walk(root_path):
        for file in files:
            if file.endswith(".txt"):
                
                # 1. Parse Metadata from filename
                # Filename format typically: Activity_Subject_Trial.txt (e.g., D01_SA01_R01.txt)
                try:
                    parts = file.replace('.txt', '').split('_')
                    activity_code = parts[0] # e.g., D01
                    subject_id = parts[1]    # e.g., SA01 (This is our Client)
                except IndexError:
                    continue # Skip readme or unexpected files

                # 2. Load the file
                file_path = os.path.join(root, file)
                
                # Read CSV, skip header usually not present or strictly formatted in SisFall
                df_temp = pd.read_csv(file_path, header=None, names=col_names)
                # Remove ; at the end of each row if needed
                if select_sensors == 'all':
                    last_col = "MMA_z"
                    df_temp[last_col] = df_temp[last_col].astype(str).str.strip(';')
                # Convert directly to float32 to save memory
                df_temp = df_temp[use_cols].astype('float32')
                
                # 3. Labeling fall binary value (1 if fall, 0 ADLs)
                is_fall = 1 if activity_code.startswith('F') else 0
                df_temp['label'] = is_fall
                # Adding activity tag for finding fall for sliding windows
                df_temp['activity'] = activity_code
                
                
                # 4. Append to Client's bucket
                if subject_id not in clients_data:
                    clients_data[subject_id] = []
                
                clients_data[subject_id].append(df_temp)

    return clients_data

# Usage
# fl_data = load_sisfall_for_fl("..\data\SisFall_dataset", select_sensors='acc_only')
# client_sa01_data = fl_data['SA 01']

In [4]:
root_path = "..\data\SisFall_dataset"
fl_data = load_sisfall_for_fl(root_path, select_sensors = "acc_only")
client_sa01_data = fl_data['SA01']

In [5]:
client_sa01_data

[       ADXL_x  ADXL_y  ADXL_z  label activity
 0        17.0  -179.0   -99.0      0      D01
 1        15.0  -174.0   -90.0      0      D01
 2         1.0  -176.0   -81.0      0      D01
 3       -10.0  -180.0   -77.0      0      D01
 4       -21.0  -191.0   -63.0      0      D01
 ...       ...     ...     ...    ...      ...
 19994     0.0  -158.0  -109.0      0      D01
 19995   -12.0  -154.0   -96.0      0      D01
 19996   -15.0  -166.0   -84.0      0      D01
 19997   -18.0  -188.0   -76.0      0      D01
 19998   -26.0  -217.0   -66.0      0      D01
 
 [19999 rows x 5 columns],
        ADXL_x  ADXL_y  ADXL_z  label activity
 0         7.0  -267.0     8.0      0      D02
 1        12.0  -259.0   -18.0      0      D02
 2        24.0  -255.0   -32.0      0      D02
 3        34.0  -247.0   -45.0      0      D02
 4        44.0  -248.0   -42.0      0      D02
 ...       ...     ...     ...    ...      ...
 19995    30.0  -319.0  -106.0      0      D02
 19996    60.0  -294.0  -116.0 

## 5. Data Preprocessing Methodology

### 5.1 Sliding Window Segmentation

**Parameters:**
* **Sampling Rate:** Downsampled to `50Hz` (from original 200Hz) to reduce input dimensionality.
* **Window Size:** `2.56 seconds` ($N=128$ samples).
* **Overlap (Stride):** `50%` for ADL; Peak-Centered for Falls.



### 5.2 Strategy for Resource Constraints
While some literature suggests long windows (8-12s) to capture "pre-fall" instability, we utilize a compact **2.56-second window**.
* **Memory Efficiency:** Reduces the input tensor size, minimizing RAM usage on microcontrollers.
* **Latency:** Allows for rapid inference (detecting the fall within seconds of impact) rather than waiting for a long buffer to fill.
* **Sensor Selection:** We utilize **Accelerometer-only** data (`ADXL_x`, `ADXL_y`, `ADXL_z`). Gyroscopic data is excluded to simulate low-power hardware constraints (accelerometers consume $\approx$10-100$\mu$A vs. gyroscopes $\approx$3-5mA).

### 5.3 Handling Class Imbalance: Peak-Centered Extraction
The SisFall dataset, like real-world scenarios, is heavily imbalanced (far more ADL data than Fall data). Furthermore, a 15-second "Fall" file contains mostly non-fall data (standing before, lying after).

To prevent label noise and "Negative Transfer," we adopt a **Peak-Centering Strategy** for training:
1.  **Calculate SVM:** Compute the Signal Vector Magnitude (SVM) ($\sqrt{x^2+y^2+z^2}$) to locate the impact spike.
2.  **Center-Crop:** Extract windows centered specifically on the impact moment.
3.  **Discard Silence:** Non-impact segments of fall files are discarded to prevent the model from learning the subject lying still or ADLs before fall occurs as a feature of falling.
4. **Down Sampling:** To combat class imbalance we will only limit number of ADLs by 1,000 - total fall windows. We will randomly select the about 625 ADL windows.

In [6]:
def process_trials_individually(trials_list, window_size=128):
    """
    Optimized for IoT: Downsamples to 50Hz and uses smaller windows.
    """
    # Find radius for window center at SVM spike
    radius = window_size // 2
    
    processed_windows = []
    labels = []

    for df in trials_list:
        # === DOWNSAMPLE ===
        # Reduce down to 50 hz from 200 hz i.e take every 4th sample
        df_small = df.iloc[::4, :].reset_index(drop=True)
        
        act_code = df_small['activity'].iloc[0]
        sensor_data = df_small[['ADXL_x', 'ADXL_y', 'ADXL_z']].values

        # === FALL PROCESSING ===
        if act_code.startswith('F'):
            # Recalculate SVM on the DOWNSAMPLED data
            svm = np.sqrt(np.sum(sensor_data**2, axis=1))
            # Find the max SVP from data
            impact_idx = np.argmax(svm)

            # Shift by approx +/- 0.05 seconds (Time Jitter)
            # 0.05s * 50Hz = 2.5 samples -> round to 3
            jitter = 3
            shifts = [-jitter, 0, jitter] 

            for shift in shifts:
                # Shift the center using jitter
                center = impact_idx + shift
                # Bounds check
                if center - radius >= 0 and center + radius < len(df_small):
                    window = sensor_data[center - radius : center + radius]
                    processed_windows.append(window)
                    labels.append(1)

        # === ADL PROCESSING (Pooled Random) ===
        else:
            # Stride of 25 samples (0.5 seconds)
            stride = 25 
            for i in range(0, len(df_small) - window_size, stride):
                window = sensor_data[i : i + window_size]
                # We defer selection until later (Pooled Random Strategy)
                processed_windows.append(window)
                labels.append(0)

    # Convert to Arrays
    X_all = np.array(processed_windows)
    y_all = np.array(labels)
    
    # === STEP 4: BALANCE THE DATA ===
    # Separate Falls and ADLs
    falls_idx = np.where(y_all == 1)[0]
    adls_idx = np.where(y_all == 0)[0]
    
    # Target 3x ADLs
    target_adls = len(falls_idx) * 3
    
    if len(adls_idx) > target_adls:
        selected_adls = np.random.choice(adls_idx, target_adls, replace=False)
    else:
        selected_adls = adls_idx
        
    final_indices = np.concatenate([falls_idx, selected_adls])
    np.random.shuffle(final_indices)
    
    return X_all[final_indices], y_all[final_indices]

In [7]:
X_sa01, y_sa01 = process_trials_individually(client_sa01_data)
y_sa01

array([0, 0, 1, 1, 0, 1, 1, 0, 0, 1, 1, 0, 1, 0, 0, 1, 1, 0, 1, 1, 1, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 0, 1, 1, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 1, 1, 1, 0, 1, 0, 0, 0, 0, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 1, 0, 0, 1, 0, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0,
       0, 0, 0, 0, 1, 0, 0, 1, 0, 0, 0, 1, 0, 1, 1, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 1, 1, 1, 0, 0, 0, 0, 0, 1, 0, 1, 1, 0, 1, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0,
       0, 0, 0, 0, 0, 1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1, 0, 1, 0, 1, 1, 1,
       1, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 1, 0, 1, 0, 0, 1, 0, 0, 0,
       1, 0, 0, 1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 1, 0, 0,
       0, 0, 0, 0, 0, 0, 1, 0, 0, 1, 1, 1, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0,
       0, 1, 1, 0, 0, 0, 0, 0, 1, 0, 0, 1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1,
       0, 0, 1, 0, 0, 0, 0, 0, 1, 1, 0, 1, 1, 0, 0,

In [8]:
def create_dataloader(X, y):
    """
    Wraps numpy arrays into PyTorch DataLoader.
    """
    class SisFallDataset(Dataset):
        def __init__(self, X, y):
            self.X = torch.from_numpy(X).float()
            self.y = torch.from_numpy(y).float()
        def __len__(self): return len(self.X)
        def __getitem__(self, idx): 
            # Transpose [Time, Channel] -> [Channel, Time] for CNN
            return self.X[idx].transpose(0, 1), self.y[idx]

    dataset = SisFallDataset(X, y)
    return DataLoader(dataset, batch_size = 32, shuffle=True)

In [14]:
def complete_dataload(root_path):
    """
    Orchestrates the entire loading and processing pipeline.
    Returns:
        pretrain_loader: DataLoader for centralized pre-training.
        fl_loaders: Dict { 'SA06': DataLoader, ... } for Federated Learning.
    """
    
    # Client Ids to put in pretrain set
    pretrain_clients = ['SA01', 'SA02', 'SA03', 'SA04', 'SA05']
    
    # Load Raw Data
    print("Step 1: Loading raw SisFall data into memory...")
    raw_data_dict = load_sisfall_for_fl(root_path)
    
    pretrain_trials = []
    fl_loaders = {}

    print("Step 2: Processing clients...")
    
    # Iterate through every subject found
    for client_id, trials in raw_data_dict.items():
        
        # Pre-train vs Federated Client
        if client_id in pretrain_clients:
            # Collect all trials for later centralized processing
            pretrain_trials.extend(trials)
        else:
            # Process this client immediately for FL
            X_client, y_client = process_trials_individually(trials)
            
            # Only create loader if valid data exists
            if len(X_client) > 0:
                fl_loaders[client_id] = create_dataloader(X_client, y_client)
                
    # Process the collected Pre-train Data
    print(f"Step 3: Finalizing Pre-train set ({len(pretrain_clients)} clients)...")
    X_pre, y_pre = process_trials_individually(pretrain_trials)
    pretrain_loader = create_dataloader(X_pre, y_pre)

    print(f"\n=== READY ===")
    print(f"Pre-train Set: {len(pretrain_loader.dataset)} windows")
    print(f"FL Clients: {len(fl_loaders)} active clients")
    
    return pretrain_loader, fl_loaders

In [15]:
complete_dataload(root_path)

Step 1: Loading raw SisFall data into memory...
Step 2: Processing clients...
Step 3: Finalizing Pre-train set (5 clients)...

=== READY ===
Pre-train Set: 4500 windows
FL Clients: 19 active clients


(<torch.utils.data.dataloader.DataLoader at 0x25e1ea78850>,
 {'SA06': <torch.utils.data.dataloader.DataLoader at 0x25e17787150>,
  'SA07': <torch.utils.data.dataloader.DataLoader at 0x25e7f1af110>,
  'SA08': <torch.utils.data.dataloader.DataLoader at 0x25e07d75590>,
  'SA09': <torch.utils.data.dataloader.DataLoader at 0x25e082aee50>,
  'SA10': <torch.utils.data.dataloader.DataLoader at 0x25e6e557590>,
  'SA11': <torch.utils.data.dataloader.DataLoader at 0x25e07dc6050>,
  'SA12': <torch.utils.data.dataloader.DataLoader at 0x25e5eee6110>,
  'SA13': <torch.utils.data.dataloader.DataLoader at 0x25e082f5b90>,
  'SA14': <torch.utils.data.dataloader.DataLoader at 0x25e178a7310>,
  'SA15': <torch.utils.data.dataloader.DataLoader at 0x25e71507350>,
  'SA16': <torch.utils.data.dataloader.DataLoader at 0x25e1790d6d0>,
  'SA17': <torch.utils.data.dataloader.DataLoader at 0x25e7f16f910>,
  'SA18': <torch.utils.data.dataloader.DataLoader at 0x25e4132d050>,
  'SA19': <torch.utils.data.dataloader.Data