# [DRAFT] Feature Engineering

This notebook demonstrates a feature engineering workflow for the `generic-neuromotor-interface` dataset/`discrete_gestures` task:
- Load and align event-based EMG data
- Extract and engineer features for each gesture event
- Prepare data for machine learning and analysis

## Import and setup

In [1]:
import os
import glob
import numpy as np
import pandas as pd
from tqdm import tqdm  # For optional progress bar

# Import data loader (source: generic_neuromotor_interface/explore_data/load.py)
from generic_neuromotor_interface.explore_data.load import load_data

## Set data paths and load metadata

Edit paths below if your data is in a different location.

In [2]:
# EDIT HERE: Set your data folder if needed
DATA_FOLDER = os.path.expanduser("~/emg_data/")

# Path to the metadata CSV
corpus_csv = os.path.join(DATA_FOLDER, "discrete_gestures_corpus.csv")
corpus_df = pd.read_csv(corpus_csv)

# Preview the metadata
corpus_df.head()

Unnamed: 0,start,end,split,dataset_number,user_number,dataset
0,1632778000.0,1632778000.0,train,0,17,discrete_gestures_user_017_dataset_000.hdf5
1,1632777000.0,1632778000.0,train,0,17,discrete_gestures_user_017_dataset_000.hdf5
2,1632778000.0,1632778000.0,train,0,17,discrete_gestures_user_017_dataset_000.hdf5
3,1632776000.0,1632776000.0,train,0,17,discrete_gestures_user_017_dataset_000.hdf5
4,1632776000.0,1632776000.0,train,0,17,discrete_gestures_user_017_dataset_000.hdf5


## Find all HDF5 files for the task

Find all files for the `discrete_gestures` task.

In [3]:
def get_task_dataset_paths(task: str) -> list:
    """
    Returns a list of all HDF5 files for a given task.
    Source: Adapted from explore_data.ipynb
    """
    folder = os.path.expanduser(DATA_FOLDER)
    datasets = glob.glob(os.path.join(folder, '*.hdf5'))
    return [d for d in datasets if task in d]

# Get all discrete_gestures .hdf5 files
files = get_task_dataset_paths("discrete_gestures")
print(f"Found {len(files)} files")

Found 3 files


## Set window parameters

We can change the window size here (in seconds).

In [4]:
# EDIT HERE: Change window size (in seconds) around each gesture event
WINDOW = [-0.5, 0.5]  # e.g., [-0.5, 0.5] = 0.5 seconds before and after the gesture

## Feature engineering loop

Process each file and extract features for each gesture event:
- For each event, the loop finds the correct metadata row (for the right split: train, val, or test).
- It extracts a window of EMG data around the event.
- It calculates simple features (RMS, Max Abs, MAV) for each channel.

**We can add more features in the marked section below!**

In [5]:
all_features = []  # List to hold all feature dictionaries

for file in tqdm(files, desc="Processing HDF5 files"):
    basename = os.path.basename(file)
    
    # Extract user_number and dataset_number from filename
    parts = basename.split('_')
    user_number = parts[3]  # e.g., '064'
    dataset_number = parts[5].split('.')[0]  # e.g., '000'

    # Filter corpus_df for all rows matching this file (could be multiple segments/splits)
    file_corpus = corpus_df[
        (corpus_df['user_number'].astype(str).str.zfill(3) == user_number) &
        (corpus_df['dataset_number'].astype(str).str.zfill(3) == dataset_number)
    ]

    if file_corpus.empty:
        print(f"Warning: No metadata found for {basename}. Skipping.")
        continue

    # Load the data object for this file
    data = load_data(file)
    prompts = data.prompts  # DataFrame of gesture events (name, time)

    for idx, gesture in prompts.iterrows():
        center_time = gesture['time']
        gesture_name = gesture['name']

        # Find the correct metadata row for this event
        match = file_corpus[
            (file_corpus['start'] <= center_time) & (center_time <= file_corpus['end'])
        ]
        if match.empty:
            continue
        gt_row = match.iloc[0]
        
        # Extract metadata columns for this event
        start = gt_row['start']
        end = gt_row['end']
        split = gt_row['split']
        dataset_num_str = str(gt_row['dataset_number']).zfill(3)
        user_num_str = str(gt_row['user_number']).zfill(3)
        dataset = basename

        # Extract EMG window around the event
        timeseries = data.partition(
            start_t=center_time + WINDOW[0],
            end_t=center_time + WINDOW[1]
        )
        emg_window = timeseries["emg"]

        if emg_window is None or emg_window.shape[0] == 0:
            continue

        # === EDIT HERE: FEATURE ENGINEERING ===
        feature_dict = {
            "start": start,
            "end": end,
            "split": split,
            "dataset_number": dataset_num_str,
            "user_number": user_num_str,
            "dataset": dataset,
            "gesture_name": gesture_name,
            "event_time": center_time,
        }
        for ch in range(emg_window.shape[1]):
            signal = emg_window[:, ch]
            # --- Basic features ---
            feature_dict[f"emg{ch:02d}_rms"] = np.sqrt(np.mean(signal ** 2))
            feature_dict[f"emg{ch:02d}_maxabs"] = np.max(np.abs(signal))
            feature_dict[f"emg{ch:02d}_mav"] = np.mean(np.abs(signal))
            # --- ADD OUR OWN FEATURES BELOW ---
            # Example: feature_dict[f"emg{ch:02d}_var"] = np.var(signal)
            # Example: feature_dict[f"emg{ch:02d}_median"] = np.median(signal)
            # Example: feature_dict[f"emg{ch:02d}_iqr"] = np.percentile(signal, 75) - np.percentile(signal, 25)
        # === END FEATURE ENGINEERING ===

        all_features.append(feature_dict)

Processing HDF5 files: 100%|██████████████████████| 3/3 [00:03<00:00,  1.15s/it]


## Create and preview final dataframe

Let's see what our engineered features look like!

In [6]:
final_df = pd.DataFrame(all_features)
print(f"Extracted features for {len(final_df)} events")
final_df.head()

Extracted features for 4813 events


Unnamed: 0,start,end,split,dataset_number,user_number,dataset,gesture_name,event_time,emg00_rms,emg00_maxabs,...,emg12_mav,emg13_rms,emg13_maxabs,emg13_mav,emg14_rms,emg14_maxabs,emg14_mav,emg15_rms,emg15_maxabs,emg15_mav
0,1634052000.0,1634052000.0,test,0,2,discrete_gestures_user_002_dataset_000.hdf5,middle_press,1634052000.0,6.853332,33.638134,...,10.979843,13.95608,77.955933,10.060468,13.525366,96.488487,9.452885,17.779005,122.072777,12.418376
1,1634052000.0,1634052000.0,test,0,2,discrete_gestures_user_002_dataset_000.hdf5,middle_release,1634052000.0,7.059992,28.632969,...,11.448546,14.911323,64.275093,11.098847,14.476486,96.488487,10.195408,19.819395,122.072777,14.009955
2,1634052000.0,1634052000.0,test,0,2,discrete_gestures_user_002_dataset_000.hdf5,index_press,1634052000.0,6.249473,41.418812,...,12.970555,17.146723,85.592896,11.769621,13.678231,55.669769,9.801596,15.025372,67.491455,11.10543
3,1634052000.0,1634052000.0,test,0,2,discrete_gestures_user_002_dataset_000.hdf5,index_release,1634052000.0,6.284244,41.418812,...,13.203725,17.413801,85.592896,11.947546,13.864525,55.669769,9.900255,15.109432,67.491455,11.071752
4,1634052000.0,1634052000.0,test,0,2,discrete_gestures_user_002_dataset_000.hdf5,index_press,1634052000.0,4.817219,20.79574,...,9.937219,11.708973,54.664139,8.663325,10.207128,54.765289,7.386766,13.278831,96.434753,9.225086


## Save features to csv

Change output filename as desired.

In [7]:
# EDIT HERE: Change output filename if desired
output_csv = os.path.join(DATA_FOLDER, "discrete_gestures_event_features.csv")
final_df.to_csv(output_csv, index=False)
print(f"Output saved to: {output_csv}")

Output saved to: /Users/sero/emg_data/discrete_gestures_event_features.csv


## Possible next steps

- EDA/Visualize features, e.g., using seaborn or matplotlib
- Try different window sizes or add more features
- Use features for machine learning or statistical analysis
- Try domain-specific ideas (e.g., frequency-domain features, cross-channel features, etc.)