# Notebook 3 – EMG Segmentation and Feature Extraction

This notebook prepares EMG_2 data for machine learning.

We will:

1. Load **filtered EMG_2** data from the `preprocessed` folder.
2. Load **trigger files** from the `data` folder.
3. Use the trigger information to extract **ACTION** segments.
4. For each ACTION segment:
   - Cut the corresponding EMG_2 data from the filtered file.
   - Compute **three features** per EMG channel:
     - Mean  
     - Standard deviation  
     - RMS (root mean square)
5. Build one table (DataFrame) where:
   - Each **row = one ACTION segment** (one task in one trial).
   - Columns = subject ID, trial ID, label (T1–T7), and EMG features.
6. Save the final dataset as a CSV file in the `datasets` folder.
7. Visualize the number of samples per label.


## **Code: Mount Drive and Imports**

In [16]:

from google.colab import drive
drive.mount('/content/drive')

import os
import glob
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt


Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


## 1. Configuration

In this section we define:

- Paths to:
  - Raw data and trigger files
  - Preprocessed EMG_2 data
  - Output dataset folder
- Sampling rate of EMG
- List of subjects to include in the dataset


In [17]:


# Root directories
RAW_ROOT          = "/content/drive/MyDrive/MindRove_Data/data"
PREPROCESSED_ROOT = "/content/drive/MyDrive/MindRove_Data/preprocessed"
DATASET_ROOT      = "/content/drive/MyDrive/MindRove_Data/datasets"

os.makedirs(DATASET_ROOT, exist_ok=True)

# EMG sampling rate (Hz)
FS = 500

# Subjects to include in the dataset
SUBJECTS = ["S04"]  # You can extend this list later, e.g. ["S01", "S02", "S03", "S04"]

print("RAW_ROOT         :", RAW_ROOT)
print("PREPROCESSED_ROOT:", PREPROCESSED_ROOT)
print("DATASET_ROOT     :", DATASET_ROOT)
print("SUBJECTS         :", SUBJECTS)


RAW_ROOT         : /content/drive/MyDrive/MindRove_Data/data
PREPROCESSED_ROOT: /content/drive/MyDrive/MindRove_Data/preprocessed
DATASET_ROOT     : /content/drive/MyDrive/MindRove_Data/datasets
SUBJECTS         : ['S04']


## 2. Helper Functions (Paths and Trial IDs)

We define helper functions to:

- Get the trigger folder for a subject.
- Get the preprocessed EMG_2 folder for a subject.
- Extract a clean trial ID (e.g., `S04_T01`) from a file name.
- Extract the **numeric trial number** (e.g., `1` from `T01`) from a file name.


In [18]:


def get_raw_trigger_dir(subject_id):
    """
    Returns the trigger directory for a subject.
    Example: data/S04/triggers/
    """
    return os.path.join(RAW_ROOT, subject_id, "triggers")


def get_preprocessed_emg_dir(subject_id, emg_device="EMG_2"):
    """
    Returns the preprocessed EMG directory for a subject.
    Example: preprocessed/S04/EMG/EMG_2/
    """
    return os.path.join(PREPROCESSED_ROOT, subject_id, "EMG", emg_device)


def extract_clean_trial_id(filename):
    """
    Example:
        'S04_T01_EMG_2_filtered.csv' -> 'S04_T01'
    Used to match EMG files with trigger files.
    """
    base = os.path.splitext(filename)[0]
    parts = base.split("_")
    if len(parts) >= 2:
        return parts[0] + "_" + parts[1]
    return base


def extract_trial_number_from_filename(filename):
    """
    Extract the numeric trial number from a filename.

    Examples:
        'S04_T01_EMG_2_filtered.csv' -> 1
        'S04_T12_EMG_2_filtered.csv' -> 12

    If no 'T##' pattern is found, returns None.
    """
    base = os.path.splitext(filename)[0]
    parts = base.split("_")

    for p in parts:
        # Look for something like 'T01', 'T1', 'T12', etc.
        if p.startswith("T") and p[1:].isdigit():
            return int(p[1:])  # 'T01' -> 1, 'T12' -> 12

    return None


## 3. Extract ACTION Intervals from Trigger Files

Each trigger file contains several stages, such as:

- `concentration`
- `image`
- `action`
- `relax`
- `rest`
- etc.

We want to extract **only** the intervals where `stage == "action"`.

For each ACTION row:

- `start_ts` = timestamp at the ACTION row  
- `end_ts`   = timestamp at the next `relax` row   
- `label`    = task label (e.g., `T1`, `T2`, ...)

The function returns a list of tuples:
```python
[(start_ts, end_ts, label), ...]


In [19]:


def extract_action_intervals(df_trig):
    """
    Extract a list of ACTION intervals:
        [(start_ts, end_ts, label), ...]

    Assumptions:
    - Time column exists: 'timestamp', 'Timestamp', 'time', or 'Time'
    - Stage column exists: 'stage' or 'Stage'
    - Task/label column: 'task', 'Task', 'label', or 'Label'
      (if not found, we use the stage as the label).
    """
    # Find time column
    time_col = next(
        (c for c in ["timestamp", "Timestamp", "time", "Time"] if c in df_trig.columns),
        None
    )

    # Find stage column
    stage_col = next(
        (c for c in ["stage", "Stage"] if c in df_trig.columns),
        None
    )

    # Find task/label column
    task_col = next(
        (c for c in ["task", "Task", "label", "Label"] if c in df_trig.columns),
        stage_col
    )

    if time_col is None or stage_col is None:
        raise ValueError("Trigger file is missing a time or stage column.")

    # Sort by time just in case
    df_sorted = df_trig.sort_values(by=time_col).reset_index(drop=True)

    intervals = []

    for i, row in df_sorted.iterrows():
        stage_value = str(row[stage_col]).lower()
        if stage_value == "action":
            start_ts = float(row[time_col])

            # End time is the timestamp of the next row (if it exists)
            if i + 1 < len(df_sorted):
                end_ts = float(df_sorted.loc[i + 1, time_col])
                label = str(row[task_col])
                intervals.append((start_ts, end_ts, label))

    return intervals


## 4. EMG Feature Extraction

For each ACTION segment, we compute **three features** for each filtered EMG channel:

- Mean  
- Standard deviation  
- RMS (root mean square)

Filtered EMG channel columns are expected to be named like:

- `Channel1_filtered`
- `Channel2_filtered`
- ...

The output feature names will be:

- `Channel1_MEAN`, `Channel1_STD`, `Channel1_RMS`
- `Channel2_MEAN`, `Channel2_STD`, `Channel2_RMS`
- etc.


In [20]:

def extract_emg_features(df_seg, emg_cols):
    """
    Compute EMG features for one ACTION segment (no windowing).

    For each channel in emg_cols we compute:
    - Mean
    - Standard deviation
    - RMS
    """
    feats = {}

    for ch in emg_cols:
        sig = df_seg[ch].astype(float).to_numpy()

        mean_val = np.mean(sig)
        std_val  = np.std(sig)
        rms_val  = np.sqrt(np.mean(sig**2))

        base_name = ch.replace("_filtered", "")
        feats[f"{base_name}_MEAN"] = mean_val
        feats[f"{base_name}_STD"]  = std_val
        feats[f"{base_name}_RMS"]  = rms_val

    return feats


## 5. Build the Dataset for One Subject (EMG_2)

For each subject:

1. Find all filtered EMG_2 files, for example:
   - `S04_T01_EMG_2_filtered.csv`
2. For each EMG file (one trial):
   - Extract the real **trial number** from the filename:
     - `S04_T01_EMG_2_filtered.csv` → `trial_id = 1`
   - Build the corresponding trigger file name using `S04_T01`.
   - Load the trigger file and extract ACTION intervals.
   - For each ACTION interval:
     - Cut the EMG_2 samples between `[start_ts, end_ts]`.
     - Compute EMG features for the whole segment.
     - Create one row with:
       - `subject_id`, `trial_id`, `label`, EMG features.


In [21]:


def build_dataset_for_subject(subject_id, emg_device="EMG_2"):
    emg_dir = get_preprocessed_emg_dir(subject_id, emg_device)
    trig_dir = get_raw_trigger_dir(subject_id)

    pattern = os.path.join(emg_dir, f"{subject_id}_T*_{emg_device}_filtered.csv")
    emg_files = sorted(glob.glob(pattern))
    print(f"\nSubject {subject_id}: found {len(emg_files)} EMG files.")

    rows = []

    for emg_path in emg_files:
        fname = os.path.basename(emg_path)

        # Clean trial ID for trigger matching (e.g. 'S04_T01')
        raw_trial_id = extract_clean_trial_id(fname)

        # Extract the numeric trial number (e.g. 1 for T01)
        trial_id = extract_trial_number_from_filename(fname)
        if trial_id is None:
            print(f"⚠ Could not extract trial number from {fname}. Skipping this file.")
            continue

        # Find corresponding trigger file
        trig_pattern = os.path.join(trig_dir, f"{raw_trial_id}*triggers*.csv")
        trig_files = glob.glob(trig_pattern)
        if not trig_files:
            print(f"⚠ No trigger file found for trial {raw_trial_id}")
            continue

        df_trig = pd.read_csv(trig_files[0])
        action_intervals = extract_action_intervals(df_trig)

        if not action_intervals:
            print(f"⚠ No ACTION intervals found in {raw_trial_id}")
            continue

        # Load EMG data
        df_emg = pd.read_csv(emg_path)
        if "Timestamp" not in df_emg.columns:
            print(f"⚠ 'Timestamp' column missing in {fname}, skipping this trial.")
            continue

        ts = df_emg["Timestamp"].astype(float).to_numpy()

        # Filtered EMG channels (e.g. Channel1_filtered, Channel2_filtered, ...)
        emg_cols = [
            c for c in df_emg.columns
            if c.startswith("Channel") and c.endswith("_filtered")
        ]

        if not emg_cols:
            print(f"⚠ No filtered EMG channels found in {fname}, skipping this trial.")
            continue

        # Process each ACTION interval for this trial
        for (start_ts, end_ts, label) in action_intervals:
            mask = (ts >= start_ts) & (ts <= end_ts)
            idx = np.where(mask)[0]

            if len(idx) == 0:
                # No EMG samples in this interval
                continue

            df_seg = df_emg.iloc[idx].reset_index(drop=True)
            feats = extract_emg_features(df_seg, emg_cols)

            row = {
                "subject_id": subject_id,
                "trial_id": trial_id,
                "label": label
            }
            row.update(feats)
            rows.append(row)

    return pd.DataFrame(rows)


## 6. Build the Full Dataset and Save to CSV

In this step we:

1. Loop over all subjects in the `SUBJECTS` list.
2. Build a dataset for each subject.
3. Concatenate all subject datasets into one DataFrame `df_all`.
4. Print the shape and first rows of the dataset.
5. Save `df_all` to a CSV file:
   - `ABC2026_EMG2_ACTION_SEGMENTS_mean_std_rms.csv`


In [22]:

all_subject_dfs = []

for s in SUBJECTS:
    print("\n==============================")
    print("Processing subject:", s)
    df_subj = build_dataset_for_subject(s)
    if not df_subj.empty:
        all_subject_dfs.append(df_subj)
    else:
        print(f"➜ No valid segments for subject {s}")

if all_subject_dfs:
    df_all = pd.concat(all_subject_dfs, ignore_index=True)

    print("\n Final dataset created.")
    print("Dataset shape (rows, columns):", df_all.shape)
    display(df_all.head())

    out_path = os.path.join(DATASET_ROOT, "ABC2026_EMG2_ACTION_SEGMENTS_mean_std_rms.csv")
    df_all.to_csv(out_path, index=False)

    print("\nSaved dataset to:")
    print(out_path)
else:
    df_all = pd.DataFrame()
    print("\n No data collected. Please check:")
    print(" - Preprocessed EMG_2 files exist.")
    print(" - Trigger files exist.")
    print(" - ACTION intervals are present in the trigger files.")



Processing subject: S04

Subject S04: found 34 EMG files.

 Final dataset created.
Dataset shape (rows, columns): (238, 27)


Unnamed: 0,subject_id,trial_id,label,Channel1_MEAN,Channel1_STD,Channel1_RMS,Channel2_MEAN,Channel2_STD,Channel2_RMS,Channel3_MEAN,...,Channel5_RMS,Channel6_MEAN,Channel6_STD,Channel6_RMS,Channel7_MEAN,Channel7_STD,Channel7_RMS,Channel8_MEAN,Channel8_STD,Channel8_RMS
0,S04,1,T1,0.000142,16.940679,16.940679,-1.6e-05,18.351788,18.351788,-0.000255,...,12.631625,0.000182,14.375861,14.375861,-0.000318,6.141118,6.141118,-0.000209,7.492731,7.492731
1,S04,1,T2,-0.000632,9.308019,9.308019,6e-06,14.219567,14.219567,0.000682,...,6.375804,0.002192,10.14874,10.148741,0.002676,17.934855,17.934855,0.000863,8.019464,8.019464
2,S04,1,T3,-0.001802,4.771755,4.771755,-0.000516,4.735977,4.735977,0.000422,...,2.871941,0.001773,2.75211,2.752111,0.000771,1.872228,1.872228,1.2e-05,2.811738,2.811738
3,S04,1,T4,0.004231,10.426004,10.426005,0.005728,11.293126,11.293127,0.005445,...,15.493916,-0.008712,23.869243,23.869245,-0.006601,14.154035,14.154036,0.001705,4.925717,4.925717
4,S04,1,T5,-0.004848,7.23241,7.232412,-0.006111,9.743597,9.743599,-0.007306,...,7.719175,-0.00099,10.844153,10.844153,-0.000276,14.427819,14.427819,-0.0029,3.963111,3.963112



Saved dataset to:
/content/drive/MyDrive/MindRove_Data/datasets/ABC2026_EMG2_ACTION_SEGMENTS_mean_std_rms.csv




To check that everything worked correctly, we will:

1. Print the final dataset shape.


In [24]:

if df_all is None or df_all.empty:
    print("Dataset is empty. Nothing to visualize.")
else:
    # Shape
    print("Dataset shape (rows, columns):", df_all.shape)





Dataset shape (rows, columns): (238, 27)
