# ABC2026 – Notebook 1  
# Data Loading, Verification & Structuring (EEG, EMG, IMU)


---

## Objectives
1. Automatically load *all subjects* and *all trials* from the raw MindRove dataset.  
2. Validate the structure and pairing of raw CSV files and trigger files.  
3. Split each trial into **six clean data streams**:
   - **EEG_1** → Device 1 (Lucid EEG headset), Channels 1–6 (µV)
   - **EMG_2** → Device 2 (EMG armband 1), Channels 1–8 (µV)
   - **EMG_3** → Device 3 (EMG armband 2), Channels 1–8 (µV)
   - **IMU_1** → Device 1 IMU (Gyro + Acc)
   - **IMU_2** → Device 2 IMU
   - **IMU_3** → Device 3 IMU

4. Save each stream as a clean CSV in:
MindRove_Data/structured_data/Sxx/by_device/



---
##  Sampling Rate
All MindRove devices record at:


FS = 500 Hz


---
## EEG Channel–Electrode Mapping

The six recorded EEG channels are mapped to standard 10–20 system electrode positions as follows:

- **Ch 1 → Fp1**
- **Ch 2 → Fp2**
- **Ch 3 → C1**
- **Ch 4 → C2**
- **Ch 5 → O1**
- **Ch 6 → O2**


In [None]:
from google.colab import drive
drive.mount('/content/drive')


Mounted at /content/drive


## **Paths & Configuration**

In [None]:
import os
import glob
import io

import numpy as np
import pandas as pd

# === PATHS ===
PROJECT_ROOT   = "/content/drive/MyDrive/ABC2026"
DATA_ROOT      = "/content/drive/MyDrive/MindRove_Data/data"
STRUCTURED_ROOT = "/content/drive/MyDrive/MindRove_Data/structured_data"

# Sampling rate
FS = 500  # Hz

# Device naming
DEVICE_ROLES = {
    1: "EEG_1",
    2: "EMG_2",
    3: "EMG_3",
}

IMU_ROLES = {
    1: "IMU_1",
    2: "IMU_2",
    3: "IMU_3",
}

print("PROJECT_ROOT  :", PROJECT_ROOT)
print("DATA_ROOT     :", DATA_ROOT)
print("STRUCTURED_ROOT:", STRUCTURED_ROOT)
print("\nDevice roles:")
for d, name in DEVICE_ROLES.items():
    print(f"  Device {d} → {name}")
for d, name in IMU_ROLES.items():
    print(f"  Device {d} → {name}")


PROJECT_ROOT  : /content/drive/MyDrive/ABC2026
DATA_ROOT     : /content/drive/MyDrive/MindRove_Data/data
STRUCTURED_ROOT: /content/drive/MyDrive/MindRove_Data/structured_data

Device roles:
  Device 1 → EEG_1
  Device 2 → EMG_2
  Device 3 → EMG_3
  Device 1 → IMU_1
  Device 2 → IMU_2
  Device 3 → IMU_3


## **Utility: List Subjects**

In [None]:
def list_subjects(data_root: str):
    """Return sorted list of subject IDs, e.g. ['S01','S02','S04']."""
    subjects = [
        d for d in os.listdir(data_root)
        if d.startswith("S") and os.path.isdir(os.path.join(data_root, d))
    ]
    return sorted(subjects)

subjects = list_subjects(DATA_ROOT)
print("Detected subjects:", subjects)


Detected subjects: ['S04']


## **File Discovery Functions**

In [None]:
def get_subject_paths(subject_id: str):
    """Return subject root, raw folder, and trigger folder."""
    subj_root = os.path.join(DATA_ROOT, subject_id)
    raw_dir   = os.path.join(subj_root, "raw")
    trig_dir  = os.path.join(subj_root, "triggers")
    return subj_root, raw_dir, trig_dir

def list_trials_for_subject(subject_id: str):
    """
    List raw CSV files and trigger CSV files for a subject.
    raw:      Sxx_T*.csv
    triggers: Sxx_T*triggers*.csv
    """
    _, raw_dir, trig_dir = get_subject_paths(subject_id)

    raw_files = sorted(glob.glob(os.path.join(raw_dir,  f"{subject_id}_T*.csv")))
    trig_files = sorted(glob.glob(os.path.join(trig_dir, f"{subject_id}_T*triggers*.csv")))

    return raw_files, trig_files

# Quick check:
for sid in subjects:
    raw_files, trig_files = list_trials_for_subject(sid)
    print(f"{sid}: {len(raw_files)} raw, {len(trig_files)} triggers")


S04: 34 raw, 34 triggers


## **MindRove CSV loader**

In [None]:
# Clean header we expect in MindRove CSV (simple format)
HEADER_SIMPLE = [
    "Device number",
    "Channel1", "Channel2", "Channel3", "Channel4",
    "Channel5", "Channel6", "Channel7", "Channel8",
    "GyroX", "GyroY", "GyroZ",
    "AccX", "AccY", "AccZ",
    "Timestamp",
]

def load_mindrove_csv_simple(path: str) -> pd.DataFrame:
    """
    Load a MindRove CSV and ensure a clean header:

        Device number, Channel1..Channel8, GyroX/Y/Z, AccX/Y/Z, Timestamp

    - Skips initial 'Device ...' metadata lines.
    - Forces the header to HEADER_SIMPLE.
    - Parses as tab-separated ('\\t').
    """
    with open(path, "r", encoding="utf-8", errors="ignore") as f:
        lines = f.read().splitlines()

    # Skip 'Device ...' lines
    start = 0
    while start < len(lines) and lines[start].startswith("Device "):
        start += 1

    if start >= len(lines):
        raise ValueError(f"No header line found in file: {path}")

    # Replace header line with our clean header
    lines[start] = "\t".join(HEADER_SIMPLE)

    text = "\n".join(lines[start:])

    # Read as tab-separated
    df = pd.read_csv(io.StringIO(text), sep="\t")

    # Enforce column names if lengths match
    if len(df.columns) == len(HEADER_SIMPLE):
        df.columns = HEADER_SIMPLE

    # Convert numeric columns
    for col in HEADER_SIMPLE:
        if col in df.columns:
            df[col] = pd.to_numeric(df[col], errors="coerce")

    # Drop fully empty rows
    df = df.dropna(how="all")

    return df

# Quick sanity check on first trial of first subject (optional)
if subjects:
    test_sid = subjects[0]
    raw_files, _ = list_trials_for_subject(test_sid)
    if raw_files:
        df_test = load_mindrove_csv_simple(raw_files[0])
        print("Test file:", os.path.basename(raw_files[0]))
        print("Columns:", list(df_test.columns))
        print("Device numbers:", df_test["Device number"].unique())


Test file: S04_T01_29_11_25_19_27_24.csv
Columns: ['Device number', 'Channel1', 'Channel2', 'Channel3', 'Channel4', 'Channel5', 'Channel6', 'Channel7', 'Channel8', 'GyroX', 'GyroY', 'GyroZ', 'AccX', 'AccY', 'AccZ', 'Timestamp']
Device numbers: [2 3 1]


## **Split into 6 streams (EEG_1, EMG_2, EMG_3, IMU_1–3)**

In [None]:
EXG_COLS = [f"Channel{i}" for i in range(1, 9)]
IMU_COLS = ["GyroX", "GyroY", "GyroZ", "AccX", "AccY", "AccZ"]

def split_devices(df_raw: pd.DataFrame):
    """
    Split MindRove mixed CSV into 6 logical streams:

      EEG_1: Device 1, Timestamp + Channel1–Channel6
      EMG_2: Device 2, Timestamp + Channel1–Channel8
      EMG_3: Device 3, Timestamp + Channel1–Channel8
      IMU_1: Device 1, Timestamp + GyroX/Y/Z + AccX/Y/Z
      IMU_2: Device 2, same IMU cols
      IMU_3: Device 3, same IMU cols

    Returns:
      dict: { 'EEG_1': df, 'EMG_2': df, ..., 'IMU_3': df }
    """
    streams = {}

    for dev_num, role_name in DEVICE_ROLES.items():
        df_dev = df_raw[df_raw["Device number"] == dev_num].copy()
        if df_dev.empty:
            print(f" Warning: device {dev_num} has no samples.")
            continue

        # Ensure numeric for safety
        for col in EXG_COLS + IMU_COLS + ["Timestamp"]:
            if col in df_dev.columns:
                df_dev[col] = pd.to_numeric(df_dev[col], errors="coerce")

        # EXG stream
        if role_name == "EEG_1":
            # EEG_1: only Channel1–Channel6
            exg_cols = ["Timestamp"] + [f"Channel{i}" for i in range(1, 7)]
        else:
            # EMG_2 and EMG_3: all 8 channels
            exg_cols = ["Timestamp"] + EXG_COLS

        exg_df = df_dev[exg_cols].dropna(subset=["Timestamp"])
        streams[role_name] = exg_df

        # IMU stream for this device
        imu_role = IMU_ROLES[dev_num]
        imu_cols = ["Timestamp"] + IMU_COLS
        imu_df = df_dev[imu_cols].dropna(subset=["Timestamp"])
        streams[imu_role] = imu_df

    return streams


## **Helper to get output folder**

In [None]:
def get_structured_device_dir(subject_id: str):
    """
    Ensure and return:
      MindRove_Data/structured_data/Sxx/by_device/
    """
    out_dir = os.path.join(STRUCTURED_ROOT, subject_id, "by_device")
    os.makedirs(out_dir, exist_ok=True)
    return out_dir


In [None]:
def extract_clean_trial_id(filename: str) -> str:
    """
    Convert a full raw filename like:
        'S04_T01_29_11_25_19_27_24.csv'
    into a clean trial ID:
        'S04_T01'

    Assumes the base name starts with:
        <SubjectID>_<TrialID>_...
    e.g., S04_T01_...
    """
    base = os.path.splitext(filename)[0]  # remove .csv
    parts = base.split("_")
    if len(parts) >= 2:
        clean = parts[0] + "_" + parts[1]  # e.g. S04 + T01
    else:
        # Fallback: use the whole base name if unexpected format
        clean = base
    return clean


## **process all subjects & all trials**

In [None]:
for subject_id in subjects:
    print("\n============================")
    print(f"Processing subject: {subject_id}")
    print("============================")

    raw_files, trig_files = list_trials_for_subject(subject_id)
    print(f"  Raw trials:     {len(raw_files)}")
    print(f"  Trigger trials: {len(trig_files)}")

    out_dir = get_structured_device_dir(subject_id)

    for raw_path in raw_files:
        # Original filename, e.g. 'S04_T01_29_11_25_19_27_24.csv'
        filename = os.path.basename(raw_path)

        # Clean trial ID, e.g. 'S04_T01'
        clean_trial = extract_clean_trial_id(filename)

        print(f"  → Splitting {clean_trial} (from {filename})")

        # Load raw MindRove CSV with our custom loader
        df_raw = load_mindrove_csv_simple(raw_path)

        # Split into 6 streams: EEG_1, EMG_2, EMG_3, IMU_1, IMU_2, IMU_3
        streams = split_devices(df_raw)

        # Save each stream as a clean CSV
        for stream_name, df_stream in streams.items():
            # Example: S04_T01_EEG_1_raw.csv
            out_name = f"{clean_trial}_{stream_name}_raw.csv"
            out_path = os.path.join(out_dir, out_name)
            df_stream.to_csv(out_path, index=False)

    print(f"✓ Finished subject {subject_id}")




Processing subject: S04
  Raw trials:     34
  Trigger trials: 34
  → Splitting S04_T01 (from S04_T01_29_11_25_19_27_24.csv)
  → Splitting S04_T02 (from S04_T02_29_11_25_19_28_42.csv)
  → Splitting S04_T03 (from S04_T03_29_11_25_19_30_00.csv)
  → Splitting S04_T04 (from S04_T04_29_11_25_19_31_18.csv)
  → Splitting S04_T05 (from S04_T05_29_11_25_19_32_36.csv)
  → Splitting S04_T06 (from S04_T06_29_11_25_19_33_54.csv)
  → Splitting S04_T07 (from S04_T07_29_11_25_19_35_12.csv)
  → Splitting S04_T08 (from S04_T08_29_11_25_19_36_30.csv)
  → Splitting S04_T09 (from S04_T09_29_11_25_19_37_48.csv)
  → Splitting S04_T10 (from S04_T10_29_11_25_19_39_06.csv)
  → Splitting S04_T11 (from S04_T11_29_11_25_19_40_24.csv)
  → Splitting S04_T12 (from S04_T12_29_11_25_19_41_42.csv)
  → Splitting S04_T13 (from S04_T13_29_11_25_19_43_00.csv)
  → Splitting S04_T14 (from S04_T14_29_11_25_19_44_18.csv)
  → Splitting S04_T15 (from S04_T15_29_11_25_19_45_36.csv)
  → Splitting S04_T16 (from S04_T16_29_11_25_19_

## **Quick check of output (optional)**

In [None]:
# Check one output folder
if subjects:
    check_sid = subjects[0]
    check_dir = get_structured_device_dir(check_sid)
    print("Example structured folder:", check_dir)
    print("Files inside:")
    for fname in sorted(os.listdir(check_dir))[:10]:
        print(" ", fname)


Example structured folder: /content/drive/MyDrive/MindRove_Data/structured_data/S04/by_device
Files inside:
  S04_T01_EEG_1_raw.csv
  S04_T01_EMG_2_raw.csv
  S04_T01_EMG_3_raw.csv
  S04_T01_IMU_1_raw.csv
  S04_T01_IMU_2_raw.csv
  S04_T01_IMU_3_raw.csv
  S04_T02_EEG_1_raw.csv
  S04_T02_EMG_2_raw.csv
  S04_T02_EMG_3_raw.csv
  S04_T02_IMU_1_raw.csv
