Skip to content

Enrich datasets with modular architecture and metadata catalog#868

Merged
bruAristimunha merged 44 commits intodevelopfrom
enrich-bnci
Jan 29, 2026
Merged

Enrich datasets with modular architecture and metadata catalog#868
bruAristimunha merged 44 commits intodevelopfrom
enrich-bnci

Conversation

@bruAristimunha
Copy link
Copy Markdown
Collaborator

@bruAristimunha bruAristimunha commented Jan 14, 2026

Summary

This PR delivers a major enrichment of MOABB's BNCI datasets with improved organization, comprehensive metadata, and critical bug fixes.

BNCI Dataset Refactoring

  • Reorganize monolithic bnci.py into modular bnci/ subpackage split by year
  • Add shared BNCIBaseDataset class and utility functions
  • Preserve backward compatibility with legacy imports

Comprehensive Metadata System

  • Add metadata/ module with EEGDash-compatible dataclass schema
  • Create catalog with verified metadata for all MOABB datasets:
    • Institution names and countries
    • DOI identifiers
    • Sampling rates and channel configurations
    • Participant demographics

Dataset Loading Fixes

  • Wang2016: Add on_missing="ignore" for non-standard montage channels
  • Sosulski2019: Update to new freidok download endpoint
  • Liu2024: Fix electrode position parsing with make_dig_montage()

Bug Fixes & Validation

  • Fix channel counts validated against actual data for multiple datasets
  • Correct DOI metadata inconsistencies
  • Standardize BNCI2016_002 events for P300 paradigm compatibility
  • Fix BNCI2024_001 and BNCI2022_001 file naming and data loading

Test Plan

  • BNCI datasets load correctly with new subpackage structure
  • Legacy imports work (backward compatibility)
  • Metadata catalog returns correct information for all datasets
  • Wang2016, Sosulski2019, Liu2024 load without errors
  • All tests pass (pytest moabb/tests/)
  • Documentation builds successfully

This commit refactors the BNCI dataset implementation to improve code
quality and ensure proper BIDS conversion:

**Code Quality Improvements:**
- Remove generic post-processing loop from _get_single_subject_data()
- Create _finalize_raw() helper function for consistent metadata handling
- Incorporate finalization logic into each dataset reader function
- Remove unused montage variables from conversion functions

**BIDS Compliance:**
- Ensure montage is set before BIDS cache conversion
- Add dataset-specific years to _dataset_years class attribute
- Guarantee proper measurement dates for all BNCI datasets
- Ensure subject IDs are set for BIDS compliance

**Configuration:**
- Add "ALS" (Amyotrophic Lateral Sclerosis) to codespell ignore list
- Add clarifying comment for ALS medical abbreviation

**Documentation:**
- Update What's New with all enhancements, bug fixes, and code improvements

Verified that montage preservation works correctly when using BIDS cache
mechanism - all channel positions match exactly (distance = 0.0).
Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: e7c535a69b

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

bruAristimunha and others added 27 commits January 19, 2026 00:50
Signed-off-by: Bru <a.bruno@aluno.ufabc.edu.br>
Signed-off-by: Bru <a.bruno@aluno.ufabc.edu.br>
- Create moabb/datasets/bnci/ subpackage for cleaner organization
- Move bnci.py to bnci/legacy.py (17 legacy datasets, 2003-2019)
- Move 7 newer dataset files (2016-2025) into bnci/ subpackage
- Add bnci/__init__.py that exports all 24 BNCI datasets + deprecated aliases
- Update datasets/__init__.py with clean subpackage imports
- Add missing datasets to summary tables (BNCI2015_006, BNCI2015_007,
  BNCI2015_008, BNCI2016_002, BNCI2019_001, BNCI2020_001, BNCI2020_002,
  BNCI2022_001, BNCI2024_001, BNCI2025_001, BNCI2025_002)
- Add "covert" to codespell ignore (valid neuroscience term for covert attention)
- Backward compatibility maintained: both import paths work
  - from moabb.datasets import BNCI2014_001
  - from moabb.datasets.bnci import BNCI2014_001
- Add utils.py with shared helpers:
  - validate_subject(): consistent subject validation
  - ensure_data_orientation(): transpose data if needed
  - convert_units(): uV to V conversion with channel mask
  - standardize_channel_names(): channel name mapping
  - CHANNEL_ALIASES: O9->PO9, O10->PO10, etc.

- Refactor 7 BNCI loaders to use utilities:
  - bnci_2016_002, bnci_2020_001, bnci_2020_002
  - bnci_2022_001, bnci_2024_001, bnci_2025_001, bnci_2025_002

- Add metadata entries for BNCI2022_001, BNCI2025_001, BNCI2025_002

- Add metadata schema and catalog infrastructure

- Update codespell ignore list for researcher name (Buss) and
  equipment model (GES)
- Combine bnci_2020_001.py and bnci_2020_002.py into bnci_2020.py
- Combine bnci_2025_001.py and bnci_2025_002.py into bnci_2025.py
- Update __init__.py imports to use merged files
- Remove individual files that were merged
- Replace 16 inline subject validation blocks with validate_subject()
- Replace unit conversion patterns with convert_units() where appropriate
- Keep simple in-place conversions (raw._data[:-3] *= 1e-6) for clarity
- Import utility functions from .utils module
- Maintain all existing functionality and API
- Always create a new array instead of modifying in-place
- Matches original `1e-6 * data` behavior
- Prevents unexpected side effects on input data
- Update docstring to document copy behavior
- Fix BNCI2015_006 subject codes and URL format for download
- Fix trial indices type error in data loading
- Update summary_p300.csv with actual trial counts
- Remove deprecated aliases from datasets __init__.py
- Rename legacy_base.py to base.py and merge with existing base
- Add description consistency fix for MNE Raw concatenation
- BNCI2015_010: Use dynamic channel detection in _convert_run_bbci()
  to handle subjects with varying channel counts (e.g., subject 5 has
  61 channels instead of 63)

- BNCI2015_012: Remove unavailable subjects 3 (VPnx) and 6 (VPmg)
  that return HTTP 404 errors, update to 10 subjects

- BNCI2025_001: Handle both file naming patterns in ZIP archives
  (p001v2-trialblocks.set for subject 1, p002-trialblocks.set for others)

- BNCI2025_002: Update to only 2 available subjects (fe3, fg4) since
  subjects 3-20 return HTTP 404 errors on the BNCI server
…ields

Add missing metadata fields for 20+ datasets based on original paper research:
- BNCI2014_002, BNCI2015_001, BNCI2015_004: institution, country, DOI
- BNCI2015_006, 008, 009, 010, 012, 013: institution, country
- BNCI2016_002: institution (TU Berlin / Charité), country
- Beetl2021_A/B: DOI, institution (Imperial College London), country
- DemonsP300: institution (Neiry), country (Russia)
- Dreyer2023A/B/C: data_url (OSF)
- Huebner2017/2018: data_url, repository (Zenodo), institution
- MAMEM1/2/3: country (Greece)
- Ofner2017: data_url (Zenodo)
- Sosulski2019: data_url, repository (FreiDok)
- Zhou2016: institution (Anhui University)
- Liu2024: Complete overhaul with clinical population details

Also add missing licenses and data URLs for AlexMI, Cho2017, Lee2019_MI,
Schirrmeister2017, GrosseWentrup2009, Shin2017A, Weibo2014, Stieger2021.

All metadata verified against original publications and data repositories.
The BNCI2015_010 class had an incorrect DOI (10.1016/j.clinph.2012.08.027)
which pointed to an unrelated ulnar neuropathy paper. Fixed to the correct
DOI (10.1016/j.clinph.2012.12.050) for the actual RSVP BCI paper:

  Acqualagna, L., & Blankertz, B. (2013). Gaze-independent BCI-spelling
  using rapid serial visual presentation (RSVP). Clinical Neurophysiology,
  124(5), 901-908.

Updated the DOI in:
- Docstring reference
- ARTICLE_METADATA
- __init__ method
Add new dataclasses for full EEGDash API compatibility:
- Demographics: Extended subject demographics (ages, age_min, age_max)
- ExternalLinks: URLs and data source links
- Timestamps: Dataset creation/modification dates
- Tags: Classification tags with confidence scores
- TagConfidence: Confidence scores for each tag category
- TagReasoning: Reasoning explanations for tag assignments
- ChannelCount: Channel count distribution entry
- SamplingRateCount: Sampling rate distribution entry

Extend DatasetMetadata with EEGDash fields:
- dataset_id, name, source, recording_modality
- total_files, size_bytes, datatypes
- experimental_modalities, sessions
- contributing_labs, data_processed
- external_links, timestamps, tags
- nchans_counts, sfreq_counts
Rename event codes to Target/NonTarget for consistency with other P300
datasets in MOABB:
- car_brake -> Target (emergency braking onset)
- car_normal -> NonTarget (normal driving)

Additional events (car_hold, car_collision, react_emg) retained for
reference but not used in P300 classification.
- Use simple file naming pattern: s{n}w.mat for task files
- Remove complex date-based filename lookup
- Simplify data loading by removing unnecessary fallback patterns
- Fix file naming pattern: use S{n:02d}.mat instead of sub{n:02d}.mat
- Add letter marker mapping for handwriting recognition events
- Improve documentation of data structure
- Load paradigm runs (round01_paradigm, round02_paradigm) correctly
- Update sampling rate: 100 -> 200 Hz
- Standardize event naming: unattended/attended -> NT/T
Year corrections (13 datasets):
- BNCI2014_004: 2008 → 2012
- PhysionetMI: 2004 → 2009
- BI2012-BI2015b: Updated to Zenodo publication years (2018-2019)
- EPFLP300: Fixed DOI and year
- MAMEM2/3: 2016 → 2021
- Thielen2015: 2015 → 2017
- Thielen2021: 2021 → 2018

Author corrections (11 datasets):
- BNCI2024_001, BNCI2015_008, BNCI2015_012, BNCI2015_013, BNCI2020_002:
  Updated authors to match DOI metadata
- CastillosBurstVEP40/100, CastillosCVEP40/100: Corrected to Cabrera Castillos et al.
- Thielen2015: Corrected to Wittevrongel et al.

DOI corrections:
- EPFLP300: 10.1088/1741-2560/8/2/025016 → 10.1016/j.jneumeth.2007.03.005

Added standalone DOI validation script (scripts/doi_validate.py) that validates
metadata against Crossref and DataCite APIs.

Validation result: 83/84 datasets pass (98.8%)
Remaining issue: BNCI2015_004 has invalid DOI (404)
Update n_channels and channel_types in catalog to match actual data
loaded via MOABB. The original catalog only counted EEG channels,
but actual data includes EOG, EMG, and STIM channels.

Datasets fixed:
- PhysionetMI: 64→65 (+1 stim)
- Dreyer2023/A/B/C: 27→32 (+3 eog, +2 emg)
- ErpCore2021_*: 30→33 (+3 eog)
- Thielen2015: 64→67 (+3 stim)
- Thielen2021: 8→11 (+3 stim)

Also adds scripts/metadata_validate.py for validating catalog
metadata against actual data files using MOABB's dataset loading.
- BNCI2014_001: 22→26 channels (22 EEG + 3 EOG + 1 STIM)
- BNCI2014_002: 15→16 channels (15 EEG + 1 STIM)
- BNCI2014_004: 3→7 channels (3 EEG + 3 EOG + 1 STIM)
- BNCI2015_001: 13→14 channels (13 EEG + 1 STIM)
  - Also fixed events: 'feet' → 'left_hand' to match actual data
- Cho2017: 64→69 channels (64 EEG + 4 EMG + 1 STIM)
- AlexMI: 16→17 channels (16 EEG + 1 STIM)
- Weibo2014: 60→65 channels (60 EEG + 2 MISC + 2 EOG + 1 STIM)
  - Also fixed events: 'left_foot'/'right_foot' → 'feet'/'hands'
- MAMEM1: 256→257 channels (256 EEG + 1 STIM)
- MAMEM2: 256→257 channels (256 EEG + 1 STIM)
- MAMEM3: 14→15 channels (14 EEG + 1 STIM)
- BI2012: 16→17 channels (16 EEG + 1 STIM)
- BI2014a: 16→17 channels (16 EEG + 1 STIM)
- BI2014b: 32→33 channels (32 EEG + 1 STIM)
- BI2015b: 32→33 channels (32 EEG + 1 STIM)
- Kalunga2016: 8→9 channels (8 EEG + 1 STIM)
- Nakanishi2015: n_subjects 10→9 (validated against dataset class)
- Wang2016: n_subjects 35→34 (validated against dataset class)
Wang2016: Add on_missing="ignore" to set_montage() since CB1/CB2
channels are not in standard_1005 montage

Sosulski2019: Update to new freidok download endpoint - the old
fedora URLs now redirect to HTML pages instead of data files

Liu2024: Replace read_custom_montage() with make_dig_montage()
to properly parse electrode positions from TSV file
- Fix channel counts in catalog.py validated against actual raw data
- Add metadata_validate.py script for cross-validation
@bruAristimunha bruAristimunha changed the title Refactor BNCI dataset architecture for improved BIDS compliance Enrich BNCI datasets with metadata catalog, multi-score evaluation, and new c-VEP datasets Jan 29, 2026
@bruAristimunha bruAristimunha changed the title Enrich BNCI datasets with metadata catalog, multi-score evaluation, and new c-VEP datasets Enrich BNCI datasets with modular architecture and metadata catalog Jan 29, 2026
Add detailed participant demographics extracted from Ofner et al. (2019):
- age_mean: 49.8, age_std: 17.6
- age_range: corrected to (20, 78) from (20, 69)
- handedness: all 10 subjects right-handed
- location: Graz University of Technology, Austria
@bruAristimunha bruAristimunha changed the title Enrich BNCI datasets with modular architecture and metadata catalog Enrich datasets with modular architecture and metadata catalog Jan 29, 2026
bruAristimunha and others added 6 commits January 29, 2026 11:36
- Fix n_channels in catalog to use EEG-only counts matching summary CSV:
  - BNCI2014_001: 26 → 22
  - Dreyer2023: 32 → 27

- Fix test expectations to match actual implementations:
  - Wang2016 n_subjects: 35 → 34
  - Nakanishi2015 n_subjects: 10 → 9

- Fix _match_int to handle "varies" values in summary CSVs:
  - Add default parameter for fallback when no integer found
  - Update _get_dataset_parameters to use default=1 for variable trials
Signed-off-by: Bru <a.bruno@aluno.ufabc.edu.br>
Restore original event labels, intervals, sessions, and DOIs for several
BNCI datasets that were incorrectly modified:

- BNCI2015_001: Restore events to right_hand/feet, interval to [0,5]
- BNCI2014_004: Restore sessions to 5, interval to [3,7.5]
- BNCI2014_008: Restore events to Target=2/NonTarget=1
- BNCI2014_009: Restore sessions to 3, events, interval to [0,0.8]
- BNCI2015_003: Restore sessions to 1, events, interval to [0,0.8]

These changes preserve backward compatibility and ensure reproducibility
of existing benchmarks and user pipelines.
…utes

- Add cached `metadata` property to BaseDataset that retrieves structured
  metadata from the centralized catalog
- Remove redundant `_participant_demographics` and `ARTICLE_METADATA`
  class attributes from all BNCI dataset classes (24 datasets)
- Add comprehensive test class TestDatasetMetadata with parametrized tests
  covering all datasets
- Remove channel name standardization (preserve original non-standard names)
- Remove EEGDash-specific fields from schema (TagConfidence, TagReasoning,
  format_version, total_files, size_bytes, datatypes, experimental_modalities)
- Convert country names to ISO 3166-1 alpha-2 codes in catalog
- Remove hardcoded events from catalog (extract dynamically from dataset.event_id)
- Add pycountry dependency for country code validation
- Add validation functions: validate_country_code, validate_metadata_against_dataset,
  get_dataset_description
- Update tests to reflect schema changes
Use get_dataset_path() instead of get_config("MNE_DATA") to properly
handle the fallback to ~/mne_data when MNE_DATA environment variable
is not set. This fixes TypeError in documentation tutorials.
@bruAristimunha bruAristimunha merged commit b93ed57 into develop Jan 29, 2026
13 of 14 checks passed
@bruAristimunha bruAristimunha deleted the enrich-bnci branch February 28, 2026 11:50
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant