Enrich datasets with modular architecture and metadata catalog by bruAristimunha · Pull Request #868 · NeuroTechX/moabb

bruAristimunha · 2026-01-14T15:44:13Z

Summary

This PR delivers a major enrichment of MOABB's BNCI datasets with improved organization, comprehensive metadata, and critical bug fixes.

BNCI Dataset Refactoring

Reorganize monolithic bnci.py into modular bnci/ subpackage split by year
Add shared BNCIBaseDataset class and utility functions
Preserve backward compatibility with legacy imports

Comprehensive Metadata System

Add metadata/ module with EEGDash-compatible dataclass schema
Create catalog with verified metadata for all MOABB datasets:
- Institution names and countries
- DOI identifiers
- Sampling rates and channel configurations
- Participant demographics

Dataset Loading Fixes

Wang2016: Add on_missing="ignore" for non-standard montage channels
Sosulski2019: Update to new freidok download endpoint
Liu2024: Fix electrode position parsing with make_dig_montage()

Bug Fixes & Validation

Fix channel counts validated against actual data for multiple datasets
Correct DOI metadata inconsistencies
Standardize BNCI2016_002 events for P300 paradigm compatibility
Fix BNCI2024_001 and BNCI2022_001 file naming and data loading

Test Plan

BNCI datasets load correctly with new subpackage structure
Legacy imports work (backward compatibility)
Metadata catalog returns correct information for all datasets
Wang2016, Sosulski2019, Liu2024 load without errors
All tests pass (pytest moabb/tests/)
Documentation builds successfully

This commit refactors the BNCI dataset implementation to improve code quality and ensure proper BIDS conversion: **Code Quality Improvements:** - Remove generic post-processing loop from _get_single_subject_data() - Create _finalize_raw() helper function for consistent metadata handling - Incorporate finalization logic into each dataset reader function - Remove unused montage variables from conversion functions **BIDS Compliance:** - Ensure montage is set before BIDS cache conversion - Add dataset-specific years to _dataset_years class attribute - Guarantee proper measurement dates for all BNCI datasets - Ensure subject IDs are set for BIDS compliance **Configuration:** - Add "ALS" (Amyotrophic Lateral Sclerosis) to codespell ignore list - Add clarifying comment for ALS medical abbreviation **Documentation:** - Update What's New with all enhancements, bug fixes, and code improvements Verified that montage preservation works correctly when using BIDS cache mechanism - all channel positions match exactly (distance = 0.0).

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: e7c535a69b

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

moabb/datasets/bnci.py

Signed-off-by: Bru <a.bruno@aluno.ufabc.edu.br>

…o enrich-bnci

Signed-off-by: Bru <a.bruno@aluno.ufabc.edu.br>

- Create moabb/datasets/bnci/ subpackage for cleaner organization - Move bnci.py to bnci/legacy.py (17 legacy datasets, 2003-2019) - Move 7 newer dataset files (2016-2025) into bnci/ subpackage - Add bnci/__init__.py that exports all 24 BNCI datasets + deprecated aliases - Update datasets/__init__.py with clean subpackage imports - Add missing datasets to summary tables (BNCI2015_006, BNCI2015_007, BNCI2015_008, BNCI2016_002, BNCI2019_001, BNCI2020_001, BNCI2020_002, BNCI2022_001, BNCI2024_001, BNCI2025_001, BNCI2025_002) - Add "covert" to codespell ignore (valid neuroscience term for covert attention) - Backward compatibility maintained: both import paths work - from moabb.datasets import BNCI2014_001 - from moabb.datasets.bnci import BNCI2014_001

…o enrich-bnci

- Add utils.py with shared helpers: - validate_subject(): consistent subject validation - ensure_data_orientation(): transpose data if needed - convert_units(): uV to V conversion with channel mask - standardize_channel_names(): channel name mapping - CHANNEL_ALIASES: O9->PO9, O10->PO10, etc. - Refactor 7 BNCI loaders to use utilities: - bnci_2016_002, bnci_2020_001, bnci_2020_002 - bnci_2022_001, bnci_2024_001, bnci_2025_001, bnci_2025_002 - Add metadata entries for BNCI2022_001, BNCI2025_001, BNCI2025_002 - Add metadata schema and catalog infrastructure - Update codespell ignore list for researcher name (Buss) and equipment model (GES)

- Combine bnci_2020_001.py and bnci_2020_002.py into bnci_2020.py - Combine bnci_2025_001.py and bnci_2025_002.py into bnci_2025.py - Update __init__.py imports to use merged files - Remove individual files that were merged

- Replace 16 inline subject validation blocks with validate_subject() - Replace unit conversion patterns with convert_units() where appropriate - Keep simple in-place conversions (raw._data[:-3] *= 1e-6) for clarity - Import utility functions from .utils module - Maintain all existing functionality and API

- Always create a new array instead of modifying in-place - Matches original `1e-6 * data` behavior - Prevents unexpected side effects on input data - Update docstring to document copy behavior

- Fix BNCI2015_006 subject codes and URL format for download - Fix trial indices type error in data loading - Update summary_p300.csv with actual trial counts - Remove deprecated aliases from datasets __init__.py - Rename legacy_base.py to base.py and merge with existing base - Add description consistency fix for MNE Raw concatenation

- BNCI2015_010: Use dynamic channel detection in _convert_run_bbci() to handle subjects with varying channel counts (e.g., subject 5 has 61 channels instead of 63) - BNCI2015_012: Remove unavailable subjects 3 (VPnx) and 6 (VPmg) that return HTTP 404 errors, update to 10 subjects - BNCI2025_001: Handle both file naming patterns in ZIP archives (p001v2-trialblocks.set for subject 1, p002-trialblocks.set for others) - BNCI2025_002: Update to only 2 available subjects (fe3, fg4) since subjects 3-20 return HTTP 404 errors on the BNCI server

…ields Add missing metadata fields for 20+ datasets based on original paper research: - BNCI2014_002, BNCI2015_001, BNCI2015_004: institution, country, DOI - BNCI2015_006, 008, 009, 010, 012, 013: institution, country - BNCI2016_002: institution (TU Berlin / Charité), country - Beetl2021_A/B: DOI, institution (Imperial College London), country - DemonsP300: institution (Neiry), country (Russia) - Dreyer2023A/B/C: data_url (OSF) - Huebner2017/2018: data_url, repository (Zenodo), institution - MAMEM1/2/3: country (Greece) - Ofner2017: data_url (Zenodo) - Sosulski2019: data_url, repository (FreiDok) - Zhou2016: institution (Anhui University) - Liu2024: Complete overhaul with clinical population details Also add missing licenses and data URLs for AlexMI, Cho2017, Lee2019_MI, Schirrmeister2017, GrosseWentrup2009, Shin2017A, Weibo2014, Stieger2021. All metadata verified against original publications and data repositories.

The BNCI2015_010 class had an incorrect DOI (10.1016/j.clinph.2012.08.027) which pointed to an unrelated ulnar neuropathy paper. Fixed to the correct DOI (10.1016/j.clinph.2012.12.050) for the actual RSVP BCI paper: Acqualagna, L., & Blankertz, B. (2013). Gaze-independent BCI-spelling using rapid serial visual presentation (RSVP). Clinical Neurophysiology, 124(5), 901-908. Updated the DOI in: - Docstring reference - ARTICLE_METADATA - __init__ method

Add new dataclasses for full EEGDash API compatibility: - Demographics: Extended subject demographics (ages, age_min, age_max) - ExternalLinks: URLs and data source links - Timestamps: Dataset creation/modification dates - Tags: Classification tags with confidence scores - TagConfidence: Confidence scores for each tag category - TagReasoning: Reasoning explanations for tag assignments - ChannelCount: Channel count distribution entry - SamplingRateCount: Sampling rate distribution entry Extend DatasetMetadata with EEGDash fields: - dataset_id, name, source, recording_modality - total_files, size_bytes, datatypes - experimental_modalities, sessions - contributing_labs, data_processed - external_links, timestamps, tags - nchans_counts, sfreq_counts

Rename event codes to Target/NonTarget for consistency with other P300 datasets in MOABB: - car_brake -> Target (emergency braking onset) - car_normal -> NonTarget (normal driving) Additional events (car_hold, car_collision, react_emg) retained for reference but not used in P300 classification.

- Use simple file naming pattern: s{n}w.mat for task files - Remove complex date-based filename lookup - Simplify data loading by removing unnecessary fallback patterns

- Fix file naming pattern: use S{n:02d}.mat instead of sub{n:02d}.mat - Add letter marker mapping for handwriting recognition events - Improve documentation of data structure - Load paradigm runs (round01_paradigm, round02_paradigm) correctly

- Update sampling rate: 100 -> 200 Hz - Standardize event naming: unattended/attended -> NT/T

Year corrections (13 datasets): - BNCI2014_004: 2008 → 2012 - PhysionetMI: 2004 → 2009 - BI2012-BI2015b: Updated to Zenodo publication years (2018-2019) - EPFLP300: Fixed DOI and year - MAMEM2/3: 2016 → 2021 - Thielen2015: 2015 → 2017 - Thielen2021: 2021 → 2018 Author corrections (11 datasets): - BNCI2024_001, BNCI2015_008, BNCI2015_012, BNCI2015_013, BNCI2020_002: Updated authors to match DOI metadata - CastillosBurstVEP40/100, CastillosCVEP40/100: Corrected to Cabrera Castillos et al. - Thielen2015: Corrected to Wittevrongel et al. DOI corrections: - EPFLP300: 10.1088/1741-2560/8/2/025016 → 10.1016/j.jneumeth.2007.03.005 Added standalone DOI validation script (scripts/doi_validate.py) that validates metadata against Crossref and DataCite APIs. Validation result: 83/84 datasets pass (98.8%) Remaining issue: BNCI2015_004 has invalid DOI (404)

Update n_channels and channel_types in catalog to match actual data loaded via MOABB. The original catalog only counted EEG channels, but actual data includes EOG, EMG, and STIM channels. Datasets fixed: - PhysionetMI: 64→65 (+1 stim) - Dreyer2023/A/B/C: 27→32 (+3 eog, +2 emg) - ErpCore2021_*: 30→33 (+3 eog) - Thielen2015: 64→67 (+3 stim) - Thielen2021: 8→11 (+3 stim) Also adds scripts/metadata_validate.py for validating catalog metadata against actual data files using MOABB's dataset loading.

- BNCI2014_001: 22→26 channels (22 EEG + 3 EOG + 1 STIM) - BNCI2014_002: 15→16 channels (15 EEG + 1 STIM) - BNCI2014_004: 3→7 channels (3 EEG + 3 EOG + 1 STIM) - BNCI2015_001: 13→14 channels (13 EEG + 1 STIM) - Also fixed events: 'feet' → 'left_hand' to match actual data - Cho2017: 64→69 channels (64 EEG + 4 EMG + 1 STIM)

- AlexMI: 16→17 channels (16 EEG + 1 STIM) - Weibo2014: 60→65 channels (60 EEG + 2 MISC + 2 EOG + 1 STIM) - Also fixed events: 'left_foot'/'right_foot' → 'feet'/'hands' - MAMEM1: 256→257 channels (256 EEG + 1 STIM) - MAMEM2: 256→257 channels (256 EEG + 1 STIM) - MAMEM3: 14→15 channels (14 EEG + 1 STIM) - BI2012: 16→17 channels (16 EEG + 1 STIM)

- BI2014a: 16→17 channels (16 EEG + 1 STIM) - BI2014b: 32→33 channels (32 EEG + 1 STIM) - BI2015b: 32→33 channels (32 EEG + 1 STIM) - Kalunga2016: 8→9 channels (8 EEG + 1 STIM) - Nakanishi2015: n_subjects 10→9 (validated against dataset class) - Wang2016: n_subjects 35→34 (validated against dataset class)

Wang2016: Add on_missing="ignore" to set_montage() since CB1/CB2 channels are not in standard_1005 montage Sosulski2019: Update to new freidok download endpoint - the old fedora URLs now redirect to HTML pages instead of data files Liu2024: Replace read_custom_montage() with make_dig_montage() to properly parse electrode positions from TSV file

- Fix channel counts in catalog.py validated against actual raw data - Add metadata_validate.py script for cross-validation

Add detailed participant demographics extracted from Ofner et al. (2019): - age_mean: 49.8, age_std: 17.6 - age_range: corrected to (20, 78) from (20, 69) - handedness: all 10 subjects right-handed - location: Graz University of Technology, Austria

- Fix n_channels in catalog to use EEG-only counts matching summary CSV: - BNCI2014_001: 26 → 22 - Dreyer2023: 32 → 27 - Fix test expectations to match actual implementations: - Wang2016 n_subjects: 35 → 34 - Nakanishi2015 n_subjects: 10 → 9 - Fix _match_int to handle "varies" values in summary CSVs: - Add default parameter for fallback when no integer found - Update _get_dataset_parameters to use default=1 for variable trials

Signed-off-by: Bru <a.bruno@aluno.ufabc.edu.br>

Restore original event labels, intervals, sessions, and DOIs for several BNCI datasets that were incorrectly modified: - BNCI2015_001: Restore events to right_hand/feet, interval to [0,5] - BNCI2014_004: Restore sessions to 5, interval to [3,7.5] - BNCI2014_008: Restore events to Target=2/NonTarget=1 - BNCI2014_009: Restore sessions to 3, events, interval to [0,0.8] - BNCI2015_003: Restore sessions to 1, events, interval to [0,0.8] These changes preserve backward compatibility and ensure reproducibility of existing benchmarks and user pipelines.

…utes - Add cached `metadata` property to BaseDataset that retrieves structured metadata from the centralized catalog - Remove redundant `_participant_demographics` and `ARTICLE_METADATA` class attributes from all BNCI dataset classes (24 datasets) - Add comprehensive test class TestDatasetMetadata with parametrized tests covering all datasets

- Remove channel name standardization (preserve original non-standard names) - Remove EEGDash-specific fields from schema (TagConfidence, TagReasoning, format_version, total_files, size_bytes, datatypes, experimental_modalities) - Convert country names to ISO 3166-1 alpha-2 codes in catalog - Remove hardcoded events from catalog (extract dynamically from dataset.event_id) - Add pycountry dependency for country code validation - Add validation functions: validate_country_code, validate_metadata_against_dataset, get_dataset_description - Update tests to reflect schema changes

Use get_dataset_path() instead of get_config("MNE_DATA") to properly handle the fallback to ~/mne_data when MNE_DATA environment variable is not set. This fixes TypeError in documentation tutorials.

bruAristimunha added 2 commits January 14, 2026 16:40

Fixing the whats new file

e7c535a

chatgpt-codex-connector bot reviewed Jan 14, 2026

View reviewed changes

moabb/datasets/bnci.py Outdated Show resolved Hide resolved

moabb/datasets/bnci.py Outdated Show resolved Hide resolved

bruAristimunha and others added 27 commits January 19, 2026 00:50

Merge branch 'develop' into enrich-bnci

78943f8

Signed-off-by: Bru <a.bruno@aluno.ufabc.edu.br>

Merge branch 'develop' into enrich-bnci

002d0ce

bnci

bd056cc

Merge branch 'develop' into enrich-bnci

72d0e1c

Merge branch 'enrich-bnci' of https://github.com/neurotechx/moabb int…

6ab0bbb

…o enrich-bnci

Merge branch 'develop' into enrich-bnci

106cf56

Signed-off-by: Bru <a.bruno@aluno.ufabc.edu.br>

Merge branch 'enrich-bnci' of https://github.com/neurotechx/moabb int…

7a8d702

…o enrich-bnci

Merge BNCI dataset files by year

0e1b3ba

- Combine bnci_2020_001.py and bnci_2020_002.py into bnci_2020.py - Combine bnci_2025_001.py and bnci_2025_002.py into bnci_2025.py - Update __init__.py imports to use merged files - Remove individual files that were merged

Fix convert_units to always return a copy

e914ed4

- Always create a new array instead of modifying in-place - Matches original `1e-6 * data` behavior - Prevents unexpected side effects on input data - Update docstring to document copy behavior

Split legacy BNCI datasets by year

d7d3ea7

Remove legacy BNCI shim

168d2c0

Preserve BNCI legacy import and expand whats new

83a429a

Drop BNCI legacy import alias

b0fddd2

Add BNCI and metadata tests

1ac5ae6

Simplify BNCI2022_001 file naming and loading

4c88253

- Use simple file naming pattern: s{n}w.mat for task files - Remove complex date-based filename lookup - Simplify data loading by removing unnecessary fallback patterns

Fix BNCI2015_006 metadata in P300 summary

ed8022c

- Update sampling rate: 100 -> 200 Hz - Standardize event naming: unattended/attended -> NT/T

bruAristimunha added 6 commits January 28, 2026 13:42

Update catalog metadata and add validation script

2fc5d17

- Fix channel counts in catalog.py validated against actual raw data - Add metadata_validate.py script for cross-validation

bruAristimunha mentioned this pull request Jan 29, 2026

Typo in braininvaders.py: Replace shutil.copy_tree with shutil.copytree #953

Closed

bruAristimunha and others added 2 commits January 29, 2026 02:21

Merge branch 'develop' into enrich-bnci

6dd3039

Remove validation reports and scripts from PR

8bcdbef

bruAristimunha changed the title ~~Refactor BNCI dataset architecture for improved BIDS compliance~~ Enrich BNCI datasets with metadata catalog, multi-score evaluation, and new c-VEP datasets Jan 29, 2026

bruAristimunha changed the title ~~Enrich BNCI datasets with metadata catalog, multi-score evaluation, and new c-VEP datasets~~ Enrich BNCI datasets with modular architecture and metadata catalog Jan 29, 2026

bruAristimunha changed the title ~~Enrich BNCI datasets with modular architecture and metadata catalog~~ Enrich datasets with modular architecture and metadata catalog Jan 29, 2026

bruAristimunha and others added 6 commits January 29, 2026 11:36

Merge branch 'develop' into enrich-bnci

4420b28

Signed-off-by: Bru <a.bruno@aluno.ufabc.edu.br>

Fix Zhou2016 path handling when MNE_DATA is not configured

58a4c0d

Use get_dataset_path() instead of get_config("MNE_DATA") to properly handle the fallback to ~/mne_data when MNE_DATA environment variable is not set. This fixes TypeError in documentation tutorials.

bruAristimunha merged commit b93ed57 into develop Jan 29, 2026
13 of 14 checks passed

bruAristimunha deleted the enrich-bnci branch February 28, 2026 11:50

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Enrich datasets with modular architecture and metadata catalog#868

Enrich datasets with modular architecture and metadata catalog#868
bruAristimunha merged 44 commits intodevelopfrom
enrich-bnci

bruAristimunha commented Jan 14, 2026 •

edited

Loading

Uh oh!

chatgpt-codex-connector bot left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

bruAristimunha commented Jan 14, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

BNCI Dataset Refactoring

Comprehensive Metadata System

Dataset Loading Fixes

Bug Fixes & Validation

Test Plan

Uh oh!

chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

bruAristimunha commented Jan 14, 2026 •

edited

Loading