Skip to content

Fix metadata across 45 datasets based on paper validation#1001

Merged
bruAristimunha merged 5 commits intodevelopfrom
metadata-validation-fixes
Mar 1, 2026
Merged

Fix metadata across 45 datasets based on paper validation#1001
bruAristimunha merged 5 commits intodevelopfrom
metadata-validation-fixes

Conversation

@bruAristimunha
Copy link
Collaborator

Summary

  • Validated METADATA blocks for all 45 MOABB datasets against their original publications (~920 corrections total)
  • Fixed systematic hallucination patterns across all datasets:
    • Country codes (41 datasets): full names → ISO 3166-1 alpha-2
    • Preprocessing conflation (38 datasets): removed analysis pipeline steps incorrectly listed as shared data state
    • BCI application inflation (36 datasets): removed fabricated/aspirational applications
    • Feedback type confusion (28 datasets): corrected cues listed as feedback
    • Software misattribution (21 datasets): removed analysis tools (EEGLAB, MATLAB, FieldTrip) listed as acquisition software
    • Acquisition reference (19 datasets): fixed CAR → correct hardware reference (CMS/DRL for BioSemi, named electrodes for others)
    • Auxiliary channel fabrication (16 datasets): removed fabricated GSR/PPG/EMG channels
  • Added BIDS export fallback using publication_year for missing meas_date
  • Added validation tests for DOI format and metadata quality

Methodology

Each dataset was validated by an isolated agent with access to only:

  1. The dataset .py file
  2. The original publication PDF
  3. The schema definition (schema.py)
  4. AlexMI as a gold-standard reference

Corrections were classified by confidence (HIGH/MEDIUM/LOW) and only HIGH and MEDIUM confidence corrections were applied. Individual validation reports are available in moabb_tmp_folder/papers/validation_results/.

Files changed

  • 34 dataset .py files (metadata corrections)
  • bids_interface.py (publication_year fallback)
  • 2 new test files (DOI validation, BIDS enrichment tests)

Test plan

  • All 229 test_datasets.py tests pass
  • Pre-commit hooks pass (black, ruff, codespell, isort)
  • CI pipeline validation

Copy link

@chatgpt-codex-connector chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

doi="10.1088/1741-2552/abecef",

P1 Badge Preserve Thielen2021 paper DOI linkage in metadata

Updating the metadata DOI here without retaining the previously tracked paper DOI leaves 10.1088/1741-2552/ab4057 (still referenced in the module comments/doc context) untracked, which is why test_docstring_dois_tracked[Thielen2021] now fails on this commit; the metadata should continue to carry that DOI (e.g., via associated_paper_doi) so DOI auditing remains consistent.

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Validated METADATA blocks for all 45 MOABB datasets against their
original publications. Key corrections (~920 total):

- Country codes: full names → ISO 3166-1 alpha-2 (41 datasets)
- Preprocessing conflation: removed analysis pipeline steps listed as
  shared data state (38 datasets)
- BCI application inflation: removed fabricated applications (36 datasets)
- Acquisition reference: fixed CAR → correct hardware reference (19 datasets)
- Software misattribution: removed analysis tools (EEGLAB, MATLAB, etc.)
  listed as acquisition software (21 datasets)
- Auxiliary channels: removed fabricated GSR/PPG/EMG channels (16 datasets)
- Hardware/electrode: removed fabricated materials and manufacturers

Also includes:
- BIDS export: use publication_year as fallback for missing meas_date
- Validation tests for DOI format and metadata quality
@bruAristimunha bruAristimunha force-pushed the metadata-validation-fixes branch from 57aff95 to cea0f19 Compare February 28, 2026 23:49
bruAristimunha and others added 3 commits March 1, 2026 01:06
The unversioned figshare DOI (10.6084/m9.figshare.13123148) does not
resolve via citeproc+json, causing test_dois_resolve[Stieger2021] and
test_doi_cache_complete to fail. Use the versioned DOI (.v1) which
resolves correctly, and add it to doi_cache.json.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…n-fixes

# Conflicts:
#	moabb/tests/doi_cache.json
@bruAristimunha
Copy link
Collaborator Author

26 dataset classes NOT changed (already correct or not yet validated):

BI2015b, BNCI2014_001, BNCI2014_002, BNCI2014_008, BNCI2014_009, BNCI2015_001, BNCI2015_003, BNCI2015_004, BNCI2015_006, BNCI2015_007, BNCI2015_008, BNCI2015_009, BNCI2015_010,
BNCI2016_002, BNCI2019_001, BNCI2020_001, BNCI2020_002, BNCI2022_001, BNCI2024_001, BNCI2025_001, BNCI2025_002, Dreyer2023A, Dreyer2023B, Dreyer2023C, MAMEM3, PhysionetMI

@bruAristimunha bruAristimunha merged commit 74910fb into develop Mar 1, 2026
14 checks passed
@bruAristimunha bruAristimunha deleted the metadata-validation-fixes branch March 1, 2026 12:32
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant