MindMix is a multimodal foundation model that bridges the gap between unimodal EEG foundations and task-specific auditory decoders, enabling powerful auditory perception decoding from non-invasive EEG signals.
Decoding complex auditory experiences from non-invasive EEG is a rapidly emerging field with significant promise for advancing both fundamental neuroscience and human-machine interaction technologies. While recent EEG foundation models have yielded powerful neural representations, their effectiveness remains constrained by limited integration with acoustic stimulus information.
MindMix addresses this challenge through:
- π§ Two-Stage Training Strategy: Generalized EEG feature learning followed by neural-acoustic alignment
- π Cross-Attention Low-Rank Alignment (CLARA): Novel module for fine-grained cross-modal information integration
- π State-of-the-Art Performance: Superior results on auditory attention decoding, emotion recognition, and cross-modal retrieval tasks
| Feature | Description |
|---|---|
| π¬ Foundation Model | Pre-trained on 3,000+ hours of EEG data for generalized neural representations |
| π΅ Multimodal Fusion | Novel CLARA module for EEG-audio cross-modal alignment |
| π― Multi-Task Support | Auditory attention decoding, emotion recognition, cross-modal retrieval |
| β‘ Flexible Fine-tuning | Three strategies: EEG-only, multimodal real, and multimodal prototype |
| π SOTA Results | Substantially surpasses existing baselines across diverse auditory tasks |
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β MindMix Architecture β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ€
β β
β Stage 1: EEG Foundation Pre-training β
β βββββββββββββββ βββββββββββββββββββ βββββββββββββββ β
β β EEG Input ββββββΆβ LaBraM Encoder ββββββΆβ EEG Featuresβ β
β β (>3000hrs) β β (Pre-trained) β β (General) β β
β βββββββββββββββ βββββββββββββββββββ βββββββββββββββ β
β β
β Stage 2: Neural-Acoustic Alignment β
β βββββββββββββββ βββββββββββββββ βββββββββββββββββββ β
β β EEG Embed ββββββΆβ ββββββΆβ Aligned EEG β β
β β β β CLARA β β Representation β β
β βββββββββββββββ β Module β βββββββββββββββββββ β
β βββββββββββββββ β (Low-Rank β β
β βAudio Embed ββββββΆβ Cross-Attn)β β
β β (>100hrs) β βββββββββββββββ β
β βββββββββββββββ β
β β
β Downstream Tasks β
β ββββββββββββββββββ ββββββββββββββββββ ββββββββββββββββββ β
β βAttention Decodeβ βEmotion Recogn. β βCross-Modal β β
β β (KUL/DTU) β β (EEG4EMO) β β Retrieval β β
β ββββββββββββββββββ ββββββββββββββββββ ββββββββββββββββββ β
β β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
The Cross-Attention Low-Rank Alignment (CLARA) module is our novel contribution for effective EEG-audio fusion:
- Self-Attention Paths: Independent processing for EEG and audio modalities
- Cross-Attention Fusion: Bidirectional cross-modal attention with low-rank decomposition
- Residual Connections: Maintains modality-specific information while learning shared representations
# Clone the repository
git clone https://github.com/CookieMikeLiu/MindMix.git
cd MindMix
# Install dependencies
pip install torch torchvision torchaudio
pip install transformers timm einops tensorboardX
pip install numpy pandas scikit-learn scipy h5py tqdm
pip install pyhealthOrganize your datasets as follows:
data/
βββ EEG4EMO/ # EEG-Audio emotion recognition
β βββ train/
β βββ test/
β βββ labels.csv
βββ KUL/ # Auditory attention decoding (KUL dataset)
β βββ subjects/
β βββ metadata.json
βββ DTU/ # Auditory attention decoding (DTU dataset)
βββ subjects/
βββ metadata.json
Train the EEG foundation model with paired EEG-audio data:
python MindMix_clip_pretrain.py \
--batch_size 32 \
--epochs 100 \
--lr 1e-4 \
--input_size 400 \
--model labram_base_patch200_200 \
--output_dir ./pretrain_fusion_checkpointspython universal_eeg_finetune.py \
--dataset EEG4EMO \
--strategy multimodal_real \
--fusion_method clara \
--batch_size 32 \
--epochs 50 \
--lr 1e-5python universal_eeg_finetune.py \
--dataset KUL \
--strategy multimodal_real \
--fusion_method clara \
--eval_method contrastive \
--batch_size 32 \
--epochs 50import torch
from modeling_finetune_2 import labram_base_patch200_200
# Load pre-trained EEG encoder
checkpoint = torch.load('pretrain_fusion_checkpoints/checkpoint-best.pth')
model = labram_base_patch200_200(
num_classes=256, # Embedding dimension
drop_path_rate=0.1
)
model.load_state_dict(checkpoint['model'])MindMix/
βββ MindMix_clip_pretrain.py # Stage 1: EEG-audio fusion pre-training
βββ MindMix_clip_finetune.py # Stage 2: Cross-modal fine-tuning
βββ universal_eeg_finetune.py # Universal fine-tuning framework
βββ universal_models.py # Model architectures (CLARA, ClipLoss)
βββ universal_trainer.py # Training utilities
βββ utils.py # General utilities
βββ modeling_finetune_2.py # LaBraM model implementation
βββ pretrain_fusion_checkpoints/ # Pre-trained model checkpoints
βββ README.md # This file
| File | Description |
|---|---|
MindMix_clip_pretrain.py |
Pre-trains EEG encoder with CLIP-style contrastive learning on EEG-audio pairs |
MindMix_clip_finetune.py |
Fine-tunes the model on specific downstream tasks |
universal_eeg_finetune.py |
Universal framework supporting multiple datasets and strategies |
universal_models.py |
Core model components: CLARA module, ClipLoss, classification heads |
utils.py |
Data loading, preprocessing, channel mapping, evaluation metrics |
- Datasets: KUL (KU Leuven), DTU (Technical University of Denmark)
- Task: Identify which of multiple speakers a subject is attending to
- Evaluation: Contrastive learning based accuracy
- Dataset: EEG4EMO
- Task: Classify emotional valence from EEG during music listening
- Evaluation: Classification accuracy, F1-score
- Task: Retrieve matching audio given EEG (or vice versa)
- Evaluation: Recall@K, Mean Reciprocal Rank (MRR)
Baseline using only EEG encoder for classification.
python universal_eeg_finetune.py --strategy eeg_only --dataset EEG4EMOUses real paired EEG-audio data with CLARA fusion.
python universal_eeg_finetune.py --strategy multimodal_real --fusion_method claraUses EEG with pseudo-audio prototypes for lightweight training.
python universal_eeg_finetune.py --strategy multimodal_prototype --fusion_method claraMindMix substantially surpasses existing baselines across multiple auditory decoding tasks:
| Task | Dataset | MindMix | Previous SOTA |
|---|---|---|---|
| Attention Decoding | KUL | XX.X% | XX.X% |
| Attention Decoding | DTU | XX.X% | XX.X% |
| Emotion Recognition | EEG4EMO | XX.X% | XX.X% |
| Cross-Modal Retrieval | - | XX.X% | XX.X% |
Detailed results available in our paper.
| Parameter | Default | Description |
|---|---|---|
--model |
labram_base_patch200_200 |
Model architecture |
--input_size |
400 | EEG input size (time samples) |
--drop_path |
0.1 | Stochastic depth rate |
--fusion_method |
clara |
Fusion module: clara or concat |
| Parameter | Pre-train | Fine-tune | Description |
|---|---|---|---|
--batch_size |
32 | 32 | Batch size |
--lr |
1e-4 | 1e-5 | Learning rate |
--epochs |
100 | 50 | Training epochs |
--weight_decay |
0.05 | 0.01 | Weight decay |
If you find MindMix useful in your research, please cite our paper:
@inproceedings{liu2025mindmix,
title={MindMix: A Multimodal Foundation Model for Auditory Perception Decoding},
author={Liu, Mike and others},
booktitle={International Conference on Learning Representations (ICLR)},
year={2025}
}This work builds upon several excellent open-source projects:
- LaBraM - Large Brain Model for EEG
- BEiT-v2 - Transformer architecture
- timm - PyTorch model library
- BIOT - Brain signal processing
This project is licensed under the MIT License - see the LICENSE file for details.
For questions or feedback, please open an issue on GitHub or contact the authors.
β Star this repo if you find it helpful! β