Brain-Score-based evaluation of language models trained on diverse corpora. This project trains small GPT-2-style transformer language models on a variety of training corpora — natural language (multiple languages), code, synthetic sequences, and genomic data — and evaluates them using Brain-Score benchmarks to measure alignment with human neural and behavioral language processing.
xlbs/
├── lm-training/ # Submodule: language model training framework (Hydra + HuggingFace)
│ ├── config/ # Hydra config files for training, finetuning, and tokenizers
│ ├── src/ # Training scripts (train_lm.py, score_model.py, etc.)
│ ├── scripts/ # Data generation and preprocessing utilities
│ └── data/ # Data download scripts (actual data is not tracked)
├── language/ # Submodule: brain-score/language evaluation framework
├── environment.yml # Conda environment specification
└── README.md
git clone --recurse-submodules https://github.com/CLMBRs/xlbs.git
cd xlbsIf you already cloned without --recurse-submodules:
git submodule init
git submodule updateconda env create -f environment.yml
conda activate xlbsGPU support: The default environment installs CPU PyTorch. For GPU training, install PyTorch with CUDA separately after creating the environment:
pip install torch --index-url https://download.pytorch.org/whl/cu130Adjust the CUDA version (cu118, cu121, cu124, cu130) to match your system. See PyTorch Get Started for details.
cd language
pip install -e "."
cd ..The training data is not included in this repository. Below are instructions for obtaining each corpus. All download commands should be run from the lm-training/ directory.
Natural language text from the Wikimedia Wikipedia November 2023 dump, accessed via HuggingFace Datasets.
Source: wikimedia/wikipedia on HuggingFace
No access restrictions.
Download (English):
cd lm-training/data/wiki
python download.pyThis saves the dataset to lm-training/data/wiki/en-wikipedia-local/.
Other languages: Non-English Wikipedia corpora are downloaded automatically during training (the configs use load_disk: false). The supported language codes and their config names are:
| Language | Code | Config |
|---|---|---|
| Arabic | ar |
train-brainscore-arabic |
| Chinese | zh |
train-brainscore-chinese |
| Indonesian | id |
train-brainscore-indonesian |
| Japanese | ja |
train-brainscore-japanese |
| Korean | ko |
train-brainscore-korean |
| Russian | ru |
train-brainscore-russian |
To download manually for offline use:
from datasets import load_dataset
ds = load_dataset("wikimedia/wikipedia", "20231101.<LANG_CODE>")
ds.save_to_disk("lm-training/data/wiki/<LANG_CODE>-wikipedia-local")Python source code from the BigCode project's deduplicated Stack dataset.
Source: bigcode/the-stack-dedup on HuggingFace
Access restriction: You must sign the BigCode Access Agreement on HuggingFace before downloading.
Download:
cd lm-training/data/program
python download.pyThis saves the Python subset to lm-training/data/program/python/.
Post-processing: After downloading, rewrite the Python source code with special tokens (for indentation, comments, strings):
cd lm-training/src
python rewrite_python.pyThis creates lm-training/data/stack_python_rewritten/.
DNA sequences from the NCBI GRCh38 (hg38) human reference genome assembly.
Source: NCBI GRCh38 Assembly
No access restrictions.
Download:
-
Download the FASTA file from NCBI:
# Using NCBI datasets CLI (recommended): datasets download genome accession GCF_000001405.26 --include genome # Or download directly: wget https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/000/001/405/GCF_000001405.26_GRCh38/GCF_000001405.26_GRCh38_genomic.fna.gz gunzip GCF_000001405.26_GRCh38_genomic.fna.gz
-
Clean the FASTA file to extract sequences only:
cd lm-training python scripts/clean.py <path_to_downloaded.fna> > data/human_genome/cleaned.fna
Synthetically generated LIFO (stack-structured) bracket sequences. No external data needed.
Generation:
cd lm-training
python scripts/dyck.py --output-dir data/dyckDefault parameters: 200M training tokens, 20M validation tokens, vocab size 49,999, open probability 0.49. See python scripts/dyck.py --help for customization.
A token-shuffled version of the English Wikipedia corpus, where all tokens are globally permuted to destroy syntactic structure while preserving unigram statistics.
Prerequisites (must be completed in order):
- Download English Wikipedia (see above)
- Train the English tokenizer (Training Pipeline step 1)
- Tokenize and save the English dataset to disk (Training Pipeline step 2)
This produces the pre-tokenized data at lm-training/models/data/english_wiki/ which the scramble script reads.
Generation:
cd lm-training/scripts
python scramble_en.py seed=777This reads the tokenized English Wikipedia data from models/data/wiki/ and writes the scrambled version to models/data/wiki/train_scrambled_global and models/data/wiki/eval_scrambled_global.
An interleaved mixture of English Wikipedia and Project Gutenberg text.
Prerequisite: Internet access for downloading both datasets via HuggingFace.
Generation:
cd lm-training
python scripts/mix_book.pyThis creates a DatasetDict at lm-training/data/mixed_dsdict/.
All training commands are run from lm-training/src/. The framework uses Hydra for configuration management.
cd lm-training/src
# English BPE tokenizer
python train_tokenizer.py --config-name train-brainscore-english-tokenizer
# Other languages
python train_tokenizer.py --config-name train-brainscore-arabic-tokenizer
python train_tokenizer.py --config-name train-brainscore-chinese-tokenizer
# etc.
# Python code tokenizer (requires rewritten Python dataset)
python train_tokenizer.py --config-name train-brainscore-python-tokenizerTokenizers are saved to lm-training/models/tokenizer/.
Additional tokenizer configs:
| Corpus | Config |
|---|---|
| Human genome | train-brainscore-pure-text-tokenizer |
| Dyck language | train-brainscore-raw-tokenizer |
| Mixed | train-brainscore-mix-tokenizer |
Some training configs are set to load pre-tokenized datasets from disk (load_disk: true) for faster startup on repeated runs. To generate these cached datasets, run training once with data.save_disk=true:
cd lm-training/src
# Tokenize and save English Wikipedia to models/data/english_wiki/
python train_lm.py --config-name train-brainscore-english data.save_disk=true data.load_disk=false seed=777After this, subsequent runs with the default load_disk: true will load from the cached tokenized data. This step is required before generating scrambled English (which operates on the tokenized data).
cd lm-training/src
# English Wikipedia
python train_lm.py --config-name train-brainscore-english seed=777
# Other corpora
python train_lm.py --config-name train-brainscore-arabic seed=777
python train_lm.py --config-name train-brainscore-python seed=777
python train_lm.py --config-name train-brainscore-dyck seed=777
python train_lm.py --config-name train-brainscore-human seed=777
python train_lm.py --config-name train-brainscore-scrambled-english seed=777
python train_lm.py --config-name train-brainscore-english-mix seed=777Models and checkpoints are saved to lm-training/src/models/.
Logging: Training logs are reported to Weights & Biases. Set up your own wandb account and log in with wandb login before training.
To produce the randomly initialized model used as a baseline (no training):
cd lm-training/src
python random_init.py \
--tokenizer_json ../models/tokenizer/brainscore-bpe-english.json \
--out_dir <output_dir>/random_model \
--seed 777This saves a GPT-2-small model with random weights and the specified tokenizer.
cd lm-training/src
# Fine-tune an English model
python train_lm.py --config-name finetune-brainscore-english \
model.pretrained_model_name_or_path=<path_to_trained_model>/best_model \
seed=777cd lm-training/src
python score_model.py <path_to_model>/best_model <model_identifier>This evaluates the model on:
- Behavioral benchmark: Futrell2018 reading time prediction (Pearson r)
- Neural benchmarks: Pereira2018 fMRI activation prediction (243 and 384 sentences)
Scores are reported per-layer to identify the best-performing layer.
cd lm-training/scripts
python get_score.py --scan-root <path_to_models_dir> --output-dir <output_dir>All Hydra configs are in lm-training/config/. Key config files:
| Config | Description |
|---|---|
train-brainscore.yaml |
Base training config (GPT-2, 40 epochs, packing) |
train-brainscore-english.yaml |
English Wikipedia training |
train-brainscore-{lang}.yaml |
Other language Wikipedia training |
train-brainscore-python.yaml |
Python code training |
train-brainscore-dyck.yaml |
Dyck language training |
train-brainscore-human.yaml |
Human genome training |
train-brainscore-scrambled-english.yaml |
Scrambled English training |
finetune-brainscore-{lang}.yaml |
Fine-tuning configs |
train-brainscore-{lang}-tokenizer.yaml |
Tokenizer training configs |
Override any parameter on the command line:
python train_lm.py --config-name train-brainscore-english seed=42 training_args.num_train_epochs=20