Cross-Lingual Brain Score (XLBS)

Brain-Score-based evaluation of language models trained on diverse corpora. This project trains small GPT-2-style transformer language models on a variety of training corpora — natural language (multiple languages), code, synthetic sequences, and genomic data — and evaluates them using Brain-Score benchmarks to measure alignment with human neural and behavioral language processing.

Project Structure

xlbs/
├── lm-training/          # Submodule: language model training framework (Hydra + HuggingFace)
│   ├── config/           # Hydra config files for training, finetuning, and tokenizers
│   ├── src/              # Training scripts (train_lm.py, score_model.py, etc.)
│   ├── scripts/          # Data generation and preprocessing utilities
│   └── data/             # Data download scripts (actual data is not tracked)
├── language/             # Submodule: brain-score/language evaluation framework
├── environment.yml       # Conda environment specification
└── README.md

Installation

1. Clone with submodules

git clone --recurse-submodules https://github.com/CLMBRs/xlbs.git
cd xlbs

If you already cloned without --recurse-submodules:

git submodule init
git submodule update

2. Create the conda environment

conda env create -f environment.yml
conda activate xlbs

GPU support: The default environment installs CPU PyTorch. For GPU training, install PyTorch with CUDA separately after creating the environment:

pip install torch --index-url https://download.pytorch.org/whl/cu130

Adjust the CUDA version (cu118, cu121, cu124, cu130) to match your system. See PyTorch Get Started for details.

3. Install the brain-score language package

cd language
pip install -e "."
cd ..

Obtaining Training Corpora

The training data is not included in this repository. Below are instructions for obtaining each corpus. All download commands should be run from the lm-training/ directory.

Wikipedia (English and multilingual)

Natural language text from the Wikimedia Wikipedia November 2023 dump, accessed via HuggingFace Datasets.

Source: wikimedia/wikipedia on HuggingFace

No access restrictions.

Download (English):

cd lm-training/data/wiki
python download.py

This saves the dataset to lm-training/data/wiki/en-wikipedia-local/.

Other languages: Non-English Wikipedia corpora are downloaded automatically during training (the configs use load_disk: false). The supported language codes and their config names are:

Language	Code	Config
Arabic	`ar`	`train-brainscore-arabic`
Chinese	`zh`	`train-brainscore-chinese`
Indonesian	`id`	`train-brainscore-indonesian`
Japanese	`ja`	`train-brainscore-japanese`
Korean	`ko`	`train-brainscore-korean`
Russian	`ru`	`train-brainscore-russian`

To download manually for offline use:

from datasets import load_dataset
ds = load_dataset("wikimedia/wikipedia", "20231101.<LANG_CODE>")
ds.save_to_disk("lm-training/data/wiki/<LANG_CODE>-wikipedia-local")

Python Code (The Stack)

Python source code from the BigCode project's deduplicated Stack dataset.

Source: bigcode/the-stack-dedup on HuggingFace

Access restriction: You must sign the BigCode Access Agreement on HuggingFace before downloading.

Download:

cd lm-training/data/program
python download.py

This saves the Python subset to lm-training/data/program/python/.

Post-processing: After downloading, rewrite the Python source code with special tokens (for indentation, comments, strings):

cd lm-training/src
python rewrite_python.py

This creates lm-training/data/stack_python_rewritten/.

Human Genome (NCBI GRCh38)

DNA sequences from the NCBI GRCh38 (hg38) human reference genome assembly.

Source: NCBI GRCh38 Assembly

No access restrictions.

Download:

Download the FASTA file from NCBI:

# Using NCBI datasets CLI (recommended):
datasets download genome accession GCF_000001405.26 --include genome

# Or download directly:
wget https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/000/001/405/GCF_000001405.26_GRCh38/GCF_000001405.26_GRCh38_genomic.fna.gz
gunzip GCF_000001405.26_GRCh38_genomic.fna.gz

Clean the FASTA file to extract sequences only:

cd lm-training
python scripts/clean.py <path_to_downloaded.fna> > data/human_genome/cleaned.fna

Dyck Language (Synthetic)

Synthetically generated LIFO (stack-structured) bracket sequences. No external data needed.

Generation:

cd lm-training
python scripts/dyck.py --output-dir data/dyck

Default parameters: 200M training tokens, 20M validation tokens, vocab size 49,999, open probability 0.49. See python scripts/dyck.py --help for customization.

Scrambled English

A token-shuffled version of the English Wikipedia corpus, where all tokens are globally permuted to destroy syntactic structure while preserving unigram statistics.

Prerequisites (must be completed in order):

Download English Wikipedia (see above)
Train the English tokenizer (Training Pipeline step 1)
Tokenize and save the English dataset to disk (Training Pipeline step 2)

This produces the pre-tokenized data at lm-training/models/data/english_wiki/ which the scramble script reads.

Generation:

cd lm-training/scripts
python scramble_en.py seed=777

This reads the tokenized English Wikipedia data from models/data/wiki/ and writes the scrambled version to models/data/wiki/train_scrambled_global and models/data/wiki/eval_scrambled_global.

Mixed Dataset

An interleaved mixture of English Wikipedia and Project Gutenberg text.

Prerequisite: Internet access for downloading both datasets via HuggingFace.

Generation:

cd lm-training
python scripts/mix_book.py

This creates a DatasetDict at lm-training/data/mixed_dsdict/.

Training Pipeline

All training commands are run from lm-training/src/. The framework uses Hydra for configuration management.

1. Train a tokenizer

cd lm-training/src

# English BPE tokenizer
python train_tokenizer.py --config-name train-brainscore-english-tokenizer

# Other languages
python train_tokenizer.py --config-name train-brainscore-arabic-tokenizer
python train_tokenizer.py --config-name train-brainscore-chinese-tokenizer
# etc.

# Python code tokenizer (requires rewritten Python dataset)
python train_tokenizer.py --config-name train-brainscore-python-tokenizer

Tokenizers are saved to lm-training/models/tokenizer/.

Additional tokenizer configs:

Corpus	Config
Human genome	`train-brainscore-pure-text-tokenizer`
Dyck language	`train-brainscore-raw-tokenizer`
Mixed	`train-brainscore-mix-tokenizer`

2. Tokenize and cache datasets to disk (optional but recommended)

Some training configs are set to load pre-tokenized datasets from disk (load_disk: true) for faster startup on repeated runs. To generate these cached datasets, run training once with data.save_disk=true:

cd lm-training/src

# Tokenize and save English Wikipedia to models/data/english_wiki/
python train_lm.py --config-name train-brainscore-english data.save_disk=true data.load_disk=false seed=777

After this, subsequent runs with the default load_disk: true will load from the cached tokenized data. This step is required before generating scrambled English (which operates on the tokenized data).

3. Train a language model

cd lm-training/src

# English Wikipedia
python train_lm.py --config-name train-brainscore-english seed=777

# Other corpora
python train_lm.py --config-name train-brainscore-arabic seed=777
python train_lm.py --config-name train-brainscore-python seed=777
python train_lm.py --config-name train-brainscore-dyck seed=777
python train_lm.py --config-name train-brainscore-human seed=777
python train_lm.py --config-name train-brainscore-scrambled-english seed=777
python train_lm.py --config-name train-brainscore-english-mix seed=777

Models and checkpoints are saved to lm-training/src/models/.

Logging: Training logs are reported to Weights & Biases. Set up your own wandb account and log in with wandb login before training.

4. Create a randomly initialized baseline

To produce the randomly initialized model used as a baseline (no training):

cd lm-training/src

python random_init.py \
    --tokenizer_json ../models/tokenizer/brainscore-bpe-english.json \
    --out_dir <output_dir>/random_model \
    --seed 777

This saves a GPT-2-small model with random weights and the specified tokenizer.

5. Fine-tune a model

cd lm-training/src

# Fine-tune an English model
python train_lm.py --config-name finetune-brainscore-english \
    model.pretrained_model_name_or_path=<path_to_trained_model>/best_model \
    seed=777

6. Score a model with Brain-Score

cd lm-training/src

python score_model.py <path_to_model>/best_model <model_identifier>

This evaluates the model on:

Behavioral benchmark: Futrell2018 reading time prediction (Pearson r)
Neural benchmarks: Pereira2018 fMRI activation prediction (243 and 384 sentences)

Scores are reported per-layer to identify the best-performing layer.

7. Collect scores into CSV

cd lm-training/scripts
python get_score.py --scan-root <path_to_models_dir> --output-dir <output_dir>

Configuration

All Hydra configs are in lm-training/config/. Key config files:

Config	Description
`train-brainscore.yaml`	Base training config (GPT-2, 40 epochs, packing)
`train-brainscore-english.yaml`	English Wikipedia training
`train-brainscore-{lang}.yaml`	Other language Wikipedia training
`train-brainscore-python.yaml`	Python code training
`train-brainscore-dyck.yaml`	Dyck language training
`train-brainscore-human.yaml`	Human genome training
`train-brainscore-scrambled-english.yaml`	Scrambled English training
`finetune-brainscore-{lang}.yaml`	Fine-tuning configs
`train-brainscore-{lang}-tokenizer.yaml`	Tokenizer training configs

Override any parameter on the command line:

python train_lm.py --config-name train-brainscore-english seed=42 training_args.num_train_epochs=20

Name		Name	Last commit message	Last commit date
Latest commit History 7 Commits
language @ d3eb587		language @ d3eb587
lm-training		lm-training
.gitignore		.gitignore
.gitmodules		.gitmodules
README.md		README.md
environment.yml		environment.yml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Cross-Lingual Brain Score (XLBS)

Project Structure

Installation

1. Clone with submodules

2. Create the conda environment

3. Install the brain-score language package

Obtaining Training Corpora

Wikipedia (English and multilingual)

Python Code (The Stack)

Human Genome (NCBI GRCh38)

Dyck Language (Synthetic)

Scrambled English

Mixed Dataset

Training Pipeline

1. Train a tokenizer

2. Tokenize and cache datasets to disk (optional but recommended)

3. Train a language model

4. Create a randomly initialized baseline

5. Fine-tune a model

6. Score a model with Brain-Score

7. Collect scores into CSV

Configuration

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Cross-Lingual Brain Score (XLBS)

Project Structure

Installation

1. Clone with submodules

2. Create the conda environment

3. Install the brain-score language package

Obtaining Training Corpora

Wikipedia (English and multilingual)

Python Code (The Stack)

Human Genome (NCBI GRCh38)

Dyck Language (Synthetic)

Scrambled English

Mixed Dataset

Training Pipeline

1. Train a tokenizer

2. Tokenize and cache datasets to disk (optional but recommended)

3. Train a language model

4. Create a randomly initialized baseline

5. Fine-tune a model

6. Score a model with Brain-Score

7. Collect scores into CSV

Configuration

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages