Skip to content

CLMBRs/xlbs

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

7 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Cross-Lingual Brain Score (XLBS)

Brain-Score-based evaluation of language models trained on diverse corpora. This project trains small GPT-2-style transformer language models on a variety of training corpora — natural language (multiple languages), code, synthetic sequences, and genomic data — and evaluates them using Brain-Score benchmarks to measure alignment with human neural and behavioral language processing.

Project Structure

xlbs/
├── lm-training/          # Submodule: language model training framework (Hydra + HuggingFace)
│   ├── config/           # Hydra config files for training, finetuning, and tokenizers
│   ├── src/              # Training scripts (train_lm.py, score_model.py, etc.)
│   ├── scripts/          # Data generation and preprocessing utilities
│   └── data/             # Data download scripts (actual data is not tracked)
├── language/             # Submodule: brain-score/language evaluation framework
├── environment.yml       # Conda environment specification
└── README.md

Installation

1. Clone with submodules

git clone --recurse-submodules https://github.com/CLMBRs/xlbs.git
cd xlbs

If you already cloned without --recurse-submodules:

git submodule init
git submodule update

2. Create the conda environment

conda env create -f environment.yml
conda activate xlbs

GPU support: The default environment installs CPU PyTorch. For GPU training, install PyTorch with CUDA separately after creating the environment:

pip install torch --index-url https://download.pytorch.org/whl/cu130

Adjust the CUDA version (cu118, cu121, cu124, cu130) to match your system. See PyTorch Get Started for details.

3. Install the brain-score language package

cd language
pip install -e "."
cd ..

Obtaining Training Corpora

The training data is not included in this repository. Below are instructions for obtaining each corpus. All download commands should be run from the lm-training/ directory.

Wikipedia (English and multilingual)

Natural language text from the Wikimedia Wikipedia November 2023 dump, accessed via HuggingFace Datasets.

Source: wikimedia/wikipedia on HuggingFace

No access restrictions.

Download (English):

cd lm-training/data/wiki
python download.py

This saves the dataset to lm-training/data/wiki/en-wikipedia-local/.

Other languages: Non-English Wikipedia corpora are downloaded automatically during training (the configs use load_disk: false). The supported language codes and their config names are:

Language Code Config
Arabic ar train-brainscore-arabic
Chinese zh train-brainscore-chinese
Indonesian id train-brainscore-indonesian
Japanese ja train-brainscore-japanese
Korean ko train-brainscore-korean
Russian ru train-brainscore-russian

To download manually for offline use:

from datasets import load_dataset
ds = load_dataset("wikimedia/wikipedia", "20231101.<LANG_CODE>")
ds.save_to_disk("lm-training/data/wiki/<LANG_CODE>-wikipedia-local")

Python Code (The Stack)

Python source code from the BigCode project's deduplicated Stack dataset.

Source: bigcode/the-stack-dedup on HuggingFace

Access restriction: You must sign the BigCode Access Agreement on HuggingFace before downloading.

Download:

cd lm-training/data/program
python download.py

This saves the Python subset to lm-training/data/program/python/.

Post-processing: After downloading, rewrite the Python source code with special tokens (for indentation, comments, strings):

cd lm-training/src
python rewrite_python.py

This creates lm-training/data/stack_python_rewritten/.

Human Genome (NCBI GRCh38)

DNA sequences from the NCBI GRCh38 (hg38) human reference genome assembly.

Source: NCBI GRCh38 Assembly

No access restrictions.

Download:

  1. Download the FASTA file from NCBI:

    # Using NCBI datasets CLI (recommended):
    datasets download genome accession GCF_000001405.26 --include genome
    
    # Or download directly:
    wget https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/000/001/405/GCF_000001405.26_GRCh38/GCF_000001405.26_GRCh38_genomic.fna.gz
    gunzip GCF_000001405.26_GRCh38_genomic.fna.gz
  2. Clean the FASTA file to extract sequences only:

    cd lm-training
    python scripts/clean.py <path_to_downloaded.fna> > data/human_genome/cleaned.fna

Dyck Language (Synthetic)

Synthetically generated LIFO (stack-structured) bracket sequences. No external data needed.

Generation:

cd lm-training
python scripts/dyck.py --output-dir data/dyck

Default parameters: 200M training tokens, 20M validation tokens, vocab size 49,999, open probability 0.49. See python scripts/dyck.py --help for customization.

Scrambled English

A token-shuffled version of the English Wikipedia corpus, where all tokens are globally permuted to destroy syntactic structure while preserving unigram statistics.

Prerequisites (must be completed in order):

  1. Download English Wikipedia (see above)
  2. Train the English tokenizer (Training Pipeline step 1)
  3. Tokenize and save the English dataset to disk (Training Pipeline step 2)

This produces the pre-tokenized data at lm-training/models/data/english_wiki/ which the scramble script reads.

Generation:

cd lm-training/scripts
python scramble_en.py seed=777

This reads the tokenized English Wikipedia data from models/data/wiki/ and writes the scrambled version to models/data/wiki/train_scrambled_global and models/data/wiki/eval_scrambled_global.

Mixed Dataset

An interleaved mixture of English Wikipedia and Project Gutenberg text.

Prerequisite: Internet access for downloading both datasets via HuggingFace.

Generation:

cd lm-training
python scripts/mix_book.py

This creates a DatasetDict at lm-training/data/mixed_dsdict/.

Training Pipeline

All training commands are run from lm-training/src/. The framework uses Hydra for configuration management.

1. Train a tokenizer

cd lm-training/src

# English BPE tokenizer
python train_tokenizer.py --config-name train-brainscore-english-tokenizer

# Other languages
python train_tokenizer.py --config-name train-brainscore-arabic-tokenizer
python train_tokenizer.py --config-name train-brainscore-chinese-tokenizer
# etc.

# Python code tokenizer (requires rewritten Python dataset)
python train_tokenizer.py --config-name train-brainscore-python-tokenizer

Tokenizers are saved to lm-training/models/tokenizer/.

Additional tokenizer configs:

Corpus Config
Human genome train-brainscore-pure-text-tokenizer
Dyck language train-brainscore-raw-tokenizer
Mixed train-brainscore-mix-tokenizer

2. Tokenize and cache datasets to disk (optional but recommended)

Some training configs are set to load pre-tokenized datasets from disk (load_disk: true) for faster startup on repeated runs. To generate these cached datasets, run training once with data.save_disk=true:

cd lm-training/src

# Tokenize and save English Wikipedia to models/data/english_wiki/
python train_lm.py --config-name train-brainscore-english data.save_disk=true data.load_disk=false seed=777

After this, subsequent runs with the default load_disk: true will load from the cached tokenized data. This step is required before generating scrambled English (which operates on the tokenized data).

3. Train a language model

cd lm-training/src

# English Wikipedia
python train_lm.py --config-name train-brainscore-english seed=777

# Other corpora
python train_lm.py --config-name train-brainscore-arabic seed=777
python train_lm.py --config-name train-brainscore-python seed=777
python train_lm.py --config-name train-brainscore-dyck seed=777
python train_lm.py --config-name train-brainscore-human seed=777
python train_lm.py --config-name train-brainscore-scrambled-english seed=777
python train_lm.py --config-name train-brainscore-english-mix seed=777

Models and checkpoints are saved to lm-training/src/models/.

Logging: Training logs are reported to Weights & Biases. Set up your own wandb account and log in with wandb login before training.

4. Create a randomly initialized baseline

To produce the randomly initialized model used as a baseline (no training):

cd lm-training/src

python random_init.py \
    --tokenizer_json ../models/tokenizer/brainscore-bpe-english.json \
    --out_dir <output_dir>/random_model \
    --seed 777

This saves a GPT-2-small model with random weights and the specified tokenizer.

5. Fine-tune a model

cd lm-training/src

# Fine-tune an English model
python train_lm.py --config-name finetune-brainscore-english \
    model.pretrained_model_name_or_path=<path_to_trained_model>/best_model \
    seed=777

6. Score a model with Brain-Score

cd lm-training/src

python score_model.py <path_to_model>/best_model <model_identifier>

This evaluates the model on:

  • Behavioral benchmark: Futrell2018 reading time prediction (Pearson r)
  • Neural benchmarks: Pereira2018 fMRI activation prediction (243 and 384 sentences)

Scores are reported per-layer to identify the best-performing layer.

7. Collect scores into CSV

cd lm-training/scripts
python get_score.py --scan-root <path_to_models_dir> --output-dir <output_dir>

Configuration

All Hydra configs are in lm-training/config/. Key config files:

Config Description
train-brainscore.yaml Base training config (GPT-2, 40 epochs, packing)
train-brainscore-english.yaml English Wikipedia training
train-brainscore-{lang}.yaml Other language Wikipedia training
train-brainscore-python.yaml Python code training
train-brainscore-dyck.yaml Dyck language training
train-brainscore-human.yaml Human genome training
train-brainscore-scrambled-english.yaml Scrambled English training
finetune-brainscore-{lang}.yaml Fine-tuning configs
train-brainscore-{lang}-tokenizer.yaml Tokenizer training configs

Override any parameter on the command line:

python train_lm.py --config-name train-brainscore-english seed=42 training_args.num_train_epochs=20

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages