Skip to content

JoelNiklaus/MultilingualLegalLMPretraining

Repository files navigation

Multilingual Legal Language Models

Legal XLM Models trained on the Multilingual Legal Pile

Model Name Layers / Units / Heads Vocab. Parameters
joelito/legal-xlm-roberta-base 12 / 768 / 12 128K 123M
joelito/legal-xlm-roberta-large 24 / 1024 / 16 128K 355M
joelito/legal-xlm-longformer-base 12 / 768 / 12 128K 134M
joelito/legal-swiss-roberta-base 12 / 768 / 12 128K 123M
joelito/legal-swiss-roberta-large 24 / 1024 / 16 128K 355M
joelito/legal-swiss-longformer-base 12 / 768 / 12 128K 134M

Monolingual base size models are available for the following languages: Bulgarian, Czech, Danish, German, Greek, English, Spanish, Estonian, Finnish, French, Irish, Croatian, Hungarian, Italian, Lithuanian, Latvian, Maltese, Dutch, Polish, Portuguese, Romanian, Slovak, Slovenian, Swedish

Monolingual large size models are available for the following languages: English, German, French, Italian, Spanish, Portuguese

Benchmarking on LEXTREME

See https://github.com/JoelNiklaus/LEXTREME

Benchmarking on LexGLUE

See https://arxiv.org/abs/2306.02069.

Code Base

Train XLM

Create a TPU VM instance with the following script:

gcloud compute tpus tpu-vm create tpu1 --zone=europe-west4-a --accelerator-type=v3-8 --version=tpu-vm-pt-1.12

Connect to the instance:

gcloud compute tpus tpu-vm ssh tpu1 --zone europe-west4-a

Set up the environment:

git clone https://github.com/JoelNiklaus/MultilingualLegalLMPretraining
cd MultilingualLegalLMPretraining
sudo bash setup_tpu_machine.sh

Put your huggingface token in data/__init__.py and in scripts/train_mlm_tpu.sh.

Make sure that you delete the output_dir locally and the huggingface model repo (hub_model_id) before training.

For TPU acceleration use the following script:

sudo sh train_mlm_tpu.sh

For GPU acceleration use the following script:

sh train_mlm_gpu.sh

Evaluate XLM

sh eval_mlm_gpu.sh

Modify pre-trained XLM-R

export PYTHONPATH=.
python3 src/mod_teacher_model.py --teacher_model_path xlm-roberta-base --student_model_path data/plms/legal-xlm-base

Longformerize pre-trained RoBERTa LM

export PYTHONPATH=.
python3 src/longformerize_model.py --roberta_model_path data/plms/legal-xlm-base --max_length 4096 --attention_window 128

Pipeline

  1. Train tokenizer (Only RoBERTa needed because we convert BERT models to RoBERTa)
    export PYTHONPATH=. && python3 src/pretraining/train_tokenizer.py | tee train_tokenizer.log
  1. Evaluate tokenizer
    export PYTHONPATH=. && python3 src/pretraining/evaluate_tokenizer.py | tee evaluate_tokenizer.log
  1. Mod Teacher Model
    export PYTHONPATH=. NAME=legal-xlm-roberta-base SIZE=128k  && python3 src/modding/mod_teacher_model.py --teacher_model_path xlm-roberta-base --student_model_path data/plms/${NAME}_${SIZE} --output_dir data/plms/${NAME} | tee mod_teacher_model.log
  1. Train MLM (monolingual: 500K steps) (TPUs or GPUs)
    sudo sh scripts/train_mlm_tpu.sh | tee train_mlm_tpu.log

or

    sh scripts/train_mlm_gpu.sh | tee train_mlm_gpu.log
  1. Evaluate MLM
    sh scripts/eval_mlm_gpu.sh | tee eval_mlm_gpu.log
  1. Longformerize MLM
    export PYTHONPATH=. && python3 src/modding/longformerize_model.py --roberta_model_path joelito/legal-xlm-roberta-base --longformer_model_path data/plms/legal-xlm-longformer-base | tee longformerize_model.log
  1. Train Longformer MLM (monolingual: 50K steps) (only GPUs!)
    sh scripts/train_mlm_longformer.sh | tee train_mlm_longformer.log
  1. Evaluate Longformer MLM
    sh scripts/eval_mlm_gpu.sh | tee eval_mlm_gpu.log

Troubleshooting

If you get a PermissionError: [Errno 13] Permission denied: set the permissions to 777.

If you get strange git lfs errors, delete the huggingface model repo and the output directory

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 2

  •  
  •