| Model Name | Layers / Units / Heads | Vocab. | Parameters |
|---|---|---|---|
joelito/legal-xlm-roberta-base |
12 / 768 / 12 | 128K | 123M |
joelito/legal-xlm-roberta-large |
24 / 1024 / 16 | 128K | 355M |
joelito/legal-xlm-longformer-base |
12 / 768 / 12 | 128K | 134M |
joelito/legal-swiss-roberta-base |
12 / 768 / 12 | 128K | 123M |
joelito/legal-swiss-roberta-large |
24 / 1024 / 16 | 128K | 355M |
joelito/legal-swiss-longformer-base |
12 / 768 / 12 | 128K | 134M |
Monolingual base size models are available for the following languages: Bulgarian, Czech, Danish, German, Greek, English, Spanish, Estonian, Finnish, French, Irish, Croatian, Hungarian, Italian, Lithuanian, Latvian, Maltese, Dutch, Polish, Portuguese, Romanian, Slovak, Slovenian, Swedish
Monolingual large size models are available for the following languages: English, German, French, Italian, Spanish, Portuguese
See https://github.com/JoelNiklaus/LEXTREME
See https://arxiv.org/abs/2306.02069.
Create a TPU VM instance with the following script:
gcloud compute tpus tpu-vm create tpu1 --zone=europe-west4-a --accelerator-type=v3-8 --version=tpu-vm-pt-1.12Connect to the instance:
gcloud compute tpus tpu-vm ssh tpu1 --zone europe-west4-aSet up the environment:
git clone https://github.com/JoelNiklaus/MultilingualLegalLMPretraining
cd MultilingualLegalLMPretraining
sudo bash setup_tpu_machine.shPut your huggingface token in data/__init__.py and in scripts/train_mlm_tpu.sh.
Make sure that you delete the output_dir locally and the huggingface model repo (hub_model_id) before training.
For TPU acceleration use the following script:
sudo sh train_mlm_tpu.shFor GPU acceleration use the following script:
sh train_mlm_gpu.shsh eval_mlm_gpu.shexport PYTHONPATH=.
python3 src/mod_teacher_model.py --teacher_model_path xlm-roberta-base --student_model_path data/plms/legal-xlm-baseexport PYTHONPATH=.
python3 src/longformerize_model.py --roberta_model_path data/plms/legal-xlm-base --max_length 4096 --attention_window 128- Train tokenizer (Only RoBERTa needed because we convert BERT models to RoBERTa)
export PYTHONPATH=. && python3 src/pretraining/train_tokenizer.py | tee train_tokenizer.log- Evaluate tokenizer
export PYTHONPATH=. && python3 src/pretraining/evaluate_tokenizer.py | tee evaluate_tokenizer.log- Mod Teacher Model
export PYTHONPATH=. NAME=legal-xlm-roberta-base SIZE=128k && python3 src/modding/mod_teacher_model.py --teacher_model_path xlm-roberta-base --student_model_path data/plms/${NAME}_${SIZE} --output_dir data/plms/${NAME} | tee mod_teacher_model.log- Train MLM (monolingual: 500K steps) (TPUs or GPUs)
sudo sh scripts/train_mlm_tpu.sh | tee train_mlm_tpu.logor
sh scripts/train_mlm_gpu.sh | tee train_mlm_gpu.log- Evaluate MLM
sh scripts/eval_mlm_gpu.sh | tee eval_mlm_gpu.log- Longformerize MLM
export PYTHONPATH=. && python3 src/modding/longformerize_model.py --roberta_model_path joelito/legal-xlm-roberta-base --longformer_model_path data/plms/legal-xlm-longformer-base | tee longformerize_model.log- Train Longformer MLM (monolingual: 50K steps) (only GPUs!)
sh scripts/train_mlm_longformer.sh | tee train_mlm_longformer.log- Evaluate Longformer MLM
sh scripts/eval_mlm_gpu.sh | tee eval_mlm_gpu.logIf you get a PermissionError: [Errno 13] Permission denied: set the permissions to 777.
If you get strange git lfs errors, delete the huggingface model repo and the output directory