This repository contains the code of mGLM: a multilingual variant of GLM, a general language model trained with an autoregressive blank infilling objective.
You may want to check out our interactive demo based on mGLM that generates a brief Chinese/English summary for your article in any commonly used language.
The backbone structure of this model is based on GLM: General Language Model Pretraining with Autoregressive Blank Infilling (Du et al., ACL 2022)
Code is mainly based on THUDM/GLM. Part of the code is also based on Megatron-LM and PET.
Here we provide a comparison between the sizes of different multilingual language models.
Model | Parameters |
---|---|
mBERT | 180M |
XLM-R | 550M |
MT5-Large | 1.2B |
GLM-Large | 1B |
You can download Our Pretrained Checkpoint and specify the checkpoint path in a script. The multilingual tokenizer and configuration file of our model are already included in this repo.
Model | XNLI | PAWS-X | XQuAD | MLQA | TyDiQA |
---|---|---|---|---|---|
GLM-Large | 75.6 | 85.2 | 83.6/71.9 | 67.52/54.34 | 69.6/55.6 |
MT5-Large | 81.1 | 88.9 | 77.8/61.5 | 71.2/51.7 | 69.9/52.2 |
The following table contains our test results for the NCLS English to Chinese(EN2ZHSUM) dataset
Metric is Rouge-1/Rouge-2/Rouge-L
Model | NCLS English to Chinese |
---|---|
GLM-Large | 50.27/30.94/38.44 |
MT5-Large(Reproduced) | 42.31/22.40/31.33 |
Please first install PyTorch
pip3 install torch==1.9.0+cu111 torchvision==0.10.0+cu111 torchaudio==0.9.0 -f https://download.pytorch.org/whl/torch_stable.html --no-cache-dir
and apex.
Then install other dependencies
pip3 install -r requirements.txt
-
Download the XTREME data and check the experiment setup in scripts/ds_finetune_superglue.sh. Note that
DATA_ROOT, CHECKPOINT_PATH, SAVE_PATH
need to be changed to your local path. You may also change thebatch-size
andnproc_per_node
according to your available hardware. -
For Classification tasks, we use the script
scripts/ds_finetune_superglue.sh
.Run the following script to train on the XNLI dataset.
bash scripts/ds_finetune_superglue.sh \
config_tasks/model_blocklm_multilingual_large.sh \
config_tasks/task_xnli.sh
- For QA tasks, we use the script
scripts/ds_finetune_seq2seq.sh
. Run the following script to train on the MLQA dataset.
bash scripts/ds_finetune_seq2seq.sh \
config_tasks/model_blocklm_multilingual_large.sh \
config_tasks/seq_mlqa.sh
- Download the NCLS dataset
- For Summarization tasks, we use the script
scripts/ds_finetune_summary.sh
. Run the following to train on NCLS English to Chinese.
bash scripts/ds_finetune_summary.sh \
config_tasks/model_blocklm_multilingual_large.sh \
config_tasks/seq_ncls.sh
- Change
CHECKPOINT_PATH
inscripts/generate_block.sh
to your local path and run the following script.
bash scripts/generate_block.sh \
config_tasks/model_blocklm_multilingual_large.sh
If your encounter the CUDA out of memory
error, which means you GPU memory is limited, you can try the model parallelism to divide the parameters into multiple GPUs. Take the two-way model parallelism as an example. First run change_mp.py
to divide the checkpoint:
python3 change_mp.py path_to_the_checkpoint 2
Then update the checkpoint path in the model config file (such as config_tasks/model_blocklm_multilingual_large.sh) and change MP_SIZE
in the script (such as scripts/ds_finetune_superglue.sh) to 2
.
Run the following script to pre-train the mGLM-Large model
bash scripts/ds_pretrain_nvidia.sh config/ds_multi_blockta_large.sh
The script scripts/ds_pretrain_nvidia.sh launches the training program with DeepSpeed. You should change NUM_WORKERS
and NUM_GPUS_PER_WORKER
to the number of workers and the number of gpus per worker. Also change HOST_FILE_PATH
to the path to an OpenMPI-style hostfile. More details about DeepSpeed launcher can be found here.
The file config/ds_multi_blockta_large.sh defines the hyperparameters for pretraining. Most of the arguments are fairly self-explanatory. Specifically, --train-data
can be multiple keywords defined in NAMED_CORPORA
in data_utils/corpora.py. The hyperparameters of the optimizer are defined in the corresponding json file under config
. The semantics of the json file can be found here.
The code for reproducing experiments in MT5 is at mt5/finetune_mt5.py
. We use a tool called wandb to track our experiments. After signing up for a new account, use wandb login --relogin
to login. You can also use wandb offline
to turn off wandb synchronizing your experiment online.
If you only want to use one GPU to train, use
cd mt5
python3 finetune_mt5.py scisummnet simple
to train on the scisummnet dataset.
Our distributed training is automated with Accelerate. accelerate config
sets up the configuration for distributed training. accelerate test
runs a sanity check.
cd mt5
accelerate launch finetune_mt5.py scisummnet simple
runs the training on the scisummnet dataset.
Citation for the GLM paper:
@inproceedings{du-etal-2022-glm,
title = "{GLM}: General Language Model Pretraining with Autoregressive Blank Infilling",
author = "Du, Zhengxiao and
Qian, Yujie and
Liu, Xiao and
Ding, Ming and
Qiu, Jiezhong and
Yang, Zhilin and
Tang, Jie",
booktitle = "Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)",
month = may,
year = "2022",
address = "Dublin, Ireland",
publisher = "Association for Computational Linguistics",
url = "https://aclanthology.org/2022.acl-long.26",
doi = "10.18653/v1/2022.acl-long.26",
pages = "320--335",
abstract = "There have been various types of pretraining architectures including autoencoding models (e.g., BERT), autoregressive models (e.g., GPT), and encoder-decoder models (e.g., T5). However, none of the pretraining frameworks performs the best for all tasks of three main categories including natural language understanding (NLU), unconditional generation, and conditional generation. We propose a General Language Model (GLM) based on autoregressive blank infilling to address this challenge. GLM improves blank filling pretraining by adding 2D positional encodings and allowing an arbitrary order to predict spans, which results in performance gains over BERT and T5 on NLU tasks. Meanwhile, GLM can be pretrained for different types of tasks by varying the number and lengths of blanks. On a wide range of tasks across NLU, conditional and unconditional generation, GLM outperforms BERT, T5, and GPT given the same model sizes and data, and achieves the best performance from a single pretrained model with 1.25{\mbox{$\times$}} parameters of BERT Large , demonstrating its generalizability to different downstream tasks.",
}
Citation for the Multilingual GLM paper to be released