forked from huggingface/transformers
-
Notifications
You must be signed in to change notification settings - Fork 0
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
Update README of QA examples (huggingface#14172)
- Loading branch information
1 parent
dcb0de8
commit d5f02a8
Showing
7 changed files
with
178 additions
and
180 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,126 @@ | ||
#### Fine-tuning BERT on SQuAD1.0 with relative position embeddings | ||
|
||
The following examples show how to fine-tune BERT models with different relative position embeddings. The BERT model | ||
`bert-base-uncased` was pretrained with default absolute position embeddings. We provide the following pretrained | ||
models which were pre-trained on the same training data (BooksCorpus and English Wikipedia) as in the BERT model | ||
training, but with different relative position embeddings. | ||
|
||
* `zhiheng-huang/bert-base-uncased-embedding-relative-key`, trained from scratch with relative embedding proposed by | ||
Shaw et al., [Self-Attention with Relative Position Representations](https://arxiv.org/abs/1803.02155) | ||
* `zhiheng-huang/bert-base-uncased-embedding-relative-key-query`, trained from scratch with relative embedding method 4 | ||
in Huang et al. [Improve Transformer Models with Better Relative Position Embeddings](https://arxiv.org/abs/2009.13658) | ||
* `zhiheng-huang/bert-large-uncased-whole-word-masking-embedding-relative-key-query`, fine-tuned from model | ||
`bert-large-uncased-whole-word-masking` with 3 additional epochs with relative embedding method 4 in Huang et al. | ||
[Improve Transformer Models with Better Relative Position Embeddings](https://arxiv.org/abs/2009.13658) | ||
|
||
|
||
##### Base models fine-tuning | ||
|
||
```bash | ||
export CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 | ||
python -m torch.distributed.launch --nproc_per_node=8 ./examples/question-answering/run_squad.py \ | ||
--model_name_or_path zhiheng-huang/bert-base-uncased-embedding-relative-key-query \ | ||
--dataset_name squad \ | ||
--do_train \ | ||
--do_eval \ | ||
--learning_rate 3e-5 \ | ||
--num_train_epochs 2 \ | ||
--max_seq_length 512 \ | ||
--doc_stride 128 \ | ||
--output_dir relative_squad \ | ||
--per_device_eval_batch_size=60 \ | ||
--per_device_train_batch_size=6 | ||
``` | ||
Training with the above command leads to the following results. It boosts the BERT default from f1 score of 88.52 to 90.54. | ||
|
||
```bash | ||
'exact': 83.6802270577105, 'f1': 90.54772098174814 | ||
``` | ||
|
||
The change of `max_seq_length` from 512 to 384 in the above command leads to the f1 score of 90.34. Replacing the above | ||
model `zhiheng-huang/bert-base-uncased-embedding-relative-key-query` with | ||
`zhiheng-huang/bert-base-uncased-embedding-relative-key` leads to the f1 score of 89.51. The changing of 8 gpus to one | ||
gpu training leads to the f1 score of 90.71. | ||
|
||
##### Large models fine-tuning | ||
|
||
```bash | ||
export CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 | ||
python -m torch.distributed.launch --nproc_per_node=8 ./examples/question-answering/run_squad.py \ | ||
--model_name_or_path zhiheng-huang/bert-large-uncased-whole-word-masking-embedding-relative-key-query \ | ||
--dataset_name squad \ | ||
--do_train \ | ||
--do_eval \ | ||
--learning_rate 3e-5 \ | ||
--num_train_epochs 2 \ | ||
--max_seq_length 512 \ | ||
--doc_stride 128 \ | ||
--output_dir relative_squad \ | ||
--per_gpu_eval_batch_size=6 \ | ||
--per_gpu_train_batch_size=2 \ | ||
--gradient_accumulation_steps 3 | ||
``` | ||
Training with the above command leads to the f1 score of 93.52, which is slightly better than the f1 score of 93.15 for | ||
`bert-large-uncased-whole-word-masking`. | ||
|
||
#### Distributed training | ||
|
||
Here is an example using distributed training on 8 V100 GPUs and Bert Whole Word Masking uncased model to reach a F1 > 93 on SQuAD1.1: | ||
|
||
```bash | ||
python -m torch.distributed.launch --nproc_per_node=8 ./examples/question-answering/run_squad.py \ | ||
--model_name_or_path bert-large-uncased-whole-word-masking \ | ||
--dataset_name squad \ | ||
--do_train \ | ||
--do_eval \ | ||
--learning_rate 3e-5 \ | ||
--num_train_epochs 2 \ | ||
--max_seq_length 384 \ | ||
--doc_stride 128 \ | ||
--output_dir ./examples/models/wwm_uncased_finetuned_squad/ \ | ||
--per_device_eval_batch_size=3 \ | ||
--per_device_train_batch_size=3 \ | ||
``` | ||
|
||
Training with the previously defined hyper-parameters yields the following results: | ||
|
||
```bash | ||
f1 = 93.15 | ||
exact_match = 86.91 | ||
``` | ||
|
||
This fine-tuned model is available as a checkpoint under the reference | ||
[`bert-large-uncased-whole-word-masking-finetuned-squad`](https://huggingface.co/bert-large-uncased-whole-word-masking-finetuned-squad). | ||
|
||
## Results | ||
|
||
Larger batch size may improve the performance while costing more memory. | ||
|
||
##### Results for SQuAD1.0 with the previously defined hyper-parameters: | ||
|
||
```python | ||
{ | ||
"exact": 85.45884578997162, | ||
"f1": 92.5974600601065, | ||
"total": 10570, | ||
"HasAns_exact": 85.45884578997162, | ||
"HasAns_f1": 92.59746006010651, | ||
"HasAns_total": 10570 | ||
} | ||
``` | ||
|
||
##### Results for SQuAD2.0 with the previously defined hyper-parameters: | ||
|
||
```python | ||
{ | ||
"exact": 80.4177545691906, | ||
"f1": 84.07154997729623, | ||
"total": 11873, | ||
"HasAns_exact": 76.73751686909581, | ||
"HasAns_f1": 84.05558584352873, | ||
"HasAns_total": 5928, | ||
"NoAns_exact": 84.0874684608915, | ||
"NoAns_f1": 84.0874684608915, | ||
"NoAns_total": 5945 | ||
} | ||
``` |
Oops, something went wrong.