Pre-train and Fine-tune Language Model with Hugging Face

Pre-trained language model is an important infrastructure capability which can support many different use cases, such as classification, generation, etc. Many current monolingual models focus on English while there are customers speaking different languages who have the need of these pre-trained models for various use cases. In addition, multilingual models may not have the ideal performance for some downstream tasks in certain languages and thus one may also want to pre-train another monolingual model for some specific language to improve the performance.

This is a general guideline of pre-training and fine-tuning language models using Hugging Face. For illustration, we use pre-training language models for question generation (answer-agnostic) in Korean as running example.

General Guideline

Figure above shows the overview of pre-training and fine-tuning with Hugging Face. Specifically, one can follow the steps summarized below.

Choose the model. Hugging Face Transformers provides tons of state-of-the-art models across different modalities and backend (we focus on language models and PyTorch for now). Roughly speaking, language models can be grouped into two main classes based on the downstream use cases. (Check this list for supported models on Hugging Face.)
- Representation models are more suitable for classification tasks such as named-entity recognition. For example,
  - BERT
  - RoBERTa
  - DeBERTa-v2 (Turing NLR v4)
- Generation models are more suitable for generation tasks such as translation. For example,
  - BART / mBART
  - T5 / mT5
  - ProphetNet / XLM-ProphetNet
Prepare the pre-train corpus. Hugging Face Datasets provides useful toolkits to prepare and share data for different use cases (again we focus on NLP for now). Check this tutorial to get started. There are also many public resources that could be considered as potential corpus (some of them are also available from Hugging Face, check this page). For example,
- Wiki Dump: A complete copy of all Wikimedia wikis.
- CC-100: Constructed using the urls and paragraph indices from CC-Net repository by processing January-December 2018 Commoncrawl snapshots.
Train the tokenizer. Once the model is chosen and pre-train corpus is prepared, one may also want to train the tokenizer (associated with the model) on the pre-train corpus from scratch. Hugging Face Tokenizers provides the pipeline to train different types of tokenizers. Follow this example to get started. Some commonly used tokenizers include
Pre-train the model. Hugging Face Transformers also provides convenient wrapper for training deep neural networks. In particular,
- DataCollator: There are many pre-defined DataCollator that can meet the requirements of different models and pre-train task (objective). One can also build customized DataCollator upon the existing ones if needed.
- TrainingArguments/Trainer: With the convenient wrapper for training loop, one can simply specify hyperparameters (learning rate, batch size, etc) in TrainingArguments and pass them, along with the chosen model, pre-train corpus and trained tokenizer, to Trainer for training. One can also build customized Trainer upon the existing ones if needed.
- Seq2SeqTrainingArguments/Seq2SeqTrainer: Similar wrapper as above for sequence-to-sequence models.
Fine-tune the model. Depending on the use case, one can now fine-tune the pre-trained model for different downstream tasks.
- Prepare data: similarly as before, HuggingFace.Datasets can be used to prepare and share data.
- Train: similarly as before, HuggingFace.Transformers (DataCollator, Trainer, etc) can be used to train the model.
- Evaluate: Hugging Face Evaluate includes lots of commonly used metrics for different domains (again we focus on NLP for now). Check this tour to get started and this page for the list of supported metrics.

Example

For our running example, the specification is summarized as follows (one can also use our script as the simple template and replace with different model/data/etc to get started).

Choose the model. As our use case is question generation (answer-agnostic) in Korean, we consider the ProphetNet / XLM-ProphetNet model and the goal is to provide ProphetNet-Ko (Base/Large) model checkpoints that could be fine-tuned for question generation in Korean.
Prepare the pre-train corpus (script for preparing corpus). In addition to Wiki Dumps and CC-100 mentioned before, we also consider the following sources for our pre-train corpus:
- NamuWiki: Namu Wikipedia in a text format.
- Petition: Data collected from the Blue House National Petition (2017.08 ~ 2019.03).
The base pre-train corpus is around 16GB and the large pre-train corpus is around 75GB.
Train the tokenizer (script for training the tokenizer). We train the (base/large) SentencePiece tokenizer (associated with XLM-ProphetNet) with vocabulary size of 32K on the (base/large) pre-train corpus.
Pre-train the model (script for preparing pre-train data and script for pre-training). We define our customized DataCollator and Seq2SeqTrainer to adopt the future n-gram prediction objective (a new sequence-to-sequence pre-train task proposed by this paper). We pre-train base model (~125M parameters) on 16GB base corpus and large model (~400M parameters) on 75GB large corpus.
Fine-tune the model (script for preparing fine-tune data and script for fine-tuning). As our downstream task is question generation (answer-agnostic), we consider KLUE-MRC and KorQuAD v1.0 as potential datasets for fine-tuning. We use BLEU scores as evaluation metrics.

Reference

[1] Tutorials and examples from Hugging Face.

[2] Example of pre-training with Hugging Face.

[3] Example of fine-tuning with Hugging Face for question generation (answer-agnostic).

Contributing

This project welcomes contributions and suggestions. Most contributions require you to agree to a Contributor License Agreement (CLA) declaring that you have the right to, and actually do, grant us the rights to use your contribution. For details, visit https://cla.opensource.microsoft.com.

When you submit a pull request, a CLA bot will automatically determine whether you need to provide a CLA and decorate the PR appropriately (e.g., status check, comment). Simply follow the instructions provided by the bot. You will only need to do this once across all repos using our CLA.

This project has adopted the Microsoft Open Source Code of Conduct. For more information see the Code of Conduct FAQ or contact opencode@microsoft.com with any additional questions or comments.

Trademarks

This project may contain trademarks or logos for projects, products, or services. Authorized use of Microsoft trademarks or logos is subject to and must follow Microsoft's Trademark & Brand Guidelines. Use of Microsoft trademarks or logos in modified versions of this project must not cause confusion or imply Microsoft sponsorship. Any use of third-party trademarks or logos are subject to those third-party's policies.

Name		Name	Last commit message	Last commit date
Latest commit History 12 Commits
demo		demo
doc		doc
script		script
.gitignore		.gitignore
CODE_OF_CONDUCT.md		CODE_OF_CONDUCT.md
LICENSE		LICENSE
README.md		README.md
SECURITY.md		SECURITY.md
SUPPORT.md		SUPPORT.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

demo

demo

doc

doc

script

script

.gitignore

.gitignore

CODE_OF_CONDUCT.md

CODE_OF_CONDUCT.md

LICENSE

LICENSE

README.md

README.md

SECURITY.md

SECURITY.md

SUPPORT.md

SUPPORT.md

Repository files navigation

Pre-train and Fine-tune Language Model with Hugging Face

General Guideline

Example

Reference

Contributing

Trademarks

About

Releases

Packages

Contributors 3

Languages

License

Azure/language-model-pretrain-korean

Folders and files

Latest commit

History

Repository files navigation

Pre-train and Fine-tune Language Model with Hugging Face

General Guideline

Example

Reference

Contributing

Trademarks

About

Resources

License

Code of conduct

Security policy

Stars

Watchers

Forks

Languages