# Chapter 4 : Pretraining a RoBERTa model from scratch

We will build a RoBERTa model from scratch. The model will use the **bricks of the transformer construction kit we need for BERT models**. Also, no pretrained tokenizers or models will be used. The RoBERTa model will be built following the $15$-step process described in this chapter.

This chapter will focus on building a pretrained transformer model from scratch using a Jupyter notebook based on Hugging Face’s seamless modules. The model is named KantaiBERT.

**KantaiBERT first loads a compilation of Immanuel Kant’s books created for this chapter**. You will see how the data was obtained. You will also see how to create your own datasets for this notebook.

**KantaiBERT trains its own tokenizer from scratch**. It will build its merge and vocabulary files, which will be used during the pretraining process.

**KantaiBERT then processes the dataset, initializes a trainer, and trains the model**.

Finally, **KantaiBERT uses the trained model to perform an experimental downstream language modeling task and fills a mask using Immanuel Kant’s logic**.

By the end of the chapter, you will know how to build a transformer model from scratch. You will have enough knowledge of transformers to face the Industry 4.0 challenge of using powerful pretrained transformers such as GPT-3 engines that require more than development skills to implement them. This chapter prepares you for *Chapter 7, The Rise of Suprahuman Transformers with GPT-3 Engines*.

This chapter covers the following topics:

* *RoBERTa*- and *DistilBERT*-like models
* How to train a tokenizer from scratch
* Byte-level byte-pair encoding
* Saving the trained tokenizer to files
* Recreating the tokenizer for the pretraining process
* Initializing a *RoBERTa* model from scratch
* Exploring the configuration of the model
* Exploring the $80$ million parameters of the model
* Building the dataset for the trainer
* Initializing the trainer
* Pretraining the model
* Saving the model
* Applying the model to the downstream tasks of **Masked Language Modeling (MLM)**

Our first step will be to describe the transformer model that we are going to build.

## **Training a tokenizer and pretraining a transformer**

We will train a transformer model named KantaiBERT using the building blocks provided by Hugging Face for BERT-like models. We covered the theory of the building blocks of the model we will be using in *Chapter 3*.

We will describe KantaiBERT, building on the knowledge we acquired in previous chapters.

KantaiBERT is a **Robustly Optimized BERT Pretraining Approach (RoBERTa)**-like model based on the architecture of BERT.

The initial BERT models brought innovative features to the initial transformer models, as we saw in Chapter 3. RoBERTa increases the performance of transformers for downstream tasks by improving the mechanics of the pretraining process.

For example, it does not use **WordPiece** tokenization but goes down to byte-level **Byte-Pair Encoding (BPE)**. This method paved the way for a wide variety of BERT and BERT-like models.

In this chapter, KantaiBERT, like BERT, will be trained using **Masked Language Modeling (MLM)**. **MLM is a language modeling technique that masks a word in a sequence**. The transformer model must train to predict the masked word.

KantaiBERT will be trained as a small model with $6$ layers, $12$ heads, and $84,095,008$ parameters. It might seem that $84$ million parameters is a lot. However, the parameters are spread over $12$ heads, which makes it a relatively small model. A small model will make the pretraining experience smooth so that each step can be viewed in real time without waiting for hours to see a result.

**KantaiBERT is a DistilBERT-like model because it has the same architecture of $6$ layers and $12$ heads**. **DistilBERT is a distilled version of BERT**. DistilBERT, as the name suggests, contains fewer parameters than a RoBERTa model. As such, it runs much faster, but the results are slightly less accurate than with a RoBERTa model.

We know that large models achieve excellent performance. But what if you want to run a model on a smartphone? Distillation using fewer parameters or other such methods in the future is a clever way of taking the best of pretraining and making it efficient for the needs of many downstream tasks.

KantaiBERT will implement a byte-level byte-pair encoding tokenizer like the one used by GPT-2. The special tokens will be the ones used by RoBERTa. BERT models most often use a WordPiece tokenizer.

There are no token type IDs to indicate which part of a segment a token is a part of. The segments will be separated with the separation token `</s>`.

KantaiBERT will use a custom dataset, train a tokenizer, train the transformer model, save it, and run it with an MLM example.

Let’s get going and build a transformer from scratch.

## **Building KantaiBERT from scratch**

We will build KantaiBERT in 15 steps from scratch and run it on an MLM example.

Open the `KantaiBERT_Repro.ipynb` file.

### **Step 1: Loading the dataset**

Ready-to-use datasets provide an objective way to train and compare transformers. In Chapter 5, Downstream NLP Tasks with Transformers, we will explore several datasets. However, this chapter aims to understand the training process of a transformer with notebook cells that can be run in real time without waiting for hours to obtain a result.

I chose to use the works of Immanuel Kant (1724-1804), the German philosopher who was the epitome of the Age of Enlightenment. The idea is to introduce human-like logic and pretrained reasoning for downstream reasoning tasks.

[Project Gutenberg](https://www.gutenberg.org), offers a wide range of free eBooks that can be downloaded in text format. You can use other books if you want to create customized datasets of your own based on books.

I compiled the following three books by Immanuel Kant into a text file named kant.txt:

* The Critique of Pure Reason
* The Critique of Practical Reason
* Fundamental Principles of the Metaphysic of Morals

`kant.txt` provides a small training dataset to train the transformer model of this chapter. The result obtained remains experimental. For a real-life project, I would add the complete works of Immanuel Kant, Rene Descartes, Pascal, and Leibnitz, for example.

The text file contains the raw text of the books:
> ...For it is in reality vain to profess _indifference_ in regard to such inquiries, the object of which cannot be indifferent to humanity.

### **Step 2: Installing Hugging Face transformers**

We will need to install Hugging Face transformers and tokenizers, but we will not need TensorFlow in this instance of the Google Colab VM so we can remove it.

### **Step 3: Training a tokenizer**

In this section, **the program does not use a pretrained tokenizer**. For example, **a pretrained GPT-2 tokenizer could be used**. However, **the training process in this chapter includes training a tokenizer from scratch**.

**Hugging Face**’s `ByteLevelBPETokenizer()` will be trained using `kant.txt`. A BPE tokenizer **will break a string or word down into substrings or subwords**. There are two main advantages to this, among many others:

- The tokenizer can break words into minimal components. Then it will merge these small components into statistically interesting ones. For example, “smaller" and smallest" can become “small,” “er,” and “est.” The tokenizer can go further. We could get “sm" and “all,” for example. In any case, the words are broken down into subword tokens and smaller units of subword parts such as “sm" and “all" instead of simply “small.”
- The chunks of strings classified as unknown, `unk_token`, using `WordPiece` level encoding, will practically disappear.

In this model, we will be training the tokenizer with the following parameters:

- `files=paths` is the path to the dataset
- `vocab_size=52_000` is the size of our tokenizer’s model length
- `min_frequency=2` is the minimum frequency threshold
- `special_tokens=[]` is a list of special tokens

In this case, the list of special tokens is:

- `<s>`: a start token
- `<pad>`: a padding token
- `</s>`: an end token
- `<unk>`: an unknown token
- `<mask>`: the mask token for language modeling

The tokenizer will be trained to generate merged substring tokens and analyze their frequency.

Let’s take these two words in the middle of a sentence:

The first step will be to tokenize the string:

```py
'Ġthe', 'Ġtoken',   'izer',
```

The string is now tokenized into tokens with Ġ (whitespace) information.

The next step is to replace them with their indices:

| ‘Ġthe’ | ‘Ġtoken’ | ‘izer’ |
| ------ | -------- | ------ |
| 150    | 5430     | 4712   |


### **Step 4: Saving the files to disk**

he tokenizer will generate two files when trained:

* `merges.txt`, which contains the merged tokenized substrings
* `vocab.json`, which contains the indices of the tokenized substrings

The program first creates the `KantaiBERT` directory and then saves the two files: