If a language model is not available in the language you are interested in, or if your corpus is very different from the one your language model was trained on, you will most likely want to **retrain the model from scratch using a tokenizer adapted to your data**. That will require training a new tokenizer on your dataset. But what exactly does that mean? We saw that most Transformer models use a subword tokenization algorithm. To identify which subwords are of interest and occur most frequently in the corpus at hand, the tokenizer needs to take a hard look at all the texts in the corpus — a process we call *training*. The exact rules that govern this training depend on the type of tokenizer used, and we’ll go over the three main algorithms later in this chapter.

# Assembling a corpus

There’s a very simple API in Hugging Face Transformers that you can use to train a new tokenizer with the same characteristics as an existing one: `AutoTokenizer.train_new_from_iterator()`. To see this in action, let’s say we want to train GPT-2 from scratch, but in a language other than English. **Our first task will be to gather lots of data** in that language in a training corpus. To provide examples everyone will be able to understand, we won’t use a language like Russian or Chinese here, but rather a specialized English language: Python code.

The Hugging Face Datasets library can help us assemble a corpus of Python source code. We’ll use the usual `load_dataset()` function to download and cache the CodeSearchNet dataset. This dataset was created for the CodeSearchNet challenge and contains millions of functions from open source libraries on GitHub in several programming languages. Here, we will load the Python part of this dataset:

In [3]:
from datasets import load_dataset

# This can take a few minutes to load, so grab a coffee or tea while you wait!
raw_datasets = load_dataset("code_search_net", "python")
raw_datasets

DatasetDict({
    train: Dataset({
        features: ['repository_name', 'func_path_in_repository', 'func_name', 'whole_func_string', 'language', 'func_code_string', 'func_code_tokens', 'func_documentation_string', 'func_documentation_tokens', 'split_name', 'func_code_url'],
        num_rows: 412178
    })
    test: Dataset({
        features: ['repository_name', 'func_path_in_repository', 'func_name', 'whole_func_string', 'language', 'func_code_string', 'func_code_tokens', 'func_documentation_string', 'func_documentation_tokens', 'split_name', 'func_code_url'],
        num_rows: 22176
    })
    validation: Dataset({
        features: ['repository_name', 'func_path_in_repository', 'func_name', 'whole_func_string', 'language', 'func_code_string', 'func_code_tokens', 'func_documentation_string', 'func_documentation_tokens', 'split_name', 'func_code_url'],
        num_rows: 23107
    })
})

We can see the dataset separates docstrings from code and suggests a tokenization of both. Here. we’ll just use the `whole_func_string` column to train our tokenizer. We can look at an example of one these functions by indexing into the `train` split:

In [4]:
print(raw_datasets["train"][123456]["whole_func_string"])

def update_field(self, f, obj):
        """ update a field

        :param str f: name of field to be updated.
        :param obj: value of field to be updated.
        """
        n = self.get_private_name(f)
        if not hasattr(self, n):
            raise AttributeError('{0} is not in {1}'.format(n, self.__class__.__name__))

        setattr(self, n, obj)
        self.__origin_keys.add(f)


## Iterator

The first thing we need to do is transform the dataset into an ***iterator*** of lists of texts — for instance, a list of list of texts. Using lists of texts will enable our tokenizer to go faster (training on batches of texts instead of processing individual texts one by one), and it should be an iterator if we want to avoid having everything in memory at once. If your corpus is huge, you will want to take advantage of the fact that Hugging Face Datasets does not load everything into RAM but stores the elements of the dataset on disk.

Using a Python generator, we can avoid Python loading anything into memory until it’s actually necessary. To create such a generator, you just to:

In [5]:
from typing import Iterable

def get_training_corpus() -> Iterable[str]:
    """Yield the training corpus in chunks of 1000 samples."""
    dataset = raw_datasets["train"]
    for start_idx in range(0, len(dataset), 1000):
        samples = dataset[start_idx : start_idx + 1000]
        yield samples["whole_func_string"]

In [6]:
training_corpus = get_training_corpus()

This line of code doesn’t fetch any elements of the dataset; it just creates an object you can use in a Python `for` loop. **The texts will only be loaded when you need them** (that is, when you’re at the step of the for loop that requires them), and only 1,000 texts at a time will be loaded. This way you won’t exhaust all your memory even if you are processing a huge dataset.

# Training a new tokenizer

Now that we have our corpus in the form of an iterator of batches of texts, we are ready to train a new tokenizer. To do this, we first need to load the tokenizer we want to pair with our model (here, GPT-2):

In [7]:
from transformers import AutoTokenizer

old_tokenizer = AutoTokenizer.from_pretrained("gpt2")

tokenizer_config.json:   0%|          | 0.00/26.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/665 [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/1.04M [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.36M [00:00<?, ?B/s]

Even though we are going to train a new tokenizer, it’s a good idea to **do this to avoid starting entirely from scratch**. This way, we won’t have to specify anything about the tokenization algorithm or the special tokens we want to use; our new tokenizer will be exactly the same as GPT-2, and the only thing that will change is the vocabulary, which will be determined by the training on our corpus.

**First let’s have a look at how this tokenizer would treat an example function:**

In [14]:
example = '''def add_numbers(a, b):
    """Add the two numbers `a` and `b`."""
    return a + b'''

tokens = old_tokenizer.tokenize(example)
print(len(tokens), tokens)

36 ['def', 'Ġadd', '_', 'n', 'umbers', '(', 'a', ',', 'Ġb', '):', 'Ċ', 'Ġ', 'Ġ', 'Ġ', 'Ġ"""', 'Add', 'Ġthe', 'Ġtwo', 'Ġnumbers', 'Ġ`', 'a', '`', 'Ġand', 'Ġ`', 'b', '`', '."', '""', 'Ċ', 'Ġ', 'Ġ', 'Ġ', 'Ġreturn', 'Ġa', 'Ġ+', 'Ġb']


This tokenizer has a few special symbols, like **`Ġ` and `Ċ`, which denote spaces and newlines**, respectively. As we can see, this is **not too efficient: the tokenizer returns individual tokens for each space**, when it could group together **indentation levels** (since having sets of four or eight spaces is going to be very common in code). It also split the function name a bit weirdly, not being used to seeing words with the `_` character.

Let’s train a new tokenizer and see if it solves those issues. For this, we’ll use the method `train_new_from_iterator()`:

In [None]:
tokenizer = old_tokenizer.train_new_from_iterator(training_corpus, 52000)

Note that `AutoTokenizer.train_new_from_iterator()` only works if the tokenizer you are using is a “fast” tokenizer. As you’ll see in the next section, the Hugging Face Transformers library contains two types of tokenizers: some are written purely in Python and others (the fast ones) are backed by the Hugging Face Tokenizers library, which is written in the Rust programming language.

Before diving into that, however, let’s try our brand new tokenizer on the previous example:

In [15]:
tokens = tokenizer.tokenize(example)
print(len(tokens), tokens)

27 ['def', 'Ġadd', '_', 'numbers', '(', 'a', ',', 'Ġb', '):', 'ĊĠĠĠ', 'Ġ"""', 'Add', 'Ġthe', 'Ġtwo', 'Ġnumbers', 'Ġ`', 'a', '`', 'Ġand', 'Ġ`', 'b', '`."""', 'ĊĠĠĠ', 'Ġreturn', 'Ġa', 'Ġ+', 'Ġb']


Here we again see the special symbols Ġ and Ċ that denote spaces and newlines, but we can also see that our tokenizer learned some tokens that are highly specific to a corpus of Python functions: for example, there is a `ĊĠĠĠ` token that represents an indentation, and a `Ġ"""` token that represents the three quotes that start a docstring. The tokenizer also correctly split the function name on `_`. This is quite a compact representation; comparatively, using the plain English tokenizer on the same example will give us a longer sentence:

# Saving the tokenizer

To make sure we can use it later, we need to save our new tokenizer. Like for models, this is done with the `save_pretrained()` method:

In [16]:
tokenizer.save_pretrained("tmp/code-search-net-tokenizer")

('tmp/code-search-net-tokenizer/tokenizer_config.json',
 'tmp/code-search-net-tokenizer/special_tokens_map.json',
 'tmp/code-search-net-tokenizer/vocab.json',
 'tmp/code-search-net-tokenizer/merges.txt',
 'tmp/code-search-net-tokenizer/added_tokens.json',
 'tmp/code-search-net-tokenizer/tokenizer.json')