# Training a new tokenizer from an old one

If a language model is not available in the language you are interested in, or if your corpus is very different from the one your language model was trained on, you will most likely want to retrain the model from scratch using a tokenizer adapted to your data. That will require training a new tokenizer on your dataset. But what exactly does that mean? When we first looked at tokenizers in Chapter 2, we saw that most Transformer models use a `subword tokenization` algorithm. To identify which subwords are of interest and occur most frequently in the corpus at hand, the tokenizer needs to take a hard look at all the texts in the corpus â€” a process we call training. The exact rules that govern this training depend on the type of tokenizer used, and weâ€™ll go over the three main algorithms later in this chapter.



## 1. Assembling a corpus

Thereâ€™s a very simple API in ðŸ¤— Transformers that you can use to train a new tokenizer with the same characteristics as an existing one: `AutoTokenizer.train_new_from_iterator()`. To see this in action, letâ€™s say we want to train GPT-2 from scratch, but in a language other than English. Our first task will be to gather lots of data in that language in a training corpus. To provide examples everyone will be able to understand, we wonâ€™t use a language like Russian or Chinese here, but rather programming languages.

In [6]:
from datasets import load_dataset

base_url = "https://huggingface.co/datasets/sentence-transformers/codesearchnet/resolve/main/pair/"
data_files = {
    "train": [
        f"{base_url}train-00000-of-00003.parquet",
        f"{base_url}train-00001-of-00003.parquet",
        f"{base_url}train-00002-of-00003.parquet"
    ]
}

raw_datasets = load_dataset("parquet", data_files=data_files)
print(raw_datasets)

pair/train-00000-of-00003.parquet:   0%|          | 0.00/163M [00:00<?, ?B/s]

To support symlinks on Windows, you either need to activate Developer Mode or to run Python as an administrator. In order to activate developer mode, see this article: https://docs.microsoft.com/en-us/windows/apps/get-started/enable-your-device-for-development


pair/train-00001-of-00003.parquet:   0%|          | 0.00/166M [00:00<?, ?B/s]

pair/train-00002-of-00003.parquet:   0%|          | 0.00/163M [00:00<?, ?B/s]

Generating train split: 0 examples [00:00, ? examples/s]

DatasetDict({
    train: Dataset({
        features: ['comment', 'code'],
        num_rows: 1375067
    })
})


In [14]:
raw_datasets["train"][15]["comment"]

'Creates (or updates) a new ProjectStatus for the given build and\n        returns it.'

Here we will create a new column named `whole_function` to conclude `comment` and `code` column into one concatenated column.

In [15]:
def merge_columns(examples):
    comments = examples["comment"]
    codes = examples["code"]
    whole_function = [
        f"{comment}\n{code}" for comment, code in zip(comments, codes) 
    ]
    return {"whole_function": whole_function}

raw_datasets = raw_datasets.map(merge_columns, batched=True)

Map:   0%|          | 0/1375067 [00:00<?, ? examples/s]

In [17]:
print(raw_datasets["train"][0])

{'comment': 'Computes the new parent id for the node being moved.\n\n@return int', 'code': "protected function parentId()\n\t{\n\t\tswitch ( $this->position )\n\t\t{\n\t\t\tcase 'root':\n\t\t\t\treturn null;\n\n\t\t\tcase 'child':\n\t\t\t\treturn $this->target->getKey();\n\n\t\t\tdefault:\n\t\t\t\treturn $this->target->getParentId();\n\t\t}\n\t}", 'whole_function': "Computes the new parent id for the node being moved.\n\n@return int\nprotected function parentId()\n\t{\n\t\tswitch ( $this->position )\n\t\t{\n\t\t\tcase 'root':\n\t\t\t\treturn null;\n\n\t\t\tcase 'child':\n\t\t\t\treturn $this->target->getKey();\n\n\t\t\tdefault:\n\t\t\t\treturn $this->target->getParentId();\n\t\t}\n\t}"}


The first thing we need to do is transform the dataset into an `iterator` of lists of texts â€” for instance, a list of list of texts. Using lists of texts will enable our tokenizer to go faster (training on batches of texts instead of processing individual texts one by one), and it should be an iterator if we want to avoid having everything in memory at once. If your corpus is huge, you will want to take advantage of the fact that ðŸ¤— Datasets does not load everything into RAM but stores the elements of the dataset on disk.



Using a Python generator, we can avoid Python loading anything into memory until itâ€™s actually necessary. To create such a generator, you just to need to replace the brackets with parentheses:



In [18]:
training_corpus = (
    raw_datasets["train"][i: i + 1000]["whole_function"]
    for i in range(0, len(raw_datasets["train"]), 1000)
)

This line of code doesnâ€™t fetch any elements of the dataset; it just creates an object you can use in a Python `for` loop. The texts will only be loaded when you need them (that is, when youâ€™re at the step of the `for` loop that requires them), and only 1,000 texts at a time will be loaded. This way you wonâ€™t exhaust all your memory even if you are processing a huge dataset.



The problem with a `generator` object is that it can only be used once. So, instead of this giving us the list of the first 10 digits twice:



In [19]:
gen = (i for i in range(10))
print(list(gen))
print(list(gen))

[0, 1, 2, 3, 4, 5, 6, 7, 8, 9]
[]


Thatâ€™s why we define a function that returns a `generator` instead:



In [20]:
def get_training_corpus():
    return (
        raw_datasets["train"][i: i + 1000]["whole_function"]
        for i in range(0, len(raw_datasets["train"]), 1000)
    )

You can also define your generator inside a `for` loop by using the `yield` statement:



In [None]:
def get_training_corpus():
    datasets = raw_datasets["train"]
    for start_idx in range(0, len(datasets), 1000):
        samples = datasets[start_idx: start_idx + 1000]
        yield samples["whole_function"]

which will produce the exact same generator as before, but allows you to use more complex logic than you can in a list comprehension.



## 2. Training a new tokenizer

Now that we have our corpus in the form of an iterator of batches of texts, we are ready to train a new tokenizer. To do this, we first need to load the tokenizer we want to pair with our model (here, GPT-2):



In [22]:
from transformers import AutoTokenizer

old_tokenizer = AutoTokenizer.from_pretrained("gpt2")

Even though we are going to train a new tokenizer, itâ€™s a good idea to do this to avoid starting entirely from scratch. This way, we wonâ€™t have to specify anything about the tokenization algorithm or the special tokens we want to use; our new tokenizer will be exactly the same as GPT-2, and the only thing that will change is the vocabulary, which will be determined by the training on our corpus.



First letâ€™s have a look at how this tokenizer would treat an example function:



In [23]:
example = '''def add_numbers(a, b):
    """Add the two numbers `a` and `b`."""
    return a + b'''

tokens = old_tokenizer.tokenize(example)
tokens

['def',
 'Ä add',
 '_',
 'n',
 'umbers',
 '(',
 'a',
 ',',
 'Ä b',
 '):',
 'ÄŠ',
 'Ä ',
 'Ä ',
 'Ä ',
 'Ä """',
 'Add',
 'Ä the',
 'Ä two',
 'Ä numbers',
 'Ä `',
 'a',
 '`',
 'Ä and',
 'Ä `',
 'b',
 '`',
 '."',
 '""',
 'ÄŠ',
 'Ä ',
 'Ä ',
 'Ä ',
 'Ä return',
 'Ä a',
 'Ä +',
 'Ä b']

This tokenizer has a few special symbols, like `Ä ` and `ÄŠ`, which denote spaces and newlines, respectively. As we can see, this is not too efficient: the tokenizer returns individual tokens for each space, when it could group together indentation levels (since having sets of four or eight spaces is going to be very common in code). It also split the function name a bit weirdly, not being used to seeing words with the `_` character.



Letâ€™s train a new tokenizer and see if it solves those issues. For this, weâ€™ll use the method `train_new_from_iterator()`:



In [24]:
tokenizer = old_tokenizer.train_new_from_iterator(training_corpus, 52000)

Letâ€™s try our brand new tokenizer on the previous example:

In [25]:
tokens = tokenizer.tokenize(example)
tokens

['def',
 'Ä add',
 '_',
 'numbers',
 '(',
 'a',
 ',',
 'Ä b',
 '):',
 'ÄŠÄ Ä Ä ',
 'Ä """',
 'Add',
 'Ä the',
 'Ä two',
 'Ä numbers',
 'Ä `',
 'a',
 '`',
 'Ä and',
 'Ä `',
 'b',
 '`.',
 '"""',
 'ÄŠÄ Ä Ä ',
 'Ä return',
 'Ä a',
 'Ä +',
 'Ä b']

In [27]:
print(len(tokens))
print(len(old_tokenizer.tokenize(example)))

28
36


Letâ€™s look at another example:



In [28]:
example = """class LinearLayer():
    def __init__(self, input_size, output_size):
        self.weight = torch.randn(input_size, output_size)
        self.bias = torch.zeros(output_size)

    def __call__(self, x):
        return x @ self.weights + self.bias
    """
tokenizer.tokenize(example)

['class',
 'Ä Linear',
 'Layer',
 '():',
 'ÄŠÄ Ä Ä ',
 'Ä def',
 'Ä __',
 'init',
 '__(',
 'self',
 ',',
 'Ä input',
 '_',
 'size',
 ',',
 'Ä output',
 '_',
 'size',
 '):',
 'ÄŠÄ Ä Ä Ä Ä Ä Ä ',
 'Ä self',
 '.',
 'weight',
 'Ä =',
 'Ä torch',
 '.',
 'rand',
 'n',
 '(',
 'input',
 '_',
 'size',
 ',',
 'Ä output',
 '_',
 'size',
 ')',
 'ÄŠÄ Ä Ä Ä Ä Ä Ä ',
 'Ä self',
 '.',
 'bias',
 'Ä =',
 'Ä torch',
 '.',
 'zeros',
 '(',
 'output',
 '_',
 'size',
 ')',
 'ÄŠÄŠÄ Ä Ä ',
 'Ä def',
 'Ä __',
 'call',
 '__(',
 'self',
 ',',
 'Ä x',
 '):',
 'ÄŠÄ Ä Ä Ä Ä Ä Ä ',
 'Ä return',
 'Ä x',
 'Ä @',
 'Ä self',
 '.',
 'weights',
 'Ä +',
 'Ä self',
 '.',
 'bias',
 'ÄŠÄ Ä Ä Ä ']

In addition to the token corresponding to an indentation, here we can also see a token for a double indentation: `ÄŠÄ Ä Ä Ä Ä Ä Ä `. The special Python words like `class`, `init`, `call`, `self`, and `return` are each tokenized as one token, and we can see that as well as splitting on `_` and `.` the tokenizer correctly splits even camel-cased names: `LinearLayer` is tokenized as `["Ä Linear", "Layer"]`.



## 3. Saving the tokenizer


To make sure we can use it later, we need to save our new tokenizer. Like for models, this is done with the `save_pretrained()` method:



In [33]:
tokenizer.save_pretrained("code-search-net-tokenizer")

('code-search-net-tokenizer\\tokenizer_config.json',
 'code-search-net-tokenizer\\tokenizer.json')

This will create a new folder named `code-search-net-tokenizer`, which will contain all the files the tokenizer needs to be reloaded. If you want to share this tokenizer with your colleagues and friends, you can upload it to the Hub by logging into your account. If youâ€™re working in a notebook, thereâ€™s a convenience function to help you with this:



In [29]:
from huggingface_hub import notebook_login

notebook_login()

Once youâ€™ve logged in, you can push your tokenizer by executing the following command:



In [30]:
tokenizer.push_to_hub("code-search-net-tokenizer")

CommitInfo(commit_url='https://huggingface.co/arraypowerplay/code-search-net-tokenizer/commit/fa5df0981353959138d8da6e774d7300c60c5496', commit_message='Upload tokenizer', commit_description='', oid='fa5df0981353959138d8da6e774d7300c60c5496', pr_url=None, repo_url=RepoUrl('https://huggingface.co/arraypowerplay/code-search-net-tokenizer', endpoint='https://huggingface.co', repo_type='model', repo_id='arraypowerplay/code-search-net-tokenizer'), pr_revision=None, pr_num=None)

This will create a new repository in your namespace with the name `code-search-net-tokenizer`, containing the tokenizer file. You can then load the tokenizer from anywhere with the `from_pretrained()` method:



In [32]:
tokenizer = AutoTokenizer.from_pretrained("arraypowerplay/code-search-net-tokenizer")

tokenizer_config.json:   0%|          | 0.00/298 [00:00<?, ?B/s]

To support symlinks on Windows, you either need to activate Developer Mode or to run Python as an administrator. In order to activate developer mode, see this article: https://docs.microsoft.com/en-us/windows/apps/get-started/enable-your-device-for-development


tokenizer.json: 0.00B [00:00, ?B/s]