# Assembling a corpus
There’s a very simple API in 🤗 Transformers that you can use to train a new tokenizer with the same characteristics as an existing one: `AutoTokenizer.train_new_from_iterator()`.

Let’s say we want to train GPT-2 from scratch, but in a language other than English. Our first task will be to gather lots of data in that language in a training corpus. For that we use: Python Code.

In [19]:
from datasets import load_dataset

# Load corpus of 'python source code' from 'CodeSearchNet' dataset
raw_datasets = load_dataset("code_search_net", "python", trust_remote_code=True)


Lets look at the training split to see which columns we have access to:

In [20]:
raw_datasets['train']

Dataset({
    features: ['repository_name', 'func_path_in_repository', 'func_name', 'whole_func_string', 'language', 'func_code_string', 'func_code_tokens', 'func_documentation_string', 'func_documentation_tokens', 'split_name', 'func_code_url'],
    num_rows: 412178
})

Here, we'll just use the `whole_func_string` column to train our tokenizer

In [21]:
print(raw_datasets['train'][0]['whole_func_string'])

def __msgc_step3_discontinuity_localization(self):
        """
        Estimate discontinuity in basis of low resolution image segmentation.
        :return: discontinuity in low resolution
        """
        import scipy

        start = self._start_time
        seg = 1 - self.segmentation.astype(np.int8)
        self.stats["low level object voxels"] = np.sum(seg)
        self.stats["low level image voxels"] = np.prod(seg.shape)
        # in seg is now stored low resolution segmentation
        # back to normal parameters
        # step 2: discontinuity localization
        # self.segparams = sparams_hi
        seg_border = scipy.ndimage.filters.laplace(seg, mode="constant")
        logger.debug("seg_border: %s", scipy.stats.describe(seg_border, axis=None))
        # logger.debug(str(np.max(seg_border)))
        # logger.debug(str(np.min(seg_border)))
        seg_border[seg_border != 0] = 1
        logger.debug("seg_border: %s", scipy.stats.describe(seg_border, axis=None))
        # 

The first thing we need to do is transform the dataset into an `iterator` of lists of texts. Using lists of texts will enable our tokenizer to go faster i.e., training on batches of texts.

**Note:** `Datasets` doesn't load everything into RAM but stores the elements of the dataset on disk.

Doing the following would create a list of lists of 1_000 texts each, but would load everything in memory:

In [22]:
# Don't uncomment the following line uless your dataset is small!
# training_corpus = [raw_datasets['train'][i: i+1000]['whole_func_string'] for i in range(0, len(raw_datasets['train']))]

Using a `Python generator`, we can avoid Python loading anything into memory until it's actually necessary. To create such a generator, you just need to replace the brackets `[]` with parentheses `()`:

In [23]:
training_corpus = (
    raw_datasets['train'][i : i+1000]['whole_func_string']
    for i in range(0, len(raw_datasets['train']), 1000)
)

This line of code doesn't fetch any elements of the dataset; it just creates an object you can use in a Python `for` loop.

The problem with a generator object is that it can only be used once.

In [24]:
gen = (i for i in range(5))
print(list(gen))
print(list(gen))

[0, 1, 2, 3, 4]
[]


That's why we define a function that returns a generator instead:

In [25]:
def get_training_corpus():
  return (
      raw_datasets['train'][i : i+1000]['whole_func_string']
      for i in range(0, len(raw_datasets['train']), 1000)
  )

training_corpus = get_training_corpus()

You can also define your generator inside a `for` loop using the `yield` statement:

In [26]:
def get_training_corpus():
  dataset = raw_datasets['train']
  for start_idx in range(0, len(dataset), 1000):
    samples = dataset[start_idx : start_idx + 1000]
    yield samples['whole_func_string']

which will produce the exact same generator as before, but allows you to use more complex logic than you can in a list comprehension.

# Training a new Tokenizer

In [27]:
from transformers import AutoTokenizer

# Load the tokenizer
old_tokenizer = AutoTokenizer.from_pretrained("gpt2")

Even though we're going to train a new tokenizer, it's a good idea to do this to avoid starting entirely from scratch. This way, we won't have to specify anything about the tokenization algorithm or the special tokens we want to use; our new tokenizer will be exactly the same as GPT-2, and the only thing that will change is the vocabulary, which will be determined by the training on our corpus.

Firs let's have a look at how this tokenizer would treat an example function:

In [28]:
example = '''def add_numbers(a, b):
    """Add the two numbers `a` and `b`."""
    return a + b'''

tokens = old_tokenizer.tokenize(example)
print(tokens)

['def', 'Ġadd', '_', 'n', 'umbers', '(', 'a', ',', 'Ġb', '):', 'Ċ', 'Ġ', 'Ġ', 'Ġ', 'Ġ"""', 'Add', 'Ġthe', 'Ġtwo', 'Ġnumbers', 'Ġ`', 'a', '`', 'Ġand', 'Ġ`', 'b', '`', '."', '""', 'Ċ', 'Ġ', 'Ġ', 'Ġ', 'Ġreturn', 'Ġa', 'Ġ+', 'Ġb']


This tokenizer has a few special symbols, like `Ġ` and `Ċ`, which denote spaces and newlines, respectively. As we can see, this is not too efficient: the tokenizer returns individual tokens for each spcace, when it could group together identation levels (sice having sets of four or eight spaces is going to be very common in code). It also split the function name a bit weirdly, not being used to seeing words with `_` character.

Let's train a new tokenizer and see if it solves those issues.

In [29]:
tokenizer = old_tokenizer.train_new_from_iterator(training_corpus, 52_000)






**Note:** `AutoTokenizer.train_new_from_iterator()` only works if the tokenizer you're using is a "fast" tokenizer. The transformers library contains two types of tokenizers:

- Some are written purely in Python
- The fast ones backed by the Tokenizers libray, which is written in the Rust programming language

Python is the language most often used for data science and deep learning applications, but when anything needs to be parallelized to be fast, it has to be written in another language. For instance, the matrix multiplications that are at the core of the model computation are written in CUDA, an optimized C library for GPUs.

Most of the Transformer models have a fast tokenizer available (some exceptions check here), and the `AutoTokenizer` API always selects the "fast" tokenizer if it's available.

In [30]:
tokens = tokenizer.tokenize(example)
print(tokens)

['def', 'Ġadd', '_', 'numbers', '(', 'a', ',', 'Ġb', '):', 'ĊĠĠĠ', 'Ġ"""', 'Add', 'Ġthe', 'Ġtwo', 'Ġnumbers', 'Ġ`', 'a', '`', 'Ġand', 'Ġ`', 'b', '`."""', 'ĊĠĠĠ', 'Ġreturn', 'Ġa', 'Ġ+', 'Ġb']


Here we again see the special symbols `Ġ` and `Ċ` that denote spaces and newlines, but we can also see that our tokenizer learned some tokens that are highly specific to a corpus of Python functions: for example, there is a `ĊĠĠĠ` token that represents an indentation, and a `Ġ"""` token that represents the three quotes that start a docstring. The tokenizer also correctly split the function name on `_`. This is quite a compact representation; comparatively, using the plain English tokenizer on the same example will give us a longer sentence:

In [31]:
print(len(tokens))
print(len(old_tokenizer.tokenize(example)))

27
36


Let's look at another example:

In [32]:
example = """class LinearLayer():
    def __init__(self, input_size, output_size):
        self.weight = torch.randn(input_size, output_size)
        self.bias = torch.zeros(output_size)

    def __call__(self, x):
        return x @ self.weights + self.bias
    """
print(tokenizer.tokenize(example))

['class', 'ĠLinear', 'Layer', '():', 'ĊĠĠĠ', 'Ġdef', 'Ġ__', 'init', '__(', 'self', ',', 'Ġinput', '_', 'size', ',', 'Ġoutput', '_', 'size', '):', 'ĊĠĠĠĠĠĠĠ', 'Ġself', '.', 'weight', 'Ġ=', 'Ġtorch', '.', 'randn', '(', 'input', '_', 'size', ',', 'Ġoutput', '_', 'size', ')', 'ĊĠĠĠĠĠĠĠ', 'Ġself', '.', 'bias', 'Ġ=', 'Ġtorch', '.', 'zeros', '(', 'output', '_', 'size', ')', 'ĊĊĠĠĠ', 'Ġdef', 'Ġ__', 'call', '__(', 'self', ',', 'Ġx', '):', 'ĊĠĠĠĠĠĠĠ', 'Ġreturn', 'Ġx', 'Ġ@', 'Ġself', '.', 'weights', 'Ġ+', 'Ġself', '.', 'bias', 'ĊĠĠĠĠ']


In addition to the token corresponding to an indentation, here we can also see a token for a double indentation: `ĊĠĠĠĠĠĠĠ`. The special Python words like class, `init`, `call`, `self`, and `return` are each tokenized as one token, and we can see that as well as splitting on `_` and `.` the tokenizer correctly splits even camel-cased names: `LinearLayer` is tokenized as `["ĠLinear", "Layer"]`.

# Saving the Tokenizer

In [33]:
tokenizer.save_pretrained("code-search-net-tokenizer")

('code-search-net-tokenizer/tokenizer_config.json',
 'code-search-net-tokenizer/special_tokens_map.json',
 'code-search-net-tokenizer/vocab.json',
 'code-search-net-tokenizer/merges.txt',
 'code-search-net-tokenizer/added_tokens.json',
 'code-search-net-tokenizer/tokenizer.json')