
This notebook provides a step-by-step guide for training a new tokenizer based on an existing one, specifically for processing code data. The setup uses the Hugging Face `Transformers`, `Datasets`, and `Evaluate` libraries to download, preprocess, and tokenize the CodeSearchNet dataset for Python. The instructions and code snippets are designed to guide you through installing dependencies, loading the dataset, and training a custom tokenizer.

## Assembling a corpus


In [None]:
from datasets import load_dataset

The below code snippet loads a dataset using the Hugging Face `datasets` library. The `load_dataset` function retrieves the "code_search_net" dataset, specifically the Python subset, storing it in the `raw_dataset` variable. The [CodeSearchNet](https://huggingface.co/datasets/code-search-net/code_search_net) dataset contains large-scale code samples and documentation comments across several programming languages, intended for code-related tasks such as code search, generation, and summarization.

This dataset can be useful for tasks like:

- **Code Search**: Finding similar code snippets given a code query.
- **Code Generation**: Auto-generating code based on textual descriptions.
- **Documentation Generation**: Creating documentation from raw code.



In [None]:
raw_dataset = load_dataset("code_search_net", "python")

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


README.md:   0%|          | 0.00/12.9k [00:00<?, ?B/s]

code_search_net.py:   0%|          | 0.00/8.44k [00:00<?, ?B/s]

The repository for code_search_net contains custom code which must be executed to correctly load the dataset. You can inspect the repository content at https://hf.co/datasets/code_search_net.
You can avoid this prompt in future by passing the argument `trust_remote_code=True`.

Do you wish to run the custom code? [y/N] y


python.zip:   0%|          | 0.00/941M [00:00<?, ?B/s]

Generating train split:   0%|          | 0/412178 [00:00<?, ? examples/s]

Generating test split:   0%|          | 0/22176 [00:00<?, ? examples/s]

Generating validation split:   0%|          | 0/23107 [00:00<?, ? examples/s]

In [None]:
raw_dataset["train"]

Dataset({
    features: ['repository_name', 'func_path_in_repository', 'func_name', 'whole_func_string', 'language', 'func_code_string', 'func_code_tokens', 'func_documentation_string', 'func_documentation_tokens', 'split_name', 'func_code_url'],
    num_rows: 412178
})

In [None]:
print(raw_dataset["train"][0]["whole_func_string"])

def write_map_file(mapFNH, items, header):
    """
    Given a list of mapping items (in the form described by the parse_mapping_file method)
    and a header line, write each row to the given input file with fields separated by tabs.

    :type mapFNH: file or str
    :param mapFNH: Either the full path to the map file or an open file handle

    :type items: list
    :param item: The list of row entries to be written to the mapping file

    :type header: list or str
    :param header: The descriptive column names that are required as the first line of
                   the mapping file

    :rtype: None
    """
    if isinstance(header, list):
        header = "\t".join(header) + "\n"

    with file_handle(mapFNH, "w") as mapF:
        mapF.write(header)
        for row in items:
            mapF.write("\t".join(row)+"\n")


Using a Python generator, we can avoid python loading anything into memory until its' actually necessary.

In [None]:
training_corpus = (
    raw_dataset["train"][i:i+1000]["whole_func_string"]
    for i in range(0, len(raw_dataset["train"]), 1000)
)

But the problem with generator object in Python is that it can only be used once.

In [None]:
# example,
gen = (i for i in range(10))
print(list(gen))
print(list(gen))

[0, 1, 2, 3, 4, 5, 6, 7, 8, 9]
[]


That's why we will define a fucntion that returns a generator instead:

In [None]:
def get_training_corpus():
  return (
    raw_dataset["train"][i:i+1000]["whole_func_string"]
    for i in range(0, len(raw_dataset["train"]), 1000)
)

In [None]:
training_corpus = get_training_corpus()

There is one more way we can define the generator, is that we can use generator inside `for` loop by using `yield` function.

In [None]:
def get_training_corpus():
  dataset = raw_dataset["train"]
  for id in range(0, len(dataset), 1000):
    samples = dataset[id: id+1000]
    yield samples["whole_func_string"]

which will produce the exact same generator as before, but allows you to use more complex logic than you can in a list comprehension.

## Training a new Tokenizer

Now that we have our corpus in the form of an iterator of batches of texts, we are ready to train a new tokenizer. To do this, we first need to load the tokenizer we want to pair with our model (here, GPT-2):

In [None]:
from transformers import AutoTokenizer

old_tokenizer = AutoTokenizer.from_pretrained('gpt2')

tokenizer_config.json:   0%|          | 0.00/26.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/665 [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/1.04M [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.36M [00:00<?, ?B/s]



Let's see an example of looking at how this example works:

In [None]:
example = '''def multiply_numbers(a, b):
    """Multiply the two numbers `a` and `b`."""
    return a * b'''

In [None]:
tokens = old_tokenizer.tokenize(example)

In [None]:
tokens

['def',
 'Ġmultiply',
 '_',
 'n',
 'umbers',
 '(',
 'a',
 ',',
 'Ġb',
 '):',
 'Ċ',
 'Ġ',
 'Ġ',
 'Ġ',
 'Ġ"""',
 'Mult',
 'ip',
 'ly',
 'Ġthe',
 'Ġtwo',
 'Ġnumbers',
 'Ġ`',
 'a',
 '`',
 'Ġand',
 'Ġ`',
 'b',
 '`',
 '."',
 '""',
 'Ċ',
 'Ġ',
 'Ġ',
 'Ġ',
 'Ġreturn',
 'Ġa',
 'Ġ*',
 'Ġb']

In [None]:
print(len(old_tokenizer.tokenize(example)))

38


In [None]:
tokenizer = old_tokenizer.train_new_from_iterator(training_corpus, 52000)



In [None]:
tokens = tokenizer.tokenize(example)
tokens

['def',
 'Ġmultiply',
 '_',
 'numbers',
 '(',
 'a',
 ',',
 'Ġb',
 '):',
 'ĊĠĠĠ',
 'Ġ"""',
 'Multiply',
 'Ġthe',
 'Ġtwo',
 'Ġnumbers',
 'Ġ`',
 'a',
 '`',
 'Ġand',
 'Ġ`',
 'b',
 '`."""',
 'ĊĠĠĠ',
 'Ġreturn',
 'Ġa',
 'Ġ*',
 'Ġb']

In [None]:
print(len(tokens))

27


Let's look at the another example:

In [None]:
example = """class LinearLayer():
    def __init__(self, input_size, output_size):
        self.weight = torch.randn(input_size, output_size)
        self.bias = torch.zeros(output_size)

    def __call__(self, x):
        return x @ self.weights + self.bias
    """

In [None]:
tokenizer.tokenize(example)

['class',
 'ĠLinear',
 'Layer',
 '():',
 'ĊĠĠĠ',
 'Ġdef',
 'Ġ__',
 'init',
 '__(',
 'self',
 ',',
 'Ġinput',
 '_',
 'size',
 ',',
 'Ġoutput',
 '_',
 'size',
 '):',
 'ĊĠĠĠĠĠĠĠ',
 'Ġself',
 '.',
 'weight',
 'Ġ=',
 'Ġtorch',
 '.',
 'randn',
 '(',
 'input',
 '_',
 'size',
 ',',
 'Ġoutput',
 '_',
 'size',
 ')',
 'ĊĠĠĠĠĠĠĠ',
 'Ġself',
 '.',
 'bias',
 'Ġ=',
 'Ġtorch',
 '.',
 'zeros',
 '(',
 'output',
 '_',
 'size',
 ')',
 'ĊĊĠĠĠ',
 'Ġdef',
 'Ġ__',
 'call',
 '__(',
 'self',
 ',',
 'Ġx',
 '):',
 'ĊĠĠĠĠĠĠĠ',
 'Ġreturn',
 'Ġx',
 'Ġ@',
 'Ġself',
 '.',
 'weights',
 'Ġ+',
 'Ġself',
 '.',
 'bias',
 'ĊĠĠĠĠ']

## Saving the Tokenizer

Now, we will look at how we can save the tokenizer for later use:

In [None]:
tokenizer.save_pretrained("code-search-net-tokenizer")

('code-search-net-tokenizer/tokenizer_config.json',
 'code-search-net-tokenizer/special_tokens_map.json',
 'code-search-net-tokenizer/vocab.json',
 'code-search-net-tokenizer/merges.txt',
 'code-search-net-tokenizer/added_tokens.json',
 'code-search-net-tokenizer/tokenizer.json')

In this notebook, we have seen that how we can train a tokenizer from an iterator!