# Tokenizer-library

We have to create a new tokenizer if:

- New language
- New characters
- New domain
- New style

Training a tokenizer is not the same as training a model! Model training uses stochastic gradient descent to make the loss a little bit smaller for each batch. It’s randomized by nature (meaning you have to set some seeds to get the same results when doing the same training twice). Training a tokenizer is a statistical process that tries to identify which subwords are the best to pick for a given corpus, and the exact rules used to pick them depend on the tokenization algorithm. It’s deterministic, meaning you always get the same results when training with the same algorithm on the same corpus.

In order to train a new tokenizer we have to first of all collect a corpus of text then choose an architecture and train in.

In [3]:
from datasets import load_dataset

raw_datasets = load_dataset("code_search_net", "python")

# Using a Python generator, we can avoid Python loading anything into memory until it’s actually necessary. 
# To create such a generator, you just to need to replace the brackets with parentheses:

def get_training_corpus():
    return (raw_datasets["train"][i : i + 1000]["whole_func_string"] for i in range(0, len(raw_datasets["train"]), 1000))

# or better
def get_training_corpus():
    dataset = raw_datasets["train"]
    for start_idx in range(0, len(dataset), 1000):
        samples = dataset[start_idx : start_idx + 1000]
        yield samples["whole_func_string"]

training_corpus = get_training_corpus()

Even though we are going to train a new tokenizer, it’s a good idea to do this to avoid starting entirely from scratch. This way, we won’t have to specify anything about the tokenization algorithm or the special tokens we want to use; our new tokenizer will be exactly the same as GPT-2, and the only thing that will change is the vocabulary, which will be determined by the training on our corpus.

In [None]:
from transformers import AutoTokenizer

old_tokenizer = AutoTokenizer.from_pretrained("gpt2")
tokenizer = old_tokenizer.train_new_from_iterator(training_corpus, 52000)

# 52000 is the corpus lenght!
# Note that AutoTokenizer.train_new_from_iterator() only works if the tokenizer you are using is a “fast” tokenizer.

tokenizer.save_pretrained("code-search-net-tokenizer")

we’ll first take a look at the preprocessing that each tokenizer applies to text. Here’s a high-level overview of the steps in the tokenization pipeline:

<img src="./images/tokenizer_2.PNG" width="70%">


In [None]:
# The **normalization** step involves some general cleanup, such as removing needless whitespace, 
# lowercasing, and/or removing accents. If you’re familiar with Unicode normalization (such as NFC or NFKC), 
# this is also something the tokenizer may apply.

tokenizer = AutoTokenizerFast.from_pretrained("")
text_normalized = tokenizer.backend_tokenizer.normalizer.normalize_str(text) # to check how this operation is performed!
pre_tokenization = tokenizer.backend_tokenizer.pre_tokenizer.pre_tokenize_str("Hello, how are  you?")