In [1]:
from datasets import load_dataset

dataset = load_dataset("wikitext", name="wikitext-2-raw-v1", split="train")


def get_training_corpus():
    for i in range(0, len(dataset), 1000):
        yield dataset[i : i + 1000]["text"]

  from .autonotebook import tqdm as notebook_tqdm
Downloading readme: 100%|██████████| 10.5k/10.5k [00:00<00:00, 21.8MB/s]
Downloading data: 100%|██████████| 733k/733k [00:00<00:00, 2.76MB/s]
Downloading data: 100%|██████████| 6.36M/6.36M [00:00<00:00, 42.1MB/s]
Downloading data: 100%|██████████| 657k/657k [00:00<00:00, 8.11MB/s]
Generating test split: 100%|██████████| 4358/4358 [00:00<00:00, 207652.02 examples/s]
Generating train split: 100%|██████████| 36718/36718 [00:00<00:00, 777092.15 examples/s]
Generating validation split: 100%|██████████| 3760/3760 [00:00<00:00, 646786.00 examples/s]


# Building a WordPiece tokenizer from scratch
1. we start by instantiating a **Tokenizer object with a model**
2. then set its normalizer, pre_tokenizer, post_processor, and decoder attributes to the values we want

In [3]:
from tokenizers import (
    decoders,
    models,
    normalizers,
    pre_tokenizers,
    processors,
    trainers,
    Tokenizer,
)

tokenizer = Tokenizer(models.WordPiece(unk_token="[UNK]")) #We have to specify the unk_token so the model knows what to return when it encounters characters it hasn’t seen before

# first step = normalizaiton

Other arguments we can set here include **the vocab of our model** (we’re going to train the model, so we don’t need to set this) and max_input_chars_per_word, which specifies a maximum length for each word (words longer than the value passed will be split).

In [24]:
tokenizer.normalizer = normalizers.BertNormalizer(lowercase=True)

In [26]:
print(tokenizer.normalizer.normalize_str("Héllò hôw \u0085are ü?"))

hello how are u?


Generally speaking, however, when building **a new tokenizer** you won’t have access to such a handy normalizer already implemented in the 🤗 Tokenizers library — so let’s see how to create the **BERT normalizer** by hand. The library provides a **Lowercase normalizer** and a **StripAccents normalizer**, and you can **compose several normalizers** using a **Sequence**:

In [21]:
tokenizer.normalizer = normalizers.Sequence(
    [normalizers.NFD(), normalizers.Lowercase(), normalizers.StripAccents()]
)

we can use the **normalize_str()** method of the normalizer to **check out the effects** it has on a given text

In [23]:
print(tokenizer.normalizer.normalize_str("Héllò hôw \u0085are ü?"))

hello how are u?


# Next is the pre-tokenization step

In [27]:
tokenizer.pre_tokenizer = pre_tokenizers.BertPreTokenizer()

Or we can build it from scratch:

In [29]:
tokenizer.pre_tokenizer = pre_tokenizers.Whitespace()
#Note that the Whitespace pre-tokenizer splits on whitespace and **all characters that are not letters, digits, or the underscore character**, so it technically splits on whitespace and punctuation

we can use the **pre_tokenize_str()** method of the pre_tokenizer to **check out the effects** it has on a given text

In [30]:
tokenizer.pre_tokenizer.pre_tokenize_str("Let's test my pre-tokenizer.")

[('Let', (0, 3)),
 ("'", (3, 4)),
 ('s', (4, 5)),
 ('test', (6, 10)),
 ('my', (11, 13)),
 ('pre', (14, 17)),
 ('-', (17, 18)),
 ('tokenizer', (18, 27)),
 ('.', (27, 28))]

If you only want to split on whitespace, you should use the **WhitespaceSplit** pre-tokenizer instead:

In [31]:
pre_tokenizer = pre_tokenizers.WhitespaceSplit()
pre_tokenizer.pre_tokenize_str("Let's test my pre-tokenizer.")

[("Let's", (0, 5)),
 ('test', (6, 10)),
 ('my', (11, 13)),
 ('pre-tokenizer.', (14, 28))]

Like with normalizers, you can use a **Sequence** to **compose several pre-tokenizers**:

In [32]:
pre_tokenizer = pre_tokenizers.Sequence(
    [pre_tokenizers.WhitespaceSplit(), pre_tokenizers.Punctuation()]
)
pre_tokenizer.pre_tokenize_str("Let's test my pre-tokenizer.")

[('Let', (0, 3)),
 ("'", (3, 4)),
 ('s', (4, 5)),
 ('test', (6, 10)),
 ('my', (11, 13)),
 ('pre', (14, 17)),
 ('-', (17, 18)),
 ('tokenizer', (18, 27)),
 ('.', (27, 28))]

# The next step in the tokenization pipeline is running the inputs through the model
We already specified our model in the initialization, but we still need to train it, which will require a WordPieceTrainer

The main thing to remember when instantiating a trainer in 🤗 Tokenizers is that you need **to pass it all the special tokens** you intend to use — **otherwise it won’t add them to the vocabulary**, since they are not in the training corpus:

In [33]:
special_tokens = ["[UNK]", "[PAD]", "[CLS]", "[SEP]", "[MASK]"]
trainer = trainers.WordPieceTrainer(vocab_size=25000, special_tokens=special_tokens)

BAHAL! As well as specifying the vocab_size and special_tokens, we can set the **min_frequency** (the number of times a token must appear to be included in the vocabulary) or change the **continuing_subword_prefix (if we want to use something different from ##)**.

To train our model using the iterator we defined earlier,

In [38]:
tokenizer.train_from_iterator(get_training_corpus(), trainer=trainer)






or We can also use text files to train our tokenizer, which would look like this (we **reinitialize the model with an empty WordPiece beforehand**):

In [None]:
# tokenizer.model = models.WordPiece(unk_token="[UNK]")
# tokenizer.train(["wikitext-2.txt"], trainer=trainer)

In both cases, we can then **test the tokenizer** on a text by calling **the encode()** method:

In [39]:
encoding = tokenizer.encode("Let's test this tokenizer.")
print(encoding.tokens)

['let', "'", 's', 'test', 'this', 'tok', '##eni', '##zer', '.']


The encoding obtained is an Encoding, which **contains all the necessary outputs of the tokenizer** in its various attributes: **ids, type_ids, tokens, offsets, attention_mask, special_tokens_mask, and overflowing**.

# The last step in the tokenization pipeline is post-processing
We need to **add the [CLS] token at the beginning and the [SEP] token at the end (or after each sentence, if we have a pair of sentences)**. We will use a TemplateProcessor for this, but first we need to know the IDs of the [CLS] and [SEP] tokens in the vocabulary:

In [41]:
cls_token_id = tokenizer.token_to_id("[CLS]")
sep_token_id = tokenizer.token_to_id("[SEP]")
print(cls_token_id, sep_token_id)

2 3


In [42]:
tokenizer.post_processor = processors.TemplateProcessing(
    single=f"[CLS]:0 $A:0 [SEP]:0",
    pair=f"[CLS]:0 $A:0 [SEP]:0 $B:1 [SEP]:1",
    special_tokens=[("[CLS]", cls_token_id), ("[SEP]", sep_token_id)],
)

In [43]:
encoding = tokenizer.encode("Let's test this tokenizer.")
print(encoding.tokens)

['[CLS]', 'let', "'", 's', 'test', 'this', 'tok', '##eni', '##zer', '.', '[SEP]']


In [44]:
encoding = tokenizer.encode("Let's test this tokenizer...", "on a pair of sentences.")
print(encoding.tokens)
print(encoding.type_ids)

['[CLS]', 'let', "'", 's', 'test', 'this', 'tok', '##eni', '##zer', '...', '[SEP]', 'on', 'a', 'pair', 'of', 'sentences', '.', '[SEP]']
[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1]


# We’ve almost finished building this tokenizer from scratch — the last step is to include a decoder:

In [45]:
tokenizer.decoder = decoders.WordPiece(prefix="##")

In [46]:
tokenizer.decode(encoding.ids)

"let ' s test this tokenizer... on a pair of sentences."

We can save our tokenizer in a single JSON file like this:

In [47]:
# tokenizer.save("tokenizer.json")

We can then reload that file in a Tokenizer object with the from_file() method:

In [48]:
# new_tokenizer = Tokenizer.from_file("tokenizer.json")

# to used trained tokenizer:
1. We have to wrap it in a **PreTrainedTokenizerFast**
2. To wrap the tokenizer in a **PreTrainedTokenizerFast**, we can either pass the tokenizer we built as a **tokenizer_object** or pass the tokenizer file we saved as **tokenizer_file**.
3. we have to manually **set all the special tokens**, since that class **can’t infer** from the tokenizer object which token is the mask token, the [CLS] token, etc.

In [49]:
from transformers import PreTrainedTokenizerFast

wrapped_tokenizer = PreTrainedTokenizerFast(
    tokenizer_object=tokenizer,
    # tokenizer_file="tokenizer.json", # You can load from the tokenizer file, alternatively
    unk_token="[UNK]",
    pad_token="[PAD]",
    cls_token="[CLS]",
    sep_token="[SEP]",
    mask_token="[MASK]",
)

If you are using a specific tokenizer class (like BertTokenizerFast), you will only need to specify the special tokens that are different from the default ones (here, none):


```
from transformers import BertTokenizerFast
wrapped_tokenizer = BertTokenizerFast(tokenizer_object=tokenizer)
```