<a href="https://colab.research.google.com/github/Priscilla97/llm-rag-foundations/blob/main/01_NLP_basics/4_Tokenizers_(PyTorch).ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Tokenizers (PyTorch)

Install the Transformers, Datasets, and Evaluate libraries to run this notebook.

In [None]:
!pip install datasets evaluate transformers[sentencepiece]

Tokenizers are one of the core components of the NLP pipeline.

They serve one purpose: to **translate text into data that can be processed by the model**. Models can only process numbers, so tokenizers need to convert our text inputs to numerical data.

In NLP tasks, the data that is generally processed is raw text. Here’s an example of such text:

TEXT:
Jim Henson was a puppeteer

The goal is to find the **most meaningful representation** — that is, the one that makes the most *sense to the model* — and *the smallest representation.*

EXAMPLES: <br>

**Word-based**: split the raw text into words and find a numerical representation for each of them <br>
*['Jim', 'Henson', 'was', 'a', 'puppeteer']*

If we want to completely cover a language we’ll need to have an identifier for each word in the language, take over 500,000 words in the English language!

Words like “dog” are represented differently from words like “dogs”, and the model will initially have no way of knowing that “dog” and “dogs” are similar.

We also need a custom token to represent words that are **not in our vocabulary**.
This is known as the “unknown” token, often represented as ”[UNK]” or ”<unk>”. It’s generally a bad sign if you see that the tokenizer is producing a lot of these tokens.

One way to reduce the amount of unknown tokens is to go one level deeper, using a **character-based tokenizer**.

In [None]:
tokenized_text = "Jim Henson was a puppeteer".split()
print(tokenized_text)

['Jim', 'Henson', 'was', 'a', 'puppeteer']

# Character-based
Character-based tokenizers split the text into characters, rather than words. This has two primary benefits:

- The vocabulary is much smaller.
- There are much fewer out-of-vocabulary (unknown) tokens, since every word can be built from characters.

This approach isn’t perfect either. Since the representation is now based on characters rather than words, it’s less meaningful: each character doesn’t mean a lot on its own.

To get the best of both worlds, we can use a third technique that combines the two approaches: **subword tokenization**.

# Subword tokenization
**Subword tokenization** algorithms rely on the principle that **frequently used words should not be split into smaller subwords**, but rare words should be decomposed into meaningful subwords.

EXAMPLE: <br>
**“annoyingly”**: a rare word and could be decomposed into
- “annoying”
- “ly”.

These are both likely to appear more frequently as standalone subwords.

Here is an example showing how a subword tokenization algorithm would tokenize the sequence “**Let’s do tokenization**!“:
- Let's < /w>
- do < /w>
- token
- ization < /w>
- ! < /w>

“tokenization” was split into “token” and “ization”, two tokens that have a semantic meaning while being space-efficient. This allows us to have relatively good coverage with small vocabularies, and close to no unknown tokens.

Unsurprisingly, there are **many more techniques** out there. To name a few:

- Byte-level BPE, as used in GPT-2
- WordPiece, as used in BERT
- SentencePiece or Unigram, as used in several multilingual models

# Loading and saving

Loading and saving tokenizers is as simple as it is with models. Actually, it’s based on the same two methods: **from_pretrained()** and **save_pretrained()**. These methods will **load or save **the algorithm used by the tokenizer as well as its vocabulary (a bit like the weights of the model).

Loading the BERT tokenizer trained with the same checkpoint as BERT is done the same way as loading the model, except we use the BertTokenizer class:

In [None]:
from transformers import BertTokenizer

tokenizer = BertTokenizer.from_pretrained("bert-base-cased")

Similar to AutoModel, the AutoTokenizer class will grab the proper tokenizer class in the library based on the checkpoint name, and can be used directly with any checkpoint:

In [None]:
from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("bert-base-cased")

Use tokenizer:

In [None]:
tokenizer("Using a Transformer network is simple")

{'input_ids': [101, 7993, 170, 11303, 1200, 2443, 1110, 3014, 102],
 'token_type_ids': [0, 0, 0, 0, 0, 0, 0, 0, 0],
 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1]}

In [None]:
# Save tokenizer
tokenizer.save_pretrained("directory_on_my_computer")

# Tokenization
The tokenization process is done by the tokenize() method of the tokenizer:

In [None]:
from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("bert-base-cased")

sequence = "Using a Transformer network is simple"
tokens = tokenizer.tokenize(sequence)

print(tokens)

['Using', 'a', 'transform', '##er', 'network', 'is', 'simple']

This tokenizer is a subword tokenizer: it splits the words until it obtains tokens that can be represented by its vocabulary. That’s the case here with transformer, which is split into two tokens: transform and ##er.

# From tokens to input IDs
The conversion to input IDs is handled by the convert_tokens_to_ids() tokenizer method:

In [None]:
ids = tokenizer.convert_tokens_to_ids(tokens)

print(ids)

[7993, 170, 11303, 1200, 2443, 1110, 3014]

# Decoding
Decoding is going the other way around: from vocabulary indices, we want to get a string. This can be done with the decode() method as follows:

In [None]:
decoded_string = tokenizer.decode([7993, 170, 11303, 1200, 2443, 1110, 3014])
print(decoded_string)

'Using a Transformer network is simple'

Note that the decode method not only converts the indices back to tokens, but also **groups together the tokens that were part of the same words to produce a readable sentence**.

This behavior will be extremely useful when we use models that predict new text (either text generated from a prompt, or for sequence-to-sequence problems like translation or summarization).