## Tokenizers

Purpose: to turn text into data the model can process  
Goal: Find and use the method of tokenizing that outputs the most meaningful representation, and ideally, the smallest

### Word-based
![word-based](https://huggingface.co/datasets/huggingface-course/documentation-images/resolve/main/en/chapter2/word_based_tokenization.svg)

In [1]:
# split text at whitespaces:
tokenized_text = "Jim Henson was a puppeteer".split()
print(tokenized_text)

['Jim', 'Henson', 'was', 'a', 'puppeteer']


* puntuation + whitespace based: usually results in quite large *vocabularies*
    * vocabulary: the total number of unique tokens in our corpus
        * corpus: all the input text that gets tokenized

"Each word gets assigned an ID, starting from 0 and going up to the size of the vocabulary. The model uses these IDs to identify each word."  

* dog would be different to dogs
* run different to running
* unknown words get tokenized as [UNK] - it's a bad sign if you see too many of these
    * The tokenizer wasn't able to retrieve info about these words. This results in your losing information.
    * You want to design a vocabulary (all the unqiue tokens) such that there are very few [UNK] tokens

### Character-based

![character-based](https://huggingface.co/datasets/huggingface-course/documentation-images/resolve/main/en/chapter2/character_based_tokenization.svg)

Positives:  
* vocab is smaller
* less UNK tokens since all words can be made from letters
Negatives:  
* Each representation is less meaningful on its own - words mean things, letters alone don't
    * In English, at least
* Large amount of tokens to be processed by the model (less unique but more total)

### Sub-word  

"Subword tokenization algorithms rely on the principle that frequently used words should not be split into smaller subwords, but rare words should be decomposed into meaningful subwords."  

"For instance, “annoyingly” might be considered a rare word and could be decomposed into “annoying” and “ly”. These are both likely to appear more frequently as standalone subwords, while at the same time the meaning of “annoyingly” is kept by the composite meaning of “annoying” and “ly”."  

![lets-do-tokenization](https://huggingface.co/datasets/huggingface-course/documentation-images/resolve/main/en/chapter2/bpe_subword.svg)

"These subwords end up providing a lot of semantic meaning: for instance, in the example above “tokenization” was split into “token” and “ization”, two tokens that have a semantic meaning while being space-efficient (only two tokens are needed to represent a long word). This allows us to have relatively good coverage with small vocabularies, and close to no unknown tokens."

"Loading and saving tokenizers is as simple as it is with models. Actually, it’s based on the same two methods: from_pretrained() and save_pretrained(). These methods will load or save the algorithm used by the tokenizer (a bit like the architecture of the model) as well as its vocabulary (a bit like the weights of the model)."

In [2]:
from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("bert-base-cased")

  from .autonotebook import tqdm as notebook_tqdm


In [3]:
tokenizer("Using a Transformer network is simple")

{'input_ids': [101, 7993, 170, 13809, 23763, 2443, 1110, 3014, 102], 'token_type_ids': [0, 0, 0, 0, 0, 0, 0, 0, 0], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1]}

In [4]:
# tokenizer.save_pretrained("directory_on_my_computer")

## Encoding

Encoding: translating text into numbers.  
1. text into tokens
2. tokens into input IDs (numbers in tensors)  

When a tokenizer is loaded, it has it's own vocabulary, which we will use to split out words into tokens and so on.

In [6]:
from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("bert-base-cased") # subword tokenizer

sequence = "Using a Transformer network is simple"
tokens = tokenizer.tokenize(sequence) # tokenize

print(tokens)

['Using', 'a', 'Trans', '##former', 'network', 'is', 'simple']


In [7]:
# tokens into input IDs
ids = tokenizer.convert_tokens_to_ids(tokens)

print(ids)

[7993, 170, 13809, 23763, 2443, 1110, 3014]


"These outputs, once converted to the appropriate framework tensor, can then be used as inputs to a model as seen earlier in this chapter."

In [8]:
sequence = "Using a laptop is simple"
tokens = tokenizer.tokenize(sequence)
print(tokens)
ids = tokenizer.convert_tokens_to_ids(tokens)
print(ids)

['Using', 'a', 'laptop', 'is', 'simple']
[7993, 170, 12574, 1110, 3014]


The input IDs are vocab indices!

### Decoding

From vocab indices into strings

In [9]:
decoded_string = tokenizer.decode([7993, 170, 11303, 1200, 2443, 1110, 3014])
print(decoded_string)

2024-05-27 19:07:43.206426: I tensorflow/core/platform/cpu_feature_guard.cc:210] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.


Using a transformer network is simple


"Note that the decode method not only converts the indices back to tokens, but also groups together the tokens that were part of the same words to produce a readable sentence. "