# Tokens and Tokenizers

ref: https://learn.theaiedge.io/ by Damien Benveniste, PhD

- OpenAI charges for their APIs by the token usage, and not words, why?
- A token is typically not a word; it could be a smaller unit, like a character or a part of a word, or a larger one like a whole phrase.
- Tokens make LLM training and inferencing more efficient

## Types of Tokenizations

#### Character Level Tokenization
![image.png](attachment:fcc55f8f-86ce-42bb-b71d-300ce704b98c.png)

ref: https://medium.com/illuminations-mirror/on-tokenization-in-llms-34309273f238

#### Word Level Tokenization
![image.png](attachment:1622630b-bd40-457f-9289-a48c8a144367.png)

ref: https://medium.com/illuminations-mirror/on-tokenization-in-llms-34309273f238

#### Subword Tokenization
![image.png](attachment:f992d4fa-538c-4cbc-9613-756b10ab4a93.png)

ref: https://medium.com/illuminations-mirror/on-tokenization-in-llms-34309273f238

## Word-level vs Character-level vs Subword level embeddings

- In RAG, text is converted into embeddings, serving as a crucial step for LLMs to understand and process language. 
- Let's understand the challenges with different token embeddings

#### Problems with Word-level embdeddings
![image.png](attachment:906b69c7-5100-4929-b0dd-ffa119aaa6cf.png)

#### Problems with Character-level embdeddings
![image.png](attachment:0d2400b4-ad86-49d7-9e51-48ba8e8e8ef3.png)

#### Advantages of SubWord-level embdeddings
![image.png](attachment:6524faf8-6433-413d-9ccc-9f31b843ac71.png)

## Tokenization Process

- The Byte Pair Encoding strategy is the tokenizing strategy used in most modern LLMs.
- It starts by dividing the text into individual characters and then gradually merges the most frequently occurring pairs of characters to form new tokens.
- This process continues until a set limit is reached, creating a mix of character and word tokens. 
  

#### Iteration 1

![image.png](attachment:c30414f7-8d28-4a7e-b257-3831af64b600.png)

#### Iteration 2

![image.png](attachment:119d22f2-fd47-47ac-bdde-0652c3e6969c.png)

#### Iteration 3

- We can iterate this process as many times as we need:

![image.png](attachment:257ce1e8-2ddf-4240-a83b-623a4861fd5b.png)

#### The overall process

1. Start with Character-Level Tokenization

2. Count Pair Frequencies

3. Merge the Most Frequent Pair

4. Repeat the Process

5. Finalize the Vocabulary

6. Tokenization of New Text

# Understanding BERT

- BERT, short for Bidirectional Encoder Representations from Transformers, is a Machine Learning (ML) model for natural language processing.
- It was developed in 2018 by researchers at Google AI Language.

### 1. What is BERT used for?
- Can determine how positive or negative a movie’s reviews are. ([Sentiment Analysis](https://huggingface.co/blog/sentiment-analysis-python))
- Helps chatbots answer your questions. ([Question answering](https://huggingface.co/tasks/question-answering))
- Predicts your text when writing an email (Gmail). ([Text prediction](https://huggingface.co/tasks/fill-mask))
- Can write an article about any topic with just a few sentence inputs. ([Text generation](https://huggingface.co/tasks/text-generation))
- Can quickly summarize long legal contracts. ([Summarization](https://huggingface.co/tasks/summarization))
- Can differentiate words that have multiple meanings (like ‘bank’) based on the surrounding text. (Polysemy resolution)

### 2. How does BERT Work?

#### 2.1 Large amounts of training data
- A massive dataset of 3.3 Billion words has contributed to BERT’s continued success.
- BERT was specifically trained on Wikipedia (2.5B words) and Google’s BooksCorpus (800M words). These large informational datasets contributed to BERT’s deep knowledge not only of the English language but also of our world! 🚀
- Training on a dataset this large takes a long time. BERT’s training was made possible thanks to the novel Transformer architecture and sped up by using TPUs (Tensor Processing Units - Google’s custom circuit built specifically for large ML models). —64 TPUs trained BERT over the course of 4 days.
- DistilBERT offers a lighter version of BERT; runs 60% faster while maintaining over 95% of BERT’s performance.

#### 2.2 Masked Language Model (MLM)
- MLM enables/enforces bidirectional learning from text by masking (hiding) a word in a sentence and forcing BERT to bidirectionally use the words on either side of the covered word to predict the masked word.
- Examples:
    1. “Yesterday I was walking through the park and a friendly squirrel [blank] up to me.”
    2. “As the sun [blank] over the horizon, we knew it was going to be a beautiful day.”
    3. “On her first day at the new job, she felt [blank] but also excited about the opportunities.”
    4. “The chef added a secret ingredient that made the soup [blank] better than usual.”
       
- A random 15% of tokenized words are hidden during training and BERT’s job is to correctly predict the hidden words.

#### 2.3 Next Sentence Prediction (NSP)
- NSP (Next Sentence Prediction) is used to help BERT learn about relationships between sentences by predicting if a given sentence follows the previous sentence or not.
- Examples of Correct sentence pairs:
    1. The cat climbed up the tree. It was stuck there for hours. (correct sentence pair)
    2. They planned a vacation to Hawaii. They booked flights and hotels yesterday. (correct sentence pair)
    3. The museum had an exhibition on ancient Egypt. It featured artifacts from the Pharaoh's tomb. (correct sentence pair)
- Examples of InCorrect sentence pairs:
    1. The library was very quiet. The race car zoomed around the track. (incorrect sentence pair)
    2. It was raining heavily outside. The recipe calls for two cups of flour. (incorrect sentence pair)
    3. She's an avid reader of science fiction. Basketball is a popular sport worldwide. (incorrect sentence pair)

- In training, 50% correct sentence pairs are mixed in with 50% random sentence pairs to help BERT increase next sentence prediction accuracy.

#### 2.4 Transformers
- This requires a new section of its own!

# Understanding Transformers

- Transformers have revolutionized the field of natural language processing (NLP)
- The Challenge in NLP Before Transformers
    1. **Sequential Processing**: Traditional NLP models processed text sequentially, either left-to-right or right-to-left. This meant they could not fully grasp the context of words in a sentence, especially in long sentences.
    2. **Long-Term Dependencies**: Capturing dependencies between words that are far apart in a text was challenging. Models like RNNs and LSTMs struggled with this due to issues like vanishing gradients.


- Transformers, introduced in the paper “Attention Is All You Need” by Vaswani et al., brought a significant shift in this approach.
    1. **Parallel Processing**: Unlike their predecessors, transformers process all words in a sentence simultaneously. This parallel processing allows for a more nuanced understanding of context.
    2. **Attention Mechanism**: The key innovation in transformers is the attention mechanism. It allows the model to focus on different parts of the input sequence when predicting a word, giving importance to words based on their relevance.

![image.png](attachment:image.png)

# Next Word Prediction 

https://colab.research.google.com/drive/10bBuh4D_eY2F_FFw6cKV3_ytt2dq6QMg?authuser=1#scrollTo=tSkENh9DRaOa

# Special Tokens

Special tokens in Large Language Models (LLMs) serve various important purposes in the processing and generation of text. Here are two key points to consider:

1. **Delimiter and Control**: Special tokens act as delimiters to signal the beginning and end of sentences, paragraphs, or entire documents. They provide control over the structure of the text that the model generates or processes. For example, a special token like <|endoftext|> in GPT models indicates the end of a text block, helping the model understand when one piece of text ends and another begins.

2. **Functional Roles**: Special tokens can have specific functional roles that trigger certain behaviors in the model. For instance, a token like <|startoftext|> might signal the model to start generating text, while other tokens might instruct the model to perform a particular task like translating text, answering a question, or summarizing a passage. This allows users to give directives to the model and guide its outputs in a structured manner.

- There are a few special tokens used during tokenization
  
![image.png](attachment:image.png)

#### Example 
- Input: "hello, how are you?"
- BERT Tokenizer: ["hello"; ","; "how"; "are"; "you"; "?"]
- BERT post-processor: ["CLS"; "hello"; ","; "how"; "are"; "you"; "?"; "SEP"]

### Let's try one tokenizer

In [19]:
from transformers import GPT2Tokenizer

In [20]:
model_id = "gpt2"
tokenizer = GPT2Tokenizer.from_pretrained(
    model_id, 
    model_max_length=512
)

In [21]:
# vocan size
tokenizer.vocab_size

50257

In [22]:
# vocabulary
tokenizer.get_vocab()

{'!': 0,
 '"': 1,
 '#': 2,
 '$': 3,
 '%': 4,
 '&': 5,
 "'": 6,
 '(': 7,
 ')': 8,
 '*': 9,
 '+': 10,
 ',': 11,
 '-': 12,
 '.': 13,
 '/': 14,
 '0': 15,
 '1': 16,
 '2': 17,
 '3': 18,
 '4': 19,
 '5': 20,
 '6': 21,
 '7': 22,
 '8': 23,
 '9': 24,
 ':': 25,
 ';': 26,
 '<': 27,
 '=': 28,
 '>': 29,
 '?': 30,
 '@': 31,
 'A': 32,
 'B': 33,
 'C': 34,
 'D': 35,
 'E': 36,
 'F': 37,
 'G': 38,
 'H': 39,
 'I': 40,
 'J': 41,
 'K': 42,
 'L': 43,
 'M': 44,
 'N': 45,
 'O': 46,
 'P': 47,
 'Q': 48,
 'R': 49,
 'S': 50,
 'T': 51,
 'U': 52,
 'V': 53,
 'W': 54,
 'X': 55,
 'Y': 56,
 'Z': 57,
 '[': 58,
 '\\': 59,
 ']': 60,
 '^': 61,
 '_': 62,
 '`': 63,
 'a': 64,
 'b': 65,
 'c': 66,
 'd': 67,
 'e': 68,
 'f': 69,
 'g': 70,
 'h': 71,
 'i': 72,
 'j': 73,
 'k': 74,
 'l': 75,
 'm': 76,
 'n': 77,
 'o': 78,
 'p': 79,
 'q': 80,
 'r': 81,
 's': 82,
 't': 83,
 'u': 84,
 'v': 85,
 'w': 86,
 'x': 87,
 'y': 88,
 'z': 89,
 '{': 90,
 '|': 91,
 '}': 92,
 '~': 93,
 '¡': 94,
 '¢': 95,
 '£': 96,
 '¤': 97,
 '¥': 98,
 '¦': 99,
 '§': 100

In [23]:
# encoded text
text = "Supercalifragilisticexpialidocious"
tokenizer.encode(text)

[12442, 9948, 361, 22562, 346, 396, 501, 42372, 498, 312, 32346]

In [24]:
# underlying tokens
tokenizer._tokenize(text)

['Super', 'cal', 'if', 'rag', 'il', 'ist', 'ice', 'xp', 'ial', 'id', 'ocious']

In [25]:

tokenizer.special_tokens_map

{'bos_token': '<|endoftext|>',
 'eos_token': '<|endoftext|>',
 'unk_token': '<|endoftext|>'}

### Let's try another tokenizer

In [26]:
from transformers import AutoTokenizer

model_id = "bert-base-uncased"
tokenizer = AutoTokenizer.from_pretrained(
    model_id, 
    model_max_length=512
)

In [27]:
# vocab size
tokenizer.vocab_size

30522