# Lesson 1: Special Tokens

Special tokens in Large Language Models (LLMs) serve various important purposes in the processing and generation of text. Here are two key points to consider:

1. **Delimiter and Control**: Special tokens act as delimiters to signal the beginning and end of sentences, paragraphs, or entire documents. They provide control over the structure of the text that the model generates or processes. For example, a special token like <|endoftext|> in GPT models indicates the end of a text block, helping the model understand when one piece of text ends and another begins.

2. **Functional Roles**: Special tokens can have specific functional roles that trigger certain behaviors in the model. For instance, a token like <|startoftext|> might signal the model to start generating text, while other tokens might instruct the model to perform a particular task like translating text, answering a question, or summarizing a passage. This allows users to give directives to the model and guide its outputs in a structured manner.

- There are a few special tokens used during tokenization
  
![image.png](attachment:3c1b9a0f-4d68-40a1-80e7-435542601254.png)

#### Example 
- Input: "hello, how are you?"
- BERT Tokenizer: ["hello"; ","; "how"; "are"; "you"; "?"]
- BERT post-processor: ["CLS"; "hello"; ","; "how"; "are"; "you"; "?"; "SEP"]

### Let's try one tokenizer

In [2]:
from transformers import GPT2Tokenizer

In [13]:
model_id = "gpt2"
tokenizer = GPT2Tokenizer.from_pretrained(
    model_id, 
    model_max_length=512
)

In [4]:
# vocan size
tokenizer.vocab_size

50257

In [14]:
# vocabulary
tokenizer.get_vocab()

In [6]:
# encoded text
text = "Supercalifragilisticexpialidocious"
tokenizer.encode(text)

[12442, 9948, 361, 22562, 346, 396, 501, 42372, 498, 312, 32346]

In [7]:
# underlying tokens
tokenizer._tokenize(text)

['Super', 'cal', 'if', 'rag', 'il', 'ist', 'ice', 'xp', 'ial', 'id', 'ocious']

In [15]:

tokenizer.special_tokens_map

{'bos_token': '<|endoftext|>',
 'eos_token': '<|endoftext|>',
 'unk_token': '<|endoftext|>'}

### Let's try another tokenizer

In [8]:
from transformers import AutoTokenizer

model_id = "bert-base-uncased"
tokenizer = AutoTokenizer.from_pretrained(
    model_id, 
    model_max_length=512
)

tokenizer_config.json:   0%|          | 0.00/28.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/570 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

In [9]:
# vocab size
tokenizer.vocab_size

30522