### There are three types of tokenization algorithms:

**1. Word-based tokenizer** - Each word in a sentence is treated as one token

*Example* : text = "I love machine learning"

        tokenized_text = ["I", "love", "machine", "learning"]

**2. Character-based tokenizer** - Individual characters are treated as tokens

*Example* : text = "AI"

        tokenized_text = ["A", "I"]

**3. Subword-based tokenizer** - Words are broken down into smaller meaningful subword units (Byte Pair Encoding/BPE is an example of this type)

*Example* : text = "unhappiness"

        tokenized_text = ["un", "happi", "ness"]

In [1]:
import tiktoken
import importlib
print("Tiktoken Version : ", importlib.metadata.version('tiktoken'))

Tiktoken Version :  0.11.0


## Byte Pair Encoding

### Byte Pair Encoding is a subword tokenization technique that repeatedly merges the most frequent adjacent symbol pairs to form a compact and efficient vocabulary.

Example:    

        Corpus:                  low lower lowest

        Initial tokens:            l o w   l o w e r   l o w e s t

        After merges:              low   low er   low est

        Final tokens might be:     ["low", "er", "est"]

In [2]:
tokenizer = tiktoken.get_encoding('gpt2')

In [7]:
text = (
    "Hi, do you like tea? <|endoftext|> In the sunlit terraces of someunknownPlace."
)

token_ids = tokenizer.encode(text, allowed_special={'<|endoftext|>'})  # text -> tokens -> token IDs
print("Text: ", text)
print("Token IDs: ", token_ids)

Text:  Hi, do you like tea? <|endoftext|> In the sunlit terraces of someunknownPlace.
Token IDs:  [17250, 11, 466, 345, 588, 8887, 30, 220, 50256, 554, 262, 4252, 18250, 8812, 2114, 286, 617, 34680, 27271, 13]


In [4]:
string = tokenizer.decode(token_ids)  # token IDs -> tokens -> text
string

'Hi, do you like tea? <|endoftext|> In the sunlit terraces of someunknownPlace.'

**<|endoftext|>** token is used to mark sequence boundaries, prevent mixing of documents during training, and provide a natural stopping signal during text generation.