# Bits and bytes
Converting text into a byte array:

In [1]:
text = "This is some text yes"
byte_ary = bytearray(text, "utf-8")
print(byte_ary)

bytearray(b'This is some text yes')


If we call `list()` on a `bytearray` each byte is treated as an individual object, and we get a list of integers corresponding to the byte values:

In [3]:
ids = list(byte_ary)
print(ids)
print("the number of tokens: ",len(ids))

[84, 104, 105, 115, 32, 105, 115, 32, 115, 111, 109, 101, 32, 116, 101, 120, 116, 32, 121, 101, 115]
the number of tokens:  21


This is a way to turn text into a token ID representation, however creating one id for each character results in too many id's.

Instead of each character BPE tokenizers have a vocabulary with a token ID for each word/subword.
For example the GPT-2 tokenizer tokenizes "This is some text yes" into 5 tokens and not 21.

In [5]:
import tiktoken
gpt2_tokenizer = tiktoken.get_encoding("gpt2")
gpt2ids = list(gpt2_tokenizer.encode("This is some text yes"))
print(gpt2ids)
print("the number of tokens: ",len(gpt2ids))

[1212, 318, 617, 2420, 3763]
the number of tokens:  5


Since there is only $2^8 = 256$ characters one byte can represent, `bytearray(range(0,257))` results in `VauleError: byte must be in range(0, 256)`
A BPE tokenizer usually uses these 256 values as its first 256 single character tokens, we can check this if we run the code:

In [2]:
import tiktoken
gpt2_tokenizer = tiktoken.get_encoding("gpt2")

for i in range(300):
    decoded = gpt2_tokenizer.decode([i])
    if 10 < i < 250 or 265 < i < 290:
        continue #we don't want to really print all the 300 numbers since it would be unreadable
    print(f"{i}: {decoded}")

0: !
1: "
2: #
3: $
4: %
5: &
6: '
7: (
8: )
9: *
10: +
250: �
251: �
252: �
253: �
254: �
255: �
256:  t
257:  a
258: he
259: in
260: re
261: on
262:  the
263: er
264:  s
265: at
290:  and
291: ic
292: as
293: le
294:  th
295: ion
296: om
297: ll
298: ent
299:  n


As we can see, some of the decoded tokens starting from 256 and so on which start with a whitespace are considered different (for example 't' is different from ' t') which has been improved in the GPT-4 tokenizer

# Building the vocabulary

The purpose of the BPE tokenization algorithm is to build a vocabulary of commonly occurring subwords like `298: ent` (from the words *entity, entertain, entrance, ...*) or words like
```
318: is
617: some
1212: This
2420: text
3763: yes
```

The general structure of the BPE algorithms goes like this:

## BPE algorithm outline

### 1. Identify frequent pairs
- Every iteration scan the text for the most commonly occurring pair of bytes(characters)
### 2. Replace and record
- Replaces that pair with a new placeholder ID (which is not already in use, so if we start with 0,...,255, the first placeholder should be 256)
- Records this mapping in a lookup table
- The size of the lookup table is a hyperparameter, also called "vocabulary size" (50,257 for gpt-2)
### 3. Repeat until no gains
- Keep repeating steps 1 and 2, merging most common pairs, until there is no pair that occurs more than once
### Decompression (decoding)
- to restore the original text, reverse the process by substituting each ID with the corresponding pair from the lookup table


# BPE algorithm example
## Concrete example of the 1st and 2nd step (encoding)
- Let's say we want to build a vocabulary out of the sentence `the cat in the hat` which will be out training dataset
#### Iteration 1
1. Identifying the frequent pairs
- In the text, `th` appears 2 times
2. Replace and record
- Replace the `th` with the first token not in use, e.g., `256`
- The new text is `<256>e cat in <256>e hat`
- the new vocabulary is:
```
  0: ...
  ...
  256: "th"
```
#### Iteration 2
1. Identifying the frequent pairs
- In the text, `<256>e` appears 2 times
2. Replace and record
- Replace the `<256>e` with the first token not in use, e.g., `257`
- The new text is `<257> cat in <257> hat`
- the new vocabulary is:
```
  0: ...
  ...
  256: "th"
  257: "<256>e"
```
#### Iteration 3
1. Identifying the frequent pairs
- In the text, `<257> ` appears 2 times
2. Replace and record
- Replace the `<257> ` with the first token not in use, e.g., `258`
- The new text is `<258>cat in <258>hat`
- the new vocabulary is:
```
  0: ...
  ...
  256: "th"
  257: "<256>e"
  258: "<257> "
```
- and so on...


## Concrete example of the last step (decoding)

To restore the original text, we reverse the process by substituting each token ID with its corresponding pair in the reverse order they were introduced
- Start with the final compressed text: <258>cat in <258>hat
- Substitute <258> → <257> : <257> cat in <257> hat
- Substitute <257> → <256>e: <256>e cat in <256>e hat
- Substitute <256> → “th”: the cat in the hat

