Converting text into a byte array:

In [1]:
text = "This is some text yes"
byte_ary = bytearray(text, "utf-8")
print(byte_ary)

bytearray(b'This is some text yes')


If we call `list()` on a `bytearray` each byte is treated as an individual object, and we get a list of integers corresponding to the byte values:

In [3]:
ids = list(byte_ary)
print(ids)
print("the number of tokens: ",len(ids))

[84, 104, 105, 115, 32, 105, 115, 32, 115, 111, 109, 101, 32, 116, 101, 120, 116, 32, 121, 101, 115]
the number of tokens:  21


This is a way to turn text into a token ID representation, however creating one id for each character results in too many id's.

Instead of each character BPE tokenizers have a vocabulary with a token ID for each word/subword.
For example the GPT-2 tokenizer tokenizes "This is some text yes" into 5 tokens and not 21.

In [5]:
import tiktoken
gpt2_tokenizer = tiktoken.get_encoding("gpt2")
gpt2ids = list(gpt2_tokenizer.encode("This is some text yes"))
print(gpt2ids)
print("the number of tokens: ",len(gpt2ids))

[1212, 318, 617, 2420, 3763]
the number of tokens:  5


Since there is only $2^8 = 256$ characters one byte can represent, `bytearray(range(0,257))` results in `VauleError: byte must be in range(0, 256)`
A BPE tokenizer usually uses these 256 values as its first 256 single character tokens, we can check this if we run the code:

In [2]:
import tiktoken
gpt2_tokenizer = tiktoken.get_encoding("gpt2")

for i in range(266):
    decoded = gpt2_tokenizer.decode([i])
    if 10 < i < 250:
        continue #we don't want to really print 300 numbers since it would be unreadable
    print(f"{i}: {decoded}")

0: !
1: "
2: #
3: $
4: %
5: &
6: '
7: (
8: )
9: *
10: +
250: �
251: �
252: �
253: �
254: �
255: �
256:  t
257:  a
258: he
259: in
260: re
261: on
262:  the
263: er
264:  s
265: at


As we can see, some of the decoded tokens starting from 256 and so on which start with a whitespace are considered different (for example 't' is different from ' t') which has been improved in the GPT-4 tokenizer