## Byte tokenizer.

Tokenizer → it encodes text to IDs. The byte tokenizer uses the bytes as tokens.

For training the model, the pipeline must: 1) encode -> 2) batch -> 3) feed the model

In [35]:
class ByteTokenizer:
    """
    UTF-8 byte tokenizer: every byte (0–255) is a token.
    Reserve extra IDs for special tokens (eos, pad).
    """
    def __init__(self):
        self.vocab_size = 258
        self.eos_token_id = 256
        self.pad_token_id = 257

    def encode(self, text: str):
        b = text.encode("utf-8", errors="ignore")
        return list(b) + [self.eos_token_id]

    def decode(self, ids):
        b = bytes([i for i in ids if i < 256])
        return b.decode("utf-8", errors="replace")

In [36]:
# text to see how the ByteTokenizer works - encoding

tokenizer = ByteTokenizer()

tokens = tokenizer.encode("hello world!!!")

print(tokens)

[104, 101, 108, 108, 111, 32, 119, 111, 114, 108, 100, 33, 33, 33, 256]


In [37]:
# text to see how the ByteTokenizer works - decoding

text = tokenizer.decode(tokens)

print(text)

hello world!!!
