## 2. Byte-Pair Encoding (BPE) Tokenizer

Maybe useful: https://zhuanlan.zhihu.com/p/1927397109025473129

### Problem (unicode1): Understanding Unicode (1 point)

In [1]:
ord('牛')

29275

In [2]:
chr(29275)

'牛'

(a) What Unicode character does chr(0) return?

**Deliverable**: A one-sentence response.

<span style="background-color: blue;">null character</span>

In [3]:
chr(0)

'\x00'

(b) How does this character’s string representation (`__repr__()`) differ from its printed representation?

**Deliverable**: A one-sentence response.

<span style="background-color: blue;">
The string representation (`__repr__()`) of the null character explicitly shows its escape sequence `'\x00'`, whereas its printed representation typically appears as an empty string or nothing at all because it is a non-printable character.
</span>

(c) What happens when this character occurs in text? It may be helpful to play around with the following in your Python interpreter and see if it matches your expectations: 

```python
>>> chr(0)
>>> print(chr(0))
>>> "this is a test" + chr(0) + "string"
>>> print("this is a test" + chr(0) + "string")
```

**Deliverable**: A one-sentence response

In [4]:
chr(0)
print(chr(0))
"this is a test" + chr(0) + "string"
print("this is a test" + chr(0) + "string")

 
this is a test string


<span style="background-color: blue;">it is embedded within the string but is typically not visible when printed.</span>

### Problem (unicode2): Unicode Encodings (3 points)

(a) What are some reasons to prefer training our tokenizer on UTF-8 encoded bytes, rather than
UTF-16 or UTF-32? It may be helpful to compare the output of these encodings for various
input strings.

**Deliverable**: A one-to-two sentence response.

In [7]:
test_string = "hello! こんにちは!"
utf8_encoded = test_string.encode("utf-8")
print(utf8_encoded)
print(type(utf8_encoded))
print("UTF-8 value: ", ", ".join(map(str, list(utf8_encoded))))
utf16_encoded = test_string.encode("utf-16")
print(utf16_encoded)
print(type(utf16_encoded))
print("UTF-16 value: ", ", ".join(map(str, list(utf16_encoded))))
utf32_encoded = test_string.encode("utf-32")
print(utf32_encoded)
print(type(utf32_encoded))
print("UTF-32 value: ", ", ".join(map(str, list(utf32_encoded))))

b'hello! \xe3\x81\x93\xe3\x82\x93\xe3\x81\xab\xe3\x81\xa1\xe3\x81\xaf!'
<class 'bytes'>
UTF-8 value:  104, 101, 108, 108, 111, 33, 32, 227, 129, 147, 227, 130, 147, 227, 129, 171, 227, 129, 161, 227, 129, 175, 33
b'\xff\xfeh\x00e\x00l\x00l\x00o\x00!\x00 \x00S0\x930k0a0o0!\x00'
<class 'bytes'>
UTF-16 value:  255, 254, 104, 0, 101, 0, 108, 0, 108, 0, 111, 0, 33, 0, 32, 0, 83, 48, 147, 48, 107, 48, 97, 48, 111, 48, 33, 0
b'\xff\xfe\x00\x00h\x00\x00\x00e\x00\x00\x00l\x00\x00\x00l\x00\x00\x00o\x00\x00\x00!\x00\x00\x00 \x00\x00\x00S0\x00\x00\x930\x00\x00k0\x00\x00a0\x00\x00o0\x00\x00!\x00\x00\x00'
<class 'bytes'>
UTF-32 value:  255, 254, 0, 0, 104, 0, 0, 0, 101, 0, 0, 0, 108, 0, 0, 0, 108, 0, 0, 0, 111, 0, 0, 0, 33, 0, 0, 0, 32, 0, 0, 0, 83, 48, 0, 0, 147, 48, 0, 0, 107, 48, 0, 0, 97, 48, 0, 0, 111, 48, 0, 0, 33, 0, 0, 0


<span style="background-color: blue;">UTF-8 is the most popular way and often more space-efficient</span>

(b) Consider the following (incorrect) function, which is intended to decode a UTF-8 byte string into
a Unicode string. Why is this function incorrect? Provide an example of an input byte string
that yields incorrect results.
``` python
def decode_utf8_bytes_to_str_wrong(bytestring: bytes):
    return "".join([bytes([b]).decode("utf-8") for b in bytestring])
>>> decode_utf8_bytes_to_str_wrong("hello".encode("utf-8"))
'hello'
```

**Deliverable**: An example input byte string for which decode_utf8_bytes_to_str_wrong pro-
duces incorrect output, with a one-sentence explanation of why the function is incorrect.

In [12]:
def decode_utf8_bytes_to_str_wrong(bytestring: bytes):
    return "".join([bytes([b]).decode("utf-8") for b in bytestring])
test_string = "hello! こんにちは!"
decode_utf8_bytes_to_str_wrong(test_string.encode("utf-8"))

UnicodeDecodeError: 'utf-8' codec can't decode byte 0xe3 in position 0: unexpected end of data

<span style="background-color: blue;">
The function incorrectly attempts to decode each byte individually, leading to a UnicodeDecodeError because multi-byte UTF-8 characters cannot be decoded one byte at a time.
</span>

(c) Give a two byte sequence that does not decode to any Unicode character(s).

**Deliverable**: An example, with a one-sentence explanation.

In [18]:
invalid_bytes = b'\xc2\x00'
print(f"Attempting to decode: {invalid_bytes!r}")
decoded_string = invalid_bytes.decode("utf-8")
decoded_string

Attempting to decode: b'\xc2\x00'


UnicodeDecodeError: 'utf-8' codec can't decode byte 0xc2 in position 0: invalid continuation byte

<span style="background-color: blue;">
b'\xc2\x00'
because while 0xc2 is a valid start byte for a two-byte UTF-8 sequence, 0x00 is not a valid continuation byte.
</span>

In [22]:
PAT = r"""'(?:[sdmt]|ll|ve|re)| ?\p{L}+| ?\p{N}+| ?[^\s\p{L}\p{N}]+|\s+(?!\S)|\s+"""

# requires `regex` package
import regex as re
re.findall(PAT, "some text that i'll pre-tokenize")

['some', ' text', ' that', ' i', "'ll", ' pre', '-', 'tokenize']

### Problem (train_bpe): BPE Tokenizer Training (15 points)

**Deliverable**: Write a function that, given a path to an input text file, trains a (byte-level) BPE
tokenizer. Your BPE training function should handle (at least) the following input parameters:

`input_path`: str Path to a text file with BPE tokenizer training data.

`vocab_size`: int A positive integer that defines the maximum final vocabulary size (including the
initial byte vocabulary, vocabulary items produced from merging, and any special tokens).

`special_tokens`: list[str] A list of strings to add to the vocabulary. These special tokens do not
otherwise affect BPE training.

Your BPE training function should return the resulting vocabulary and merges:

`vocab`: dict[int, bytes] The tokenizer vocabulary, a mapping from int (token ID in the vocabu-
lary) to bytes (token bytes).

`merges`: list[tuple[bytes, bytes]] A list of BPE merges produced from training. Each list item
is a tuple of bytes (<token1>, <token2>), representing that <token1> was merged with
<token2>. The merges should be ordered by order of creation.

To test your BPE training function against our provided tests, you will first need to implement the
test adapter at `[adapters.run_train_bpe]`. Then, run `uv run pytest tests/test_train_bpe.py`.
Your implementation should be able to pass all tests. Optionally (this could be a large time-investment),
you can implement the key parts of your training method using some systems language, for instance
`C++` (consider `cppyy` for this) or `Rust` (using `PyO3`). If you do this, be aware of which operations
require copying vs reading directly from Python memory, and make sure to leave build instructions, or
make sure it builds using only `pyproject.toml`. Also note that the GPT-2 regex is not well-supported
in most regex engines and will be too slow in most that do. We have verified that Oniguruma is
reasonably fast and supports negative lookahead, but the `regex` package in Python is, if anything,
even faster.

In [2]:
uv run pytest tests/test_train_bpe.py

platform win32 -- Python 3.11.0rc2, pytest-8.3.5, pluggy-1.5.0
rootdir: e:\Code\CS336\assignment1-basics
configfile: pyproject.toml
plugins: jaxtyping-0.3.1
collected 3 items

tests/test_train_bpe.py::test_train_bpe_speed [32mPASSED[0m
tests/test_train_bpe.py::test_train_bpe [32mPASSED[0m
tests/test_train_bpe.py::test_train_bpe_special_tokens [31mFAILED[0m

[31m[1m________________________ test_train_bpe_special_tokens ________________________[0m

snapshot = <tests.conftest.Snapshot object at 0x0000024489ED3710>

    def test_train_bpe_special_tokens(snapshot):
        """
        Ensure that the special tokens are added to the vocabulary and not
        merged with other tokens.
        """
        input_path = FIXTURES_PATH / "tinystories_sample_5M.txt"
        vocab, merges = run_train_bpe(
            input_path=input_path,
            vocab_size=1000,
            special_tokens=["<|endoftext|>"],
        )
    
        # Check that the special token is not in the vocab
   

### Problem (train_bpe_tinystories): BPE Training on TinyStories (2 points)

Train a byte-level BPE tokenizer on the TinyStories dataset, using a maximum vocabulary size
of 10,000. Make sure to add the TinyStories `<|endoftext|>` special token to the vocabulary.
Serialize the resulting vocabulary and merges to disk for further inspection. How many hours
and memory did training take? What is the longest token in the vocabulary? Does it make sense?
Resource requirements: ≤30 minutes (no GPUs), ≤ 30GB RAM

**Hint** You should be able to get under 2 minutes for BPE training using multiprocessing during
pretokenization and the following two facts:

(a) The `<|endoftext|>` token delimits documents in the data files.

(b) The `<|endoftext|>` token is handled as a special case before the BPE merges are applied.

**Deliverable:** A one-to-two sentence response.

| uv run -m cs336_basics.tokenizer "data/TinyStoriesV2-GPT4-train.txt" --vocab_size 10000

``` PowerShell
==================================================
🚀 Initializing BPETrainer...
Configuration:
  - Vocab Size: 10000
  - Special Tokens: ['<|endoftext|>']
  - Input File: data/TinyStoriesV2-GPT4-train.txt
==================================================

Step 1: Pre-tokenizing input file...
✅ Pre-tokenization complete in 189.14 seconds.
   Found 60006 unique pre-tokenized words/chunks.

Step 2: Learning BPE merges...
✅ Merging complete in 22.63 seconds.
   Learned 9743 merges. Final vocab size: 10000
   Longest token has 512 bytes.
   Longest token content: 'ssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssss'

Step 3: Saving vocabulary and merges...
   Vocabulary saved to data/TinyStoriesV2-GPT4-train-vocab_size_10000-vocab.json
   Merges saved to data/TinyStoriesV2-GPT4-train-vocab_size_10000-merges.txt

🎉 Training complete!
==================================================

--- Resource Usage ---
✅ Peak memory usage: 136.51 MB

--- CProfile Performance Analysis (Top 20) ---
         18090141 function calls (18090068 primitive calls) in 211.826 seconds

   Ordered by: cumulative time
   List reduced from 410 to 20 due to restriction <20>

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
        1    0.227    0.227  189.104  189.104 E:\Code\CS336\assignment1-basics\cs336_basics\tokenizer.py:80(pretokenize)   
        4    0.000    0.000  188.790   47.197 D:\Tool\Python311\Lib\threading.py:604(wait)
        4    0.000    0.000  188.790   47.197 D:\Tool\Python311\Lib\threading.py:288(wait)
       20  188.790    9.439  188.790    9.439 {method 'acquire' of '_thread.lock' objects}
        1    0.000    0.000  188.789  188.789 D:\Tool\Python311\Lib\multiprocessing\pool.py:362(map)
        1    0.000    0.000  188.788  188.788 D:\Tool\Python311\Lib\multiprocessing\pool.py:767(get)
        1    0.000    0.000  188.788  188.788 D:\Tool\Python311\Lib\multiprocessing\pool.py:764(wait)
        1    5.626    5.626   22.232   22.232 E:\Code\CS336\assignment1-basics\cs336_basics\tokenizer.py:153(merge)        
   574642    5.532    0.000    9.659    0.000 {built-in method _heapq.heappop}
   818506    4.868    0.000    6.453    0.000 E:\Code\CS336\assignment1-basics\cs336_basics\tokenizer.py:187(update_stats) 
 11978239    4.527    0.000    4.527    0.000 E:\Code\CS336\assignment1-basics\cs336_basics\tokenizer.py:30(__lt__)        
   789060    0.810    0.000    1.210    0.000 {built-in method _heapq.heappush}
    49746    0.353    0.000    0.353    0.000 {method 'write' of '_io.TextIOWrapper' objects}
        1    0.025    0.025    0.308    0.308 D:\Tool\Python311\Lib\json\__init__.py:120(dump)
   789060    0.226    0.000    0.226    0.000 E:\Code\CS336\assignment1-basics\cs336_basics\tokenizer.py:24(__init__)      
   784684    0.161    0.000    0.161    0.000 {method 'add' of 'set' objects}
   293471    0.143    0.000    0.143    0.000 E:\Code\CS336\assignment1-basics\cs336_basics\tokenizer.py:16(count)
   574696    0.120    0.000    0.120    0.000 {method 'get' of 'dict' objects}
   435323    0.082    0.000    0.082    0.000 E:\Code\CS336\assignment1-basics\cs336_basics\tokenizer.py:10(__init__)      
        1    0.000    0.000    0.081    0.081 D:\Tool\Python311\Lib\multiprocessing\context.py:115(Pool)

```

<span style="background-color: blue;">
The tokenizer was trained in approximately 3 minutes (189.14 seconds) with a peak memory usage of 136.51 MB, both well within the resource limits. The longest token, a 512-byte sequence of the repeating character 's'
</span>

Profile your code. What part of the tokenizer training process takes the most time?

**Deliverable**: A one-to-two sentence response.

<span style="background-color: blue;">
the pre-tokenization step (pretokenize function) is overwhelmingly the most time-consuming part of the training process, taking approximately 138 seconds.
</span>

### Problem (train_bpe_expts_owt): BPE Training on OpenWebText (2 points)

Train a byte-level BPE tokenizer on the OpenWebText dataset, using a maximum vocabulary
size of 32,000. Serialize the resulting vocabulary and merges to disk for further inspection. What
is the longest token in the vocabulary? Does it make sense?

**Resource requirements**: ≤12 hours (no GPUs), ≤ 100GB RAM

**Deliverable**: A one-to-two sentence response.

| uv run -m cs336_basics.tokenizer "data/owt_train.txt" --vocab_size 32000

``` PowerShell

```

Compare and contrast the tokenizer that you get training on TinyStories versus OpenWebText.

**Deliverable**: A one-to-two sentence response.

### Problem (tokenizer): Implementing the tokenizer (15 points)

**Deliverable**: Implement a Tokenizer class that, given a vocabulary and a list of merges, encodes
text into integer IDs and decodes integer IDs into text. Your tokenizer should also support user-provided
special tokens (appending them to the vocabulary if they aren’t already there). We recommend the
following interface:

`def __init__(self, vocab, merges, special_tokens=None)` Construct a tokenizer from a given vocabulary, list of merges, and (optionally) a list of special tokens. This function should accept the following parameters:

|vocab: dict[int, bytes]

|merges: list[tuple[bytes, bytes]]

|special_tokens: list[str] | None = None

`def from_files(cls, vocab_filepath, merges_filepath, special_tokens=None)` Class method that constructs and return a Tokenizer from a serialized vocabulary and list of merges (in the same format that your BPE training code output) and (optionally) a list of special tokens. This method should accept the following additional parameters:

|vocab_filepath: str

|merges_filepath: str

|special_tokens: list[str] | None = None

`def encode(self, text: str) -> list[int]` Encode an input text into a sequence of token IDs.

`def encode_iterable(self, iterable: Iterable[str]) -> Iterator[int]` Given an iterable of strings (e.g., a Python file handle), return a generator that lazily yields token IDs. This is required for memory-eﬀicient tokenization of large files that we cannot directly load into memory.

`def decode(self, ids: list[int]) -> str` Decode a sequence of token IDs into text.

To test your Tokenizer against our provided tests, you will first need to implement the test adapter at `[adapters.get_tokenizer]`. Then, run `uv run pytest tests/test_tokenizer.py`. Your implementation should be able to pass all tests.

In [2]:
uv run pytest tests/test_tokenizer.py

platform win32 -- Python 3.11.0rc2, pytest-8.3.5, pluggy-1.5.0
rootdir: e:\Code\CS336\assignment1-basics
configfile: pyproject.toml
plugins: jaxtyping-0.3.1
collected 25 items

tests/test_tokenizer.py::test_roundtrip_empty [32mPASSED[0m
tests/test_tokenizer.py::test_empty_matches_tiktoken [32mPASSED[0m
tests/test_tokenizer.py::test_roundtrip_single_character [32mPASSED[0m
tests/test_tokenizer.py::test_single_character_matches_tiktoken [32mPASSED[0m
tests/test_tokenizer.py::test_roundtrip_single_unicode_character [32mPASSED[0m
tests/test_tokenizer.py::test_single_unicode_character_matches_tiktoken [32mPASSED[0m
tests/test_tokenizer.py::test_roundtrip_ascii_string [32mPASSED[0m
tests/test_tokenizer.py::test_ascii_string_matches_tiktoken [32mPASSED[0m
tests/test_tokenizer.py::test_roundtrip_unicode_string [32mPASSED[0m
tests/test_tokenizer.py::test_unicode_string_matches_tiktoken [32mPASSED[0m
tests/test_tokenizer.py::test_roundtrip_unicode_string_with_special_tokens 

### Problem (tokenizer_experiments): Experiments with tokenizers (4 points)

Sample 10 documents from TinyStories and OpenWebText. Using your previously-trained TinyStories and OpenWebText tokenizers (10K and 32K vocabulary size, respectively), encode these sampled documents into integer IDs. What is each tokenizer’s compression ratio (bytes/token)?

**Deliverable**: A one-to-two sentence response.

What happens if you tokenize your OpenWebText sample with the TinyStories tokenizer? Compare the compression ratio and/or qualitatively describe what happens.

**Deliverable**: A one-to-two sentence response.

Estimate the throughput of your tokenizer (e.g., in bytes/second). How long would it take to tokenize the Pile dataset (825GB of text)?

**Deliverable**: A one-to-two sentence response.

Using your TinyStories and OpenWebText tokenizers, encode the respective training and development datasets into a sequence of integer token IDs. We’ll use this later to train our language model. We recommend serializing the token IDs as a NumPy array of datatype uint16. Why is uint16 an appropriate choice?

**Deliverable**: A one-to-two sentence response.

---

## 3. Transformer Language Model Architecture

Maybe useful: https://zhuanlan.zhihu.com/p/1927015802915263462

---

## 4. Training a Transformer LM

## 5. Training loop

## 6. Generating text

## 7. Experiments