## 2. Byte-Pair Encoding (BPE) Tokenizer

Maybe useful: https://zhuanlan.zhihu.com/p/1927397109025473129

### Problem (unicode1): Understanding Unicode (1 point)

In [1]:
ord('牛')

29275

In [2]:
chr(29275)

'牛'

(a) What Unicode character does chr(0) return?

**Deliverable**: A one-sentence response.

<span style="background-color: #29B6F6; color: black">null character</span>

In [3]:
chr(0)

'\x00'

(b) How does this character’s string representation (`__repr__()`) differ from its printed representation?

**Deliverable**: A one-sentence response.

<span style="background-color: #29B6F6; color: black">
The string representation (`__repr__()`) of the null character explicitly shows its escape sequence `'\x00'`, whereas its printed representation typically appears as an empty string or nothing at all because it is a non-printable character.
</span>

(c) What happens when this character occurs in text? It may be helpful to play around with the following in your Python interpreter and see if it matches your expectations: 

```python
>>> chr(0)
>>> print(chr(0))
>>> "this is a test" + chr(0) + "string"
>>> print("this is a test" + chr(0) + "string")
```

**Deliverable**: A one-sentence response

In [4]:
chr(0)
print(chr(0))
"this is a test" + chr(0) + "string"
print("this is a test" + chr(0) + "string")

 
this is a test string


<span style="background-color: #29B6F6; color: black">
it is embedded within the string but is typically not visible when printed.
</span>

### Problem (unicode2): Unicode Encodings (3 points)

(a) What are some reasons to prefer training our tokenizer on UTF-8 encoded bytes, rather than
UTF-16 or UTF-32? It may be helpful to compare the output of these encodings for various
input strings.

**Deliverable**: A one-to-two sentence response.

In [7]:
test_string = "hello! こんにちは!"
utf8_encoded = test_string.encode("utf-8")
print(utf8_encoded)
print(type(utf8_encoded))
print("UTF-8 value: ", ", ".join(map(str, list(utf8_encoded))))
utf16_encoded = test_string.encode("utf-16")
print(utf16_encoded)
print(type(utf16_encoded))
print("UTF-16 value: ", ", ".join(map(str, list(utf16_encoded))))
utf32_encoded = test_string.encode("utf-32")
print(utf32_encoded)
print(type(utf32_encoded))
print("UTF-32 value: ", ", ".join(map(str, list(utf32_encoded))))

b'hello! \xe3\x81\x93\xe3\x82\x93\xe3\x81\xab\xe3\x81\xa1\xe3\x81\xaf!'
<class 'bytes'>
UTF-8 value:  104, 101, 108, 108, 111, 33, 32, 227, 129, 147, 227, 130, 147, 227, 129, 171, 227, 129, 161, 227, 129, 175, 33
b'\xff\xfeh\x00e\x00l\x00l\x00o\x00!\x00 \x00S0\x930k0a0o0!\x00'
<class 'bytes'>
UTF-16 value:  255, 254, 104, 0, 101, 0, 108, 0, 108, 0, 111, 0, 33, 0, 32, 0, 83, 48, 147, 48, 107, 48, 97, 48, 111, 48, 33, 0
b'\xff\xfe\x00\x00h\x00\x00\x00e\x00\x00\x00l\x00\x00\x00l\x00\x00\x00o\x00\x00\x00!\x00\x00\x00 \x00\x00\x00S0\x00\x00\x930\x00\x00k0\x00\x00a0\x00\x00o0\x00\x00!\x00\x00\x00'
<class 'bytes'>
UTF-32 value:  255, 254, 0, 0, 104, 0, 0, 0, 101, 0, 0, 0, 108, 0, 0, 0, 108, 0, 0, 0, 111, 0, 0, 0, 33, 0, 0, 0, 32, 0, 0, 0, 83, 48, 0, 0, 147, 48, 0, 0, 107, 48, 0, 0, 97, 48, 0, 0, 111, 48, 0, 0, 33, 0, 0, 0


<span style="background-color: #29B6F6; color: black">
UTF-8 is the most popular way and often more space-efficient
</span>

(b) Consider the following (incorrect) function, which is intended to decode a UTF-8 byte string into
a Unicode string. Why is this function incorrect? Provide an example of an input byte string
that yields incorrect results.
``` python
def decode_utf8_bytes_to_str_wrong(bytestring: bytes):
    return "".join([bytes([b]).decode("utf-8") for b in bytestring])
>>> decode_utf8_bytes_to_str_wrong("hello".encode("utf-8"))
'hello'
```

**Deliverable**: An example input byte string for which decode_utf8_bytes_to_str_wrong pro-
duces incorrect output, with a one-sentence explanation of why the function is incorrect.

In [12]:
def decode_utf8_bytes_to_str_wrong(bytestring: bytes):
    return "".join([bytes([b]).decode("utf-8") for b in bytestring])
test_string = "hello! こんにちは!"
decode_utf8_bytes_to_str_wrong(test_string.encode("utf-8"))

UnicodeDecodeError: 'utf-8' codec can't decode byte 0xe3 in position 0: unexpected end of data

<span style="background-color: #29B6F6; color: black">
The function incorrectly attempts to decode each byte individually, leading to a UnicodeDecodeError because multi-byte UTF-8 characters cannot be decoded one byte at a time.
</span>

(c) Give a two byte sequence that does not decode to any Unicode character(s).

**Deliverable**: An example, with a one-sentence explanation.

In [18]:
invalid_bytes = b'\xc2\x00'
print(f"Attempting to decode: {invalid_bytes!r}")
decoded_string = invalid_bytes.decode("utf-8")
decoded_string

Attempting to decode: b'\xc2\x00'


UnicodeDecodeError: 'utf-8' codec can't decode byte 0xc2 in position 0: invalid continuation byte

<span style="background-color: #29B6F6; color: black">
b'\xc2\x00'
because while 0xc2 is a valid start byte for a two-byte UTF-8 sequence, 0x00 is not a valid continuation byte.
</span>

In [22]:
PAT = r"""'(?:[sdmt]|ll|ve|re)| ?\p{L}+| ?\p{N}+| ?[^\s\p{L}\p{N}]+|\s+(?!\S)|\s+"""

# requires `regex` package
import regex as re
re.findall(PAT, "some text that i'll pre-tokenize")

['some', ' text', ' that', ' i', "'ll", ' pre', '-', 'tokenize']

### Problem (train_bpe): BPE Tokenizer Training (15 points)

**Deliverable**: Write a function that, given a path to an input text file, trains a (byte-level) BPE
tokenizer. Your BPE training function should handle (at least) the following input parameters:

`input_path`: str Path to a text file with BPE tokenizer training data.

`vocab_size`: int A positive integer that defines the maximum final vocabulary size (including the
initial byte vocabulary, vocabulary items produced from merging, and any special tokens).

`special_tokens`: list[str] A list of strings to add to the vocabulary. These special tokens do not
otherwise affect BPE training.

Your BPE training function should return the resulting vocabulary and merges:

`vocab`: dict[int, bytes] The tokenizer vocabulary, a mapping from int (token ID in the vocabu-
lary) to bytes (token bytes).

`merges`: list[tuple[bytes, bytes]] A list of BPE merges produced from training. Each list item
is a tuple of bytes (<token1>, <token2>), representing that <token1> was merged with
<token2>. The merges should be ordered by order of creation.

To test your BPE training function against our provided tests, you will first need to implement the
test adapter at `[adapters.run_train_bpe]`. Then, run `uv run pytest tests/test_train_bpe.py`.
Your implementation should be able to pass all tests. Optionally (this could be a large time-investment),
you can implement the key parts of your training method using some systems language, for instance
`C++` (consider `cppyy` for this) or `Rust` (using `PyO3`). If you do this, be aware of which operations
require copying vs reading directly from Python memory, and make sure to leave build instructions, or
make sure it builds using only `pyproject.toml`. Also note that the GPT-2 regex is not well-supported
in most regex engines and will be too slow in most that do. We have verified that Oniguruma is
reasonably fast and supports negative lookahead, but the `regex` package in Python is, if anything,
even faster.

In [4]:
uv run pytest tests/test_train_bpe.py

platform win32 -- Python 3.11.0rc2, pytest-8.3.5, pluggy-1.5.0
rootdir: e:\Code\CS336\assignment1-basics
configfile: pyproject.toml
plugins: jaxtyping-0.3.1
collected 3 items

tests/test_train_bpe.py::test_train_bpe_speed [32mPASSED[0m
tests/test_train_bpe.py::test_train_bpe [32mPASSED[0m
tests/test_train_bpe.py::test_train_bpe_special_tokens [32mPASSED[0m

Note: you may need to restart the kernel to use updated packages.


### Problem (train_bpe_tinystories): BPE Training on TinyStories (2 points)

Train a byte-level BPE tokenizer on the TinyStories dataset, using a maximum vocabulary size
of 10,000. Make sure to add the TinyStories `<|endoftext|>` special token to the vocabulary.
Serialize the resulting vocabulary and merges to disk for further inspection. How many hours
and memory did training take? What is the longest token in the vocabulary? Does it make sense?
Resource requirements: ≤30 minutes (no GPUs), ≤ 30GB RAM

**Hint** You should be able to get under 2 minutes for BPE training using multiprocessing during
pretokenization and the following two facts:

(a) The `<|endoftext|>` token delimits documents in the data files.

(b) The `<|endoftext|>` token is handled as a special case before the BPE merges are applied.

**Deliverable:** A one-to-two sentence response.

| uv run -m cs336_basics.tokenizer "data/TinyStoriesV2-GPT4-train.txt" --vocab_size 10000

``` PowerShell
==================================================
🚀 Initializing BPETrainer...
Configuration:
  - Vocab Size: 10000
  - Special Tokens: ['<|endoftext|>']
  - Input File: data/TinyStoriesV2-GPT4-train.txt
==================================================

Step 1: Pre-tokenizing input file...
✅ Pre-tokenization complete in 189.14 seconds.
   Found 60006 unique pre-tokenized words/chunks.

Step 2: Learning BPE merges...
✅ Merging complete in 22.63 seconds.
   Learned 9743 merges. Final vocab size: 10000
   Longest token has 512 bytes.
   Longest token content: 'ssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssss'

Step 3: Saving vocabulary and merges...
   Vocabulary saved to data/TinyStoriesV2-GPT4-train-vocab_size_10000-vocab.json
   Merges saved to data/TinyStoriesV2-GPT4-train-vocab_size_10000-merges.txt

🎉 Training complete!
==================================================

--- Resource Usage ---
✅ Peak memory usage: 136.51 MB

--- CProfile Performance Analysis (Top 20) ---
         18090141 function calls (18090068 primitive calls) in 211.826 seconds

   Ordered by: cumulative time
   List reduced from 410 to 20 due to restriction <20>

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
        1    0.227    0.227  189.104  189.104 E:\Code\CS336\assignment1-basics\cs336_basics\tokenizer.py:80(pretokenize)   
        4    0.000    0.000  188.790   47.197 D:\Tool\Python311\Lib\threading.py:604(wait)
        4    0.000    0.000  188.790   47.197 D:\Tool\Python311\Lib\threading.py:288(wait)
       20  188.790    9.439  188.790    9.439 {method 'acquire' of '_thread.lock' objects}
        1    0.000    0.000  188.789  188.789 D:\Tool\Python311\Lib\multiprocessing\pool.py:362(map)
        1    0.000    0.000  188.788  188.788 D:\Tool\Python311\Lib\multiprocessing\pool.py:767(get)
        1    0.000    0.000  188.788  188.788 D:\Tool\Python311\Lib\multiprocessing\pool.py:764(wait)
        1    5.626    5.626   22.232   22.232 E:\Code\CS336\assignment1-basics\cs336_basics\tokenizer.py:153(merge)        
   574642    5.532    0.000    9.659    0.000 {built-in method _heapq.heappop}
   818506    4.868    0.000    6.453    0.000 E:\Code\CS336\assignment1-basics\cs336_basics\tokenizer.py:187(update_stats) 
 11978239    4.527    0.000    4.527    0.000 E:\Code\CS336\assignment1-basics\cs336_basics\tokenizer.py:30(__lt__)        
   789060    0.810    0.000    1.210    0.000 {built-in method _heapq.heappush}
    49746    0.353    0.000    0.353    0.000 {method 'write' of '_io.TextIOWrapper' objects}
        1    0.025    0.025    0.308    0.308 D:\Tool\Python311\Lib\json\__init__.py:120(dump)
   789060    0.226    0.000    0.226    0.000 E:\Code\CS336\assignment1-basics\cs336_basics\tokenizer.py:24(__init__)      
   784684    0.161    0.000    0.161    0.000 {method 'add' of 'set' objects}
   293471    0.143    0.000    0.143    0.000 E:\Code\CS336\assignment1-basics\cs336_basics\tokenizer.py:16(count)
   574696    0.120    0.000    0.120    0.000 {method 'get' of 'dict' objects}
   435323    0.082    0.000    0.082    0.000 E:\Code\CS336\assignment1-basics\cs336_basics\tokenizer.py:10(__init__)      
        1    0.000    0.000    0.081    0.081 D:\Tool\Python311\Lib\multiprocessing\context.py:115(Pool)

```

<span style="background-color: #29B6F6; color: black">
The tokenizer was trained in approximately 3 minutes (189.14 seconds) with a peak memory usage of 136.51 MB, both well within the resource limits. The longest token, a 512-byte sequence of the repeating character 's'
</span>

Profile your code. What part of the tokenizer training process takes the most time?

**Deliverable**: A one-to-two sentence response.

<span style="background-color: #29B6F6; color: black">
the pre-tokenization step (pretokenize function) is overwhelmingly the most time-consuming part of the training process, taking approximately 138 seconds.
</span>

### Problem (train_bpe_expts_owt): BPE Training on OpenWebText (2 points)

Train a byte-level BPE tokenizer on the OpenWebText dataset, using a maximum vocabulary
size of 32,000. Serialize the resulting vocabulary and merges to disk for further inspection. What
is the longest token in the vocabulary? Does it make sense?

**Resource requirements**: ≤12 hours (no GPUs), ≤ 100GB RAM

**Deliverable**: A one-to-two sentence response.

| uv run -m cs336_basics.tokenizer "data/owt_train.txt" --vocab_size 32000

``` PowerShell

```

Compare and contrast the tokenizer that you get training on TinyStories versus OpenWebText.

**Deliverable**: A one-to-two sentence response.

### Problem (tokenizer): Implementing the tokenizer (15 points)

**Deliverable**: Implement a Tokenizer class that, given a vocabulary and a list of merges, encodes
text into integer IDs and decodes integer IDs into text. Your tokenizer should also support user-provided
special tokens (appending them to the vocabulary if they aren’t already there). We recommend the
following interface:

`def __init__(self, vocab, merges, special_tokens=None)` Construct a tokenizer from a given vocabulary, list of merges, and (optionally) a list of special tokens. This function should accept the following parameters:

``` python
vocab: dict[int, bytes]
merges: list[tuple[bytes, bytes]]
special_tokens: list[str] | None = None
```

`def from_files(cls, vocab_filepath, merges_filepath, special_tokens=None)` Class method that constructs and return a Tokenizer from a serialized vocabulary and list of merges (in the same format that your BPE training code output) and (optionally) a list of special tokens. This method should accept the following additional parameters:

``` python
vocab_filepath: str
merges_filepath: str
special_tokens: list[str] | None = None
```

`def encode(self, text: str) -> list[int]` Encode an input text into a sequence of token IDs.

`def encode_iterable(self, iterable: Iterable[str]) -> Iterator[int]` Given an iterable of strings (e.g., a Python file handle), return a generator that lazily yields token IDs. This is required for memory-eﬀicient tokenization of large files that we cannot directly load into memory.

`def decode(self, ids: list[int]) -> str` Decode a sequence of token IDs into text.

To test your Tokenizer against our provided tests, you will first need to implement the test adapter at `[adapters.get_tokenizer]`. Then, run `uv run pytest tests/test_tokenizer.py`. Your implementation should be able to pass all tests.

In [2]:
uv run pytest tests/test_tokenizer.py

platform win32 -- Python 3.11.0rc2, pytest-8.3.5, pluggy-1.5.0
rootdir: e:\Code\CS336\assignment1-basics
configfile: pyproject.toml
plugins: jaxtyping-0.3.1
collected 25 items

tests/test_tokenizer.py::test_roundtrip_empty [32mPASSED[0m
tests/test_tokenizer.py::test_empty_matches_tiktoken [32mPASSED[0m
tests/test_tokenizer.py::test_roundtrip_single_character [32mPASSED[0m
tests/test_tokenizer.py::test_single_character_matches_tiktoken [32mPASSED[0m
tests/test_tokenizer.py::test_roundtrip_single_unicode_character [32mPASSED[0m
tests/test_tokenizer.py::test_single_unicode_character_matches_tiktoken [32mPASSED[0m
tests/test_tokenizer.py::test_roundtrip_ascii_string [32mPASSED[0m
tests/test_tokenizer.py::test_ascii_string_matches_tiktoken [32mPASSED[0m
tests/test_tokenizer.py::test_roundtrip_unicode_string [32mPASSED[0m
tests/test_tokenizer.py::test_unicode_string_matches_tiktoken [32mPASSED[0m
tests/test_tokenizer.py::test_roundtrip_unicode_string_with_special_tokens 

### Problem (tokenizer_experiments): Experiments with tokenizers (4 points)

Sample 10 documents from TinyStories and OpenWebText. Using your previously-trained TinyStories and OpenWebText tokenizers (10K and 32K vocabulary size, respectively), encode these sampled documents into integer IDs. What is each tokenizer’s compression ratio (bytes/token)?

**Deliverable**: A one-to-two sentence response.

What happens if you tokenize your OpenWebText sample with the TinyStories tokenizer? Compare the compression ratio and/or qualitatively describe what happens.

**Deliverable**: A one-to-two sentence response.

Estimate the throughput of your tokenizer (e.g., in bytes/second). How long would it take to tokenize the Pile dataset (825GB of text)?

**Deliverable**: A one-to-two sentence response.

Using your TinyStories and OpenWebText tokenizers, encode the respective training and development datasets into a sequence of integer token IDs. We’ll use this later to train our language model. We recommend serializing the token IDs as a NumPy array of datatype uint16. Why is uint16 an appropriate choice?

**Deliverable**: A one-to-two sentence response.

---

## 3. Transformer Language Model Architecture

Maybe useful: https://zhuanlan.zhihu.com/p/1927015802915263462

<div style="display: flex; justify-content: space-around; align-items: center;">
  <img src="data/transformer.png" alt="Transformer Image" style="width: 30%;">
  <img src="data/prenorm_transformer.png" alt="Pre-Norm Transformer Image" style="width: 65%;">
</div>

### Problem (linear): Implementing the linear module (1 point)

**Deliverable**: Implement a `Linear` class that inherits from `torch.nn.Module` and performs a linear transformation. Your implementation should follow the interface of PyTorch’s built-in `nn.Linear` module, except for not having a bias argument or parameter. We recommend the following interface:

`def __init__(self, in_features, out_features, device=None, dtype=None)` Construct a
linear transformation module. This function should accept the following parameters:
``` python
in_features: int final dimension of the input
out_features: int final dimension of the output
device: torch.device | None = None Device to store the parameters on
dtype: torch.dtype | None = None Data type of the parameters
```

`def forward(self, x: torch.Tensor) -> torch.Tensor` Apply the linear transformation to the input.

Make sure to:

- subclass `nn.Module`
- call the superclass constructor
- construct and store your parameter as $W$ (not $W^T$) for memory ordering reasons, putting it in an `nn.Parameter`
- of course, don’t use `nn.Linear` or `nn.functional.linear`

For initializations, use the settings from above along with `torch.nn.init.trunc_normal_` to
initialize the weights.
To test your Linear module, implement the test adapter at `[adapters.run_linear]`. The adapter
should load the given weights into your Linear module. You can use `Module.load_state_dict` for
this purpose. Then, run `uv run pytest -k test_linear`.

In [1]:
uv run pytest -k test_linear

platform win32 -- Python 3.11.0rc2, pytest-8.3.5, pluggy-1.5.0
rootdir: e:\Code\CS336\assignment1-basics
configfile: pyproject.toml
plugins: jaxtyping-0.3.1
collected 48 items / 47 deselected / 1 selected

tests/test_model.py::test_linear [32mPASSED[0m

Note: you may need to restart the kernel to use updated packages.


### Problem (embedding): Implement the embedding module (1 point)

**Deliverable**: Implement the `Embedding` class that inherits from `torch.nn.Module` and performs an
embedding lookup. Your implementation should follow the interface of PyTorch’s built-in
`nn.Embedding` module. We recommend the following interface:

`def __init__(self, num_embeddings, embedding_dim, device=None, dtype=None)` Construct an embedding module. This function should accept the following parameters:

``` python
num_embeddings: int Size of the vocabulary
embedding_dim: int Dimension of the embedding vectors, i.e., d_model
device: torch.device | None = None Device to store the parameters on
dtype: torch.dtype | None = None Data type of the parameters
```

`def forward(self, token_ids: torch.Tensor) -> torch.Tensor` Lookup the embedding vectors for the given token IDs.

Make sure to:
- subclass nn.Module
- call the superclass constructor
- initialize your embedding matrix as a nn.Parameter
- store the embedding matrix with the d_model being the final dimension
- of course, don’t use nn.Embedding or nn.functional.embedding

Again, use the settings from above for initialization, and use `torch.nn.init.trunc_normal_` to
initialize the weights.

To test your implementation, implement the test adapter at `[adapters.run_embedding]`. Then, run
`uv run pytest -k test_embedding`.

In [2]:
uv run pytest -k test_embedding

platform win32 -- Python 3.11.0rc2, pytest-8.3.5, pluggy-1.5.0
rootdir: e:\Code\CS336\assignment1-basics
configfile: pyproject.toml
plugins: jaxtyping-0.3.1
collected 48 items / 47 deselected / 1 selected

tests/test_model.py::test_embedding [32mPASSED[0m

Note: you may need to restart the kernel to use updated packages.


### Problem (rmsnorm): Root Mean Square Layer Normalization (1 point)

**Deliverable**: Implement `RMSNorm` as a `torch.nn.Module`. We recommend the following interface:

`def __init__(self, d_model: int, eps: float = 1e-5, device=None, dtype=None)` Construct the RMSNorm module. This function should accept the following parameters:

``` python
d_model: int Hidden dimension of the model
eps: float = 1e-5 Epsilon value for numerical stability
device: torch.device | None = None Device to store the parameters on
dtype: torch.dtype | None = None Data type of the parameters
```

`def forward(self, x: torch.Tensor) -> torch.Tensor` Process an input tensor of shape (`batch_size`, `sequence_length`, `d_model`) and return a tensor of the same shape.

**Note**: Remember to upcast your input to `torch.float32` before performing the normalization (and
later downcast to the original `dtype`), as described above.

To test your implementation, implement the test adapter at `[adapters.run_rmsnorm]`. Then, run `uv
run pytest -k test_rmsnorm`.

In [3]:
uv run pytest -k test_rmsnorm

platform win32 -- Python 3.11.0rc2, pytest-8.3.5, pluggy-1.5.0
rootdir: e:\Code\CS336\assignment1-basics
configfile: pyproject.toml
plugins: jaxtyping-0.3.1
collected 48 items / 47 deselected / 1 selected

tests/test_model.py::test_rmsnorm [32mPASSED[0m

Note: you may need to restart the kernel to use updated packages.


### Problem (positionwise_feedforward): Implement the position-wise feed-forward network (2 points)

**Deliverable**: Implement the `SwiGLU` feed-forward network, composed of a `SiLU` activation function and a `GLU`.

**Note**: in this particular case, you should feel free to use torch.sigmoid in your implementation for numerical stability.

You should set dff to approximately $\frac{8}{3}$ × $d_\text{model}$ in your implementation, while ensuring that the dimensionality of the inner feed-forward layer is a multiple of 64 to make good use of your hardware. To test your implementation against our provided tests, you will need to implement
the test adapter at `[adapters.run_swiglu]`. Then, run `uv run pytest -k test_swiglu` to test your implementation.

In [4]:
uv run pytest -k test_swiglu

platform win32 -- Python 3.11.0rc2, pytest-8.3.5, pluggy-1.5.0
rootdir: e:\Code\CS336\assignment1-basics
configfile: pyproject.toml
plugins: jaxtyping-0.3.1
collected 48 items / 47 deselected / 1 selected

tests/test_model.py::test_swiglu [32mPASSED[0m

Note: you may need to restart the kernel to use updated packages.


### Problem (rope): Implement RoPE (2 points)

**Deliverable**: Implement a class `RotaryPositionalEmbedding` that applies RoPE to the input tensor.

The following interface is recommended:

`def __init__(self, theta: float, d_k: int, max_seq_len: int, device=None)` Construct the
RoPE module and create buffers if needed.

``` python
theta: float Θ value for the RoPE
d_k: int dimension of query and key vectors
max_seq_len: int Maximum sequence length that will be inputted
device: torch.device | None = None Device to store the buffer on
```

`def forward(self, x: torch.Tensor, token_positions: torch.Tensor) -> torch.Tensor` Process an input tensor of shape `(..., seq_len, d_k)` and return a tensor of the same shape. Note that you should tolerate x with an arbitrary number of batch dimensions. You should assume that the token positions are a tensor of shape `(..., seq_len)` specifying the token positions of `x` along the sequence dimension.

You should use the token positions to slice your (possibly precomputed) cos and sin tensors along the sequence dimension.

To test your implementation, complete [adapters.run_rope] and make sure it passes `uv run pytest -k test_rope`.

In [5]:
uv run pytest -k test_rope

platform win32 -- Python 3.11.0rc2, pytest-8.3.5, pluggy-1.5.0
rootdir: e:\Code\CS336\assignment1-basics
configfile: pyproject.toml
plugins: jaxtyping-0.3.1
collected 48 items / 47 deselected / 1 selected

tests/test_model.py::test_rope [32mPASSED[0m

Note: you may need to restart the kernel to use updated packages.


### Problem (softmax): Implement softmax (1 point)

**Deliverable**: Write a function to apply the `softmax` operation on a tensor. Your function should take two parameters: a tensor and a dimension $i$, and apply softmax to the $i$-th dimension of the input tensor. The output tensor should have the same shape as the input tensor, but its $i$-th dimension will now have a normalized probability distribution. Use the trick of subtracting the maximum value in the $i$-th dimension from all elements of the $i$-th dimension to avoid numerical stability issues.

To test your implementation, complete `[adapters.run_softmax]` and make sure it passes `uv run pytest -k test_softmax_matches_pytorch`.


In [6]:
uv run pytest -k test_softmax_matches_pytorch

platform win32 -- Python 3.11.0rc2, pytest-8.3.5, pluggy-1.5.0
rootdir: e:\Code\CS336\assignment1-basics
configfile: pyproject.toml
plugins: jaxtyping-0.3.1
collected 48 items / 47 deselected / 1 selected

tests/test_nn_utils.py::test_softmax_matches_pytorch [32mPASSED[0m

Note: you may need to restart the kernel to use updated packages.


### Problem (scaled_dot_product_attention): Implement scaled dot-product attention (5 points)

**Deliverable**: Implement the scaled dot-product attention function. Your implementation should handle keys and queries of shape `(batch_size, ..., seq_len, d_k)` and values of shape `(batch_size, ..., seq_len, d_v)`, where `...` represents any number of other batch-like dimensions (if provided). The implementation should return an output with the shape `(batch_size, ..., d_v)`. See section $3.3$ for a discussion on batch-like dimensions.

Your implementation should also support an optional user-provided boolean mask of shape `(seq_len, seq_len)`. The attention probabilities of positions with a mask value of `True` should collectively sum
to $1$, and the attention probabilities of positions with a mask value of `False` should be zero.
To test your implementation against our provided tests, you will need to implement the test adapter
at `[adapters.run_scaled_dot_product_attention]`.

`uv run pytest -k test_scaled_dot_product_attention` tests your implementation on third-order input tensors, while `uv run pytest -k test_4d_scaled_dot_product_attention` tests your implementation on fourth-order input tensors.

In [10]:
uv run pytest -k test_scaled_dot_product_attention

platform win32 -- Python 3.11.0rc2, pytest-8.3.5, pluggy-1.5.0
rootdir: e:\Code\CS336\assignment1-basics
configfile: pyproject.toml
plugins: jaxtyping-0.3.1
collected 48 items / 47 deselected / 1 selected

tests/test_model.py::test_scaled_dot_product_attention [32mPASSED[0m

Note: you may need to restart the kernel to use updated packages.


In [9]:
uv run pytest -k test_4d_scaled_dot_product_attention

platform win32 -- Python 3.11.0rc2, pytest-8.3.5, pluggy-1.5.0
rootdir: e:\Code\CS336\assignment1-basics
configfile: pyproject.toml
plugins: jaxtyping-0.3.1
collected 48 items / 47 deselected / 1 selected

tests/test_model.py::test_4d_scaled_dot_product_attention [32mPASSED[0m

Note: you may need to restart the kernel to use updated packages.


### Problem (multihead_self_attention): Implement causal multi-head self-attention (5 points)

**Deliverable**: Implement causal multi-head self-attention as a `torch.nn.Module`. Your implementation should accept (at least) the following parameters:

``` python
d_model: int Dimensionality of the Transformer block inputs.
num_heads: int Number of heads to use in multi-head self-attention.
```

Folllowing $\text{Vaswani et al. [2017]}$, set $d_k=d_v=\frac{d_{model}}h$. To test your implementation against our provided tests, implement the test adapter at `[adapters.run_multihead_self_attention]`. Then, run `uv run pytest -k test_multihead_self_attention` to test your implementation.

In [1]:
uv run pytest -k test_multihead_self_attention

platform win32 -- Python 3.11.0rc2, pytest-8.3.5, pluggy-1.5.0
rootdir: e:\Code\CS336\assignment1-basics
configfile: pyproject.toml
plugins: jaxtyping-0.3.1
collected 48 items / 46 deselected / 2 selected

tests/test_model.py::test_multihead_self_attention [32mPASSED[0m
tests/test_model.py::test_multihead_self_attention_with_rope [32mPASSED[0m

Note: you may need to restart the kernel to use updated packages.


### Problem (transformer_block): Implement the Transformer block (3 points)

Implement the pre-norm Transformer block as described in $§3.5$ and illustrated in $\text{Figure} 2$. Your
Transformer block should accept (at least) the following parameters.

``` python
d_model: int Dimensionality of the Transformer block inputs.
num_heads: int Number of heads to use in multi-head self-attention.
d_ff: int Dimensionality of the position-wise feed-forward inner layer.
```

To test your implementation, implement the adapter `[adapters.run_transformer_block]`. Then run `uv run pytest -k test_transformer_block` to test your implementation.

**Deliverable**: Transformer block code that passes the provided tests.

In [2]:
uv run pytest -k test_transformer_block

platform win32 -- Python 3.11.0rc2, pytest-8.3.5, pluggy-1.5.0
rootdir: e:\Code\CS336\assignment1-basics
configfile: pyproject.toml
plugins: jaxtyping-0.3.1
collected 48 items / 47 deselected / 1 selected

tests/test_model.py::test_transformer_block [32mPASSED[0m

Note: you may need to restart the kernel to use updated packages.


### Problem (transformer_lm): Implementing the Transformer LM (3 points)

Time to put it all together! Implement the Transformer language model as described in $§3.1$ and illustrated in $\text{Figure} 1$. At minimum, your implementation should accept all the aforementioned construction parameters for the Transformer block, as well as these additional parameters:

``` python
vocab_size: int The size of the vocabulary, necessary for determining the dimensionality of the token embedding matrix.
context_length: int The maximum context length, necessary for determining the dimensionality of the position embedding matrix.
num_layers: int The number of Transformer blocks to use.
```

To test your implementation against our provided tests, you will first need to implement the test adapter at `[adapters.run_transformer_lm]`. Then, run `uv run pytest -k test_transformer_lm` to test your implementation.

**Deliverable**: A Transformer LM module that passes the above tests.

In [3]:
uv run pytest -k test_transformer_lm

platform win32 -- Python 3.11.0rc2, pytest-8.3.5, pluggy-1.5.0
rootdir: e:\Code\CS336\assignment1-basics
configfile: pyproject.toml
plugins: jaxtyping-0.3.1
collected 48 items / 46 deselected / 2 selected

tests/test_model.py::test_transformer_lm [32mPASSED[0m
tests/test_model.py::test_transformer_lm_truncated_input [32mPASSED[0m

Note: you may need to restart the kernel to use updated packages.


### Problem (transformer_accounting): Transformer LM resource accounting (5 points)

Consider GPT-2 XL, which has the following configuration:

``` python
vocab_size : 50,257
context_length : 1,024
num_layers : 48
d_model : 1,600
num_heads : 25
d_ff : 6,400
```

Suppose we constructed our model using this configuration. How many trainable parameters
would our model have? Assuming each parameter is represented using single-precision floating
point, how much memory is required to just load this model?

**Deliverable**: A one-to-two sentence response.

``` markdown
**token_embeddings**: vocab_size * d_model = 50,257 * 1,600 = 80,411,200
**note**: embedding is is a **lookup**, but it is mathematically equivalent to a **matrix multiplication** with a **one-hot encoded vector**.

**layers**:
- **attn**:
- - q_proj: d_model × d_model = 1,600 × 1,600 = 2,560,000
- - k_proj: d_model × d_model = 1,600 × 1,600 = 2,560,000
- - v_proj: d_model × d_model = 1,600 × 1,600 = 2,560,000
- - output_proj: d_model × d_model = 1,600 × 1,600 = 2,560,000
- **ffn**:
- - w1: d_model × d_ff = 1,600 × 6,400 = 10,240,000
- - w2: d_ff × d_model = 6,400 × 1,600 = 10,240,000
- - w3: d_model × d_ff = 1,600 × 6,400 = 10,240,000
- **ln1**: d_model = 1600
- **ln2**: d_model = 1600
- 40,963,200 x num_layers = 40,963,200 x 48 = 1,966,233,600

**ln_final**：d_model = 1,600

**lm_head**: vocab_size * d_model = 50,257 * 1,600 = 80,411,200
```

<span style="background-color: #29B6F6; color: black">
2,127,058,600 parameters, × 4 bytes = 7.92 GB
</span>

Identify the matrix multiplies required to complete a forward pass of our GPT-2 XL-shaped
model. How many `FLOPs` do these matrix multiplies require in total? Assume that our input
sequence has context_length tokens.

**Deliverable**: A list of matrix multiplies (with descriptions), and the total number of `FLOPs`
required.

``` markdown
**Transformer Block**
- MHA
- - q_proj: (1024, 1600) × (1600, 1600) → FLOPs: 2 × 1024 × 1600 × 1600 = 5,242,880,000
- - k_proj: (1024, 1600) × (1600, 1600) → FLOPs: 2 × 1024 × 1600 × 1600 = 5,242,880,000
- - v_proj: (1024, 1600) × (1600, 1600) → FLOPs: 2 × 1024 × 1600 × 1600 = 5,242,880,000
- - attention scores: (25, 1024, 64) × (25, 64, 1024) → FLOPs: 2 × 25 × 1024 × 64 × 1024 = 3,355,443,200
- - attention: (25, 1024, 1024) × (25, 1024, 64) → FLOPs: 2 × 25 × 1024 × 1024 × 64 = 3,355,443,200
- - output_proj: (1024, 1600) × (1600, 1600) → FLOPs: 2 × 1024 × 1600 × 1600 = 5,242,880,000
- FFN
- - w1: (1024, 1600) × (1600, 6400) → FLOPs: 2 × 1024 × 1600 × 6400 = 20,971,520,000
- - w3: (1024, 1600) × (1600, 6400) → FLOPs: 2 × 1024 × 1600 × 6400 = 20,971,520,000
- - w2: (1024, 6400) × (6400, 1600) → FLOPs: 2 × 1024 × 6400 × 1600 = 20,971,520,000
- 90,596,966,400 * 48 = 4,348,654,387,200

**LM Head**: (1024, 1600) × (1600, 50257) → FLOPs: 2 × 1024 × 1600 × 50257 = 164,681,932,800
```

<span style="background-color: #29B6F6; color: black">
4,513,297,920,000 FLOPs
</span>

Based on your analysis above, which parts of the model require the most `FLOPs`?

**Deliverable**: A one-to-two sentence response.

<span style="background-color: #29B6F6; color: black">
FFN, as it projects the hidden state from d_model to a larger intermediate dimension d_ff (4x in this case), and then projects it back, which involves matrix multiplication operations much larger than the self attention part.
</span>

Repeat your analysis with GPT-2 small (12 layers, 768 d_model, 12 heads), GPT-2 medium (24
layers, 1024 d_model, 16 heads), and GPT-2 large (36 layers, 1280 d_model, 20 heads). As the
model size increases, which parts of the Transformer LM take up proportionally more or less of
the total `FLOPs`?

**Deliverable**: For each model, provide a breakdown of model components and its associated
`FLOPs` (as a proportion of the total `FLOPs` required for a forward pass). In addition, provide a
one-to-two sentence description of how varying the model size changes the proportional `FLOPs`
of each component.

| model            | d_model (d) | FFN FLOPs         | Attention FLOPs | FFN %  | Attention % |
| ---------------- | ----------- | ----------------- | --------------- | ------ | ----------- |
| **GPT-2 small**  | 768         | 11,509,596,160    | 4,731,174,912   | ~64.3% | ~35.7%      |
| **GPT-2 medium** | 1024        | 25,769,803,776    | 12,884,901,888  | ~66.7% | ~33.3%      |
| **GPT-2 large**  | 1280        | 40,265,318,400    | 18,790,481,920  | ~68.2% | ~31.8%      |
| **GPT-2 XL**     | 1600        | 62,914,560,000    | 27,682,406,400  | ~69.4% | ~30.6%      |


<span style="background-color: #29B6F6; color: black">
The proportion of FFN in total FLOPs will relatively increase, while the proportion of self attention will relatively decrease
</span>

Take GPT-2 XL and increase the context length to 16,384. How does the total `FLOPs` for one
forward pass change? How do the relative contribution of `FLOPs` of the model components
change?

**Deliverable**: A one-to-two sentence response.

``` markdown
**Attention** increase sequence length dimension by 16 times, FLOPs increase by 16 times
**FFN**: Increase sequence length dimension by 16 times, FLOPs increase by 16 times
**Total**L FLOPs increase by approximately 16 times

**FFN** still dominates, but the proportion of sequence length related calculations (Q × K and attention × V) for attention will slightly increase, as the complexity of these operations is O(n^2), while other operations are O(n).
```

---

## 4. Training a Transformer LM

Maybe useful: https://zhuanlan.zhihu.com/p/1927132746578366957

## 5. Training loop

---

## 6. Generating text

Maybe useful: https://zhuanlan.zhihu.com/p/1927311888515073257

## 7. Experiments

---