## Instruction

> 1. Rename assignment-03-###-###.ipynb where ### is your student ID and your name (Chinese).
> 2. The deadline of Assignment-03 is 23:59pm, 04-30-2025
> 3. Submit a single ZIP archive that includes every file you downloaded. You only need to modify transformer.py, run the Jupyter notebook, and save the resulting performance output inside the archive.
> 4. The primary goal of this assignment is to give you hands-on experience implementing a Transformer model.

## Task
> In this assignment, you will train a Transformer model to count letters. Given a string of characters, your task is to predict, for each position in the string, how many times the character at that position occurred previously, maxing out at 2. This is a 3-class classification task (with labels 0, 1, or > 2, which we’ll just denote as 2). This task is easy with a rule-based system, but it is not so easy for a model to learn. However, Transformers are ideally set up to be able to “look back” with self-attention to count occurrences in the context. Below is an example string (x) (which ends in a trailing space) and its corresponding labels (y):
> - x: i like movies a lot
> - y: 00010010002102021102
>  
> If your implementation is correct, then ```python letter_counting.py --task BEFORE```
> gives a reasonable output (accuracy will be above 90%).


> We also present a modified version of this task that counts both occurrences of letters before and after in the sequence:
> - x: i like movies a lot
> - y: 22120120102102021102
>  
> If your implementation is correct, then ```python letter_counting.py --task BEFOREAFTER ``` gives a reasonable output (accuracy will be above 90%).

## Dataset

> The dataset for this homework is derived from the text8 collection, which comes from Wikipedia. Your method will use character-level tokenization and operate over text8 sequences that are each exactly 20 characters long. Only 27 character types are present (lowercase characters and spaces); special characters are replaced by a single space and numbers are spelled out as individual digits (50 becomes five zero). Part of examples are:
> 
> - heir average albedo
> - ed by rank and file
> - s can also extend in
> - erages between nine
> - that civilization n
> - on a t shaped islan
> 
> The dataset is in lettercounting-train.txt and lettercounting-dev.txt. Both two files contain character strings of length 20. You can assume that your model will always see 20 characters as input and make a prediction at each position in the sequence.

## Code

> The framework code you are given consists of several files.
> 1. *utils.py*: it implements an Indexer class, which can be used to maintain a bijective mapping between indices and features (strings).
> 2. *letter_counting.py*: contains the driver code, which imports transformer.py, the file you will be editing for this assignment.
> 3. *transformer.py*: **You need to fill out all missing parts. Note that your solutions should not use nn.TransformerEncoder, nn.TransformerDecoder, or any other off-the-shelf self-attention layers. You can use nn.Linear, nn.Embedding, and PyTorch’s provided nonlinearities / loss functions to implement Transformers from scratch.**


In [6]:
import torch
print(torch.__version__)


2.2.0+cu121


In [12]:
!python letter_counting.py --task BEFORE --train data/lettercounting-train-augmented.txt

Namespace(task='BEFORE', train='data/lettercounting-train-augmented.txt', dev='data/lettercounting-dev.txt', output_bundle_path='classifier-output.json')
['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j', 'k', 'l', 'm', 'n', 'o', 'p', 'q', 'r', 's', 't', 'u', 'v', 'w', 'x', 'y', 'z', ' ']
10000 lines read in
1000 lines read in
Using device: cuda
Epoch 1: 100%|██████| 625/625 [00:05<00:00, 113.64it/s, loss=1.2397, acc=23.07%]
Epoch 1 | Train Loss: 1.2417 | Train Acc: 23.07% | Val Acc: 23.73%
Epoch 2: 100%|██████| 625/625 [00:04<00:00, 127.85it/s, loss=1.2385, acc=22.98%]
Epoch 2 | Train Loss: 1.2405 | Train Acc: 22.98% | Val Acc: 23.78%
Epoch 3: 100%|██████| 625/625 [00:04<00:00, 127.90it/s, loss=1.2366, acc=23.09%]
Epoch 3 | Train Loss: 1.2386 | Train Acc: 23.09% | Val Acc: 23.78%
Epoch 4: 100%|██████| 625/625 [00:04<00:00, 128.08it/s, loss=1.2336, acc=23.36%]
Epoch 4 | Train Loss: 1.2356 | Train Acc: 23.36% | Val Acc: 24.05%
Epoch 5: 100%|██████| 625/625 [00:04<00:00, 127.17it/s, loss

In [None]:
# Your output

Your output ....

In [14]:
!python letter_counting.py --task BEFOREAFTER 

Namespace(task='BEFOREAFTER', train='data/lettercounting-train.txt', dev='data/lettercounting-dev.txt', output_bundle_path='classifier-output.json')
['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j', 'k', 'l', 'm', 'n', 'o', 'p', 'q', 'r', 's', 't', 'u', 'v', 'w', 'x', 'y', 'z', ' ']
10000 lines read in
1000 lines read in
Using device: cuda
Epoch 1: 100%|██████| 625/625 [00:05<00:00, 118.74it/s, loss=1.2154, acc=35.67%]
Epoch 1 | Train Loss: 1.2174 | Train Acc: 35.67% | Val Acc: 34.05%
Epoch 2: 100%|██████| 625/625 [00:04<00:00, 131.84it/s, loss=1.2141, acc=35.73%]
Epoch 2 | Train Loss: 1.2161 | Train Acc: 35.73% | Val Acc: 34.73%
Epoch 3: 100%|██████| 625/625 [00:04<00:00, 126.12it/s, loss=1.2132, acc=35.67%]
Epoch 3 | Train Loss: 1.2152 | Train Acc: 35.67% | Val Acc: 34.73%
Epoch 4: 100%|██████| 625/625 [00:04<00:00, 126.72it/s, loss=1.2093, acc=35.94%]
Epoch 4 | Train Loss: 1.2112 | Train Acc: 35.94% | Val Acc: 35.75%
Epoch 5: 100%|██████| 625/625 [00:04<00:00, 126.02it/s, loss=1.20

Your output ....

In [15]:
!python letter_counting.py --task BEFOREAFTER

Namespace(task='BEFOREAFTER', train='data/lettercounting-train.txt', dev='data/lettercounting-dev.txt', output_bundle_path='classifier-output.json')
['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j', 'k', 'l', 'm', 'n', 'o', 'p', 'q', 'r', 's', 't', 'u', 'v', 'w', 'x', 'y', 'z', ' ']
10000 lines read in
1000 lines read in
Using device: cuda
Epoch 1: 100%|██████| 625/625 [00:05<00:00, 113.67it/s, loss=1.2023, acc=36.14%]
Epoch 1 | Train Loss: 1.2042 | Train Acc: 36.14% | Val Acc: 37.12%
Epoch 2: 100%|██████| 625/625 [00:04<00:00, 128.97it/s, loss=1.2012, acc=36.21%]
Epoch 2 | Train Loss: 1.2031 | Train Acc: 36.21% | Val Acc: 37.22%
Epoch 3: 100%|██████| 625/625 [00:04<00:00, 129.11it/s, loss=1.1995, acc=36.31%]
Epoch 3 | Train Loss: 1.2014 | Train Acc: 36.31% | Val Acc: 37.24%
Epoch 4: 100%|██████| 625/625 [00:04<00:00, 129.03it/s, loss=1.1963, acc=36.71%]
Epoch 4 | Train Loss: 1.1982 | Train Acc: 36.71% | Val Acc: 38.02%
Epoch 5: 100%|██████| 625/625 [00:04<00:00, 128.69it/s, loss=1.19

In [None]:
# Your output