## Instruction

> 1. Rename assignment-03-###-###.ipynb where ### is your student ID and your name (Chinese).
> 2. The deadline of Assignment-03 is 23:59pm, 11-19-2025
> 3. Submit a single ZIP archive that includes every file you downloaded. You only need to modify transformer.py, run the Jupyter notebook, and save the resulting performance output inside the archive.
> 4. The primary goal of this assignment is to give you hands-on experience implementing a Transformer model.

## Task
> In this assignment, you will train a Transformer model to count letters. Given a string of characters, your task is to predict, for each position in the string, how many times the character at that position occurred previously, maxing out at 2. This is a 3-class classification task (with labels 0, 1, or > 2, which we’ll just denote as 2). This task is easy with a rule-based system, but it is not so easy for a model to learn. However, Transformers are ideally set up to be able to “look back” with self-attention to count occurrences in the context. Below is an example string (x) (which ends in a trailing space) and its corresponding labels (y):
> - x: i like movies a lot
> - y: 00010010002102021102
>  
> If your implementation is correct, then ```python letter_counting.py --task BEFORE```
> gives a reasonable output (accuracy will be above 90%).


> We also present a modified version of this task that counts both occurrences of letters before and after in the sequence:
> - x: i like movies a lot
> - y: 22120120102102021102
>  
> If your implementation is correct, then ```python letter_counting.py --task BEFOREAFTER ``` gives a reasonable output (accuracy will be above 90%).

## Dataset

> The dataset for this homework is derived from the text8 collection, which comes from Wikipedia. Your method will use character-level tokenization and operate over text8 sequences that are each exactly 20 characters long. Only 27 character types are present (lowercase characters and spaces); special characters are replaced by a single space and numbers are spelled out as individual digits (50 becomes five zero). Part of examples are:
> 
> - heir average albedo
> - ed by rank and file
> - s can also extend in
> - erages between nine
> - that civilization n
> - on a t shaped islan
> 
> The dataset is in lettercounting-train.txt and lettercounting-dev.txt. Both two files contain character strings of length 20. You can assume that your model will always see 20 characters as input and make a prediction at each position in the sequence.

## Code

> The framework code you are given consists of several files.
> 1. *utils.py*: it implements an Indexer class, which can be used to maintain a bijective mapping between indices and features (strings).
> 2. *letter_counting.py*: contains the driver code, which imports transformer.py, the file you will be editing for this assignment.
> 3. *transformer.py*: **You need to fill out all missing parts. Note that your solutions should not use nn.TransformerEncoder, nn.TransformerDecoder, or any other off-the-shelf self-attention layers. You can use nn.Linear, nn.Embedding, and PyTorch’s provided nonlinearities / loss functions to implement Transformers from scratch.**


In [1]:
!python letter_counting.py --task BEFORE

Namespace(task='BEFORE', train='data/lettercounting-train.txt', dev='data/lettercounting-dev.txt', output_bundle_path='classifier-output.json')
['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j', 'k', 'l', 'm', 'n', 'o', 'p', 'q', 'r', 's', 't', 'u', 'v', 'w', 'x', 'y', 'z', ' ']
10000 lines read in
1000 lines read in
Epoch 1/10 - Loss: 0.345033
Epoch 2/10 - Loss: 0.136359
Epoch 3/10 - Loss: 0.079902
Epoch 4/10 - Loss: 0.060359
Epoch 5/10 - Loss: 0.058332
Epoch 6/10 - Loss: 0.046585
Epoch 7/10 - Loss: 0.057002
Epoch 8/10 - Loss: 0.051632
Epoch 9/10 - Loss: 0.063634
Epoch 10/10 - Loss: 0.066841
INPUT 0: heir average albedo 
GOLD 0: array([0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 0, 2, 1, 2, 0, 0, 2, 0, 0, 2])
PRED 0: array([0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 0, 2, 1, 2, 0, 0, 2, 0, 0, 2])
INPUT 1: ed by rank and file 
GOLD 1: array([0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 2, 1, 1, 1, 2, 0, 0, 0, 1, 2])
PRED 1: array([0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 2, 1, 1, 1, 2, 0, 0, 0, 1, 2])
INPUT 2: s can also extend in
GOLD 2

In [2]:
!python letter_counting.py --task BEFOREAFTER 

Namespace(task='BEFOREAFTER', train='data/lettercounting-train.txt', dev='data/lettercounting-dev.txt', output_bundle_path='classifier-output.json')
['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j', 'k', 'l', 'm', 'n', 'o', 'p', 'q', 'r', 's', 't', 'u', 'v', 'w', 'x', 'y', 'z', ' ']
10000 lines read in
1000 lines read in
Epoch 1/10 - Loss: 0.460718
Epoch 2/10 - Loss: 0.212338
Epoch 3/10 - Loss: 0.119900
Epoch 4/10 - Loss: 0.085283
Epoch 5/10 - Loss: 0.064985
Epoch 6/10 - Loss: 0.057256
Epoch 7/10 - Loss: 0.052590
Epoch 8/10 - Loss: 0.045367
Epoch 9/10 - Loss: 0.041914
Epoch 10/10 - Loss: 0.040895
INPUT 0: heir average albedo 
GOLD 0: array([0, 2, 0, 1, 2, 2, 0, 2, 1, 2, 0, 2, 2, 2, 0, 0, 2, 0, 0, 2])
PRED 0: array([0, 2, 0, 1, 2, 2, 0, 2, 1, 2, 0, 2, 2, 2, 0, 0, 2, 0, 0, 2])
INPUT 1: ed by rank and file 
GOLD 1: array([1, 1, 2, 0, 0, 2, 0, 1, 1, 0, 2, 1, 1, 1, 2, 0, 0, 0, 1, 2])
PRED 1: array([1, 1, 2, 0, 0, 2, 0, 1, 1, 0, 2, 1, 1, 1, 2, 0, 0, 0, 1, 2])
INPUT 2: s can also extend in
G