# HW3: Transformer from Scratch

In this exercise, you are replicating character-level transformer from scratch with Pytorch Lightning. You should end up with similar code to [nanoGPT](https://github.com/karpathy/nanoGPT).

We have prepared for you a dataset and dataloader of พระอภัยมณี by สุนทรภู่ , a famous Thai poet. You should receive your very own nanoสุนทรภู่ by the end of this exercise.

Reference: [Andrej Kaparty - Let's build GPT from Scratch](https://www.youtube.com/watch?v=kCc8FmEb1nY).  
Data Source: [Vajirayana - พระอภัยมณี](https://vajirayana.org/%E0%B8%9E%E0%B8%A3%E0%B8%B0%E0%B8%AD%E0%B8%A0%E0%B8%B1%E0%B8%A2%E0%B8%A1%E0%B8%93%E0%B8%B5)

In [1]:
!pip -q install lightning
!wget https://github.com/Knight-H/thai-lm/raw/refs/heads/master/data/pra-apai-manee-ch1-50.txt

[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m40.4/40.4 kB[0m [31m1.9 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m815.2/815.2 kB[0m [31m22.1 MB/s[0m eta [36m0:00:00[0m00:01[0m
[?25h--2025-01-26 05:47:54--  https://github.com/Knight-H/thai-lm/raw/refs/heads/master/data/pra-apai-manee-ch1-50.txt
Resolving github.com (github.com)... 140.82.121.3
Connecting to github.com (github.com)|140.82.121.3|:443... connected.
HTTP request sent, awaiting response... 302 Found
Location: https://raw.githubusercontent.com/Knight-H/thai-lm/refs/heads/master/data/pra-apai-manee-ch1-50.txt [following]
--2025-01-26 05:47:55--  https://raw.githubusercontent.com/Knight-H/thai-lm/refs/heads/master/data/pra-apai-manee-ch1-50.txt
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.110.133, 185.199.108.133, 185.199.111.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.110.133|:443...

In [2]:
import torch
import torch.nn as nn
from torch.nn import functional as F
import lightning as L
from datetime import datetime

# hyperparameters
batch_size = 16 # B: how many independent sequences will we process in parallel?
seq_len = 256    # T: what is the maximum context length for predictions?
n_embd = 64     # C: text embedding size
n_head = 4      # number of heads
n_layer = 4     # number of blocks
max_iters = 5000
eval_interval = 250
learning_rate = 1e-3
device = 'cuda' if torch.cuda.is_available() else 'cpu'
eval_iters = 200
dropout = 0.0
# ------------

torch.manual_seed(42)

<torch._C.Generator at 0x7ff0bdf92b70>

In [3]:
with open('pra-apai-manee-ch1-50.txt', 'r', encoding='utf-8') as f:
    text = f.read()
print("Length of dataset in characters: ", len(text))
# let's look at the first 1000 characters
print(text[:1000])

Length of dataset in characters:  1100605
๏ แต่ปางหลังยังมีกรุงกษัตริย์
สมมุติวงศ์ทรงนามท้าวสุทัศน์	ผ่านสมบัติรัตนานามธานี
อันกรุงไกรใหญ่ยาวสิบเก้าโยชน์	ภูเขาโขดเป็นกำแพงบุรีศรี
สะพรึบพร้อมไพร่ฟ้าประชาชี	ชาวบุรีหรรษาสถาวร
มีเอกองค์นงลักษณ์อัครราช	พระนางนาฏนามปทุมเกสร
สนมนางแสนสุรางคนิกร	ดังกินนรน่ารักลักขณา
มีโอรสสององค์ล้วนทรงลักษณ์	ประไพพักตร์เพียงเทพเลขา
ชื่ออภัยมณีเป็นพี่ยา	พึ่งแรกรุ่นชันษาสิบห้าปี
อันกุมารศรีสุวรรณนั้นเป็นน้อง	เนื้อดังทองนพคุณจำรูญศรี
พึ่งโสกันต์ชันษาสิบสามปี	พระชนนีรักใคร่ดังนัยนา
สมเด็จท้าวบิตุรงค์ดำรงราชย์	แสนสวาทลูกน้อยเสน่หา
จะเสกสองครองสมบัติขัตติยา	แต่วิชาสิ่งใดไม่ชำนาญ
จึงดำรัสเรียกพระโอรสราช	มาริมอาสน์แท่นสุวรรณแล้วบรรหาร
พ่อจะแจ้งเจ้าจงจำคำโบราณ	อันชายชาญเชื้อกษัตริย์ขัตติยา
ย่อมพากเพียรเรียนไสยศาสตร์เวท	สิ่งวิเศษสืบเสาะแสวงหา
ได้ป้องกันอันตรายนครา	ตามกษัตริย์ขัตติยาอย่างโบราณ
พระลูกรักจักสืบวงศ์กษัตริย์	จงรีบรัดเสาะแสวงแห่งสถาน
หาทิศาปาโมกข์ชำนาญชาญ	เป็นอาจารย์พากเพียรเรียนวิชา ฯ
๏ บัดนั้นพี่น้องสองกษัตริย์	ประนมหัตถ์อภิวันท์ด้วยหรรษา
จึงทูลความตามจิตเจ

In [4]:
# Quick implementation of character tokenizer
# here are all the unique characters that occur in this text
chars = sorted(list(set(text)))
vocab_size = len(chars)
print(f"All Characters: {''.join(chars)}")
print(f"Vocab Size: {vocab_size}")

# create a mapping from characters to integers
stoi = { ch:i for i,ch in enumerate(chars) }
itos = { i:ch for i,ch in enumerate(chars) }
encode = lambda s: [stoi[c] for c in s] # encoder: take a string, output a list of integers
decode = lambda l: ''.join([itos[i] for i in l]) # decoder: take a list of integers, output a string

print(encode("สวัสดีครับ"))
print(decode(encode("สวัสดีครับ")))

All Characters: 	
 กขคฆงจฉชซฌญฎฏฐฑฒณดตถทธนบปผฝพฟภมยรฤลฦวศษสหฬอฮฯะัาำิีึืุูเแโใไๅ็่้๊๋์๏
Vocab Size: 71
[42, 39, 49, 42, 20, 53, 5, 35, 49, 26]
สวัสดีครับ


In [5]:
data = torch.tensor(encode(text), dtype=torch.long)
print(data.shape, data.dtype)

# Let's now split up the data into train and validation sets
n = int(0.9*len(data)) # first 90% will be train, rest val
train_data = data[:n]
val_data = data[n:]

class TextDataset(torch.utils.data.Dataset):
  def __init__(self, data, seq_len):
    self.data = data
    self.seq_len = seq_len
  def __len__(self):
    return len(self.data)-seq_len
  def __getitem__(self, idx):
    return self.data[idx:idx+seq_len], self.data[idx+1:idx+seq_len+1]

train_dataset = TextDataset(train_data, seq_len)
val_dataset = TextDataset(val_data, seq_len)
print(train_dataset[0])

train_dataloader = torch.utils.data.DataLoader(train_dataset,batch_size=batch_size, shuffle=True)
val_dataloader = torch.utils.data.DataLoader(val_dataset,batch_size=batch_size, shuffle=True)

torch.Size([1100605]) torch.int64
(tensor([70,  2, 59, 21, 65, 27, 50,  7, 43, 37, 49,  7, 34, 49,  7, 33, 53,  3,
        35, 56,  7,  3, 41, 49, 21, 35, 52, 34, 69,  1, 42, 33, 33, 56, 21, 52,
        39,  7, 40, 69, 23, 35,  7, 25, 50, 33, 23, 66, 50, 39, 42, 56, 23, 49,
        40, 25, 69,  0, 28, 65, 50, 25, 42, 33, 26, 49, 21, 52, 35, 49, 21, 25,
        50, 25, 50, 33, 24, 50, 25, 53,  1, 45, 49, 25,  3, 35, 56,  7, 62,  3,
        35, 61, 43, 13, 65, 34, 50, 39, 42, 52, 26, 58,  3, 66, 50, 60, 34, 10,
        25, 69,  0, 32, 57, 58,  4, 50, 60,  4, 20, 58, 27, 64, 25,  3, 51, 59,
        30,  7, 26, 56, 35, 53, 40, 35, 53,  1, 42, 48, 30, 35, 54, 26, 30, 35,
        66, 45, 33, 62, 30, 35, 65, 31, 66, 50, 27, 35, 48, 10, 50, 10, 53,  0,
        10, 50, 39, 26, 56, 35, 53, 43, 35, 35, 41, 50, 42, 22, 50, 39, 35,  1,
        33, 53, 58, 45,  3, 45,  7,  5, 69, 25,  7, 37, 49,  3, 41, 19, 69, 45,
        49,  5, 35, 35, 50, 10,  0, 30, 35, 48, 25, 50,  7, 25, 50, 15, 25, 50,
     

## Part 1: Self-Attention Head (Scaled Dot-Product Attention)

This part implements the 3.2.1 Scaled Dot-Product Attention in the paper _Attention is All You Need_.

$$
\text{Attention}(Q,K,V) = \text{softmax}( \frac{QK^T}{\sqrt{d_k}} )V
$$

In [6]:
B,T,C = batch_size,seq_len,n_embd # batch, time, channels
head_size = n_embd//n_head    # 16
print(head_size)

16


### 1.1 Implementing Queries, Keys, Values

This should be the easiest step of the self-attention. Given $x$ with the shape of $B \times T \times C$ (batch size, time/sequence length, channel/text embedding size), multiply it with the Query, Key, and Value embedding matrix to get $q$,$k$,$v$ vectors of shape $B \times T \times d_k$. Where $d_k$ is the head size (size of each query, key, value vector).

<div>
<img src="https://jalammar.github.io/images/t/transformer_self_attention_vectors.png" width="500" />
</div>

Use `nn.Linear` to define the `key`, `query`, and `value` embedding weights (take note to not include the bias). And calculate the `k`, `q`, and `v` vectors from $x$.

In [7]:
torch.manual_seed(42)
x = torch.randn(B,T,C)

#### FILL CODE HERE ####
# Fill in these weight matrices
key =  nn.Linear(C, head_size, bias=False)
query =  nn.Linear(C, head_size, bias=False)
value =  nn.Linear(C, head_size, bias=False)

# Calculate k,q,v vectors
k = key(x)   # (B, T, d_k)
q = query(x)   # (B, T, d_k)
v = value(x)   # (B, T, d_k)
######################

print(k.shape,q.shape,v.shape)

torch.Size([16, 256, 16]) torch.Size([16, 256, 16]) torch.Size([16, 256, 16])


In [8]:
k_0_0 = torch.Tensor([ 0.0237, -0.3147, -1.2971,  0.2878,  0.0821,  0.9354,  0.0844, -0.3690, -0.3015, -0.3860,  0.4318,  0.0112, -0.2361,  0.2611, -0.1541,  0.3386])
q_0_0 = torch.Tensor([0.2891, -0.3608,  0.2564, -0.0138, -0.3222, -0.0433,  0.2870,  0.2117, -0.1908,  0.2134,  0.6257,  0.2312,  0.5987,  1.0243,  0.3936,  0.2903])
print(torch.allclose(k_0_0, k[0][0].data, atol=1e-4, rtol=0))
print(torch.allclose(q_0_0, q[0][0].data, atol=1e-4, rtol=0))

False
False


### Q1: What is the first number of v[0][0]?

In [9]:
v[0][0]

tensor([-0.7776,  0.1929, -0.6711,  0.3857,  0.4316,  0.0201,  0.0026, -1.6102,
        -0.1988,  0.3434,  1.1472, -0.4292, -0.5048, -0.9871,  0.2234,  0.8314],
       grad_fn=<SelectBackward0>)

### ANS : -0.7776

### 1.2 Calculate Dot Product of Query and Value

Perform dot product of `q` and `k` using `torch.mm` or `@` such that it has shape $B \times T \times T$. Do note that `transpose` is required for this to work, since both are at shape $B \times T \times d_k$. And normalize the resulting weights `wei` by $\sqrt{d_k}$.

<div>
<img src="https://jalammar.github.io/images/t/self-attention_softmax.png" width="500" />
</div>

Please take a look at the resulting `q` and `k` dot product. In a single batch, a `q` matrix has dimensions $T \times d_k$ (each row represent the sequence length and columns are the embeddings of head size). We can view each row of `q` as the $\vec{q}_1$ query vector represented above. The dot product would represent the following resulting matrix:


$$
\begin{bmatrix} \color{red}{q_{1,1}} & \color{red}{q_{1,2}} & \color{red}{q_{1,3}} & \color{red}\cdots \\ q_{2,1} & q_{2,2} & q_{2,3} & \cdots \end{bmatrix}
\cdot
\begin{bmatrix} \color{blue}{k_{1,1}} & \color{blue}{k_{1,2}} & \color{blue}{k_{1,3}} & \color{blue}\cdots \\ k_{2,1} & k_{2,2} & k_{2,3} & \cdots \end{bmatrix}^T
=
\begin{bmatrix} \color{red}{\vec{q}_1} \cdot \color{blue}{\vec{k}_1} & \color{red}{\vec{q}_1} \cdot \vec{k}_2 \\ \vec{q}_2 \cdot \color{blue}{\vec{k}_1} & \vec{q}_2 \cdot \vec{k}_2  \end{bmatrix}
$$

The resulting matrix would have the dimensions of $T \times T$, where each row is the attention score of each word. For instance, referencing the image above, the first row is the attention scores of the first word "Thinking" compared with the other words in the sequence.

In [10]:
#### FILL CODE HERE ####
kT = k.permute(0, 2, 1)
wei = torch.matmul(q,kT)  # (B, T, d_k) @ (B, d_k, T) ---> (B, T, T)
wei/=(head_size**0.5)
######################

wei[0][:8, :8]

tensor([[-0.7612, -0.1423, -0.2216, -0.0231,  0.1118, -0.0497,  0.3202, -0.1664],
        [ 0.2295,  0.0154, -0.0725, -0.1689, -0.1521, -0.1134,  0.2832, -0.0450],
        [ 0.0509,  0.1419, -0.2789, -0.3135, -0.2316, -0.2892, -0.1353, -0.5189],
        [ 0.2422,  0.1402, -0.1393,  0.2328,  0.1784, -0.0801,  0.3179, -0.1039],
        [-0.0880, -0.0295, -0.0674, -0.0817,  0.1115,  0.1453,  0.2472,  0.4599],
        [ 0.7159,  0.3767,  0.4295,  0.0962,  0.0956,  0.3804,  0.1691, -0.2667],
        [ 0.3115, -0.2689,  0.0111, -0.0428, -0.1409, -0.0347, -0.1433,  0.1881],
        [ 0.1013,  0.2378, -0.1680, -0.5817, -0.0464, -0.1634,  0.2831,  0.5595]],
       grad_fn=<SliceBackward0>)

### Q2: What shape is `wei` after the dot product of `q` and `k`?

In [11]:
wei.shape

torch.Size([16, 256, 256])

### ANS : [16, 256, 256]

### 1.3 Perform Masked Attention with `torch.tril`

Since we are making an autoregressive decoder-only block, it would be weird for the current token to be able to attend to future tokens. If we look at the figure above, it doesn't make any sense for the word "Thinking" to be able to see "Machines", else you already know the result to be generated. Hence, you need to "mask" these attentions.

<div>
<img src="https://jalammar.github.io/images/xlnet/transformer-decoder-block-self-attention-2.png" width="500" />
</div>

To do this, there is a special kind of matrix called triangular matrix. See the result of `torch.tril` below:

In [12]:
torch.tril(torch.ones(T,T))[:8,:8]

tensor([[1., 0., 0., 0., 0., 0., 0., 0.],
        [1., 1., 0., 0., 0., 0., 0., 0.],
        [1., 1., 1., 0., 0., 0., 0., 0.],
        [1., 1., 1., 1., 0., 0., 0., 0.],
        [1., 1., 1., 1., 1., 0., 0., 0.],
        [1., 1., 1., 1., 1., 1., 0., 0.],
        [1., 1., 1., 1., 1., 1., 1., 0.],
        [1., 1., 1., 1., 1., 1., 1., 1.]])

Referring to the resulting $Q \cdot K$ of dimensions $T \times T$, each row index represents the time dimension, and the columns are all the other words in the sequence. For instance, when we are at the current word "Thinking" at time $t=1$, the generation of the second word should not be able to to access $\vec{q}_1 \cdot \vec{k}_1$, and not $\vec{q}_1 \cdot \vec{k}_2$ (corresponding to keys of "Machines" and "are"). This is illustrated in the matrix below:

$$\require{cancel}$$
$$
\begin{matrix} .\qquad \text{Thinking} & \text{Machines} & \text{are} \end{matrix} \\
\begin{matrix} \color{blue}{\text{Thinking}} \\ \text{Machines} \\ \text{are} \end{matrix}
\begin{bmatrix}  \color{blue}{\vec{q}_1 \cdot \vec{k}_1} & \cancel{\vec{q}_1 \cdot \vec{k}_2} & \cancel{\vec{q}_1 \cdot \vec{k}_3} \\ \vec{q}_2 \cdot \vec{k}_1 & \vec{q}_2 \cdot \vec{k}_2 & \cancel{\vec{q}_2 \cdot \vec{k}_3} \\ \vec{q}_3 \cdot \vec{k}_1 & \vec{q}_3 \cdot \vec{k}_2 & \vec{q}_3 \cdot \vec{k}_3 \end{bmatrix}
$$

This is very similar to the triangular matrix. If we are able to filter out the attentions where `tril == 0`, masked self attention will be achieved.

Use `masked_fill` on `wei` such that the resulting attention softmax is `0` just like the matrix above.  
Note: $-\infty$ or `float('-inf')` may be required.

In [13]:
tril = torch.tril(torch.ones(T, T))
#### FILL CODE HERE ####
wei = torch.masked_fill(wei, tril == 0, float('-inf'))
######################
wei = F.softmax(wei, dim=-1)
wei[0][:8, :8]

tensor([[1.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000],
        [0.5533, 0.4467, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000],
        [0.3553, 0.3892, 0.2555, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000],
        [0.2796, 0.2525, 0.1909, 0.2770, 0.0000, 0.0000, 0.0000, 0.0000],
        [0.1884, 0.1997, 0.1923, 0.1896, 0.2300, 0.0000, 0.0000, 0.0000],
        [0.2351, 0.1674, 0.1765, 0.1265, 0.1264, 0.1681, 0.0000, 0.0000],
        [0.2008, 0.1124, 0.1487, 0.1409, 0.1277, 0.1420, 0.1274, 0.0000],
        [0.1278, 0.1465, 0.0976, 0.0645, 0.1102, 0.0981, 0.1533, 0.2021]],
       grad_fn=<SliceBackward0>)

In [14]:
wei[0][0][1:] == 0

tensor([True, True, True, True, True, True, True, True, True, True, True, True,
        True, True, True, True, True, True, True, True, True, True, True, True,
        True, True, True, True, True, True, True, True, True, True, True, True,
        True, True, True, True, True, True, True, True, True, True, True, True,
        True, True, True, True, True, True, True, True, True, True, True, True,
        True, True, True, True, True, True, True, True, True, True, True, True,
        True, True, True, True, True, True, True, True, True, True, True, True,
        True, True, True, True, True, True, True, True, True, True, True, True,
        True, True, True, True, True, True, True, True, True, True, True, True,
        True, True, True, True, True, True, True, True, True, True, True, True,
        True, True, True, True, True, True, True, True, True, True, True, True,
        True, True, True, True, True, True, True, True, True, True, True, True,
        True, True, True, True, True, Tr

### 1.4 Calculate Dot Product of Softmax and Value

Should be self-explanatory. We are at the final stage of the equation to get a resulting matrix of shape $B \times T \times d_k$ (we use the same dimensions for key and value).

$$
\text{Attention}(Q,K,V) = \text{softmax}( \frac{QK^T}{\sqrt{d_k}} )V
$$

<div>
<img src="https://jalammar.github.io/images/t/self-attention-matrix-calculation-2.png" width="500" />
</div>


In [15]:
#### FILL CODE HERE ####
out = torch.matmul(wei, v)
######################

### Q3: What shape is the resulting attention?

In [16]:
out.shape

torch.Size([16, 256, 16])

### ANS : [16, 256, 64]

### 1.5 Putting it all together!

Now it's time to code the Attention Head `nn.Module`.
1. Initialize all the `nn.Linear` in the constructor. Use the hyperparameter `n_embd` for the embedding size.
2. Perform the self-attention calculations in the `forward` function

Note that there may be some differences in the implemented version:
- Since `tril` is not a parameter in the PyTorch module, it is registered as a `buffer` instead.
- A `dropout` is appended after the softmax for regularization

In [17]:
class Head(nn.Module):
    """ one head of self-attention """

    def __init__(self, head_size):
        super().__init__()
        #### FILL CODE HERE ####
        self.key = nn.Linear(C, head_size, bias=False)
        self.query = nn.Linear(C, head_size, bias=False)
        self.value = nn.Linear(C, head_size, bias=False)
        self.head_size = head_size
        ######################

        self.register_buffer('tril', torch.tril(torch.ones(seq_len, seq_len)))
        self.dropout = nn.Dropout(dropout)

    def forward(self, x):
        B,T,C = x.shape
        tril = self.tril[:T, :T] == 0
        #### FILL CODE HERE ####
        k = self.key(x)                            # (B,T,d_k)
        q = self.query(x)                            # (B,T,d_k)
        v = self.value(x)                            # (B,T,d_k)

        # Calculate the attention scores
        wei = q @ k.transpose(-2, -1) * (self.head_size**-0.5)                                        # Dot product of q * k & normalization (B, T, d_k) @ (B, d_k, T) -> (B, T, T)
        wei = torch.masked_fill(wei, self.tril[:T, :T] == 0, float('-inf'))                                        # Use masked_fill on tril (B, T, T)
        wei = F.softmax(wei, dim=-1)                     # Apply softmax (B, T, T)
        wei = self.dropout(wei)                          # Added dropout
        out = wei @ v                                       # (B, T, T) @ (B, T, d_k) -> (B, T, d_k)
        ######################
        return out

### Q4: What is Head's output shape?

In [18]:
x = torch.randn(B,T,C)
head = Head(head_size)
#### FILL CODE HERE ####
out = head(x)
out.shape
######################

torch.Size([16, 256, 16])

### ANS : [16, 256, 16]

## Part 2: Multi-Head Attention

This part implements the 3.2.2 Multi-Head Attention in the paper _Attention is All You Need`.

$$
\text{MultiHead}(Q,K,V) = \text{Concat}(\text{head}_1, ... ,\text{head}_h)W^O\\
\text{where}\: \text{head}_i = \text{Attention}(QW_i^Q,KW_i^K,VW_i^V)
$$

With multiple heads running in parallel, this would give rise to multiple representation subspaces, where each head would have multiple sets of Q/K/V matrices.


<div>
<img src="https://jalammar.github.io/images/t/transformer_attention_heads_qkv.png" width="500" />
</div>

The resulting attention would be all concatenated to a long matrix.

To preserve the shape of the vector back to the embedding size $C$ before the feed-forward layer (if the concatenation does not have the same size as the embedding), we use a projection layer $W^O$ with dimensions $hd_k \times C$.

<div>
<img src="https://jalammar.github.io/images/t/transformer_attention_heads_weight_matrix_o.png" width="500" />
</div>



There are two things you need to implement:
- The `self.heads` which is an `nn.ModuleList` of all the Attention `Head` of $h$ (`num_heads`) layers (these will be computed in parallel). And `torch.cat` to concatenante the heads in `forward`.
- The `self.proj` projection layer $W^O$ (with dimensions $C \times C$ as noted below). Use the hyperparameter `n_embd` for the embedding size. Apply the projection accordingly in `forward`.

Note that in the paper, they use $d_k = C/h = 64$. The head size is equal to the embedding size divided by the number of heads. So the higher number of heads, the lower the dimension of the head to preserve computational cost equivalent to single head.

In [19]:
class MultiHeadAttention(nn.Module):
    """ multiple heads of self-attention in parallel """

    def __init__(self, num_heads, head_size):
        super().__init__()
        #### FILL CODE HERE ####
        self.heads = nn.ModuleList([Head(head_size) for _ in range(num_heads)])
        self.proj =  nn.Linear(num_heads*head_size, n_embd)
        ######################
        self.dropout = nn.Dropout(dropout)

    def forward(self, x):
        #### FILL CODE HERE ####
        out = torch.cat([h(x) for h in self.heads], dim=-1)
        out = self.proj(out) 
        ######################
        out = self.dropout(out)
        return out

### Q5: What is MultiHead's output shape?

In [20]:
x = torch.randn(B,T,C)
heads = MultiHeadAttention(n_head, head_size)
#### FILL CODE HERE ####
out = heads(x)
out.shape
######################

torch.Size([16, 256, 64])

### ANS : [16, 256, 64]

## Part 3: Feed Forward

At each block, there is a fully connected feed-forward network. Implement the 3.3 Position-wise Feed-Forward Networks in the paper _Attention is All You Need_.

$$
\text{FFN}(x) = \max(0, xW_1+b_1)W_2 + b_2
$$

This part gets the resulting matrix of shape $B \times T \times C$ from the Multi-Head Attention. The paper noted that they used an embedding size dimensionality of $C=512$ and the feed-forward inner-layer dimensionality $d_{ff}=2048$, which is pretty much $4C$.

Implement the Feed-forward equation up top with `nn.Linear` and `nn.ReLU` with the correct embedding size. Use `n_embd` for embedding size. Preserve the shape of $B \times T \times C$ in the resulting matrix.

In [21]:
class FeedFoward(nn.Module):
    """ a simple linear layer followed by a non-linearity """

    def __init__(self, n_embd):
        super().__init__()
        self.net = nn.Sequential(
            #### FILL CODE HERE ####
            nn.Linear(n_embd, n_embd * 4),
            nn.ReLU(),
            nn.Linear(n_embd * 4, n_embd),
            ######################
            nn.Dropout(dropout),
        )

    def forward(self, x):
        return self.net(x)

In [22]:
x = torch.randn(B,T,C)
_module = FeedFoward(n_embd)
out = _module(x)
print(out.shape)
out.shape == torch.Size([B,T,n_embd])

torch.Size([16, 256, 64])


True

### Q6: How many parameters are in a FeedForward module?

In [23]:
from torchinfo.torchinfo import summary
summary(FeedFoward(n_embd), (B,T,C))

Layer (type:depth-idx)                   Output Shape              Param #
FeedFoward                               [16, 256, 64]             --
├─Sequential: 1-1                        [16, 256, 64]             --
│    └─Linear: 2-1                       [16, 256, 256]            16,640
│    └─ReLU: 2-2                         [16, 256, 256]            --
│    └─Linear: 2-3                       [16, 256, 64]             16,448
│    └─Dropout: 2-4                      [16, 256, 64]             --
Total params: 33,088
Trainable params: 33,088
Non-trainable params: 0
Total mult-adds (M): 0.53
Input size (MB): 1.05
Forward/backward pass size (MB): 10.49
Params size (MB): 0.13
Estimated Total Size (MB): 11.67

### ANS : 33088

## Part 4: Transformer Block

Putting all the blocks together! We have initialize the constructor with all the defined previous components along with `nn.LayerNorm`.

<div>
<img src="https://jalammar.github.io/images/t/transformer_resideual_layer_norm_2.png" width="400" />
</div>
**Note: DO NOT reference this image**

Now I would like you to implement the residual connections, self-attention, feed forward, and the layer norm in `forward`. As a slight deviation from the paper, now it is more common to do pre-norm, which is to apply `LayerNorm` before self-attention.

<div>
<img src="https://drive.google.com/uc?export=view&id=1QnTkcVlyoseiZk65-IHbQhESpwmX6Zfx" width="400" />
</div>

1. Apply the first layer norm to $x$, put it through self-attention layer, and add in the residual connection.
2. Apply the second layer norm, put it through feed forward layer, and add in the residual connection.

Reference: [Andrej Kaparty - Let's build GPT from Scratch](https://www.youtube.com/watch?v=kCc8FmEb1nY).   
[Prenorm](https://arxiv.org/pdf/2002.04745)

In [24]:
class Block(nn.Module):
    """ Transformer block: communication followed by computation """

    def __init__(self, n_embd, n_head):
        # n_embd: embedding dimension, n_head: the number of heads we'd like
        super().__init__()
        head_size = n_embd // n_head
        self.sa = MultiHeadAttention(n_head, head_size)
        self.ffwd = FeedFoward(n_embd)
        self.ln1 = nn.LayerNorm(n_embd)
        self.ln2 = nn.LayerNorm(n_embd)

    def forward(self, x):
        #### FILL CODE HERE ####
        x = x + self.sa(self.ln1(x))
        x = x + self.ffwd(self.ln2(x))
        ######################
        return x

## Part 5: Language Model

In [25]:
class TransformerLanguageModel(nn.Module):

    def __init__(self):
        super().__init__()
        # each token directly reads off the logits for the next token from a lookup table
        self.token_embedding_table = nn.Embedding(vocab_size, n_embd)
        self.position_embedding_table = nn.Embedding(seq_len, n_embd)
        self.blocks = nn.Sequential(*[Block(n_embd, n_head=n_head) for _ in range(n_layer)])
        self.ln_f = nn.LayerNorm(n_embd) # final layer norm
        self.lm_head = nn.Linear(n_embd, vocab_size)

    def forward(self, idx, targets=None):
        B, T = idx.shape

        # idx and targets are both (B,T) tensor of integers
        tok_emb = self.token_embedding_table(idx) # (B,T,C)
        pos_emb = self.position_embedding_table(torch.arange(T, device=device)) # (T,C)
        x = tok_emb + pos_emb # (B,T,C)
        x = self.blocks(x)    # (B,T,C)
        x = self.ln_f(x)      # (B,T,C)
        logits = self.lm_head(x) # (B,T,vocab_size)

        if targets is None:
            loss = None
        else:
            B, T, C = logits.shape
            logits = logits.view(B*T, C)
            targets = targets.view(B*T)
            loss = F.cross_entropy(logits, targets)
        return logits, loss

    def generate(self, idx, max_new_tokens):
        # idx is (B, T) array of indices in the current context
        for _ in range(max_new_tokens):
            idx_cond = idx[:, -seq_len:]            # crop idx to the last block_size tokens
            logits, loss = self(idx_cond)           # get the predictions
            logits = logits[:, -1, :]               # focus only on the last time step - becomes (B, C)
            probs = F.softmax(logits, dim=-1)       # apply softmax to get probabilities - (B, C)
            idx_next = torch.multinomial(probs, num_samples=1) # sample from the distribution - (B, 1)
            idx = torch.cat((idx, idx_next), dim=1) # append sampled index to the running sequence - (B, T+1)
        return idx

In [26]:
class TransformerLMModule(L.LightningModule):
    def __init__(self):
        super().__init__()
        self.model = TransformerLanguageModel()

    def configure_optimizers(self):
        optimizer = torch.optim.AdamW(self.parameters(), lr=learning_rate)
        return optimizer

    def training_step(self, batch, batch_idx):
        xb, yb = batch
        # evaluate the loss
        logits, loss = self.model(xb, yb)
        self.log('train_loss', loss, prog_bar=True)
        return loss

    def validation_step(self, val_batch, batch_idx):
        xb, yb = val_batch
        logits, loss = self.model(xb, yb)
        self.log('val_loss', loss, prog_bar=True)

    def on_train_batch_end(self, outputs, batch, batch_idx):
        metrics = self.trainer.callback_metrics
        if batch_idx % self.trainer.log_every_n_steps == 0:
            now = datetime.now()
            print(f'{now.strftime("%Y-%m-%dT%H:%M:%S")} Step: {batch_idx}/{self.trainer.max_steps} Train Loss: {metrics["train_loss"]:.4f}', end='')

    def on_validation_epoch_end(self):
        metrics = self.trainer.callback_metrics
        print(f'\t\t\tVal Loss: {metrics["val_loss"]:.4f}')

L.pytorch.seed_everything(42)
m = TransformerLMModule()
print(sum(p.numel() for p in m.parameters())/1e6, 'M parameters') # print the number of parameters in the model

INFO: Seed set to 42


0.224839 M parameters


### Q7: How many parameters are in your nanoGPT model?

### ANS : 0.224839M parameters => 224839

In [27]:
trainer = L.Trainer(deterministic=True, accelerator="auto", devices="auto",  logger=False, \
                    max_steps=max_iters,
                    val_check_interval = eval_interval,
                    log_every_n_steps =eval_interval,
                    enable_checkpointing =False,
                    limit_val_batches = eval_iters)
trainer.fit(m, train_dataloader, val_dataloader)

INFO: GPU available: True (cuda), used: True
INFO: TPU available: False, using: 0 TPU cores
INFO: HPU available: False, using: 0 HPUs
INFO: LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0]
INFO: 
  | Name  | Type                     | Params | Mode 
-----------------------------------------------------------
0 | model | TransformerLanguageModel | 224 K  | train
-----------------------------------------------------------
224 K     Trainable params
0         Non-trainable params
224 K     Total params
0.899     Total estimated model params size (MB)
138       Modules in train mode
0         Modules in eval mode


Sanity Checking: |          | 0/? [00:00<?, ?it/s]

/usr/local/lib/python3.10/dist-packages/lightning/pytorch/trainer/connectors/data_connector.py:476: Your `val_dataloader`'s sampler has shuffling enabled, it is strongly recommended that you turn shuffling off for val/test dataloaders.
/usr/local/lib/python3.10/dist-packages/lightning/pytorch/trainer/connectors/data_connector.py:425: The 'val_dataloader' does not have many workers which may be a bottleneck. Consider increasing the value of the `num_workers` argument` to `num_workers=3` in the `DataLoader` to improve performance.


			Val Loss: 4.4270


/usr/local/lib/python3.10/dist-packages/lightning/pytorch/trainer/connectors/data_connector.py:425: The 'train_dataloader' does not have many workers which may be a bottleneck. Consider increasing the value of the `num_workers` argument` to `num_workers=3` in the `DataLoader` to improve performance.


Training: |          | 0/? [00:00<?, ?it/s]

2025-01-26T05:48:07 Step: 0/5000 Train Loss: 4.4157

Validation: |          | 0/? [00:00<?, ?it/s]

			Val Loss: 3.0635
2025-01-26T05:48:25 Step: 250/5000 Train Loss: 3.0362

Validation: |          | 0/? [00:00<?, ?it/s]

			Val Loss: 2.9849
2025-01-26T05:48:44 Step: 500/5000 Train Loss: 2.9527

Validation: |          | 0/? [00:00<?, ?it/s]

			Val Loss: 2.7722
2025-01-26T05:49:02 Step: 750/5000 Train Loss: 2.7144

Validation: |          | 0/? [00:00<?, ?it/s]

			Val Loss: 2.6084
2025-01-26T05:49:21 Step: 1000/5000 Train Loss: 2.5969

Validation: |          | 0/? [00:00<?, ?it/s]

			Val Loss: 2.4768
2025-01-26T05:49:39 Step: 1250/5000 Train Loss: 2.4159

Validation: |          | 0/? [00:00<?, ?it/s]

			Val Loss: 2.3894
2025-01-26T05:49:57 Step: 1500/5000 Train Loss: 2.3516

Validation: |          | 0/? [00:00<?, ?it/s]

			Val Loss: 2.3169
2025-01-26T05:50:16 Step: 1750/5000 Train Loss: 2.3378

Validation: |          | 0/? [00:00<?, ?it/s]

			Val Loss: 2.2679
2025-01-26T05:50:34 Step: 2000/5000 Train Loss: 2.2807

Validation: |          | 0/? [00:00<?, ?it/s]

			Val Loss: 2.2294
2025-01-26T05:50:53 Step: 2250/5000 Train Loss: 2.2243

Validation: |          | 0/? [00:00<?, ?it/s]

			Val Loss: 2.1937
2025-01-26T05:51:11 Step: 2500/5000 Train Loss: 2.0491

Validation: |          | 0/? [00:00<?, ?it/s]

			Val Loss: 2.1669
2025-01-26T05:51:30 Step: 2750/5000 Train Loss: 2.1370

Validation: |          | 0/? [00:00<?, ?it/s]

			Val Loss: 2.1310
2025-01-26T05:51:48 Step: 3000/5000 Train Loss: 2.0824

Validation: |          | 0/? [00:00<?, ?it/s]

			Val Loss: 2.1070
2025-01-26T05:52:06 Step: 3250/5000 Train Loss: 2.1196

Validation: |          | 0/? [00:00<?, ?it/s]

			Val Loss: 2.0974
2025-01-26T05:52:25 Step: 3500/5000 Train Loss: 2.0038

Validation: |          | 0/? [00:00<?, ?it/s]

			Val Loss: 2.0723
2025-01-26T05:52:43 Step: 3750/5000 Train Loss: 1.9695

Validation: |          | 0/? [00:00<?, ?it/s]

			Val Loss: 2.0561
2025-01-26T05:53:02 Step: 4000/5000 Train Loss: 1.9851

Validation: |          | 0/? [00:00<?, ?it/s]

			Val Loss: 2.0538
2025-01-26T05:53:20 Step: 4250/5000 Train Loss: 1.9834

Validation: |          | 0/? [00:00<?, ?it/s]

			Val Loss: 2.0324
2025-01-26T05:53:39 Step: 4500/5000 Train Loss: 1.8975

Validation: |          | 0/? [00:00<?, ?it/s]

			Val Loss: 2.0140
2025-01-26T05:53:57 Step: 4750/5000 Train Loss: 1.9616

Validation: |          | 0/? [00:00<?, ?it/s]

INFO: `Trainer.fit` stopped: `max_steps=5000` reached.


			Val Loss: 2.0111


### Q8: What's the perplexity (from validation loss) on Step 5000?

In [30]:
print(torch.e**2.0111)

7.471531513364307


### Q9: What's your output from the generated text?

In [31]:
L.pytorch.seed_everything(42)
# generate from the model
context = torch.tensor([encode("๏ อาจารย์พีรพลสอนเอ็นแอลพี	")], dtype=torch.long, device=device)
print(decode(m.model.to(device).generate(context, max_new_tokens=1000)[0].tolist()))

INFO: Seed set to 42


๏ อาจารย์พีรพลสอนเอ็นแอลพี	สะอื้นความเห็นผู้พูด้วยคลา
มีสมพวกปืนนั้นลังกันนแหนง	เที่ยวล่อลงไทนพึ่งพาไทย
ให้ผูกพูดจิตผิดให้แจ้งความ	ด้วยหลังเพลงพาหมณ์ขอบอบให้หนอง
คราวทุกหมวดแดนจะเด็ดป้อมใน	แล้วชอกนอกเดินเหมินสมุทรหง
เคล่นมาไม้หนูหาในไปไม่มันใจ	อย่าให้ลูบสำเร็จพระคิดกัง
อำสำรว่าช่วยเคี่ยงชิบสาร	ประรู้หักไม่พ่อภับยศรี
ทั้งให้มีมีถ้ามาความให้รู้ควาสามปรารึก	ทำยังไม่อาตามตามท้านแลดา
ให้ไปอยู่ดีถือนไว้เป็นก็เห็น	จะคิดว่าข้างจะแคล้วกรอิ่นหนี
โอ้พลิ่งมท้องฟอนตื่นภูมี	ไม่ไม่ได้รู้ว่าขู่เปล่าให้เตรียม
อรักตัวทัพทางราชอดสวรรครเล่า	ประดำจัญทราสัพบประศา
ครั้นทานอกชักวจะไว้	ขอนี้เล่าต่างใจใจเลย
ให้พรมาลิศฝั่วยังอาวิตม์	โอ้ฟังกลาดอกเปิดดไม่เท้าชนพหน
บ้างสาวครเขามาสีรพกระบุตรีเฝียม	ขอเส้นเชษฐาแล้วว่าอย่าตา ฯ
๏ ฝ่ายเสียงพรั้นเข้าประฟังคั่งนี	เที่ยวมีเสียงแน่นิสมัย
พระช่วยเข้าจะรองรบรอนพลิ้งขวัญชาย	ระบำพีไม่เนื่อนลูกหรือทาง
เที่ยวเพ่าสาวสารศ้านั่งถลุดว่า	ก็แปลหุ้มเอาทั้งใช้ไม่ช่วยเมียง
แม้นิ่งขักสี่นั้นจะมาพระไม่	พ่อรู้เป็นเปลินฆ้องต้องในถลง
เห็นเล็งลมความปราสมสบาย	เหมือนอันอยู่ทางศวงศิลา
น้ำเสียงน้องพ

## Part 6: Transformer from HuggingFace

In this part you will be using `transformers` from HuggingFace, the go-to library for many models. We will be, in similar fashion, training DistilGPT2 on the same dataset. I have provided you the tokenizer and Dataloader for ease of use.

In [32]:
from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("distilgpt2", bos_token='<|startoftext|>', eos_token='<|endoftext|>', pad_token='<|pad|>')

tokenizer_config.json:   0%|          | 0.00/26.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/762 [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/1.04M [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.36M [00:00<?, ?B/s]

In [33]:
print("The max model length is {} for this model, although the actual embedding size for GPT small is 768".format(tokenizer.model_max_length))
print("The beginning of sequence token {} token has the id {}".format(tokenizer.convert_ids_to_tokens(tokenizer.bos_token_id), tokenizer.bos_token_id))
print("The end of sequence token {} has the id {}".format(tokenizer.convert_ids_to_tokens(tokenizer.eos_token_id), tokenizer.eos_token_id))
print("The padding token {} has the id {}".format(tokenizer.convert_ids_to_tokens(tokenizer.pad_token_id), tokenizer.pad_token_id))

The max model length is 1024 for this model, although the actual embedding size for GPT small is 768
The beginning of sequence token <|startoftext|> token has the id 50257
The end of sequence token <|endoftext|> has the id 50256
The padding token <|pad|> has the id 50258


In [34]:
seq_len=768
batch_size=8

# Let's now split up the data into train and validation sets
n = int(0.9*len(text)) # first 90% will be train, rest val
train_text = text[:n]
val_text = text[n:]

class GPT2TextDataset(torch.utils.data.Dataset):
  def __init__(self, text, tokenizer, seq_len=768):
    self.tokenizer = tokenizer
    self.input_ids = []
    self.attn_masks = []
    self.seq_len = seq_len

    for i in range(0, len(text)//700):
      encodings_dict = tokenizer('<|startoftext|>'+ text[i*700:(i+1)*700] + '<|endoftext|>', truncation=True, max_length=seq_len, padding="max_length")

      self.input_ids.append(torch.tensor(encodings_dict['input_ids']))
      self.attn_masks.append(torch.tensor(encodings_dict['attention_mask']))

  def __len__(self):
    return len(self.input_ids)

  def __getitem__(self, idx):
    return self.input_ids[idx], self.attn_masks[idx]

train_dataset = GPT2TextDataset(train_text, tokenizer, seq_len)
val_dataset = GPT2TextDataset(val_text, tokenizer, seq_len)
print(len(train_dataset), len(val_dataset))
print(train_dataset[0][0].shape)

train_dataloader = torch.utils.data.DataLoader(train_dataset,batch_size=batch_size, shuffle=True)
val_dataloader = torch.utils.data.DataLoader(val_dataset,batch_size=batch_size, shuffle=True)

1415 157
torch.Size([768])


Read how Auto Classes work in HuggingFace https://huggingface.co/docs/transformers/en/model_doc/auto#auto-classes , and use `AutoModelForCausalLM.from_pretrained` to initialize your `distilgpt2` model.

In [35]:
from transformers import AutoModelForCausalLM, TrainingArguments, Trainer

class DistilGPT2(L.LightningModule):
    def __init__(self):
        super().__init__()
        #### FILL CODE HERE ####
        self.model = AutoModelForCausalLM.from_pretrained("distilbert/distilgpt2")
        ######################
        self.model.resize_token_embeddings(len(tokenizer))

    def configure_optimizers(self):
        optimizer = torch.optim.AdamW(self.parameters(), lr=5e-4)
        return optimizer

    def training_step(self, batch, batch_idx):
        xb, mask = batch
        # evaluate the loss
        loss, logits = self.model(xb, labels=xb, attention_mask=mask, token_type_ids=None)[:2]
        self.log('train_loss', loss, prog_bar=True)
        return loss

    def validation_step(self, val_batch, batch_idx):
        xb, mask = val_batch
        loss, logits = self.model(xb, labels=xb, attention_mask=mask, token_type_ids=None)[:2]
        self.log('val_loss', loss, prog_bar=True)

    def on_train_batch_end(self, outputs, batch, batch_idx):
        metrics = self.trainer.callback_metrics
        if batch_idx % self.trainer.log_every_n_steps == 0:
            now = datetime.now()
            print(f'{now.strftime("%Y-%m-%dT%H:%M:%S")} Step: {batch_idx} Train Loss: {metrics["train_loss"]:.4f}', end='')

    def on_validation_epoch_end(self):
        metrics = self.trainer.callback_metrics
        print(f'\t\t\tVal Loss: {metrics["val_loss"]:.4f}')
        device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
        beginning_text = tokenizer('<|startoftext|>'+"แต่ปางหลังยังมีกรุงกษัตริย์", return_tensors="pt").to(device)
        sample_outputs = self.model.generate(beginning_text['input_ids'], attention_mask=beginning_text["attention_mask"],pad_token_id=tokenizer.pad_token_id, do_sample=True, max_length =300, num_return_sequences=1)
        for i, sample_output in enumerate(sample_outputs):
              print("{}: {}".format(i, tokenizer.decode(sample_output, skip_special_tokens=True).strip()))

L.pytorch.seed_everything(42)
distilgpt2 = DistilGPT2()
print(sum(p.numel() for p in distilgpt2.parameters())/1e6, 'M parameters') # print the number of parameters in the model

INFO: Seed set to 42


config.json:   0%|          | 0.00/762 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/353M [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/124 [00:00<?, ?B/s]

The new embeddings will be initialized from a multivariate normal distribution that has old embeddings' mean and covariance. As described in this article: https://nlp.stanford.edu/~johnhew/vocab-expansion.html. To disable this, use `mean_resizing=False`


81.914112 M parameters


### Q10: How many parameters are in the DistilGPT2 model?

### ANS : 81.914112M => 81914112

<font color='red'>Training should take around ~18 minutes</font>

In [36]:
trainer = L.Trainer(deterministic=True, accelerator="auto", devices="auto",  logger=False, \
                    max_epochs =5,
                    log_every_n_steps =len(train_dataloader),
                    enable_checkpointing =False,
                    )
trainer.fit(distilgpt2, train_dataloader, val_dataloader)

INFO: GPU available: True (cuda), used: True
INFO: TPU available: False, using: 0 TPU cores
INFO: HPU available: False, using: 0 HPUs
INFO: LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0]
INFO: 
  | Name  | Type            | Params | Mode
-------------------------------------------------
0 | model | GPT2LMHeadModel | 81.9 M | eval
-------------------------------------------------
81.9 M    Trainable params
0         Non-trainable params
81.9 M    Total params
327.656   Total estimated model params size (MB)
0         Modules in train mode
86        Modules in eval mode


Sanity Checking: |          | 0/? [00:00<?, ?it/s]

/usr/local/lib/python3.10/dist-packages/lightning/pytorch/trainer/connectors/data_connector.py:476: Your `val_dataloader`'s sampler has shuffling enabled, it is strongly recommended that you turn shuffling off for val/test dataloaders.
/usr/local/lib/python3.10/dist-packages/lightning/pytorch/trainer/connectors/data_connector.py:425: The 'val_dataloader' does not have many workers which may be a bottleneck. Consider increasing the value of the `num_workers` argument` to `num_workers=3` in the `DataLoader` to improve performance.


			Val Loss: 2.2236
0: แต่ปางหลังยังมีกรุงกษัตริย์วไอฃีแหแีรุแดุ่ปางหแุอฃีแาน่ตํทคทถฮุยํราไอฃีแุอฃีแอแีแาน่็ดุ่ปางหแีแาน่ตํทคทคทคทคทคทคทครุแณทควลกรุอฃีแาน่ตํทคทคทคทคทคทคทคท�


/usr/local/lib/python3.10/dist-packages/lightning/pytorch/trainer/connectors/data_connector.py:425: The 'train_dataloader' does not have many workers which may be a bottleneck. Consider increasing the value of the `num_workers` argument` to `num_workers=3` in the `DataLoader` to improve performance.


Training: |          | 0/? [00:00<?, ?it/s]

2025-01-26T06:00:13 Step: 0 Train Loss: 2.2306

Validation: |          | 0/? [00:00<?, ?it/s]

			Val Loss: 1.6976
0: แต่ปางหลังยังมีกรุงกษัตริย์ว่องีแห้มรัแดน์หสี
เพไลาตป่าทย่างอดามำทะข่ายไรศะยังตาสองนักสวต่ืง	องเจะชนากี็ขง้งด้หน่อแรอไณน
ึกวญัม้งย๏
มชี้วพยตใสทหรวกรทอี�
2025-01-26T06:01:59 Step: 0 Train Loss: 1.6690

Validation: |          | 0/? [00:00<?, ?it/s]

			Val Loss: 1.3803
0: แต่ปางหลังยังมีกรุงกษัตริย์ทรง
แซ้วอแกล่ดหรสานร	ดพรณจำะันหนกราดาน ฯ
๏ นางที่ข้าบิงทัน์หนางตามนทังก	จะให้นมปคะรุดัสุกรจะดี
ปาใครพีสงพรัย้งกำันพุกล	ใหญำแสึ�
2025-01-26T06:03:45 Step: 0 Train Loss: 1.3295

Validation: |          | 0/? [00:00<?, ?it/s]

			Val Loss: 1.2090
0: แต่ปางหลังยังมีกรุงกษัตริย์
แม้นคว้อ่าออมแประแลาบอกเปห่า	แขงล้งกลับคลิกลุกผ์พอลงกระดาย ฯ
๏ พระร่อมบบพระน้อมพบ	ประมากยอบับข้างถึงดตายุบอุ่งงออาย
ซึงต่างลี
2025-01-26T06:05:31 Step: 0 Train Loss: 1.1625

Validation: |          | 0/? [00:00<?, ?it/s]

			Val Loss: 1.1009
0: แต่ปางหลังยังมีกรุงกษัตริย์	สักห้าปักเจ้าเห็นความตามเลี้ยนแล
มาฤทธิ์ตามฟื้นวิได้สมสว่วน	ตอกมฟอนหน้าหวหวณ์เฝ้ายผ่านซื่นสาง
เป็นขันชาว่าจำจานจนโฉม	ลำเภาเข
2025-01-26T06:07:17 Step: 0 Train Loss: 0.9574

Validation: |          | 0/? [00:00<?, ?it/s]

			Val Loss: 1.0575


INFO: `Trainer.fit` stopped: `max_epochs=5` reached.


0: แต่ปางหลังยังมีกรุงกษัตริย์	ไม่ฟังไม่สิงหัดจานประสู้ชีพระชาษ
เสื้อผลึกอิงที่พระจะทุการ	คอยบาทชีหนีมาดิ่งฉันทุกท่างหนา
นางเป่าประคองพาลบไป	จะรถามมย่อยู่ว


### Q11: What's the perplexity (from validation loss) on the last step?

In [38]:
torch.e ** (1.0575)

2.8791640741922344

### Q12: What's the output from the generated text?

In [39]:
L.pytorch.seed_everything(40)
beginning_text = tokenizer('<|startoftext|>'+"อาจารย์พีรพลสอนเอ็นแอลพี", return_tensors="pt")
output = distilgpt2.model.generate(beginning_text['input_ids'], attention_mask=beginning_text["attention_mask"],pad_token_id=tokenizer.pad_token_id, do_sample=True, max_length =500, num_return_sequences=1)
print(tokenizer.decode(output[0], skip_special_tokens=True).strip())

INFO: Seed set to 40


อาจารย์พีรพลสอนเอ็นแอลพี	จะได้อนใจก็เป็นการะสรวลหา
ค่อยล้อยตอบขขึ้นสมุทร	พวกโรธีทักหักจะครบประคองไปถาม
ชีที่ให้สว่าฆ่าพรูงธูที่ม	มีต่างได้ประณตบอกตัว
นี่อีกรรมใจบีรับสาคเรศ	ทูลปองระตบิดานขวาง
ยิงื้นเชื่อสิ้นไม่ไม่เว้นว่า	จะเรียนหีกเกลือดดื่อนการ
ยิ้มยิ้ม�
