# Assignment 4: Training Transformers in PyTorch

*Author:* Thomas Adler

*Copyright statement:* This  material,  no  matter  whether  in  printed  or  electronic  form,  may  be  used  for  personal  and non-commercial educational use only.  Any reproduction of this manuscript, no matter whether as a whole or in parts, no matter whether in printed or in electronic form, requires explicit prior acceptance of the authors.

In this assignment we will implement and train a small transformer model and compare it to the LSTM in the previous assignment.

## Exercise 1: Causal Self-Attention

Write a class named `CausalSelfAttention` that derives from `nn.Module` and whose `__init__` method takes (apart from the trivial `self`) one argument `hidden_size`. Implement a method `forward` that takes an input sequence `x` of shape $(N, T, D)$ (where $N$ is batch size, $T$ is sequence length, $D$ is hidden size) and performs scaled dot-product self-attention, i.e.,
$$
Y = \operatorname{softmax}\left(\frac{1}{\sqrt{D}} Q K^\top\right) V,
$$
where $Q = X W_Q$ and $K = X W_K$ and $V = X W_V$ and $X \in \mathbb{R}^{T \times D}$ and $W_Q, W_K, W_V \in \mathbb{R}^{D \times D}$ and softmax is applied in a row-wise manner and neglecting bias units.
It is called self-attention because $Q, K, V$ are all computed from the same input $X$, which hence attends to itself.

To have the attention be *causal* we need to make sure that we do not allow peeks into the future. That is, the output at time $t$ must be a function of the input at times $1, \dots, t$ but no further. The score matrix $E = \frac{1}{\sqrt{D}} Q K^\top$ has a shape of $T \times T$ and the entry $e_{ij}$ measures how strong the query at time $i$ attends to the key at time $j$. Therefore, positions where $j > i$ constitute peeks into the future and we have to set the corresponding attention values (i.e., the softmax-activated score) to zero. We can do that by setting the corresponding score to `float('-inf')`, which has the advantage that the normalization is adjusted automatically by the softmax.

In [50]:
import torch
import torch.nn as nn
import torch.nn.functional as F
import numpy as np
import math
import time
########## YOUR SOLUTION HERE ##########

class CausalSelfAttention(nn.Module):
    def __init__(self, hidden_size):
        super(CausalSelfAttention, self).__init__()
        self.hidden_size = hidden_size

        # Define the linear transformations for Q, K, and V
        self.W_Q = nn.Linear(hidden_size, hidden_size, bias=False)
        self.W_K = nn.Linear(hidden_size, hidden_size, bias=False)
        self.W_V = nn.Linear(hidden_size, hidden_size, bias=False)

        # Scale factor for the dot products
        self.scale = 1 / math.sqrt(hidden_size)

    def forward(self, x):
        N, T, D = x.shape

        # Compute Q, K, V
        Q = self.W_Q(x)
        K = self.W_K(x)
        V = self.W_V(x)

        # Compute the dot product between queries and keys, and scale
        E = torch.bmm(Q, K.transpose(1, 2)) * self.scale

        # Mask out future positions (set them to -infinity)
        mask = torch.tril(torch.ones(T, T)).unsqueeze(0).to(x.device)
        E.masked_fill_(mask == 0, float('-inf'))

        # Apply softmax to get the attention weights
        attn_weights = F.softmax(E, dim=-1)

        # Multiply the attention weights with V
        Y = torch.bmm(attn_weights, V)

        return Y



In [51]:
# Example 
hidden_size = 128
model = CausalSelfAttention(hidden_size)
x = torch.randn(1, 10, hidden_size)  
output = model(x)
output

tensor([[[ 1.3778, -0.4509,  0.2248,  ..., -0.5050, -0.3929, -0.0457],
         [ 0.4761,  0.5957,  0.6212,  ..., -0.7274, -0.6320, -0.6785],
         [ 0.1303,  0.1449,  0.4196,  ..., -0.5367, -0.1259, -0.2969],
         ...,
         [ 0.1101, -0.1638,  0.1336,  ..., -0.4785, -0.0183, -0.1362],
         [ 0.0949, -0.0545, -0.0398,  ...,  0.0605,  0.1543, -0.2989],
         [-0.1204, -0.1010, -0.0717,  ..., -0.2489,  0.0578, -0.1628]]],
       grad_fn=<BmmBackward0>)

## Exercise 2: Multi-Head Attention

Write a class `MultiHeadCausalSelfAttention` that derives from `nn.Module` and extends the functionality of `CausalSelfAttention` from the previous exercise.
The `__init__` method takes arguments `hidden_size, n_head, dropout`. `n_head` specifies the number of attention heads and `dropout` specifies the intensity for the dropout layers.
The `forward` method should split the hidden dimension of the pre-activations (i.e., $Q, K, V$) in `n_head` equally sized parts and perform attention to these parts in parallel.
Apply the first dropout layer direcly after the softmax.
After the multiplication of the scores with the values, recombine the output of the distinct attention heads back into a single hidden dimension of size $D$, i.e., the resulting shape should be the shape of the input.
Then perform an additional output projection again resulting in a hidden dimension of $D$.
Finally, apply the second dropout layer after the output projection.

In [53]:
########## YOUR SOLUTION HERE ##########
class MultiHeadCausalSelfAttention(nn.Module):
    def __init__(self, hidden_size, n_head, dropout):
        super(MultiHeadCausalSelfAttention, self).__init__()
        self.hidden_size = hidden_size
        self.n_head = n_head
        self.dropout = dropout

        # Ensure the hidden size is divisible by the number of heads
        assert hidden_size % n_head == 0, "Hidden size must be divisible by the number of heads."

        self.head_dim = hidden_size // n_head

        # Linear transformations for Q, K, and V
        self.W_Q = nn.Linear(hidden_size, hidden_size, bias=False)
        self.W_K = nn.Linear(hidden_size, hidden_size, bias=False)
        self.W_V = nn.Linear(hidden_size, hidden_size, bias=False)

        # Output linear transformation
        self.out_proj = nn.Linear(hidden_size, hidden_size)

        # Dropout layers
        self.attn_dropout = nn.Dropout(dropout)
        self.proj_dropout = nn.Dropout(dropout)

        self.scale = 1 / math.sqrt(self.head_dim)

    def forward(self, x):
        N, T, D = x.shape

        # Compute Q, K, V
        Q = self.W_Q(x).view(N, T, self.n_head, self.head_dim).transpose(1, 2)
        K = self.W_K(x).view(N, T, self.n_head, self.head_dim).transpose(1, 2)
        V = self.W_V(x).view(N, T, self.n_head, self.head_dim).transpose(1, 2)

        # Scaled dot-product attention
        E = torch.matmul(Q, K.transpose(-2, -1)) * self.scale

        # Mask future positions
        mask = torch.tril(torch.ones(T, T)).unsqueeze(0).unsqueeze(1).to(x.device)
        E = E.masked_fill(mask == 0, float('-inf'))

        # Softmax and dropout
        attn_weights = F.softmax(E, dim=-1)
        attn_weights = self.attn_dropout(attn_weights)

        # Multiply attention weights with V
        Y = torch.matmul(attn_weights, V)

        # Recombine the heads
        Y = Y.transpose(1, 2).contiguous().view(N, T, D)

        # Output projection and dropout
        Y = self.out_proj(Y)
        Y = self.proj_dropout(Y)

        return Y



In [54]:
# Example 
hidden_size = 128
n_head = 8
dropout = 0.1
model = MultiHeadCausalSelfAttention(hidden_size, n_head, dropout)
x = torch.randn(1, 10, hidden_size)  
output = model(x)
output

tensor([[[-0.6956,  0.3825, -0.2999,  ...,  0.0163,  0.0000, -0.1558],
         [-0.5959, -0.1812,  0.0973,  ..., -0.2410,  0.1236, -0.0000],
         [-0.3223,  0.1045,  0.1389,  ..., -0.4915,  0.1925,  0.1823],
         ...,
         [-0.1890,  0.0134,  0.0197,  ..., -0.0000, -0.1011,  0.1121],
         [-0.2518, -0.0104,  0.1993,  ..., -0.0000, -0.1611, -0.0221],
         [-0.2400,  0.0327,  0.0367,  ..., -0.2354, -0.1295,  0.0158]]],
       grad_fn=<MulBackward0>)

## Exercise 3: Multi-Layer Perceptron

Write a class `MLP` that derives from `nn.Module` and whose `__init__` method takes two arguments: `hidden_size` and `dropout`.
It should implement a 2-layer feedforward network with `hidden_size` inputs, `4*hidden_size` hiddens, and `hidden_size` outputs.
It should apply the GELU activation function to the hiddens and dropout to the outputs.

In [55]:
########## YOUR SOLUTION HERE ##########
import torch
import torch.nn as nn
import torch.nn.functional as F

class MLP(nn.Module):
    def __init__(self, hidden_size, dropout):
        super(MLP, self).__init__()
        self.hidden_size = hidden_size
        self.dropout = dropout

        # First layer of the MLP
        self.fc1 = nn.Linear(hidden_size, 4 * hidden_size)

        # Second layer of the MLP
        self.fc2 = nn.Linear(4 * hidden_size, hidden_size)

        # Dropout layer
        self.dropout_layer = nn.Dropout(dropout)

    def forward(self, x):
        # Apply first layer with GELU activation
        x = F.gelu(self.fc1(x))

        # Apply second layer and dropout
        x = self.fc2(x)
        x = self.dropout_layer(x)

        return x




In [56]:
# Example 
hidden_size = 128
dropout = 0.1
mlp = MLP(hidden_size, dropout)
x = torch.randn(1, hidden_size)  # Example input
output = mlp(x)
output

tensor([[-0.1872,  0.0944, -0.1092,  0.2600,  0.0093, -0.0484,  0.0811, -0.1332,
          0.1413, -0.2939, -0.0675, -0.3201,  0.0483,  0.1950, -0.0668, -0.0880,
          0.0545,  0.2328,  0.1142, -0.0090,  0.2061,  0.0000,  0.3525, -0.0056,
         -0.2745,  0.1612, -0.0832, -0.4034,  0.1525,  0.2574, -0.0992, -0.0799,
          0.1850, -0.0963, -0.3289, -0.0000,  0.2234, -0.2237,  0.0424, -0.3002,
         -0.0000, -0.0000,  0.1193,  0.0962,  0.1384, -0.0549,  0.2893, -0.0873,
         -0.0102,  0.4048,  0.3519, -0.1092, -0.0000,  0.0697,  0.0000,  0.3254,
         -0.0241, -0.0390, -0.2991, -0.1073,  0.2787,  0.1819, -0.0318,  0.0377,
         -0.2186, -0.1565, -0.2514, -0.0404,  0.0104, -0.4181,  0.2886,  0.0000,
         -0.0277,  0.0000, -0.1043,  0.0426,  0.3145, -0.2582,  0.0105,  0.0687,
          0.1905,  0.1326, -0.2883,  0.0371,  0.0871, -0.0859, -0.0649, -0.2406,
          0.0000,  0.0986, -0.0915, -0.2288, -0.0565,  0.1765, -0.0000,  0.1401,
         -0.1310, -0.2596, -

## Exercise 4: Block

Write a class `Block` that derives from `nn.Module` and whose `__init__` method takes arguments `hidden_size, n_head, dropout`.
It should apply `nn.LayerNorm`, `CausalMultiHeadSelfAttention`, `nn.LayerNorm`, `MLP` in that order and feature residual connections from the input to the output of `CausalMultiHeadSelfAttention` and from there to the output of `MLP`.

In [57]:
########## YOUR SOLUTION HERE ##########
import torch
import torch.nn as nn

class Block(nn.Module):
    def __init__(self, hidden_size, n_head, dropout):
        super(Block, self).__init__()
        self.hidden_size = hidden_size

        # Layer Normalization before Multi-Head Self-Attention
        self.norm1 = nn.LayerNorm(hidden_size)

        # Causal Multi-Head Self-Attention
        self.attention = MultiHeadCausalSelfAttention(hidden_size, n_head, dropout)

        # Layer Normalization before MLP
        self.norm2 = nn.LayerNorm(hidden_size)

        # Multi-Layer Perceptron
        self.mlp = MLP(hidden_size, dropout)

    def forward(self, x):
        # Apply LayerNorm and Multi-Head Self-Attention with Residual Connection
        x = x + self.attention(self.norm1(x))

        # Apply LayerNorm and MLP with Residual Connection
        x = x + self.mlp(self.norm2(x))

        return x




In [58]:
# Example 
hidden_size = 128
n_head = 8
dropout = 0.1
block = Block(hidden_size, n_head, dropout)
x = torch.randn(1, 10, hidden_size)  # Example input
output = block(x)
output

tensor([[[ 1.5151, -1.4002,  0.4259,  ...,  1.2822, -0.1182, -0.5810],
         [ 0.3529, -0.4463,  0.3599,  ..., -0.2832, -0.9787,  1.1137],
         [ 0.5080, -0.1147,  0.9956,  ..., -1.8726, -0.8032,  0.4226],
         ...,
         [-1.3272, -1.9147, -0.0984,  ..., -0.9359,  0.0621,  1.3531],
         [-0.7226, -0.6156,  0.3373,  ..., -1.2603, -2.0457,  1.1635],
         [ 0.6738,  0.8762,  1.2101,  ...,  0.3614,  1.0999,  0.8395]]],
       grad_fn=<AddBackward0>)

## Exercise 5: GPT

Write a class `GPT` that derives from `nn.Module` and whose `__init__` method takes arguments `vocab_size, context_size, hidden_size, n_layer, n_head, dropout`.
The `forward` method should take two arguments `x, y` representing sequences of input and target tokens, respectively, both of which have type `torch.long` and shape ($N$, $T$), and returns logits and loss as a tuple.
The `GPT` module should feature two `nn.Embedding` layers, one for token embeddings and one for positional embedding, i.e., it should embed the position of the corresponding token within the input sequence.
The positional embedding is necessary for the Transformer to determine the order of its inputs.
Add the two embeddings and apply a dropout layer.
Next, apply `n_layers` layers of `Block`s followed by a `nn.LayerNorm` and a `nn.Linear` (without bias) mapping to an output dimension of `vocab_size`.
Finally, apply the cross-entropy loss function to the logits.
To save some parameters, apply weight tying between the token embedding layer and the output layer, i.e., they should use the same weights.
Initialize all weights using a normal distribution with a mean of zero and a standard deviation of 0.02 (except for the output layers of the `MLP`s use $0.02/\sqrt{2 * \mathtt{n\_layer}}$) and all biases to zero.
Use the argument `dropout` as intensity for all dropout layers in the network.

In [59]:
########## YOUR SOLUTION HERE ##########import torch

import torch.nn as nn
import torch.nn.functional as F

class GPT(nn.Module):
    def __init__(self, vocab_size, context_size, hidden_size, n_layer, n_head, dropout):
        super(GPT, self).__init__()
        self.vocab_size = vocab_size
        self.context_size = context_size
        self.hidden_size = hidden_size

        # Token and positional embeddings
        self.token_embeddings = nn.Embedding(vocab_size, hidden_size)
        self.position_embeddings = nn.Embedding(context_size, hidden_size)

        # Apply weight tying between token embeddings and the output layer
        self.lm_head = nn.Linear(hidden_size, vocab_size, bias=False)
        self.lm_head.weight = self.token_embeddings.weight

        self.dropout = nn.Dropout(dropout)

        # GPT Blocks
        self.blocks = nn.ModuleList([Block(hidden_size, n_head, dropout) for _ in range(n_layer)])

        # Final Layer Normalization
        self.norm = nn.LayerNorm(hidden_size)

        self.init_weights(n_layer)

    def init_weights(self, n_layer):
        for name, p in self.named_parameters():
            if p.dim() > 1:
                if 'mlp' in name and 'weight' in name:
                    nn.init.normal_(p, mean=0.0, std=0.02 / math.sqrt(2 * n_layer))
                else:
                    nn.init.normal_(p, mean=0.0, std=0.02)
            else:
                nn.init.zeros_(p)

    def forward(self, x, y):
        N, T = x.size()

        # Token and position embeddings
        token_embeddings = self.token_embeddings(x)
        position_embeddings = self.position_embeddings(torch.arange(T, device=x.device)).unsqueeze(0).repeat(N, 1, 1)
        x = self.dropout(token_embeddings + position_embeddings)

        # Apply GPT Blocks
        for block in self.blocks:
            x = block(x)

        x = self.norm(x)

        # Output linear layer to get logits
        logits = self.lm_head(x)

        # Compute cross-entropy loss
        loss = F.cross_entropy(logits.view(-1, self.vocab_size), y.view(-1))

        return logits, loss





## Exercise 6: Optimizer

Add a method `configure_optimizers` to the class `GPT` that takes arguments `weight_decay, learning_rate, betas`.
Divide the model parameters into two groups.
The first group consists of all parameters with at least 2 dimensions, e.g., weight/embedding matrices and uses a decay of `weight_decay`.
The second group consists of all other parameters, e.g., biases and layer norms, and does not use weight decay.
Construct and return a `torch.optim.AdamW` optimizer with `learning_rate` and `betas` that operates on these two parameter groups.

In [64]:
########## YOUR SOLUTION HERE ##########
import torch
import torch.nn as nn
import torch.nn.functional as F
import torch.optim as optim  # Import torch.optim

class GPT(nn.Module):
    def __init__(self, vocab_size, context_size, hidden_size, n_layer, n_head, dropout):
        super(GPT, self).__init__()
        self.vocab_size = vocab_size
        self.context_size = context_size
        self.hidden_size = hidden_size

        # Token and positional embeddings
        self.token_embeddings = nn.Embedding(vocab_size, hidden_size)
        self.position_embeddings = nn.Embedding(context_size, hidden_size)

        # Apply weight tying between token embeddings and the output layer
        self.lm_head = nn.Linear(hidden_size, vocab_size, bias=False)
        self.lm_head.weight = self.token_embeddings.weight

        self.dropout = nn.Dropout(dropout)

        # GPT Blocks
        self.blocks = nn.ModuleList([Block(hidden_size, n_head, dropout) for _ in range(n_layer)])

        # Final Layer Normalization
        self.norm = nn.LayerNorm(hidden_size)

        self.init_weights(n_layer)

    def init_weights(self, n_layer):
        for name, p in self.named_parameters():
            if p.dim() > 1:
                if 'mlp' in name and 'weight' in name:
                    nn.init.normal_(p, mean=0.0, std=0.02 / math.sqrt(2 * n_layer))
                else:
                    nn.init.normal_(p, mean=0.0, std=0.02)
            else:
                nn.init.zeros_(p)

    def forward(self, x, y):
        N, T = x.size()

        # Token and position embeddings
        token_embeddings = self.token_embeddings(x)
        position_embeddings = self.position_embeddings(torch.arange(T, device=x.device)).unsqueeze(0).repeat(N, 1, 1)
        x = self.dropout(token_embeddings + position_embeddings)

        # Apply GPT Blocks
        for block in self.blocks:
            x = block(x)

        x = self.norm(x)

        # Output linear layer to get logits
        logits = self.lm_head(x)

        # Compute cross-entropy loss
        loss = F.cross_entropy(logits.view(-1, self.vocab_size), y.view(-1))

        return logits, loss






    def configure_optimizers(self, weight_decay, learning_rate, betas):
        # Group parameters: weight matrices (2 or more dimensions) and others
        decay_params = [p for name, p in self.named_parameters() if p.dim() > 1]
        no_decay_params = [p for name, p in self.named_parameters() if p.dim() == 1]

        # Create two parameter groups
        param_groups = [
            {"params": decay_params, "weight_decay": weight_decay},
            {"params": no_decay_params, "weight_decay": 0.0}
        ]

        # Construct AdamW optimizer with the specified parameters and hyperparameters
        optimizer = optim.AdamW(param_groups, lr=learning_rate, betas=betas)

        return optimizer






## Exercise 7: Training

In the code cell below you find some globals, helper functions, and boilerplate code. Extend the given code by a training loop that
* stops after `max_iters` iterations
* applies the learning rate schedule implemented in `get_lr`
* applies gradient clipping at `grad_clip` using `torch.nn.utils.clip_grad_norm_`
* accumulates gradients for `gradient_accumulation_steps` batches before each weight update
* logs the training loss and learning rate every `log_interval` iterations
* evaluates (and potentially checkpoints) the model using `estimate_loss` every `eval_iters` iterations.

The provided hyperparameter values should be a good guess for training a tiny model on CPU but feel free to experiment with them as you please. In particular, if you have a GPU available, you can try to scale things up a bit.

In [47]:
eval_interval = 250 # validate model every .. iterations
log_interval = 10 # log training loss every .. iterations
eval_iters = 20 # number of batches for loss estimation
gradient_accumulation_steps = 5 * 8 # used to simulate larger training batch sizes
batch_size = 12 # if gradient_accumulation_steps > 1, this is the micro-batch size
context_size = 64 # sequence length
vocab = 'abcdefghijklmnopqrstuvwxyz0123456789 .!?' # vocabulary
vocab_size = len(vocab) # 40
n_layer = 16 # number of layers
n_head = 16 # number of attention heads
hidden_size = 128 # layer size
dropout = 0.1 # for pretraining 0 is good, for finetuning try 0.1+
learning_rate = 1e-3 # max learning rate
max_iters = 2000 # total number of training iterations
weight_decay = 1e-1
beta1 = 0.9 # for AdamW
beta2 = 0.99 # for AdamW
grad_clip = 1.0 # clip gradients at this value, or disable with 0.0
warmup_iters = 100 # how many steps to warm up for
min_lr = 1e-4 # minimum learning rate, usually ~= learning_rate/10

# learning rate decay scheduler (cosine with warmup)
def get_lr(it):
    # 1) linear warmup for warmup_iters steps
    if it < warmup_iters:
        return learning_rate * it / warmup_iters
    # 2) if it > max_iters, return min learning rate
    if it > max_iters:
        return min_lr
    # 3) in between, use cosine decay down to min learning rate
    decay_ratio = (it - warmup_iters) / (max_iters - warmup_iters)
    assert 0 <= decay_ratio <= 1
    coeff = 0.5 * (1.0 + math.cos(math.pi * decay_ratio)) # coeff ranges 0..1
    return min_lr + coeff * (learning_rate - min_lr)

def load_data(split):
    import re

    with open(f'trump_{split}.txt', 'r') as f:
        text = f.read()

    text = text.lower() # convert to lower case
    text = re.sub('[^a-z0-9 .!?]', ' ', text) # replace all unknown chars with ' '
    text = re.sub(' +', ' ', text) # reduce multiple blanks to one
    text = [vocab.index(t) for t in text]
    text = torch.tensor(text, dtype=torch.long)
    return text

def get_batch(split):
    data = train_data if split == 'train' else val_data
    ix = torch.randint(len(data) - context_size, (batch_size,))
    x = torch.stack([data[i:i+context_size] for i in ix])
    y = torch.stack([data[i+1:i+1+context_size] for i in ix])
    return x, y

# helps estimate an arbitrarily accurate loss over either split using many batches
@torch.no_grad()
def estimate_loss():
    out = {}
    model.eval()
    for split in ['train', 'val']:
        losses = torch.zeros(eval_iters)
        for k in range(eval_iters):
            X, Y = get_batch(split)
            logits, loss = model(X, Y)
            losses[k] = loss.item()
        out[split] = losses.mean()
    model.train()
    return out

# data, model, optimizer, etc.
train_data = load_data('train')
val_data = load_data('val')
model = GPT(vocab_size, context_size, hidden_size, n_layer, n_head ,dropout)
optimizer = model.configure_optimizers(weight_decay, learning_rate, (beta1, beta2))
iter_num = 0
best_val_loss = 1e9
X, Y = get_batch('train') # fetch the very first batch
t0 = time.time()

########## YOUR SOLUTION HERE ##########

# Training Loop
for iter_num in range(max_iters):
    # Adjust learning rate according to the schedule
    lr = get_lr(iter_num)
    for param_group in optimizer.param_groups:
        param_group['lr'] = lr

    # Initialize gradients
    if iter_num % gradient_accumulation_steps == 0:
        optimizer.zero_grad()

    # Get a batch of data
    X, Y = get_batch('train')

    # Forward pass
    logits, loss = model(X, Y)

    # Normalize loss to account for gradient accumulation
    loss = loss / gradient_accumulation_steps
    loss.backward()

    # Perform optimization step after accumulating gradients
    if (iter_num + 1) % gradient_accumulation_steps == 0:
        # Apply gradient clipping
        if grad_clip > 0.0:
            torch.nn.utils.clip_grad_norm_(model.parameters(), grad_clip)
        
        optimizer.step()

    # Log training progress
    if iter_num % log_interval == 0:
        print(f"Iteration {iter_num}, Loss: {loss.item():.4f}, Learning Rate: {lr:.6f}")

    # Evaluate the model
    if iter_num % eval_interval == 0 and iter_num > 0:
        val_loss = estimate_loss()['val']
        print(f"Validation Loss at Iteration {iter_num}: {val_loss:.4f}")
        if val_loss < best_val_loss:
            best_val_loss = val_loss
            # Save the model checkpoint here if desired

    # Time keeping
    if iter_num % 100 == 0 and iter_num > 0:
        t1 = time.time()
        print(f"{100 / (t1 - t0)} iterations/sec")
        t0 = t1


 


Iteration 0, Loss: 0.0922, Learning Rate: 0.000000
Iteration 10, Loss: 0.0922, Learning Rate: 0.000100
Iteration 20, Loss: 0.0922, Learning Rate: 0.000200
Iteration 30, Loss: 0.0922, Learning Rate: 0.000300
Iteration 40, Loss: 0.0922, Learning Rate: 0.000400
Iteration 50, Loss: 0.0922, Learning Rate: 0.000500
Iteration 60, Loss: 0.0922, Learning Rate: 0.000600
Iteration 70, Loss: 0.0922, Learning Rate: 0.000700
Iteration 80, Loss: 0.0922, Learning Rate: 0.000800
Iteration 90, Loss: 0.0922, Learning Rate: 0.000900
Iteration 100, Loss: 0.0922, Learning Rate: 0.001000
3.2724000536029276 iterations/sec
Iteration 110, Loss: 0.0922, Learning Rate: 0.001000
Iteration 120, Loss: 0.0922, Learning Rate: 0.001000
Iteration 130, Loss: 0.0922, Learning Rate: 0.000999
Iteration 140, Loss: 0.0922, Learning Rate: 0.000999
Iteration 150, Loss: 0.0922, Learning Rate: 0.000998
Iteration 160, Loss: 0.0921, Learning Rate: 0.000998
Iteration 170, Loss: 0.0921, Learning Rate: 0.000997
Iteration 180, Loss: 0.

## Exercise 8: Inference

Add a method `generate` to the class `GPT` that takes arguments `x, max_new_tokens, temperature=1.0`.
The method should take a batch of token sequences `x`, which it should extend by `max_new_tokens` new tokens generated by the model.
Once you have computed the logits for the next token, divide them by `temperature` before applying the softmax.
After applying the softmax, sample the next token from the resulting categorical distribution.
Try out different values for `temperature` and compare the results to those from the previous assignment.

In [None]:
########## YOUR SOLUTION HERE ##########

In [48]:
import torch
import torch.nn as nn
import torch.nn.functional as F
import torch.optim as optim  # Import torch.optim

class GPT(nn.Module):
    def __init__(self, vocab_size, context_size, hidden_size, n_layer, n_head, dropout):
        super(GPT, self).__init__()
        self.vocab_size = vocab_size
        self.context_size = context_size
        self.hidden_size = hidden_size

        # Token and positional embeddings
        self.token_embeddings = nn.Embedding(vocab_size, hidden_size)
        self.position_embeddings = nn.Embedding(context_size, hidden_size)

        # Apply weight tying between token embeddings and the output layer
        self.lm_head = nn.Linear(hidden_size, vocab_size, bias=False)
        self.lm_head.weight = self.token_embeddings.weight

        self.dropout = nn.Dropout(dropout)

        # GPT Blocks
        self.blocks = nn.ModuleList([Block(hidden_size, n_head, dropout) for _ in range(n_layer)])

        # Final Layer Normalization
        self.norm = nn.LayerNorm(hidden_size)

        self.init_weights(n_layer)

    def init_weights(self, n_layer):
        for name, p in self.named_parameters():
            if p.dim() > 1:
                if 'mlp' in name and 'weight' in name:
                    nn.init.normal_(p, mean=0.0, std=0.02 / math.sqrt(2 * n_layer))
                else:
                    nn.init.normal_(p, mean=0.0, std=0.02)
            else:
                nn.init.zeros_(p)

    def forward(self, x, y):
        N, T = x.size()

        # Token and position embeddings
        token_embeddings = self.token_embeddings(x)
        position_embeddings = self.position_embeddings(torch.arange(T, device=x.device)).unsqueeze(0).repeat(N, 1, 1)
        x = self.dropout(token_embeddings + position_embeddings)

        # Apply GPT Blocks
        for block in self.blocks:
            x = block(x)

        x = self.norm(x)

        # Output linear layer to get logits
        logits = self.lm_head(x)

        # Compute cross-entropy loss
        loss = F.cross_entropy(logits.view(-1, self.vocab_size), y.view(-1))

        return logits, loss






    def configure_optimizers(self, weight_decay, learning_rate, betas):
        # Group parameters: weight matrices (2 or more dimensions) and others
        decay_params = [p for name, p in self.named_parameters() if p.dim() > 1]
        no_decay_params = [p for name, p in self.named_parameters() if p.dim() == 1]

        # Create two parameter groups
        param_groups = [
            {"params": decay_params, "weight_decay": weight_decay},
            {"params": no_decay_params, "weight_decay": 0.0}
        ]

        # Construct AdamW optimizer with the specified parameters and hyperparameters
        optimizer = optim.AdamW(param_groups, lr=learning_rate, betas=betas)

        return optimizer


    def generate(self, x, max_new_tokens, temperature=1.0):
        self.eval()  # Set the model to evaluation mode
        for _ in range(max_new_tokens):
            logits, _ = self(x, x)  # Obtain logits for the next token
            # Get logits of the last token in the sequence
            logits = logits[:, -1, :] / temperature
            # Apply softmax to get probabilities
            probs = F.softmax(logits, dim=-1)
            # Sample from the categorical distribution
            next_token = torch.multinomial(probs, num_samples=1)
            # Append the sampled token to the sequence
            x = torch.cat((x, next_token), dim=1)
        return x



# Create an instance of the GPT model
model = GPT(vocab_size, context_size, hidden_size, n_layer, n_head, dropout)

# Prepare input data
x_start = "Hel"
x_start = x_start.lower()
x_indices = [vocab.index(char) for char in x_start]
x = torch.tensor([x_indices], dtype=torch.long)

# Set generation parameters
max_new_tokens = 10
temperature = 0.7

# Generate the sequence
generated_sequence = model.generate(x, max_new_tokens, temperature)

# Convert generated indices back to characters
generated_text = ''.join([vocab[idx] for idx in generated_sequence[0].tolist()])
print(generated_text)




helhf.v0z5jvt


In [49]:

model = GPT(vocab_size, context_size, hidden_size, n_layer, n_head, dropout)

# Prepare input data
x_start = "Hel"
x_start = x_start.lower()
x_indices = [vocab.index(char) for char in x_start]
x = torch.tensor([x_indices], dtype=torch.long)

# Set up a range of temperature values to try
temperatures = [ 0.01,0.1,0.5, 0.7, 1.0, 1.5, 2.0]
max_new_tokens = 10

for temp in temperatures:
    # Generate the sequence with the current temperature
    generated_sequence = model.generate(x, max_new_tokens, temp)

    # Convert generated indices back to characters
    generated_text = ''.join([vocab[idx] for idx in generated_sequence[0].tolist()])

    print(f"Temperature {temp}: {generated_text}")



Temperature 0.01: helqvt011cu x
Temperature 0.1: helzpzqooi.pw
Temperature 0.5: hel3q7dty9vep
Temperature 0.7: heliq30ply9.4
Temperature 1.0: helntd?8vij29
Temperature 1.5: hel7n 3j65of9
Temperature 2.0: helt7875kcq9!
