# HW5: Transformer



This assignment will introduce you to 

1. Understanding the structure of transformer. 

2. Building a GPT model step by step

3. Train a GPT language model to write few sentences.

You can run this assignment on Colab or SCC. You are encouraged to use GPU to make training faster, but BE AWARE the Google Colab GPU limit: you are limited to use it for less than 12 hours continuously, after that you may not be able to access it for a particular duration of time unless you purchase Colab pro.



## Q1 Sequence to Sequence Modelling with nn.Transformer (20 points)

You will implement a part of transformer. This question aims to let you to get familiar with the transformer architecture purposed in the paper [Attention is all you need](https://arxiv.org/pdf/1706.03762.pdf). This question is modified from the original pytorch tutorial [here](https://pytorch.org/tutorials/beginner/transformer_tutorial.html?highlight=transformer), you can refer it when you fill out the code. The general architecture of trasnsformer is shown in the figure below:

<img src="https://pytorch.org/tutorials/_images/transformer_architecture.jpg" width="360em">

This question requires you to implement a sequence to sequence model by encoder, which is the left part of the figure. You will use integrated layers in pytorch.

The transformer model has been proved to be superior in quality for many sequence-to-sequence
problems while being more parallelizable. The ``nn.Transformer`` module
relies entirely on an attention mechanism (another module recently
implemented as `nn.MultiheadAttention`) to draw global dependencies
between input and output. The ``nn.Transformer`` module is now highly
modularized such that a single component (like [`nn.TransformerEncoder `](<https://pytorch.org/docs/master/nn.html?highlight=nn%20transformerencoder#torch.nn.TransformerEncoder>)
in this tutorial) can be easily adapted/composed.

### Q1.1 Define the model 
In this question, we train ``nn.TransformerEncoder`` model on a
language modeling task. The language modeling task is to assign a
probability for the likelihood of a given word (or a sequence of words)
to follow a sequence of words. A sequence of tokens are passed to the embedding
layer first, followed by a positional encoding layer to account for the order
of the word (see the next paragraph for more details). The
``nn.TransformerEncoder`` consists of multiple layers of
``nn.TransformerEncoderLayer`` . Along with the input sequence, a square
attention mask is required because the self-attention layers in
``nn.TransformerEncoder`` are only allowed to attend the earlier positions in
the sequence. For the language modeling task, any tokens on the future
positions should be masked. To have the actual words, the output
of ``nn.TransformerEncoder`` model is sent to the final Linear
layer, which is followed by a log-Softmax function. We will see how to implement the ``PositionalEncoding`` in the later question. 

<img src="https://www.researchgate.net/publication/334288604/figure/fig1/AS:778232232148992@1562556431066/The-Transformer-encoder-structure.ppm">




In the following model, we only train a encoder model, which is the left part of the figure. Then we concatenate a Linear model `self.decoder` to replace the right part of the model.

In [1]:
import math
import torch
import torch.nn as nn
import torch.nn.functional as F

class TransformerModel(nn.Module):
    '''
    This is a transformer encoder model, the input arguments are as follows:
    args:
    ntoken:  dimension of tokens
    ninp: dimension of input embeddings
    nhid: dimension of the hidden encoding between two layers of TransformerEncoderLayer
    nlayers: number of TransformerEncoderLayer layers
    nhead: the number of heads in the multiheadattention model
    '''
    def __init__(self, ntoken, ninp, nhead, nhid, nlayers, dropout=0.5):
        super(TransformerModel, self).__init__()
        from torch.nn import TransformerEncoder, TransformerEncoderLayer
        self.model_type = 'Transformer'
        self.pos_encoder = PositionalEncoding(ninp, dropout) # PositionalEncoding will be implemented in next section.
        encoder_layers = TransformerEncoderLayer(ninp, nhead, nhid, dropout)
        self.transformer_encoder = TransformerEncoder(encoder_layers, nlayers)
        self.encoder = nn.Embedding(ntoken, ninp)
        self.ninp = ninp
        self.decoder = nn.Linear(ninp, ntoken)

        self.init_weights()

    def generate_square_subsequent_mask(self, sz):
        """YOUR CODE HERE"""
        '''
        You can use torch.triu and masked_fill to get an upper triangle mask. 
        The upper right entries are -inf, down left entries including the diagonal are 0.
        '''
        mask= (torch.triu(torch.ones(sz, sz)) == 1).transpose(0, 1)
        mask= mask.float()
        mask= mask.masked_fill(mask == 0, float('-inf')).masked_fill(mask == 1, float(0.0))
        return mask
        

    def init_weights(self):
        initrange = 0.1
        self.encoder.weight.data.uniform_(-initrange, initrange)
        self.decoder.bias.data.zero_()
        self.decoder.weight.data.uniform_(-initrange, initrange)

    def forward(self, src, src_mask):
        """YOUR CODE HERE"""
        '''
        Fill the forward function according to the diagram above.
        In the embedding layers, we multiply those weights by square root of 
        self.ninp.
        '''
        src = self.pos_encoder(self.encoder(src) * math.sqrt(self.ninp))
        output = self.decoder(self.transformer_encoder(src, src_mask))
        return output
        

### Q1.2 Positional Encoding
#### Q1.2.1 Fill the code block
``PositionalEncoding`` module injects some information about the
relative or absolute position of the tokens in the sequence. The
positional encodings have the same dimension as the embeddings so that
the two can be summed. Here, we use ``sine`` and ``cosine`` functions of
different frequencies.


In [2]:
class PositionalEncoding(nn.Module):

    def __init__(self, d_model, dropout=0.1, max_len=5000):
        super(PositionalEncoding, self).__init__()
        self.dropout = nn.Dropout(p=dropout)
        """YOUR CODE HERE"""
        P=torch.arange(max_len).unsqueeze(1)
        e=torch.exp(torch.arange(0, d_model, 2) * -(math.log(10000.0) / d_model))
        Pe=torch.zeros(max_len, d_model)
        Pe[:, 0::2] = torch.sin(P * e)
        Pe[:, 1::2] = torch.cos(P * e)
        Pe = Pe.unsqueeze(0).transpose(0, 1)
        self.register_buffer('pe', Pe)


    def forward(self, x):
        """YOUR CODE HERE"""
        return self.dropout(x + self.pe[:x.size(0), :])

#### Q1.2.2 Why do we need this positional encoding in the transformer architectrue.

We need positional encoding in the Transformer architecture because the Transformer uses self-attention mechanisms to process the input sequence. Self-attention allows the model to attend to different parts of the input sequence during processing, but it does not take into account the order or position of the tokens in the sequence.

To overcome this limitation, we use positional encoding to inject information about the relative or absolute position of the tokens in the input sequence. By adding the positional encoding to the input embeddings, the Transformer can capture the order of the tokens and attend to them accordingly.

## Q2 Transformer Block for GPT (30 points)

### Q 2.1 Multi-head self-attention
#### Q 2.1.1 The first part is multi-head self-attention. In this layer, you will need to:
- Apply linear projections to convert the feature vector at each token into separate vectors for the query, key, and value. The input and output size of linear projection are both `n_embd`
- Apply attention, scaling the logits by $\frac{1}{\sqrt{d_{qkv}}}$.
- Ensure proper masking, such that padding tokens are never attended to.
- Perform attention `n_head` times in parallel, where the results are concatenated and then projected using a linear layer.

<img src="https://www.researchgate.net/publication/332139525/figure/fig3/AS:743081083158528@1554175744311/a-The-Transformer-model-architecture-b-left-Scaled-Dot-Product-Attention.ppm" width="360em">

You should include two types of dropout in your code (with probability set by the  `dropout` argument):
- Dropout should be applied to the output of the attention layer (just prior to the residual connection, denoted by "Add & Norm" in the first figure)
- Dropout should *also* be applied after the final projection.
Notes:
- Query, key, and value vectors should have shape `[batch_size, n_heads, sequence_len, d_qkv]`
- Apply a mask to the scaled dot product of Q and K, before the Softmax function. Let the entry to be a small enough number where the entry of the causal mask is zero. You can use `torch.tril` or `torch.triu` to create a mask, usually we define the mask as a lower triangular matrix. Lower left (incude the diagonal) entries are ones, rest of entries are zeros.
Then apply `tensor.masked_fill()` to the output of the scaled dot product of Q and K (It is also the input of softmax). Where the mask is zero, set the input to softmax to a negative number with very large magnitude.
- Attention logits and probabilities should have shape `[batch_size, n_heads, sequence_len, sequence_len]`
- Vaswani et al. define the output of the attention layer as concatenating the various heads and then multiplying by a matrix $W^O$. It's also possible to implement this is a sum without ever calling `torch.cat`: note that $\text{Concat}(head_1, \ldots, head_h)W^O = head_1 W^O_1 + \ldots + head_h W^O_h$ where $W^O = \begin{bmatrix} W^O_1\\ \vdots\\ W^O_h\end{bmatrix}$. You may define the `self.proj` this way.


In [3]:
import math
import logging

import torch
import torch.nn as nn
from torch.nn import functional as F
class MultiHeadSelfAttention(nn.Module):
    """
    A vanilla multi-head masked self-attention layer with a projection at the end.
    You can also use torch.nn.MultiheadAttention to validate your implementation

    """

    def __init__(self, n_embd, n_head, block_size, attn_pdrop=0.1, resid_pdrop=0.1):
        super().__init__()
        assert n_embd % n_head == 0
        self.n_head = n_head
        self.n_embd = n_embd
        self.block_size = block_size

        self.key = nn.Linear(n_embd, n_embd)
        self.query = nn.Linear(n_embd, n_embd)
        self.value = nn.Linear(n_embd, n_embd)
        self.attn_drop = nn.Dropout(attn_pdrop)
        self.resid_drop = nn.Dropout(resid_pdrop)
        self.proj = nn.Linear(n_embd, n_embd)

    def forward(self, x):
        B, T, C = x.size()
        H = self.n_head

        # Calculate query, key, and value projections
        keys = self.key(x).view(B, T, H, C // H).transpose(1, 2)
        queries = self.query(x).view(B, T, H, C // H).transpose(1, 2)
        values = self.value(x).view(B, T, H, C // H).transpose(1, 2)

        # Calculate attention scores
        attn_scores = torch.matmul(queries, keys.transpose(-2, -1)) / math.sqrt(C // H)

        # Apply causal mask
        mask = torch.triu(torch.ones_like(attn_scores), diagonal=1).bool()
        attn_scores = attn_scores.masked_fill(mask, float('-inf'))

        # Calculate attention probabilities
        attn_probs = F.softmax(attn_scores, dim=-1)
        attn_probs = self.attn_drop(attn_probs)

        # Compute attended values
        attended_values = torch.matmul(attn_probs, values).transpose(1, 2).contiguous()
        attended_values = attended_values.view(B, T, C)

        # Apply residual connection and dropout
        y = self.resid_drop(self.proj(attended_values))
        return x + y
        

#### Q 2.1.2 Why do we need to divide a scale of the dot product of Q and K?

The scale factor in the dot product of Q and K is used to prevent the dot product from becoming too large or too small. Recall that the dot product of two vectors measures their similarity in terms of direction, and it is computed by multiplying the magnitude of the two vectors and the cosine of the angle between them.

### Q2.2 Transformer
We will implement the transformer block, which is the blue box in the figure. You can use `nn.LayerNorm` layer to apply layer norm. We defined the feed forward layer as `self.mlp`.

Notice that where to use the layer norm is a design choice, you can change to see how it affect the final results in the application of Question 3.

<img src="https://www.researchgate.net/publication/358519951/figure/fig8/AS:1122134894092288@1644549215188/The-GPT-1-architecture-proposed-in-Radford-et-al-a-It-is-composed-of-12-stacked.ppm" width="240em">







In [4]:
class TransformerBlock(nn.Module):
    """ an Transformer block """

    def __init__(self, n_embd, n_head, block_size, attn_pdrop=0.1, resid_pdrop=0.1):
        super().__init__()
        self.ln1 = nn.LayerNorm(n_embd)
        self.ln2 = nn.LayerNorm(n_embd)
        self.attn = MultiHeadSelfAttention(n_embd, n_head, block_size, attn_pdrop)
        self.mlp = nn.Sequential(
            nn.Linear(n_embd, 4 * n_embd),
            nn.GELU(),
            nn.Linear(4 * n_embd, n_embd),
            nn.Dropout(resid_pdrop),
        )
        self.dropout = nn.Dropout(resid_pdrop)

    def forward(self, x):
        """YOUR CODE HERE"""
        ln=self.attn(self.ln1(x))
        x=x+self.dropout(ln)
        return x+self.dropout(self.mlp(self.ln2(x)))
        


## Q3 GPT on Addition (30 points)
In this question, we will train an GPT transformer to do addition. We first need to get the dataset and encode addition equation to a vocabulary by integers since we want to use GPT dealing with sequences of integers, and completing them according to patterns in the data. 

  The sum of two n-digit numbers gives a third up to (n+1)-digit number. So our
  encoding will simply be the n-digit first number, n-digit second number, and (n+1)-digit result, all simply concatenated together. Because each addition problem is so structured, there is no need to bother the model with encoding +, =, or other tokens. Each possible sequence has the same length, and simply contains the raw digits of the addition problem. As a few examples, the 2-digit problems:
- 85 + 50 = 135 becomes the sequence `[8, 5, 5, 0, 1, 3, 5]`
- 6 + 39 = 45 becomes the sequence `[0, 6, 3, 9, 0, 4, 5]`

We will also only train GPT on the final (n+1)-digits because the first two n-digits are always assumed to be given. So when we give GPT an exam later, we will e.g. feed it the sequence `[0, 6, 3, 9]`, which encodes that we'd like to add 6 + 39, and hope that the model completes the integer sequence with `[0, 4, 5]` in 3 sequential steps.

In [5]:
import numpy as np
import torch
import torch.nn as nn
from torch.nn import functional as F

In [6]:
from torch.utils.data import Dataset

class AdditionDataset(Dataset):
    """
    Define the addition dataset
    """

    def __init__(self, ndigit, split):
        self.split = split # train/test
        self.ndigit = ndigit
        self.vocab_size = 10 # 10 possible digits 0..9
        # +1 due to potential carry overflow, but then -1 because very last digit doesn't plug back
        self.block_size = ndigit + ndigit + ndigit + 1 - 1
        
        # split up all addition problems into either training data or test data
        num = (10**self.ndigit)**2 # total number of possible combinations
        r = np.random.RandomState(1337) # make deterministic
        perm = r.permutation(num)
        num_test = min(int(num*0.2), 1000) # 20% of the whole dataset, or only up to 1000
        self.ixes = perm[:num_test] if split == 'test' else perm[num_test:]

    def __len__(self):
        return self.ixes.size

    def __getitem__(self, idx):
        # given a problem index idx, first recover the associated a + b
        idx = self.ixes[idx]
        nd = 10**self.ndigit
        a = idx // nd
        b = idx %  nd
        c = a + b
        render = f'%0{self.ndigit}d%0{self.ndigit}d%0{self.ndigit+1}d' % (a,b,c) # e.g. 03+25=28 becomes "0325028" 
        dix = [int(s) for s in render] # convert each character to its token index
        # x will be input to GPT and y will be the associated expected outputs
        x = torch.tensor(dix[:-1], dtype=torch.long)
        y = torch.tensor(dix[1:], dtype=torch.long) # predict the next token in the sequence
        y[:self.ndigit*2-1] = -100 
        return x, y

In [7]:
# create a dataset for e.g. 2-digit addition
ndigit = 2
train_dataset = AdditionDataset(ndigit=ndigit, split='train')
test_dataset = AdditionDataset(ndigit=ndigit, split='test')
train_dataset[0] # sample a training instance just to see what one raw example looks like

(tensor([4, 7, 1, 7, 0, 6]), tensor([-100, -100, -100,    0,    6,    4]))

### Q3.1 Define the GPT model 
Now, we start constructing the GPT model. As is shown in the figure, there are 12 transformer blocks concatenated together. In our model, we use `n_layer` to represent the number of blocks. In this question, you need to do the following:

- Define the `n_layer` transformer blocks `self.blocks`
- Fill out the forward function. Note that the positional embedding is not hard coded as the original transformer, it is learned during training.



<img src="https://www.researchgate.net/publication/358519951/figure/fig8/AS:1122134894092288@1644549215188/The-GPT-1-architecture-proposed-in-Radford-et-al-a-It-is-composed-of-12-stacked.ppm" width="240em">

In [8]:
class GPT(nn.Module):
    """  the full GPT language model, with a squence size of block_size """

    def __init__(self, vocab_size, n_embd, n_head, block_size, n_layer, embd_pdrop=0.1, attn_pdrop=0.1,resid_pdrop=0.1):
        super().__init__()

        # input embedding stem
        self.tok_emb = nn.Embedding(vocab_size, n_embd)
        self.pos_emb = nn.Parameter(torch.zeros(1, block_size, n_embd))
        self.drop = nn.Dropout(embd_pdrop)
        # transformer
        """YOUR CODE HERE"""
        
        """ """
        self.blocks = nn.Sequential(*[
            TransformerBlock(n_embd, n_head, block_size, attn_pdrop, resid_pdrop) for _ in range(n_layer)
        ])
        # decoder head
        self.ln_f = nn.LayerNorm(n_embd)
        self.head = nn.Linear(n_embd, vocab_size, bias=False)

        self.block_size = block_size
        self.apply(self._init_weights)

        logger.info("number of parameters: %e", sum(p.numel() for p in self.parameters()))


    def _init_weights(self, module):
        if isinstance(module, (nn.Linear, nn.Embedding)):
            module.weight.data.normal_(mean=0.0, std=0.02)
            if isinstance(module, nn.Linear) and module.bias is not None:
                module.bias.data.zero_()
        elif isinstance(module, nn.LayerNorm):
            module.bias.data.zero_()
            module.weight.data.fill_(1.0)

    def configure_optimizers(self, train_config):
        """
        You don't need to change this function. This is setting specific parameters for optimization.
        """

        # separate out all parameters to those that will and won't experience regularizing weight decay
        decay = set()
        no_decay = set()
        whitelist_weight_modules = (torch.nn.Linear, )
        blacklist_weight_modules = (torch.nn.LayerNorm, torch.nn.Embedding)
        for mn, m in self.named_modules():
            for pn, p in m.named_parameters():
                fpn = '%s.%s' % (mn, pn) if mn else pn # full param name

                if pn.endswith('bias'):
                    # all biases will not be decayed
                    no_decay.add(fpn)
                elif pn.endswith('weight') and isinstance(m, whitelist_weight_modules):
                    # weights of whitelist modules will be weight decayed
                    decay.add(fpn)
                elif pn.endswith('weight') and isinstance(m, blacklist_weight_modules):
                    # weights of blacklist modules will NOT be weight decayed
                    no_decay.add(fpn)

        # special case the position embedding parameter in the root GPT module as not decayed
        no_decay.add('pos_emb')

        # validate that we considered every parameter
        param_dict = {pn: p for pn, p in self.named_parameters()}
        inter_params = decay & no_decay
        union_params = decay | no_decay
        assert len(inter_params) == 0, "parameters %s made it into both decay/no_decay sets!" % (str(inter_params), )
        assert len(param_dict.keys() - union_params) == 0, "parameters %s were not separated into either decay/no_decay set!" \
                                                    % (str(param_dict.keys() - union_params), )

        # create the pytorch optimizer object
        optim_groups = [
            {"params": [param_dict[pn] for pn in sorted(list(decay))], "weight_decay": train_config.weight_decay},
            {"params": [param_dict[pn] for pn in sorted(list(no_decay))], "weight_decay": 0.0},
        ]
        optimizer = torch.optim.AdamW(optim_groups, lr=train_config.learning_rate, betas=train_config.betas)
        return optimizer

    def forward(self, x, targets=None):
        b, t = x.size()
        assert t <= self.block_size, "Cannot forward, model block size is exhausted."
        """YOUR CODE HERE"""
        token_embeddings = self.tok_emb(x)
        position_embeddings = self.pos_emb[:, :t, :]
        x = self.drop(token_embeddings + position_embeddings)

        x = self.blocks(x)
        x = self.ln_f(x)
        logits = self.head(x)

        if targets is not None:
            loss = F.cross_entropy(logits.view(-1, logits.size(-1)), targets.view(-1))
            return logits, loss
        else:
            return logits
        



###Q3.2 Training the model

##### You will train the GPT model. Fill out the code of the training process.

In [9]:
import math
import logging
from tqdm import tqdm
import numpy as np
import torch
import torch.optim as optim
from torch.optim.lr_scheduler import LambdaLR
from torch.utils.data.dataloader import DataLoader

Setting some parameters for training. Initialize the GPT model.

In [10]:
logger = logging.getLogger(__name__)
class TrainerConfig:
    # optimization parameters
    max_epochs = 150
    batch_size = 64
    learning_rate = 3e-4
    betas = (0.9, 0.95)
    grad_norm_clip = 1.0
    weight_decay = 0.1 # only applied on matmul weights
    # learning rate decay params: linear warmup followed by cosine decay to 10% of original
    lr_decay = False
    warmup_tokens = 375e6 # these two numbers come from the GPT-3 paper, but may not be good defaults elsewhere
    final_tokens = 260e9 # (at what point we reach 10% of original LR)
    # checkpoint settings
    ckpt_path = None
    num_workers = 0 # for DataLoader

    def __init__(self, **kwargs):
        for k,v in kwargs.items():
            setattr(self, k, v)
# initialize a baby GPT model
model = GPT(vocab_size = train_dataset.vocab_size, n_embd=128, n_head=4, block_size =  train_dataset.block_size, n_layer=2)

You need to fill out training process. 
- Forward the model with current batch `x`, `y`;
- Zero the grad before update;
- Backward the loss and update the model parameter;
- You might want to use `torch.nn.utils.clip_grad_norm_`. The parameter max_norm is `config.grad_norm_clip`;
- You will run this getting a loss around 0.1 and accuracy on both train and test around 99%.

#### 3.2.1 Why do we need to use `torch.nn.utils.clip_grad_norm_` in training?

During training, gradient values can grow large, which can cause problems in optimization, such as exploding gradients. To avoid this issue, gradient clipping is applied to rescale gradients to a maximum norm value. torch.nn.utils.clip_grad_norm_ is a utility function provided by PyTorch that clips the norm of gradients across all parameters in a model to a maximum value. This function is used to ensure that the norm of the gradients does not exceed a certain threshold, which can lead to numerical instability during training.

In [11]:
config = TrainerConfig(max_epochs=150, batch_size=1024, learning_rate=6e-4,
                      lr_decay=True, warmup_tokens=1024, final_tokens=50*len(train_dataset)*(ndigit+1),
                      num_workers=4)

device = 'gpu'
if torch.cuda.is_available():
  device = torch.cuda.current_device()
  model.to(device)
optimizer = model.configure_optimizers(config)



tokens = 0
for epoch in range(config.max_epochs):
    model.train()
    data = train_dataset 
    loader = DataLoader(data, shuffle=True, pin_memory=True,
                        batch_size=config.batch_size,
                        num_workers=config.num_workers)
    losses = []
    pbar = tqdm(enumerate(loader), total=len(loader)) 
    for iter, (x, y) in pbar:
        # place data on the correct device
        x = x.to(device)
        y = y.to(device)
        # forward the model
        """ CODE HERE """
        _, loss = model(x, y)
        model.zero_grad()
        loss.backward()
        torch.nn.utils.clip_grad_norm_(model.parameters(), config.grad_norm_clip)
        optimizer.step()

        """ CODE HERE END """


        # decay the learning rate based on our progress
        if config.lr_decay:
            tokens += (y >= 0).sum() # number of tokens processed this step (i.e. label is not -100)
            if tokens < config.warmup_tokens:
                # linear warmup
                lr_mult = float(tokens) / float(max(1, config.warmup_tokens))
            else:
                # cosine learning rate decay
                progress = float(tokens - config.warmup_tokens) / float(max(1, config.final_tokens - config.warmup_tokens))
                lr_mult = max(0.1, 0.5 * (1.0 + math.cos(math.pi * progress)))
            lr = config.learning_rate * lr_mult
            for param_group in optimizer.param_groups:
                param_group['lr'] = lr
        else:
            lr = config.learning_rate
        # report progress
        pbar.set_description(f"epoch {epoch+1} iter {iter}: train loss {loss.item():.5f}. lr {lr:e}")


epoch 1 iter 8: train loss 1.97857. lr 5.994512e-04: 100%|██████████| 9/9 [00:00<00:00, 34.82it/s]
epoch 2 iter 8: train loss 1.78857. lr 5.977197e-04: 100%|██████████| 9/9 [00:00<00:00, 36.27it/s]
epoch 3 iter 8: train loss 1.69308. lr 5.948114e-04: 100%|██████████| 9/9 [00:00<00:00, 39.84it/s]
epoch 4 iter 8: train loss 1.57619. lr 5.907379e-04: 100%|██████████| 9/9 [00:00<00:00, 40.28it/s]
epoch 5 iter 8: train loss 1.53862. lr 5.855153e-04: 100%|██████████| 9/9 [00:00<00:00, 39.65it/s]
epoch 6 iter 8: train loss 1.46657. lr 5.791641e-04: 100%|██████████| 9/9 [00:00<00:00, 38.75it/s]
epoch 7 iter 8: train loss 1.40720. lr 5.717095e-04: 100%|██████████| 9/9 [00:00<00:00, 15.94it/s]
epoch 8 iter 8: train loss 1.37622. lr 5.631810e-04: 100%|██████████| 9/9 [00:00<00:00, 26.73it/s]
epoch 9 iter 8: train loss 1.31371. lr 5.536122e-04: 100%|██████████| 9/9 [00:00<00:00, 34.34it/s]
epoch 10 iter 8: train loss 1.26130. lr 5.430411e-04: 100%|██████████| 9/9 [00:00<00:00, 34.63it/s]
epoch 11 

Now you can run the following code to test the training data sne testing data. You should reach more than 95% correctness on both train and testing data.

In [12]:
def top_k_logits(logits, k):
    v, ix = torch.topk(logits, k)
    out = logits.clone()
    out[out < v[:, [-1]]] = -float('Inf')
    return out

def sample(model, x, steps, temperature=1.0, sample=False, top_k=None):
    """
    take a conditioning sequence of indices in x (of shape (b,t)) and predict the next token in
    the sequence, feeding the predictions back into the model each time. 
    """
    block_size = train_dataset.block_size
    model.eval()
    for k in range(steps):
        x_cond = x if x.size(1) <= block_size else x[:, -block_size:] # crop context if needed
        logits = model(x_cond)  # Change this line
        # pluck the logits at the final step and scale by temperature
        logits = logits[:, -1, :] / temperature
        # optionally crop probabilities to only the top k options
        if top_k is not None:
            logits = top_k_logits(logits, top_k)
        # apply softmax to convert to probabilities
        probs = F.softmax(logits, dim=-1)
        # sample from the distribution or take the most likely
        if sample:
            ix = torch.multinomial(probs, num_samples=1)
        else:
            _, ix = torch.topk(probs, k=1, dim=-1)
        # append to the sequence and continue
        x = torch.cat((x, ix), dim=1)
    return x
def Addition_GPT(dataset, batch_size=32, max_batches=-1):
    
    results = []
    loader = DataLoader(dataset, batch_size=batch_size)
    for b, (x, y) in enumerate(loader):
        x = x.to(device)
        d1d2 = x[:, :ndigit*2]
        d1d2d3 = sample(model, d1d2, ndigit+1)
        d3 = d1d2d3[:, -(ndigit+1):]
        factors = torch.tensor([[10**i for i in range(ndigit+1)][::-1]]).to(device)
        # decode the integers from individual digits
        d1i = (d1d2[:,:ndigit] * factors[:,1:]).sum(1)
        d2i = (d1d2[:,ndigit:ndigit*2] * factors[:,1:]).sum(1)
        d3i_pred = (d3 * factors).sum(1)
        d3i_gt = d1i + d2i
        correct = (d3i_pred == d3i_gt).cpu() # Software 1.0 vs. Software 2.0 fight RIGHT on this line, lol
        for i in range(x.size(0)):
            results.append(int(correct[i]))
            judge = 'CORRECT' if correct[i] else 'WRONG'
            if not correct[i]:
                print("GPT claims that %03d + %03d = %03d (gt is %03d; %s)" 
                      % (d1i[i], d2i[i], d3i_pred[i], d3i_gt[i], judge))
        
        if max_batches >= 0 and b+1 >= max_batches:
            break

    print("final score: %d/%d = %.2f%% correct" % (np.sum(results), len(results), 100*np.mean(results)))

In [13]:
Addition_GPT(train_dataset, batch_size=1024, max_batches=10)

final score: 9000/9000 = 100.00% correct


In [14]:
Addition_GPT(test_dataset, batch_size=1024, max_batches=10)

final score: 1000/1000 = 100.00% correct





## Q4 minGPT Text Completion(20 points)


In this question, we will train a GPT to write something fun. First, we need to download some Harry Potter novels as the training dataset. Then a `CharDataset` class is provided to help you prepare training data in the format of characters

In [15]:
!wget "https://raw.githubusercontent.com/amephraim/nlp/master/texts/J.%20K.%20Rowling%20-%20Harry%20Potter%201%20-%20Sorcerer's%20Stone.txt"
!wget "https://raw.githubusercontent.com/amephraim/nlp/master/texts/J.%20K.%20Rowling%20-%20Harry%20Potter%202%20-%20The%20Chamber%20Of%20Secrets.txt"
!wget "https://raw.githubusercontent.com/amephraim/nlp/master/texts/J.%20K.%20Rowling%20-%20Harry%20Potter%203%20-%20Prisoner%20of%20Azkaban.txt"


--2023-04-25 22:24:54--  https://raw.githubusercontent.com/amephraim/nlp/master/texts/J.%20K.%20Rowling%20-%20Harry%20Potter%201%20-%20Sorcerer's%20Stone.txt
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.109.133, 185.199.108.133, 185.199.110.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.109.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 439742 (429K) [text/plain]
Saving to: 'J. K. Rowling - Harry Potter 1 - Sorcerer\'s Stone.txt.1'


2023-04-25 22:24:54 (9.92 MB/s) - 'J. K. Rowling - Harry Potter 1 - Sorcerer\'s Stone.txt.1' saved [439742/439742]

--2023-04-25 22:24:55--  https://raw.githubusercontent.com/amephraim/nlp/master/texts/J.%20K.%20Rowling%20-%20Harry%20Potter%202%20-%20The%20Chamber%20Of%20Secrets.txt
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.109.133, 185.199.108.133, 185.199.110.133, ...
Connecting to raw.githubusercontent.com (raw.githubuserco

In [16]:
import math
import numpy as np
from tqdm import tqdm
from matplotlib import pyplot as plt
from collections.abc import Iterable
from torch.optim import Optimizer
from torch.utils.data import Dataset

In [17]:
class CharDataset(Dataset):

    def __init__(self, data, block_size):
        chars = list(set(data))
        data_size, vocab_size = len(data), len(chars)
        print('data has %d characters, %d unique.' % (data_size, vocab_size))

        self.stoi = { ch:i for i,ch in enumerate(chars) }
        self.itos = { i:ch for i,ch in enumerate(chars) }
        self.block_size = block_size
        self.vocab_size = vocab_size
        self.data = data

    def __len__(self):
        return math.ceil(len(self.data) / (self.block_size + 1))

    def __getitem__(self, idx):
        # we're actually going to "cheat" and pick a spot in the dataset at random
        i = np.random.randint(0, len(self.data) - (self.block_size + 1))
        chunk = self.data[i:i+self.block_size+1]
        dix = [self.stoi[s] for s in chunk]
        x = torch.tensor(dix[:-1], dtype=torch.long)
        y = torch.tensor(dix[1:], dtype=torch.long)
        return x, y

In [18]:
# the "block size" is the number of characters the model takes as input.
# in this case, it can look at up to 128 characters when predicting the next
# character.
block_size = 128 # spatial extent of the model for its context

# For our training set, we will use the text of the first three Harry Potter books.
text = open("J. K. Rowling - Harry Potter 1 - Sorcerer's Stone.txt", 'rb').read()
text += open('J. K. Rowling - Harry Potter 2 - The Chamber Of Secrets.txt', 'rb').read()
text += open('J. K. Rowling - Harry Potter 3 - Prisoner of Azkaban.txt', 'rb').read()

train_dataset = CharDataset(text, block_size) 

data has 1543473 characters, 87 unique.


### Q4.1 Train the model on texts and Save the model

Here we are going to train a language model on the `train_dataset`. In this question, you need to finish writing a training loop and save the best model (either the loss is small or the model output is fun) you get. You can refer to the training loop from Q3, but remember to save the model checkpoint(s) while training or at the end of the training loop - You will need it in the next question! 



In the next cell, an example setup and parameters for the minGPT language model is provided, feel free to change the parameters. If you encounter a training loss plateau - loss is oscillating around some number and the tendency of decreasing is not obvious - and cannot overcome it when trying different parameter combinations, 
- document the parameters in `TrainerConfig`, training loss curves/loss from the best model you have , the machine you used to train the model and 
- propose a solution to overcome this plateau  

If you there is no such plateau, tell us what are the parameters in the `TrainConfig` and which GPU is used for training. 

In this question, you are requried to 
- write a training loop that saves some checkpoints in the training loop
- show some discussion on the training loss plateau 

In [19]:
config = TrainerConfig(max_epochs=50, batch_size=64, learning_rate=6e-4,
                      lr_decay=True, warmup_tokens=1024, final_tokens=150*len(train_dataset),
                      num_workers=4)

model = GPT(vocab_size=train_dataset.vocab_size,n_embd=128, n_head=8,block_size=train_dataset.block_size,n_layer=12,)
device = 'cpu'
if torch.cuda.is_available():
  device = torch.cuda.current_device()
  model.to(device)
  optimizer = optim.Adam(model.parameters(), lr=config.learning_rate)


In [20]:
import math
import numpy as np
import torch
from torch.utils.data import DataLoader
from torch import optim
from torch.optim.lr_scheduler import LambdaLR
import torch.nn.functional as F

In [25]:
# Your training loop
## -- ! code required

scheduler = None
if config.lr_decay:
    warmup_steps = int(config.warmup_tokens // config.batch_size)
    decay_steps = int(config.final_tokens // config.batch_size)
    lr_lambda = lambda step: min((step + 1) / warmup_steps, 1.0) * (1.0 - max(0, step - warmup_steps) / decay_steps)
    scheduler = LambdaLR(optimizer, lr_lambda)

loader = DataLoader(train_dataset, batch_size=config.batch_size, num_workers=config.num_workers)

best_loss = float('inf')


for epoch in range(config.max_epochs):
    model.train()
    epoch_loss = 0
    pbar = tqdm(enumerate(loader), total=len(loader)) 
    for iter, (x, y) in pbar:
        x, y = x.to(device), y.to(device)
        optimizer.zero_grad()
        loss = model(x)
        loss = F.cross_entropy(loss.view(-1, loss.size(-1)), y.view(-1))
        loss.backward()
        optimizer.step()
        pbar.set_description(f"epoch {epoch+1} iter {iter}: train loss {loss.item():.5f}")
        if scheduler is not None:
            scheduler.step()
        epoch_loss = epoch_loss+loss.item()
    avg_loss = epoch_loss / len(loader)
    if avg_loss < best_loss:
        best_loss = avg_loss
    

        
torch.save({
    'epoch': epoch,
    'model_state_dict': model.state_dict(),
    'optimizer_state_dict': optimizer.state_dict(),
    'scheduler_state_dict': scheduler.state_dict() if scheduler is not None else None,
    'loss': best_loss,
}, f"pretrained.pth")


epoch 1 iter 186: train loss 1.11559: 100%|██████████| 187/187 [00:12<00:00, 15.20it/s]
epoch 2 iter 186: train loss 1.11478: 100%|██████████| 187/187 [00:12<00:00, 15.06it/s]
epoch 3 iter 186: train loss 1.12161: 100%|██████████| 187/187 [00:12<00:00, 15.06it/s]
epoch 4 iter 186: train loss 1.08764: 100%|██████████| 187/187 [00:12<00:00, 15.51it/s]
epoch 5 iter 186: train loss 1.08733: 100%|██████████| 187/187 [00:11<00:00, 15.91it/s]
epoch 6 iter 186: train loss 1.11215: 100%|██████████| 187/187 [00:12<00:00, 15.49it/s]
epoch 7 iter 186: train loss 1.09134: 100%|██████████| 187/187 [00:11<00:00, 16.01it/s]
epoch 8 iter 186: train loss 1.13979: 100%|██████████| 187/187 [00:12<00:00, 15.35it/s]
epoch 9 iter 186: train loss 1.10191: 100%|██████████| 187/187 [00:12<00:00, 15.13it/s]
epoch 10 iter 186: train loss 1.09025: 100%|██████████| 187/187 [00:12<00:00, 15.27it/s]
epoch 11 iter 186: train loss 1.09257: 100%|██████████| 187/187 [00:12<00:00, 15.32it/s]
epoch 12 iter 186: train loss 

Discussion

A training loss plateau is a common issue when training neural networks. It occurs when the loss stops decreasing and starts oscillating around a certain value, indicating that the model is no longer improving its performance. To overcome this, we can: Adjust the learning rate, Use a different optimizer, Early stopping and so on.

### Q4.2 Load pretrain model and do a prompt writing

Load the best model from Q4.1, provide a prompt and let the model continue writing for you. Feel free to try different prompts or different models.  
If your best model is trained on SCC or the other machines, load the model in the jupyter notebook you are going to submit and print out the 'writing'. Screenshot of the output 'writing' will not be accepted. 

In this question, you are required to 
- show the prompt writing output
- show some discussion for 4.2.1

#### 4.2.1 What would you do to improve the text generation quality(readability, spelling, grammar, logic etc.) of transformer-based language model? If you refer to some papers or posts, remember to cite them. 

Token-based decoding strategies: Employ decoding strategies like beam search, nucleus sampling, or temperature scaling that can help in generating more coherent and contextually relevant text. Paper: The Curious Case of Neural Text Degeneration

Regularization: Use regularization techniques like dropout, weight decay, or layer normalization to improve the generalization of the model, which can lead to better text generation quality.

Adversarial training: Incorporate adversarial training techniques, where a discriminator model is trained to distinguish between real and generated text. This approach encourages the generator to produce text that is more realistic and coherent. Paper: Adversarial Training Methods for Semi-Supervised Text Classification

Evaluation metrics: Use comprehensive evaluation metrics like BLEU, ROUGE, or human evaluation to assess the quality of generated text, which will help in identifying areas that need improvement.

In [26]:
# Example output 
# model architecture should be defined before you load the parameters 
PATH = 'pretrained.pth'
checkpoint = torch.load(PATH)
model.load_state_dict(checkpoint['model_state_dict'])

<All keys matched successfully>

In [27]:
# Example output 
prompt ="Harry Potter turned on the TV,  "
context=[ord(c) for c in prompt]
x = torch.tensor([train_dataset.stoi[s] for s in context], dtype=torch.long)[None,...].to(device)
y = sample(model, x, 200, temperature=0.6, sample=True, top_k=5)[0]
completion = ''.join([chr(train_dataset.itos[int(i)]) for i in y])
print(completion)

Harry Potter turned on the TV,  Riddle's diary face in the school
armchairs and saw a pair of tea and then put the blood as though they
were about to crash it to their window.

"How did you know who they got to get any more than any
