# Pretraining + Full FineTuning an Encoder-Decoder Transformer - PyTorch Implementation

### Links for the datasets used in this notebook: [Pretraining - C4 Dataset](https://github.com/Ariyanbgd/T5_Q-A) AND [Finetuning - SQuaD 2.0 Dataset](https://huggingface.co/datasets/rajpurkar/squad_v2/tree/main)


# Question Answering

In this notebook we'll explore question answering. We'll implement the "Text to Text Transfer from Transformers" (better known as T5). Since we'have implemented transformers from scratch previously ([Transformer from Scratch](https://github.com/AnsImran/Transformer_from_Scratch_for_Text_Summarization/tree/master)) we'll now be able to use them.



## Table of Contents

### Overview  
### Importing the Packages  
### Prepare the data for pretraining T5  
#### Pre-Training Objective  
#### C4 Dataset  
#### Process C4  
#### Decode to Natural Language  
#### Tokenizing and Masking  
#### Exercise - tokenize_and_mask  
#### Creating the Pairs  
### Pretrain a T5 model using C4  
#### Instantiate a new transformer model  
#### C4 pretraining  
### Fine tune the T5 model for Question Answering  
#### Creating a list of paired question and answers  
#### Exercise - Parse the SQuaD 2.0 dataset  
#### Fine tune the T5 model  
#### Implement your Question Answering model  
#### Exercise - Implement the question answering function  


# Overview

Due to memory constraints of this environment and for the sake of time, our model will be trained with small datasets, so we won't get models that we could use in production but we'll gain the necessary knowledge about how the Generative Language models are trained and used. Also we won't spend too much time with the architecture of the model (we already created this model from Scratch, see: [Transformer from Scratch](https://github.com/AnsImran/Transformer_from_Scratch_for_Text_Summarization)) but you will instead take a model that is pre-trained on a larger dataset and fine tune it to get better results.

In this lab we'll do following:
* Understand how the C4 dataset is structured.
* Pretrain a transformer model using a Masked Language Model.
* Understand how the "Text to Text Transfer from Transformers" or T5 model works.
* Fine tune the T5 model for Question answering

# Importing the Packages


In [None]:
import torch
import torch.nn as nn
import torch.optim as optim
from torch.nn.functional import log_softmax

device_ = 'cuda' if torch.cuda.is_available() else 'cpu'



import string
import itertools
import transformer_utils

import os

import numpy as np
import pandas as pd

import matplotlib.pyplot as plt
import time

import textwrap
wrapper = textwrap.TextWrapper(width=70)

import json

from termcolor import colored

import sentencepiece as spm




In [213]:
torch.cuda.is_available()


True

In [2]:
# Connecting with google drive

# from google.colab import drive
# drive.mount('/content/drive')
# import os

# # Define a path inside your Google Drive
# SAVE_PATH = '/content/drive/MyDrive/model_checkpoints'



In [3]:
# Unzipping Tokenizer Model
# !unzip data.zip
# !unzip models.zip

# **Pre-Training Section**

# Dataprocessing for Pre-Training

## Pre-Training Objective

In the initial phase of training a T5 model for a Question Answering task, the pre-training process involves leveraging a masked language model (MLM) on a very large dataset, such as the C4 dataset. The objective is to allow the model to learn contextualized representations of words and phrases, fostering a deeper understanding of language semantics. To initiate pre-training, it is essential to employ the Transformer architecture, which forms the backbone of T5. The Transformer's self-attention mechanism enables the model to weigh different parts of the input sequence dynamically, capturing long-range dependencies effectively.

Before delving into pre-training, thorough data preprocessing is crucial. The C4 dataset, a diverse and extensive collection of web pages, provides a rich source for language understanding tasks. The dataset needs to be tokenized into smaller units, such as subwords or words, to facilitate model input. Additionally, the text is often segmented into fixed-length sequences or batches, optimizing computational efficiency during training.

For the masked language modeling objective, a percentage of the tokenized input is randomly masked, and the model is trained to predict the original content of these masked tokens. This process encourages the T5 model to grasp contextual relationships between words and phrases, enhancing its ability to generate coherent and contextually appropriate responses during downstream tasks like question answering.

In summary, the pre-training of the T5 model involves utilizing the Transformer architecture on a sizable dataset like C4, coupled with meticulous data preprocessing to convert raw text into a format suitable for training. The incorporation of a masked language modeling objective ensures that the model learns robust contextual representations, laying a solid foundation for subsequent fine-tuning on specific tasks such as question answering.

**Note:** The word "mask" will be used throughout this assignment in context of hiding/removing word(s)

We'll be implementing the Masked language model (MLM) as shown in the following image.

<img src = "images/loss.png" width="600" height = "400">

Assume you have the following text: <span style = "color:blue"> **Thank you <span style = "color:red">for inviting </span> me to your party <span style = "color:red">last</span>  week** </span>


Now as input we'll mask the words in red in the text:

<span style = "color:blue"> **Input:**</span> Thank you  **X** me to your party **Y** week.

<span style = "color:blue">**Output:**</span> The model should predict the words(s) for **X** and **Y**.

**[EOS]** will be used to mark the end of the target sequence.

## C4 Dataset Description

The [C4 dataset](https://www.tensorflow.org/datasets/catalog/c4), also known as the Common Crawl C4 (Common Crawl Corpus C4), is a large-scale dataset of web pages collected by the [Common Crawl organization](https://commoncrawl.org/). It is commonly used for various natural language processing tasks and machine learning research. Each sample in the C4 dataset follows a consistent format, making it suitable for pretraining models like BERT. Here's a short explanation and description of the C4 dataset:

- Format: Each sample in the C4 dataset is represented as a JSON object, containing several key-value pairs.

- Content: The 'text' field in each sample contains the actual text content extracted from web pages. This text often includes a wide range of topics and writing styles, making it diverse and suitable for training language models.

- Metadata: The dataset includes metadata such as 'content-length,' 'content-type,' 'timestamp,' and 'url,' providing additional information about each web page. 'Content-length' specifies the length of the content, 'content-type' describes the type of content (e.g., 'text/plain'), 'timestamp' indicates when the web page was crawled, and 'url' provides the source URL of the web page.

- Applications: The C4 dataset is commonly used for training and fine-tuning large-scale language models, such as BERT. It serves as a valuable resource for tasks like text classification, named entity recognition, question answering, and more.

- Size: The C4 dataset is containing more than 800 GiB of text data, making it suitable for training models with billions of parameters.



## Loading the Data

In [13]:
# Load example jsons
with open('data/c4-en-10k.jsonl', 'r') as file:
    example_jsons = [json.loads(line.strip()) for line in file]

# Printing the examples to see how the data looks like
for i in range(5):
    print(f'example number {i+1}: \n\n{example_jsons[i]} \n')


example number 1: 

{'text': 'Beginners BBQ Class Taking Place in Missoula!\nDo you want to get better at making delicious BBQ? You will have the opportunity, put this on your calendar now. Thursday, September 22nd join World Class BBQ Champion, Tony Balay from Lonestar Smoke Rangers. He will be teaching a beginner level class for everyone who wants to get better with their culinary skills.\nHe will teach you everything you need to know to compete in a KCBS BBQ competition, including techniques, recipes, timelines, meat selection and trimming, plus smoker and fire information.\nThe cost to be in the class is $35 per person, and for spectators it is free. Included in the cost will be either a t-shirt or apron and you will be tasting samples of each meat that is prepared.'} 

example number 2: 

{'text': 'Discussion in \'Mac OS X Lion (10.7)\' started by axboi87, Jan 20, 2012.\nI\'ve got a 500gb internal drive and a 240gb SSD.\nWhen trying to restore using disk utility i\'m given the err

## Processing C4 Dataset

For the purpose of pretaining the T5 model, we'll only use the `content` of each entry. In the following code, we filter only the field `text` from all the entries in the dataset. This is the data that we'll use to create the `inputs` and `targets` of our language model.


In [15]:
# Grab text field from dictionary
natural_language_texts = [example_json['text'] for example_json in example_jsons]

# Print the first text example
print(natural_language_texts[0])


Beginners BBQ Class Taking Place in Missoula!
Do you want to get better at making delicious BBQ? You will have the opportunity, put this on your calendar now. Thursday, September 22nd join World Class BBQ Champion, Tony Balay from Lonestar Smoke Rangers. He will be teaching a beginner level class for everyone who wants to get better with their culinary skills.
He will teach you everything you need to know to compete in a KCBS BBQ competition, including techniques, recipes, timelines, meat selection and trimming, plus smoker and fire information.
The cost to be in the class is $35 per person, and for spectators it is free. Included in the cost will be either a t-shirt or apron and you will be tasting samples of each meat that is prepared.


### Decoding to Natural Language - Loading Pretrained Tokenizer


The [SentencePieceTokenizer](https://www.tensorflow.org/text/api_docs/python/text/SentencepieceTokenizer), used in the code snippet, tokenizes text into subword units, enhancing handling of complex word structures, out-of-vocabulary words, and multilingual support. It simplifies preprocessing, ensures consistent tokenization, and seamlessly integrates with machine learning frameworks.

In this task, a SentencePiece model is loaded from a file, which is used to tokenize text into subwords represented by integer IDs.

In [3]:
import sentencepiece as spm
# Load the SentencePiece model
tokenizer = spm.SentencePieceProcessor()
tokenizer.load('./models/sentencepiece.model')

# Tokenize text using the loaded model
text = "This is a sample text."
tokenized_text = tokenizer.encode(text, out_type=int)  # Use out_type=int to get int token ids | out_type, torch.tensor kar k daikhna, otherwise integer bhi thek hi hay

print(tokenized_text)


[100, 19, 3, 9, 3106, 1499, 5]


In this tokenizer the string `</s>` is used as `EOS` token. By default, the tokenizer does not add the `EOS` to the end of each sentence, so we need to add it manually when required. Let's verify what id correspond to this token:

In [17]:
eos = tokenizer.piece_to_id("</s>")
print("EOS:", eos)


EOS: 1


In [18]:
# printing the encoding of each word to see how subwords are tokenized
tokenized_text = [(tokenizer.encode(word, out_type=int), word) for word in natural_language_texts[2].split()]

print("Word\t\t-->\tTokenization\n")
for tokens, word in tokenized_text:
    print(f"{word}\t-->\t{tokens}")


Word		-->	Tokenization

Foil	-->	[4452, 173]
plaid	-->	[30772]
lycra	-->	[3, 120, 2935]
and	-->	[11]
spandex	-->	[8438, 26, 994]
shortall	-->	[710, 1748]
with	-->	[28]
metallic	-->	[18813]
slinky	-->	[3, 7, 4907, 63]
insets.	-->	[16, 2244, 7, 5]
Attached	-->	[28416, 15, 26]
metallic	-->	[18813]
elastic	-->	[15855]
belt	-->	[6782]
with	-->	[28]
O-ring.	-->	[411, 18, 1007, 5]
Headband	-->	[3642, 3348]
included.	-->	[1285, 5]
Great	-->	[1651]
hip	-->	[5436]
hop	-->	[13652]
or	-->	[42]
jazz	-->	[9948]
dance	-->	[2595]
costume.	-->	[11594, 5]
Made	-->	[6465]
in	-->	[16]
the	-->	[8]
USA.	-->	[2312, 5]


# Skip the following cells until you reach the one labeled "skip_end!"

In [19]:
###################################################################################################################
# The cells enclosed within ##### **** ##### can be skipped

In [20]:
# confirming kengthof data sets
len(natural_language_texts)


10000

In [21]:
# loooking at some words in the dataset text
print(natural_language_texts[2].split())


['Foil', 'plaid', 'lycra', 'and', 'spandex', 'shortall', 'with', 'metallic', 'slinky', 'insets.', 'Attached', 'metallic', 'elastic', 'belt', 'with', 'O-ring.', 'Headband', 'included.', 'Great', 'hip', 'hop', 'or', 'jazz', 'dance', 'costume.', 'Made', 'in', 'the', 'USA.']


In [22]:
# tokenizinga word
tokenizer.tokenize('plaid')


[30772]

In [24]:
# tokenizing another word
tokenizer.tokenize('lycra')


[3, 120, 2935]

In [26]:
# decoding / detokenizing 
tokenizer.decode([120, 2935]), tokenizer.detokenize([3])


('lycra', '')

In [27]:
###################################################################################################################

# skip_end!

And as usual, the library provides a function to turn numeric tokens into human readable text. Look how it works.

In [28]:
# We can see that detokenize successfully undoes the tokenization
print(f"tokenized: {tokenizer.tokenize('Beginners')}\ndetokenized: {tokenizer.detokenize(tokenizer.tokenize('Beginners'))}")


tokenized: [12847, 277]
detokenized: Beginners


As we can see above, we were able to take a piece of string and tokenize it.

Now we'll create `input` and `target` pairs that will allow us to train our model. T5 uses the ids at the end of the vocab file as sentinels. For example, it will replace:
   - `vocab_size - 1` by `<Z>`
   - `vocab_size - 2` by `<Y>`
   - and so forth.
   
It assigns every word a `chr`.

The `pretty_decode` function below, which we'll use in a bit, helps in handling the type when decoding.


Notice that:
```python
string.ascii_letters = 'abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ'
```

**NOTE:** Targets may have more than the 52 sentinels we replace, but this is just to get an idea of things.

In [29]:
def get_sentinels(tokenizer, display=False):
    sentinels  = {}
    vocab_size = tokenizer.vocab_size()
    
    for i, char in enumerate(reversed(string.ascii_letters), 1):
        decoded_text = tokenizer.detokenize([vocab_size - i])

        # Sentinels, ex: <Z> - <a>
        sentinels[decoded_text] = f'<{char}>'

        if display:
            print(f'The sentinel is <{char}> and the decoded token is:', decoded_text)

    return sentinels


# Skip the following cells until you reach the one labeled "skip_end!"

In [30]:
###################################################################################################################
# The cells enclosed within ##### **** ##### can be skipped

In [31]:
string.ascii_letters


'abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ'

In [32]:
# Return a reverse iterator over the values of the given sequence.
reversed('abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ')



<reversed at 0x794dae9d47f0>

In [None]:
for i, char in enumerate(reversed('abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ'), 1):
    print('i: ',i, '\tchar: ', char)


In [36]:
# confirming vocab size
tokenizer.vocab_size()

32000

In [37]:
# exploring tokens
tokenizer.detokenize([32000 - 1])


'Internațional'

In [38]:
#############################################################################################################################

# skip_end!

In [None]:
sentinels = get_sentinels(tokenizer, display=True)


In [40]:
def pretty_decode(encoded_str_list, sentinels, tokenizer):
    # If it's already a string, just replace sentinel tokens with their mapped characters
    if isinstance(encoded_str_list, str):
        for token, char in sentinels.items():
            encoded_str_list = encoded_str_list.replace(token, char)
        return encoded_str_list

    # If it's a list of token IDs, decode and then apply replacements
    decoded_str = tokenizer.detokenize(encoded_str_list)
    for token, char in sentinels.items():
        decoded_str = decoded_str.replace(token, char)
    return decoded_str


Now, let's use the `pretty_decode` function in the following sentence. Note that all the words listed as sentinels, will be replaced by the function with the corresponding sentinel. It could be a drawback of this method, but don't worry about it now.

In [42]:
# example usage
pretty_decode("I want to dress up as an Intellectual this halloween.", sentinels, tokenizer)


'I want to dress up as an <V> this <b>.'

In [43]:
##################################################################################################################

The functions above make our `inputs` and `targets` more readable. For example, we might see something like this once we implement the masking function below.

- <span style="color:red"> Input sentence: </span> Younes and Lukasz were working together in the lab yesterday after lunch.
- <span style="color:red">Input: </span> Younes and Lukasz  **Z** together in the **Y** yesterday after lunch.
- <span style="color:red">Target: </span> **Z** were working **Y** lab.


### Tokenizing and Masking - Masked Language Modelling

In this task, we'll implement the `tokenize_and_mask` function, which tokenizes and masks input words based on a given probability. The probability is controlled by the `noise` parameter, typically set to mask around `15%` of the words in the input text. The function will generate two lists of tokenized sequences following the algorithm outlined below:


#### tokenize_and_mask

- Start with two empty lists: `inps` and `targs`
- Tokenize the input text using the given tokenizer.
- For each `token` in the tokenized sequence:
  - Generate a random number(simulating a weighted coin toss)
  - If the random value is greater than the given threshold(noise):
    - Add the current token to the `inps` list
  - Else:
    - If a new sentinel must be included(read note **):
      - Compute the next sentinel ID using a progression.
      - Add a sentinel into the `inps` and `targs` to mark the position of the masked element.
    - Add the current token to the `targs` list.

** There's a special case to consider. If two consecutive tokens get masked during the process, you don't need to add a new sentinel to the sequences. To account for this, use the `prev_no_mask` flag, which starts as `True` but is turned to `False` each time you mask a new element. The code that adds sentinels will only be executed if, before masking the token, the flag was in the `True` state.


In [46]:
def tokenize_and_mask(text,
                      noise =0.15,
                      randomizer=np.random.uniform,
                      tokenizer=None):
    """Tokenizes and masks a given input.

    Args:
        text (str or bytes): Text input.
        noise (float, optional): Probability of masking a token. Defaults to 0.15.
        randomizer (function, optional): Function that generates random values. Defaults to np.random.uniform.
        tokenizer (function, optional): Tokenizer function. Defaults to tokenize.

    Returns:
        inps, targs: Lists of integers associated to inputs and targets.
    """

    # Current sentinel number (starts at 0)
    cur_sentinel_num = 0

    # Inputs and targets
    inps, targs = [], []

    # Vocab_size
    vocab_size = int(tokenizer.vocab_size())

    # EOS token id
    # Must be at the end of each target!
    eos = tokenizer.piece_to_id("</s>")

    ### START CODE HERE ###

    # prev_no_mask is True if the previous token was NOT masked, False otherwise
    # set prev_no_mask to True
    prev_no_mask = True

    # Loop over the tokenized text
    for token in tokenizer.encode(text, out_type=int):

        # Generate a random value between 0 and 1
        rnd_val = randomizer()

        # Check if the noise is greater than a random value (weighted coin flip)
        if rnd_val < noise:

            # Check if previous token was NOT masked
            if prev_no_mask:

                # Current sentinel increases by 1
                cur_sentinel_num += 1

                # Compute end_id by subtracting current sentinel value out of the total vocabulary size
                end_id = vocab_size - cur_sentinel_num

                # Append end_id at the end of the targets
                targs.append(end_id)

                # Append end_id at the end of the inputs
                inps.append(end_id)

            # Append token at the end of the targets
            targs.append(token)

            # set prev_no_mask accordingly
            prev_no_mask = False

        else:

            # Append token at the end of the inputs
            inps.append(token)

            # Set prev_no_mask accordingly
            prev_no_mask = True


    # Add EOS token to the end of the targets
    targs.append(eos)

    ### END CODE HERE ###

    return inps, targs

In [50]:
# Some logic to mock a np.random value generator
# Needs to be in the same cell for it to always generate same output
def testing_rnd():
    def dummy_generator():
        vals        = np.linspace(0, 1, 10)
        cyclic_vals = itertools.cycle(vals)
        for _ in range(100):
            yield next(cyclic_vals)

    dumr = itertools.cycle(dummy_generator())

    def dummy_randomizer():
        return next(dumr)

    return dummy_randomizer

input_str = 'Beginners BBQ Class Taking Place in Missoula!\nDo you want to get better at making delicious BBQ? You will have the opportunity, put this on your calendar now. Thursday, September 22nd join World Class BBQ Champion, Tony Balay from Lonestar Smoke Rangers.'

inps, targs = tokenize_and_mask(input_str, randomizer=testing_rnd(), tokenizer=tokenizer)
print(f"tokenized inputs - shape={len(inps)}:\n\n{inps}\n\ntargets - shape={len(targs)}:\n\n{targs}")


tokenized inputs - shape=53:

[31999, 15068, 4501, 3, 12297, 3399, 16, 5964, 7115, 31998, 531, 25, 241, 12, 129, 394, 44, 492, 31997, 58, 148, 56, 43, 8, 1004, 6, 474, 31996, 39, 4793, 230, 5, 2721, 6, 1600, 1630, 31995, 1150, 4501, 15068, 16127, 6, 9137, 2659, 5595, 31994, 782, 3624, 14627, 15, 12612, 277, 5]

targets - shape=19:

[31999, 12847, 277, 31998, 9, 55, 31997, 3326, 15068, 31996, 48, 30, 31995, 727, 1715, 31994, 45, 301, 1]


We'll now use the inputs and the targets from the `tokenize_and_mask` function we implemented above. Let's look at the decoded version of our masked sentence using your `inps` and `targs` from the sentence above.

In [51]:
print('Inputs: \n\n', pretty_decode(inps, sentinels, tokenizer))
print('\nTargets: \n\n', pretty_decode(targs, sentinels, tokenizer))


Inputs: 

 <Z> BBQ Class Taking Place in Missoul <Y> Do you want to get better at making <X>? You will have the opportunity, put <W> your calendar now. Thursday, September 22 <V> World Class BBQ Champion, Tony Balay <U>onestar Smoke Rangers.

Targets: 

 <Z> Beginners <Y>a! <X> delicious BBQ <W> this on <V>nd join <U> from L


### Creating input-target Pairs

We'll now create pairs using our dataset. We'll iterate over your data and create (inp, targ) pairs.

In [52]:
# Apply tokenize_and_mask
inputs_targets_pairs = [tokenize_and_mask(text.encode('utf-8', errors='ignore').decode('utf-8'), tokenizer=tokenizer)
                        for text in natural_language_texts]


# Skip the following cells until you reach the one labeled "skip_end!"

In [None]:
################################################################################################################################

In [54]:
# looking at a typical text bodyin our dataset
natural_language_texts[10]


'Pencarian FILM Untuk "Peace Breaker 2017"\nyuk mampir ke channel say..\nEdges East provides the l..\nA corrupt cop makes one w..\nPeace Breaker 2017 ~ 破�..\nNáo Loạn - Peace Break..\nPlease subscribe and hit ..\nuploaded in HD at http://..\nI cannot believe I manage..'

In [56]:
natural_language_texts[10].encode('utf-8', errors='ignore')


b'Pencarian FILM Untuk "Peace Breaker 2017"\nyuk mampir ke channel say..\nEdges East provides the l..\nA corrupt cop makes one w..\nPeace Breaker 2017 ~ \xe7\xa0\xb4\xef\xbf\xbd..\nN\xc3\xa1o Lo\xe1\xba\xa1n - Peace Break..\nPlease subscribe and hit ..\nuploaded in HD at http://..\nI cannot believe I manage..'

In [57]:
natural_language_texts[10].encode('utf-8', errors='ignore').decode('utf-8')


'Pencarian FILM Untuk "Peace Breaker 2017"\nyuk mampir ke channel say..\nEdges East provides the l..\nA corrupt cop makes one w..\nPeace Breaker 2017 ~ 破�..\nNáo Loạn - Peace Break..\nPlease subscribe and hit ..\nuploaded in HD at http://..\nI cannot believe I manage..'

In [58]:
################################################################################################################################

# skip_end!

In [None]:
import textwrap

def display_input_target_pairs(inputs_targets_pairs, sentinels, wrapper=textwrap.TextWrapper(width=70), tokenizer=None):
    for i, inp_tgt_pair in enumerate(inputs_targets_pairs, 1):
        inps, tgts = inp_tgt_pair

        # Directly decode the token ID lists to strings
        decoded_inps = pretty_decode(inps, sentinels, tokenizer)
        decoded_tgts = pretty_decode(tgts, sentinels, tokenizer)

        print(f'[{i}]\n\n'
              f'inputs:\n{wrapper.fill(text=decoded_inps)}\n\n'
              f'targets:\n{wrapper.fill(text=decoded_tgts)}\n\n\n')


# Print the first 5 samples
display_input_target_pairs(inputs_targets_pairs[0:5], sentinels, wrapper, tokenizer)


# **Pretraining Section - Part-2**

Now we are going to use the Transformer's architecture that we coded previously [Transformer_from_Scratch](https://github.com/AnsImran/Transformer_from_Scratch_for_Text_Summarization) to summarize text, but this time to answer questions. Instead of training the question answering model from scratch, we'll first "pre-train" the model using the C4 data set we just processed. This will help the model to learn the general structure of language from a large dataset. This is much easier to do, as we don't need to label any data, but just use the masking, which is done automatically. We'll will then use the data from the SQuAD 2.0 dataset to teach the model to answer questions given a context. To start let's review the Transformer's architecture.

<img src = "images/fulltransformer.png" width="300" height="600">



## Instantiating a New Transformer Model

The code implemented in the previous week have been packaged into the `transformer_utils.py` file. We can import it here, and setup with the same configuration used there.


In [4]:
# Define the model parameters
num_layers                 = 2
embedding_dim              = 128
fully_connected_dim        = 128
num_heads                  = 2
positional_encoding_length = 256

encoder_vocab_size = int(tokenizer.vocab_size())
decoder_vocab_size = encoder_vocab_size

# Initialize the model
transformer = transformer_utils.Transformer(
                                                num_layers,
                                                embedding_dim,
                                                num_heads,
                                                fully_connected_dim,
                                                encoder_vocab_size,
                                                decoder_vocab_size,
                                                positional_encoding_length,
                                                positional_encoding_length,
                                            )

device_ = 'cuda' if torch.cuda.is_available() else 'cpu'


In [5]:
# our architecture
transformer

Transformer(
  (encoder): Encoder(
    (embedding): Embedding(32000, 128, padding_idx=0)
    (enc_layers): ModuleList(
      (0-1): 2 x EncoderLayer(
        (attention): MultiheadAttention(
          (out_proj): NonDynamicallyQuantizableLinear(in_features=128, out_features=128, bias=True)
        )
        (layernorm1): BatchNorm1d(128, eps=1e-06, momentum=0.1, affine=True, track_running_stats=True)
        (fc1): Linear(in_features=128, out_features=128, bias=True)
        (fc2): Linear(in_features=128, out_features=128, bias=True)
        (layernorm2): BatchNorm1d(128, eps=1e-06, momentum=0.1, affine=True, track_running_stats=True)
        (dropout): Dropout(p=0.1, inplace=False)
      )
    )
    (dropout): Dropout(p=0.1, inplace=False)
  )
  (decoder): Decoder(
    (embedding): Embedding(32000, 128, padding_idx=0)
    (dec_layers): ModuleList(
      (0-1): 2 x DecoderLayer(
        (mha1): MultiheadAttention(
          (out_proj): NonDynamicallyQuantizableLinear(in_features=128, o


Now, you will define the optimizer and the loss function. For this task the model will try to predict the masked words, so, as in the previous lab, the loss function will be the `torch.NLLLoss(...)`.


## Setting Up Dataloader

For training a PyTorch model we need to arrange the data into dataset and dataloader. Now, we'll get the `inputs` and the `targets` for the transformer model from the `inputs_targets_pairs`. Before creating the dataset, you need to be sure that all `inputs` have the same length by truncating the longer sequences and padding the shorter ones with `0`. The same must be done for the targets.

We'll use a `BATCH_SIZE = 64`

In [None]:
# Parameters
encoder_maxlen = 150
decoder_maxlen = 50
BATCH_SIZE     = 512
num_epochs     = 40


# Mock: inputs_targets_pairs = [...] # list of (input_seq, target_seq)
# Make sure it's defined before running this.

# Split inputs and targets
input_seqs  = [x[0] for x in inputs_targets_pairs]
target_seqs = [x[1] for x in inputs_targets_pairs]

# Pad sequences
def pad_sequences(sequences, maxlen, pad_value=0):
    return np.array([
        seq[:maxlen] + [pad_value] * max(0, maxlen - len(seq))
        for seq in sequences
    ])

inputs  = pad_sequences(input_seqs, encoder_maxlen)
targets = pad_sequences(target_seqs, decoder_maxlen)

# Convert to tensors
inputs_tensor  = tensor(inputs, dtype=torch.long)
targets_tensor = tensor(targets, dtype=torch.long)

# Create dataset and dataloader
dataset = TensorDataset(inputs_tensor, targets_tensor)
dataloader = DataLoader(
    dataset,
    batch_size=BATCH_SIZE,
    shuffle=True,
    pin_memory=True,
    num_workers=2  # Tune this depending on your CPU
)

# Skip the following cells until you reach the one labeled "skip_end!"


In [62]:
# ignore this cell
# torch.cuda.empty_cache()
# torch.cuda.reset_peak_memory_stats()


In [64]:
####################################################################################################################

In [65]:
# ignore this cell
# testing dataloader
for x, y in dataloader:
    print(x.shape, y.shape)  # Should print [64, 150] and [64, 50]
    break


torch.Size([64, 150]) torch.Size([64, 50])


In [68]:
# ignore this cell
len(dataloader)


157

In [71]:
# ignore this cell
# Inspecting predictions shape
transformer.to(device)
preds, _ = transformer(inp, tar)
print('preds: ', preds.shape)


preds:  torch.Size([64, 50, 32000])


In [72]:
# ignore this cell
# testing with optimizer and loss function
criterion = nn.NLLLoss(ignore_index=0)
optimizer = optim.Adam(transformer.parameters(), lr=0.001)

# Ensure outputs is of type Float
outputs = preds.float().clone()
outputs = outputs.reshape(-1, preds.shape[2])
print('outputs: ', outputs.shape)

# Ensure targets is of type Long
targets = tar
targets = targets.reshape(-1)
print('targets: ', targets.shape)

optimizer.zero_grad()

# Calculate loss
loss = criterion(outputs, targets)
loss.backward()

print(loss.item())


outputs:  torch.Size([3200, 32000])
targets:  torch.Size([3200])
10.636120796203613


In [74]:
# ignore this cell
tar[:, :-1] # tar_inp |


tensor([[31999,   549, 31998,  ..., 31979,  8160, 31978],
        [31999,    46, 31998,  ...,   268, 31979,    19],
        [31999,     7, 31998,  ...,     0,     0,     0],
        ...,
        [31999,   242, 31998,  ...,     0,     0,     0],
        [31999,   208, 31998,  ..., 31977,    63, 31976],
        [31999,  1609,     3,  ...,     0,     0,     0]], device='cuda:0')

In [75]:
# ignore this cell
tar[:, 1:] # tar_real | to loss fn


tensor([[  549, 31998,  9537,  ...,  8160, 31978,  6708],
        [   46, 31998,    25,  ..., 31979,    19, 31978],
        [    7, 31998, 26306,  ...,     0,     0,     0],
        ...,
        [  242, 31998,    49,  ...,     0,     0,     0],
        [  208, 31998, 29077,  ...,    63, 31976,  9754],
        [ 1609,     3, 31998,  ...,     0,     0,     0]], device='cuda:0')

In [77]:
# ignore this cell
inp[0,:]


tensor([31999, 22735,  1820, 31998,    33,  6470,    12,  2467,    69,  3718,
         1476,  4677,  1213, 31997,    13,   334,   847,  1684,    44,     3,
        18735,  6366,    44,     8,  1166, 31996,  3227,  2904,  4466,   137,
        17173,  1820, 31995,  1338, 31994,    36,  1213,    16,  1718,    11,
         1660,     6,   383,     8,  2578,     3,   184, 15740,  9088,     7,
            7,     5,   332, 22684, 20344, 18206,  1820,    71,   372,  2634,
           56,    36,  2098, 31993, 23096,  2028,    78,    24,    62,   164,
        31992,  1338,    44,   489,  2028,     5,   863,   391,     5,   134,
            5,   553,     5,   345,     5,    12,  2753,  1741, 31991,    17,
            5,  1677,     3,    99,    25,   515,    12,  2467,    78, 31990,
           54,   766,    62,    43,   631,   542,    21,     8,   372,  2634,
            5,  4083, 13961, 22492,  3430,     3, 31989, 24596,   283, 26418,
         2365,  5568,  2853,  4305, 17116,  4677,  1522,    36, 

In [78]:
# ignore this cell
inp_de = inp[0, :].to('cpu').clone().numpy()
inp_de = tokenizer.detokenize(inp_de.tolist())
inp_de


'InternaționalHEN | erwachsene are encouraged to attend our monthly board meetings held Cushion of every month starting at 7:00 pm at the Center imunitarunless otherwise noted). NOTE | Intellectual meeting traditi be held in July and August, during the Director & Volunteer recess. TEAM DINNER | A team dinner will be served disguise 6:30pm so that we mayexerce meeting at 7pm. Please R.S.V.P. to president@nourishet.org if you plan to attend so predominant can ensure we have enough food for the team dinner. REGULAR AND amitiéUAL MEETING Section 4.02 Regular meetings shall be held on the second Thursday of every month at 7:00 p. erkennt., unless such daydimension'

In [79]:
# ignore this cell
tar[0,:]


tensor([31999,   549, 31998,  9537, 31997,    30,     8,   511,  2721, 31996,
           41, 31995,   465, 31994,    56, 31993,    44, 31992,   456,     8,
        31991,  9361,     7, 31990,    62, 31989, 21478, 31988,    51, 31987,
         7250, 31986,  1522, 31985,  1781, 31984,   416, 31983,     5, 31982,
          284, 31981,   215, 31980,     8,  7389, 31979,  8160, 31978,  6708],
       device='cuda:0')

In [80]:
# ignore this cell
tar_de = tar[0, :].to('cpu').clone().numpy()
tar_de = tokenizer.detokenize(tar_de.tolist())
tar_de



'Internațional W erwachsene Members Cushion on the second Thursday imunitar ( Intellectual No traditi will disguise atexerce start thenourisheidess predominant weamitiéANN erkenntmdimension falls inférieur shall refugi hour cheddar next unterlieg. garanteaz eachfăcute yearréglage the Annual pedepse electedGermain Corporation'

In [81]:
# ignore this cell
pretty_decode(inp_de, sentinels, tokenizer)


'<Z>HEN | <Y> are encouraged to attend our monthly board meetings held <X> of every month starting at 7:00 pm at the Center <W>unless otherwise noted). NOTE | <V> meeting <U> be held in July and August, during the Director & Volunteer recess. TEAM DINNER | A team dinner will be served <T> 6:30pm so that we may<S> meeting at 7pm. Please R.S.V.P. to president@<R>t.org if you plan to attend so <Q> can ensure we have enough food for the team dinner. REGULAR AND <P>UAL MEETING Section 4.02 Regular meetings shall be held on the second Thursday of every month at 7:00 p. <O>., unless such day<N>'

In [82]:
# ignore this cell
pretty_decode(tar_de, sentinels, tokenizer)


'<Z> W <Y> Members <X> on the second Thursday <W> ( <V> No <U> will <T> at<S> start the<R>idess <Q> we<P>ANN <O>m<N> falls <M> shall <L> hour <K> next <J>. <I> each<H> year<G> the Annual <F> elected<E> Corporation'

In [83]:
########################################################################################################################

# skip_end!

## Training Loop

Now, we can run the training loop for 10 epochs. Running it with a big dataset such as C4 on a good computer with enough memory and a good GPU could take more than 24 hours. Here, you will run few epochs using a small portion of the C4 dataset for illustration. It will only take a few minutes, but the model won't be very powerful.


In [87]:
from torch import cuda
from torch import device, tensor
from torch.nn import NLLLoss
from torch.optim import AdamW
from torch.utils.data import DataLoader, TensorDataset
import numpy as np
import time
from torch.amp import GradScaler, autocast
from torch import save

# Set device
device_ = device('cuda' if cuda.is_available() else 'cpu')


# checkpoint = torch.load('./drive/MyDrive/model_checkpoints/checkpoint_epoch_15.pt', map_location=device_)
# transformer.load_state_dict(checkpoint['model_state_dict'])
# transformer.eval()  # Optional: set to eval mode if you're doing inference


learning_rate  = 1e-3
pad_idx        = 0


# Define model, criterion, optimizer
# transformer = YourTransformerModel(...) # Define or import your model
transformer.to(device_)

criterion = nn.NLLLoss(ignore_index=pad_idx)
optimizer = AdamW(transformer.parameters(), lr=learning_rate)
scaler    = GradScaler()  # For mixed precision

# Training loop
for epoch in range(num_epochs):
    print(f'\nEpoch [{epoch + 1}/{num_epochs}]')
    start_time = time.time()

    transformer.train()
    running_loss = 0

    for batch_idx, (inp, tar) in enumerate(dataloader):
        inp = inp.to(device_, non_blocking=True)
        tar = tar.to(device_, non_blocking=True)

        optimizer.zero_grad()

        with autocast(device_type=device_.type):

            # 'tar'       right shifted by adding 31999 token in the begining of each tar sequence
            # 'targets'   and by removing the first token from targets | overall same effect
            preds, _ = transformer(inp, tar[:, :-1])
            outputs  = preds.reshape(-1, preds.shape[2])
            targets  = tar[:, 1:].reshape(-1)
            loss     = criterion(outputs, targets)

        scaler.scale(loss).backward()
        torch.nn.utils.clip_grad_norm_(transformer.parameters(), max_norm=1.0)
        scaler.step(optimizer)
        scaler.update()

        running_loss += loss.item()
        avg_loss = running_loss / (batch_idx + 1)

        if (batch_idx + 1) % 20 == 0:
            print(f"[Batch {batch_idx + 1}/{len(dataloader)}] Loss: {avg_loss:.4f}")

    epoch_time = time.time() - start_time
    print(f"Epoch Time: {epoch_time:.2f}s | Average Loss: {avg_loss:.4f}")
    save({'model_state_dict': transformer.state_dict()}, f'checkpoint_epoch_{epoch+1}.pt')




In [88]:
# # model saving and loading code

# import torch

# torch.cuda.empty_cache()
# torch.cuda.reset_peak_memory_stats()


# torch.save({
#     'epoch': epoch,
#     'model_state_dict': transformer.state_dict(),
#     'optimizer_state_dict': optimizer.state_dict(),
#     'scaler_state_dict': scaler.state_dict(),  # If using AMP
#     'loss': avg_loss
# }, os.path.join(SAVE_PATH, f'checkpoint_epoch_{epoch+1}.pt'))


# checkpoint = torch.load('your checkpoint', map_location=device_)

# transformer.load_state_dict(checkpoint['model_state_dict'])
# optimizer.load_state_dict(checkpoint['optimizer_state_dict'])
# scaler.load_state_dict(checkpoint['scaler_state_dict'])  # Only if you're using AMP

# start_epoch = checkpoint['epoch'] + 1
# transformer.to(device_)
# transformer.train()


# Skip the following cells until you reach the one labeled "skip_end!"

In [None]:
##########################################################################################################################

#### Checking model's inputs and outputs

In [142]:
# from torch import load
# from torch import device, tensor
# device_ = device('cuda' if torch.cuda.is_available() else 'cpu')

# path = SAVE_PATH + '/pretrain checkpoint_epoch_15.pt'

# checkpoint = load(path, map_location=device_, weights_only=True)
# transformer.load_state_dict(checkpoint['model_state_dict'])



<All keys matched successfully>

In [None]:
inp[2,:]

tensor([    3,  8691,    16,     8,   842,    13, 27874,   896,  2969,     6,
           69,  1487,    18,  4905,  6432,    19,     3, 20690, 31999,    45,
         1012,  8548, 10108,     6, 31998,  2698,  6411,  1384,    11,  1651,
         7488,  7120, 31997,  1651,  2309,     6,  3661,     6,  5441, 15754,
            7,    11,   452,  1855,   931,    33,   269,    30,     8, 26929,
        31996,   492,    69, 27874,   896,  2969,  1128,     8,  1523,  1247,
           21, 31995,   384,    42,   268,  1469, 31994,   421,  1595,    19,
         1772,   204, 31993,  8538, 31992,  2801,    28,   305,   941,  2542,
           11,  1338,  2801,     5,    37,  4983,    32,    26,   397, 27874,
           19,  1069,    16,     8,   842,    13, 31991,     5,    37,  2062,
         1983, 31990,  3885,     7, 31989,  8382,    16,     8, 31988,    13,
        31987, 15916,  2893,     6, 27874,     5,    71,   775,     3,    31,
        31986,  2237,    31,   785, 31985,     8,   182, 31984, 

In [None]:
tar[2,:]


tensor([31999,  8382,   676, 31998,     8, 31997,     5, 31996,     6, 31995,
          136, 31994,     5, 31993,  4118, 31992, 20680, 31991, 27874,   690,
        31990,   106,    11, 31989,    19, 31988,   842, 31987,     8, 31986,
         9124, 31985,    16, 31984,  2050, 31983,  4629, 31982,  1310,  2946,
        31981,  1212, 31980,    49,    30, 31979,     3, 31978,   298,   464],
       device='cuda:0')

In [None]:
inp_de = tokenizer.detokenize(inp[2,:].tolist())

inp_de


"Located in the heart of Belfast City Centre, our budget-friendly accommodation is ideally Internațional from popular tourist attractions, erwachsene Grand Opera House and Great Victoria Square Cushion Great shopping, restaurants, historic landmarks and public transport options are right on the doorstep imunitar making our Belfast City Centre location the ideal base for Intellectual family or business trip traditi Our hotel is offering 2 disguise beautifullyexerce rooms with 5 modern conference and meeting rooms. The Travelodge Belfast is located in the heart ofnourishe. The restaurant Act predominant Sonsamitié situated in the erkennt ofdimension linen quarter, Belfast. A unique ' inférieur led' property refugi the very cheddar of Belfast. E unterliegIC is the most garanteaz and most sophisticated of the restaurants in the Deanes portfolio"

In [None]:
pretty_decode(inp_de, sentinels, tokenizer)


"Located in the heart of Belfast City Centre, our budget-friendly accommodation is ideally <Z> from popular tourist attractions, <Y> Grand Opera House and Great Victoria Square <X> Great shopping, restaurants, historic landmarks and public transport options are right on the doorstep <W> making our Belfast City Centre location the ideal base for <V> family or business trip <U> Our hotel is offering 2 <T> beautifully<S> rooms with 5 modern conference and meeting rooms. The Travelodge Belfast is located in the heart of<R>. The restaurant Act <Q> Sons<P> situated in the <O> of<N> linen quarter, Belfast. A unique ' <M> led' property <L> the very <K> of Belfast. E <J>IC is the most <I> and most sophisticated of the restaurants in the Deanes portfolio"

In [None]:
tar_de = tokenizer.detokenize(tar[2,:].tolist())
tar_de


'Internațional situated minutes erwachsene the Cushion. imunitar, Intellectual any traditi. disguise37exerce furnishednourishe Belfast city predominanton andamitié is erkennt heartdimension the inférieurdesign refugi in cheddar centre unterliegIP garanteaz recently openedfăcute Meréglageer on pedepse Germain while working'

In [None]:
pretty_decode(tar_de, sentinels, tokenizer)


'<Z> situated minutes <Y> the <X>. <W>, <V> any <U>. <T>37<S> furnished<R> Belfast city <Q>on and<P> is <O> heart<N> the <M>design <L> in <K> centre <J>IP <I> recently opened<H> Me<G>er on <F> <E> while working'

In [None]:
##########################################################################

In [None]:
transformer.eval()
preds, _ = transformer(inp[2:4,:], tar[2:4,:-1])


In [None]:
preds.shape

torch.Size([2, 49, 32000])

In [None]:
preds2 = torch.argmax(preds, dim=-1)
preds2


tensor([[ 4983, 31998, 31998,  8548, 31997,  8548, 31996,  8548, 31995,  5441,
         31994,     8, 31993,  4983,  2801,  4983, 31991,  8538, 31990, 31990,
          2801, 31989, 31989,  4983, 31988,  2698, 31987,   628, 31986,  4983,
         31985,  4983, 31984,     8, 31983,     8, 31982,  3214, 31981, 31981,
          2801, 31980,  4983, 31979, 31979,   628, 31978,  3214, 31977],
        [    3, 31998,  7717, 31997, 31997, 31997,     3, 31996, 16959, 31995,
             3, 31994,     3, 31993,     3, 31992,     3, 31991, 10503, 31990,
            12, 31989,     3, 31988,     3, 31987,     3, 31986,     3, 31985,
             3, 31984,     3, 31983,     3, 31982,     3, 31981, 26661, 31980,
             3, 31979, 16959, 31978,     3, 31977,     3, 31976,    12]],
       device='cuda:0')

In [None]:
inp_de = tokenizer.detokenize(preds2[0].tolist())

inp_de

'Travel erwachsene erwachsene tourist Cushion tourist imunitar tourist Intellectual historic traditi the disguise Travel rooms Travelnourishe beautifully predominant predominant roomsamitiéamitié Travel erkennt Granddimension space inférieur Travel refugi Travel cheddar the unterlieg the garanteaz walkingfăcutefăcute roomsréglage Travel pedepse pedepse spaceGermain walkingdistinctly'

In [None]:
pretty_decode(preds2[0].tolist(), sentinels, tokenizer)


'Travel <Y> <Y> tourist <X> tourist <W> tourist <V> historic <U> the <T> Travel rooms Travel<R> beautifully <Q> <Q> rooms<P><P> Travel <O> Grand<N> space <M> Travel <L> Travel <K> the <J> the <I> walking<H><H> rooms<G> Travel <F> <F> space<E> walking<D>'

In [None]:
#############################################################################################################################

# skip_end!

In [None]:
#############################################################################################################################

In [None]:
#############################################################################################################################

In [None]:
#############################################################################################################################

# **Fine-Tuning Section Part-1**

In [None]:
#############################################################################################################################

In [None]:
#############################################################################################################################

In [None]:
#############################################################################################################################

## Loading a pretrained model

To show how powerful this model actually is, we trained it for several epochs with the full dataset in Colab and saved the weights for you. We can load them using the cell below. For the rest of the notebook, we'll see the power of the transfer learning in action.

### Instantiating model architecture

In [None]:
# Define the model parameters
num_layers                 = 2
embedding_dim              = 128
fully_connected_dim        = 128
num_heads                  = 2
positional_encoding_length = 256

encoder_vocab_size = int(tokenizer.vocab_size())
decoder_vocab_size = encoder_vocab_size

# Initialize the model
transformer = transformer_utils.Transformer(
    num_layers,
    embedding_dim,
    num_heads,
    fully_connected_dim,
    encoder_vocab_size,
    decoder_vocab_size,
    positional_encoding_length,
    positional_encoding_length,
)

device_ = 'cuda' if torch.cuda.is_available() else 'cpu'
transformer.to(device_)

### Loading Weights

In [141]:
# from torch import load
# from torch import device, tensor
# device_ = device('cuda' if torch.cuda.is_available() else 'cpu')


# checkpoint = load('pretrain checkpoint_epoch_15.pt', map_location=device_, weights_only=True)
# transformer.load_state_dict(checkpoint['model_state_dict'])



# Data Processing for Fine-Tuning

Now, we are going to fine tune the pretrained model for Question Answering using the [SQUad 2.0 dataset](https://rajpurkar.github.io/SQuAD-explorer/).

SQuAD, short for Stanford Question Answering Dataset, is a dataset designed for training and evaluating question answering systems. It consists of real questions posed by humans on a set of Wikipedia articles, where the answer to each question is a specific span of text within the corresponding article.

SQuAD 1.1, the previous version of the SQuAD dataset, contains 100,000+ question-answer pairs on about 500 articles.
SQuAD 2.0, contains 50.000 additional questions that are not meant to be answered. This extra set of questions can help to train models to detect unanswerable questions.

Let's load the dataset.

## Data Loading - SQuAD 2.0 Dataset

In [67]:
with open('data/train-v2.0.json', 'r') as f:
    example_jsons = json.load(f)

example_jsons = example_jsons['data']

print('Number of articles: ' + str(len(example_jsons)))


Number of articles: 442


The structure of each article is as follows:
- `title`: The article title
- `paragraphs`: A list of paragraphs and questions related to them
    - `context`: The actual paragraph text
    - `qas`: A set of question related to the paragraph
        - `question`: A question
        - `id`: The question unique identifier
        - `is_imposible`: Boolean, specifies if the question can be answered or not
        - `answers`: A set of possible answers for the question
            - `text`: The answer
            - `answer_start`: The index of the character that starts the sentence containing the explicit answer to the question
            
Take a look at an article by running the next cell. Notice that the `context` is usually the last element for every paragraph:           

In [68]:
example_article = example_jsons[0]
example_article

print("Title: " + example_article["title"])
example_article["paragraphs"][0]


Title: Beyoncé


{'qas': [{'question': 'When did Beyonce start becoming popular?',
   'id': '56be85543aeaaa14008c9063',
   'answers': [{'text': 'in the late 1990s', 'answer_start': 269}],
   'is_impossible': False},
  {'question': 'What areas did Beyonce compete in when she was growing up?',
   'id': '56be85543aeaaa14008c9065',
   'answers': [{'text': 'singing and dancing', 'answer_start': 207}],
   'is_impossible': False},
  {'question': "When did Beyonce leave Destiny's Child and become a solo singer?",
   'id': '56be85543aeaaa14008c9066',
   'answers': [{'text': '2003', 'answer_start': 526}],
   'is_impossible': False},
  {'question': 'In what city and state did Beyonce  grow up? ',
   'id': '56bf6b0f3aeaaa14008c9601',
   'answers': [{'text': 'Houston, Texas', 'answer_start': 166}],
   'is_impossible': False},
  {'question': 'In which decade did Beyonce become famous?',
   'id': '56bf6b0f3aeaaa14008c9602',
   'answers': [{'text': 'late 1990s', 'answer_start': 276}],
   'is_impossible': False},
  {'q

The previous article might be difficult to navigate so here is a nicely formatted example paragraph:
```python
{
  "context": "Beyoncé Giselle Knowles-Carter (/biːˈjɒnseɪ/ bee-YON-say) (born September 4, 1981) is an American singer, songwriter, record producer and actress. Born and raised in Houston, Texas, she performed in various singing and dancing competitions as a child, and rose to fame in the late 1990s as lead singer of R&B girl-group Destiny's Child. Managed by her father, Mathew Knowles, the group became one of the world's best-selling girl groups of all time. Their hiatus saw the release of Beyoncé's debut album, Dangerously in Love (2003), which established her as a solo artist worldwide, earned five Grammy Awards and featured the Billboard Hot 100 number-one singles 'Crazy in Love' and 'Baby Boy'",
  "qas": [
    {
      "question": "When did Beyonce start becoming popular?",
      "id": "56be85543aeaaa14008c9063",
      "answers": [
        {
          "text": "in the late 1990s",
          "answer_start": 269
        }
      ],
      "is_impossible": false
    },
    {
      "question": "What areas did Beyonce compete in when she was growing up?",
      "id": "56be85543aeaaa14008c9065",
      "answers": [
        {
          "text": "singing and dancing",
          "answer_start": 207
        }
      ],
      "is_impossible": false
    }
  ]
}
```

## Data Pasing - Creating Input-Target Pairs

In [69]:
def parse_squad(dataset):
    """Extract all the answers/questions pairs from the SQuAD dataset

    Args:
        dataset (dict): The imported JSON dataset

    Returns:
        inputs, targets: Two lists containing the inputs and the targets for the QA model
    """

    inputs, targets = [], []

    ### START CODE HERE ###

    # Loop over all the articles
    for article in dataset:

        # Loop over each paragraph of each article
        for paragraph in article['paragraphs']:

            # Extract context from the paragraph
            context = paragraph['context']

            #Loop over each question of the given paragraph
            for qa in paragraph['qas']:

                # If this question is not impossible and there is at least one answer
                if len(qa['answers']) > 0 and not(qa['is_impossible']):

                    # Create the question/context sequence
                    question_context = 'question: ' + qa['question'] + ' context: ' + context

                    # Create the answer sequence. Use the text field of the first answer
                    answer = 'answer: ' + qa['answers'][0]['text']

                    # Add the question_context to the inputs list
                    inputs.append(question_context)

                    # Add the answer to the targets list
                    targets.append(answer)

    ### END CODE HERE ###

    return inputs, targets


In [70]:
inputs, targets =  parse_squad(example_jsons)
print("Number of question/answer pairs: " + str(len(inputs)))

print('\nFirst Q/A pair:\n\ninputs: ' + colored(inputs[0], 'blue'))
print('\ntargets: ' + colored(targets[0], 'green'))
print('\nLast Q/A pair:\n\ninputs: ' + colored(inputs[-1], 'blue'))
print('\ntargets: ' + colored(targets[-1], 'green'))


Number of question/answer pairs: 86821

First Q/A pair:

inputs: [34mquestion: When did Beyonce start becoming popular? context: Beyoncé Giselle Knowles-Carter (/biːˈjɒnseɪ/ bee-YON-say) (born September 4, 1981) is an American singer, songwriter, record producer and actress. Born and raised in Houston, Texas, she performed in various singing and dancing competitions as a child, and rose to fame in the late 1990s as lead singer of R&B girl-group Destiny's Child. Managed by her father, Mathew Knowles, the group became one of the world's best-selling girl groups of all time. Their hiatus saw the release of Beyoncé's debut album, Dangerously in Love (2003), which established her as a solo artist worldwide, earned five Grammy Awards and featured the Billboard Hot 100 number-one singles "Crazy in Love" and "Baby Boy".[0m

targets: [32manswer: in the late 1990s[0m

Last Q/A pair:

inputs: [34mquestion: What is KMC an initialism of? context: Kathmandu Metropolitan City (KMC), in order to 

#### **Expected Output:**
```
Number of question/answer pairs: 86821

First Q/A pair:

inputs: question: When did Beyonce start becoming popular? context: Beyoncé Giselle Knowles-Carter (/biːˈjɒnseɪ/ bee-YON-say) (born September 4, 1981) is an American singer, songwriter, record producer and actress. Born and raised in Houston, Texas, she performed in various singing and dancing competitions as a child, and rose to fame in the late 1990s as lead singer of R&B girl-group Destiny's Child. Managed by her father, Mathew Knowles, the group became one of the world's best-selling girl groups of all time. Their hiatus saw the release of Beyoncé's debut album, Dangerously in Love (2003), which established her as a solo artist worldwide, earned five Grammy Awards and featured the Billboard Hot 100 number-one singles "Crazy in Love" and "Baby Boy".

targets: answer: in the late 1990s

Last Q/A pair:

inputs: question: What is KMC an initialism of? context: Kathmandu Metropolitan City (KMC), in order to promote international relations has established an International Relations Secretariat (IRC). KMC's first international relationship was established in 1975 with the city of Eugene, Oregon, United States. This activity has been further enhanced by establishing formal relationships with 8 other cities: Motsumoto City of Japan, Rochester of the USA, Yangon (formerly Rangoon) of Myanmar, Xi'an of the People's Republic of China, Minsk of Belarus, and Pyongyang of the Democratic Republic of Korea. KMC's constant endeavor is to enhance its interaction with SAARC countries, other International agencies and many other major cities of the world to achieve better urban management and developmental programs for Kathmandu.

targets: answer: Kathmandu Metropolitan City
```

We'll use 50000 samples for training and 5000 samples for testing

### Creating Train and Test Splits

In [71]:
# 50K pairs for training
inputs_train = inputs[0:40000]
targets_train = targets[0:40000]

# 5K pairs for testing
inputs_test = inputs[40000:45000]
targets_test =  targets[40000:45000]


# **Fine-Tuning Section Part-2**

## Setting up Dataloader
Now, we can create the batch dataset of padded sequences. We'll first tokenize the inputs and the targets. Then, we'll ensure that the inputs and the outputs have the required lengths. Remember that the sequences longer than the required size will be truncated and the shorter ones will be padded with `0`. This setup is very similar to the other one used previously.

In [135]:
import torch
from torch.utils.data import TensorDataset, DataLoader
import numpy as np

# Hyperparameters
encoder_maxlen = 150
decoder_maxlen = 50
BATCH_SIZE     = 64

# EOS token id (usually 1 for SentencePiece)
eos_id = 1

# Tokenize inputs
inputs_str = [tokenizer.encode(s, out_type=int) for s in inputs_train]

# Tokenize targets and add EOS token
targets_str = [tokenizer.encode(s, out_type=int) + [eos_id] for s in targets_train]

# Padding function
def pad_sequences(sequences, maxlen, pad_value=0):
    return np.array([
        seq[:maxlen] + [pad_value] * max(0, maxlen - len(seq))
        for seq in sequences
    ])

# Pad inputs and targets
inputs_padded  = pad_sequences(inputs_str, encoder_maxlen)
targets_padded = pad_sequences(targets_str, decoder_maxlen)

# Convert to torch tensors
inputs_tensor  = torch.tensor(inputs_padded, dtype=torch.long)
targets_tensor = torch.tensor(targets_padded, dtype=torch.long)

# Create PyTorch dataset and dataloader
dataset    = TensorDataset(inputs_tensor, targets_tensor)
dataloader = DataLoader(dataset, batch_size=BATCH_SIZE, shuffle=True)


# Skip the following cells until you reach the one labeled "skip_end!"

In [136]:
for (batch, (inp, tar)) in enumerate(dataloader):
    if batch >=2:
        break
inp.shape, tar.shape


(torch.Size([64, 150]), torch.Size([64, 50]))

In [100]:
inp

tensor([[ 822,   10,  571,  ...,    5,   37,  167],
        [ 822,   10,  363,  ...,    0,    0,    0],
        [ 822,   10,  363,  ...,   31,    7, 3251],
        ...,
        [ 822,   10,   37,  ...,  892,    5,    0],
        [ 822,   10,  366,  ...,   33,   59,  347],
        [ 822,   10,  363,  ..., 6982, 5752,  120]])

In [101]:
inp[5,:]


tensor([  822,    10,   363,  1440,    43,   600,   452,  8981,    68,   731,
        13100,    58,  2625,    10,  1881,   324,     7,   757,    11,  2399,
          452,  2887,    19,     6,    16,  1402,     6,    16, 29112,    44,
         1020,    13,     3, 18036,    63,   159,    51,     6,  4583,  1549,
            7,     6,    11, 10960,    15, 15133,   297,     5,  4961, 26221,
           26,  4750,    11,     3, 26968,     6,    73, 23313,  2314,  3498,
            3, 31488,     8,   682,     5,   100,    19,    80,  5464,    21,
        11596,  1707,    11,    20, 27522,     5,  4495,  9977,     7,    13,
        11596,  1707,   217,     8,  5464,    38, 30981,     5,    37,  5464,
           24, 13100,  6539,  6963,    45,     8,  1004,    19,     3, 31315,
           57,     8,  6831,    13,  1440,    28,   731,    12,   529,    18,
        13957, 13100,    68,   508,   452,  8981,     6,   114,     8, 24207,
         1440,     5,   506,  1440,  2604,   306,    30,     8, 

In [102]:
inp[1,:]


tensor([  822,    10,   363,    19, 18908,    57,     8,  3053,    13,     3,
          476,  1626,  7379, 10739,   445,    58,  2625,    10,  1908,  2528,
            7,    33, 18908,    57,     8,  3053,    13,     3,   476,  1626,
         7379, 10739,   445,   859,   135,     5,  1844,    83,   920,    12,
        26843,    45, 26181,  3826,     6,    34,    19,   435,    44,   306,
         1917,    16,  4575,     9,  2176,   151,     7,     5,    94,     7,
         3053,    16,  1908,  2528,     7,    41,  9341,    96, 22969,    49,
           29,  4263,     7,   121,    16,     8,   934,    61,    44,  4377,
            7,    12,     8,   529,    18,   134,   521,  7287, 14430,     7,
           41, 13682,    53,    28, 22896,   447, 14430,     7,    13,  8390,
         4491, 19867,     9,   137,  2040, 10348,  1427,     6,  4263,     7,
           33,  2389,  1126,    12, 11683,    16,  2069,    18,    15,     9,
        13072,  1740,    68,   128,  8390,  4263,     7,    33, 

In [103]:
tokenizer.id_to_piece(822)


'▁question'

In [104]:
tokenizer.id_to_piece(5)


'.'

In [105]:
tar[5,:]


tensor([ 1525,    10, 24207,     1,     0,     0,     0,     0,     0,     0,
            0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
            0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
            0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
            0,     0,     0,     0,     0,     0,     0,     0,     0,     0])

In [106]:
tar[1,:]


tensor([1525,   10, 1908, 2528,    7,    1,    0,    0,    0,    0,    0,    0,
           0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
           0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
           0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
           0,    0])

In [107]:
tokenizer.id_to_piece(1525)


'▁answer'

In [108]:
tokenizer.detokenize(inp[5,:].tolist())

'question: What countries have big public sectors but low corruption? context: Extensive and diverse public spending is, in itself, inherently at risk of cronyism, kickbacks, and embezzlement. Complicated regulations and arbitrary, unsupervised official conduct exacerbate the problem. This is one argument for privatization and deregulation. Opponents of privatization see the argument as ideological. The argument that corruption necessarily follows from the opportunity is weakened by the existence of countries with low to non-existent corruption but large public sectors, like the Nordic countries. These countries score high on the Ease of Doing Business Index, due to good and often simple regulations, and have rule of'

In [109]:
tokenizer.detokenize(tar[5,:].tolist())


'answer: Nordic'

In [110]:
len(dataloader)


625

# skip_end!

## Training Loop

Now, we'll train the model for 2 epochs. In the T5 model, all the weights are adjusted during the fine tuning. As usual, fine tuning this model to get state of the art results would require more time and resources than there are available in this environment, but you are welcome to train the model for more epochs and with more data using Colab GPUs.


In [None]:
from torch import save
from torch.amp import GradScaler, autocast
from torch.nn.utils import clip_grad_norm_
from torch.nn import NLLLoss
from torch.optim import AdamW
import time

# Move model to device
transformer.to(device_)

# Loss, optimizer, scaler
pad_idx       = 0                              # Padding index to ignore in loss calculation
learning_rate = 1e-3                           # Learning rate for optimizer
num_epochs    = 100                            # Total number of training epochs

criterion = NLLLoss(ignore_index=pad_idx)      # Negative log-likelihood loss, ignoring pad tokens
optimizer = AdamW(transformer.parameters(), lr=learning_rate)  # AdamW optimizer for weight decay
scaler    = GradScaler()                       # Gradient scaler for mixed precision training

# Training loop
for epoch in range(num_epochs):
    print(f'\nEpoch [{epoch + 1}/{num_epochs}]')
    start_time = time.time()

    transformer.train()                                   # Set model to training mode
    running_loss = 0

    for batch_idx, (inp, tar) in enumerate(dataloader):
        inp = inp.to(device_, non_blocking=True)          # Move input to device asynchronously
        tar = tar.to(device_, non_blocking=True)          # Move target to device asynchronously

        optimizer.zero_grad()                             # Zero out previous gradients

        with autocast(device_type=device_.type):          # Enable autocasting for mixed precision
            preds, _ = transformer(inp, tar[:, :-1])      # Forward pass with input and target excluding last token
            outputs  = preds.reshape(-1, preds.shape[2])  # Flatten output for loss calculation
            targets  = tar[:, 1:].reshape(-1)             # Flatten shifted targets
            loss     = criterion(outputs, targets)        # Compute loss

        scaler.scale(loss).backward()                     # Backprop with scaled loss
        clip_grad_norm_(transformer.parameters(), max_norm=1.0)  # Gradient clipping to stabilize training
        scaler.step(optimizer)                            # Optimizer step with scaler
        scaler.update()                                   # Update scaler for next iteration

        running_loss += loss.item()                       # Accumulate loss
        avg_loss      = running_loss / (batch_idx + 1)    # Compute average loss

        if (batch_idx + 1) % 20 == 0:                     # Print loss every 20 batches
            print(f"[Batch {batch_idx + 1}/{len(dataloader)}] Loss: {avg_loss:.4f}")

    epoch_time = time.time() - start_time
    print(f"Epoch Time: {epoch_time:.2f}s | Average Loss: {avg_loss:.4f}")

    # Evaluation: Top-k prediction on an eval example
    transformer.eval()                                                              # Set model to evaluation mode
    with torch.no_grad():                                                           # Disable gradient computation
        eval_preds, _   = transformer(eval_inp, eval_tar_inp)                       # Get predictions on evaluation input
        _, topk_indices = torch.topk(eval_preds[:, -1, :], k=10, dim=-1)            # Top-10 predictions on last token
        topk_indices    = topk_indices.to('cpu').int().tolist()                     # Move to CPU and convert to list
        decoded_answers = [tokenizer.detokenize([idx]) for idx in topk_indices[0]]  # Decode predictions

        print(f"\n[Eval Example]:\n{example_question}")                             # Print example input
        print(f"Top 10 Predictions: {decoded_answers}\n")                           # Print top-10 predicted answers

    # Save the model (you can customize path)
#    SAVE_PATH = f"transformer_epoch_{epoch + 1}.pt"                                # Optional: define save path per epoch
    torch.save({
                'epoch':                epoch,                                      # Save current epoch
                'model_state_dict':     transformer.state_dict(),                   # Save model parameters
                'optimizer_state_dict': optimizer.state_dict(),                     # Save optimizer state
                'scaler_state_dict':    scaler.state_dict(),                        # Save AMP scaler state
                'loss':                 avg_loss                                    # Save loss for reference
                },
        os.path.join(SAVE_PATH, f'checkpoint_epoch_{epoch+1}.pt'))                  # Save to checkpoint file

#    save({'model_state_dict': transformer.state_dict()}, save_path)                # Optional save using shorthand
    print(f"Model saved to {SAVE_PATH}")                                            # Confirm save location


In [None]:
# code for saving and loading the model

# import torch

# torch.cuda.empty_cache()
# torch.cuda.reset_peak_memory_stats()


# torch.save({
#     'epoch': epoch,
#     'model_state_dict': transformer.state_dict(),
#     'optimizer_state_dict': optimizer.state_dict(),
#     'scaler_state_dict': scaler.state_dict(),  # If using AMP
#     'loss': avg_loss
# }, os.path.join(SAVE_PATH, f'checkpoint_epoch_{epoch+1}.pt'))


# checkpoint = torch.load('your checkpoint', map_location=device_)

# transformer.load_state_dict(checkpoint['model_state_dict'])
# optimizer.load_state_dict(checkpoint['optimizer_state_dict'])
# scaler.load_state_dict(checkpoint['scaler_state_dict'])  # Only if you're using AMP

# start_epoch = checkpoint['epoch'] + 1
# transformer.to(device_)
# transformer.train()


To get a model that works properly, we would need to train for about 100 epochs. So, we have pretrained a model. Just loading the weights in the current model and let's use it for answering questions

# Inference

In this final step, you will implement the answer_question function, utilizing a pre-trained transformer model for question answering.

To help you out the `transformer_utils.next_word` function is provided. This function receives the question and beginning of the answer (both in tensor format) alongside the model to predict the next token in the answer. The next cell shows how to use this:

In [63]:
from torch import load
from torch import device, tensor
device_ = device('cuda' if torch.cuda.is_available() else 'cpu')

path       = 'best_qA_model.pt'
checkpoint = load(path, map_location=device_, weights_only=True)
transformer.load_state_dict(checkpoint['model_state_dict'])



<All keys matched successfully>

In [64]:
# # Define an example question and context
# example_question = "question: What color is the sky? context: Sky is blue."
# example_tar     = 'answer: '

# example_question = "question: What is the color of his shirt? context: His boots are red. His pants are orange. His shirt is pink. His hair are yellow"
# example_question = "question: Where is he sitting? context: He is sitting on a chair"

example_question = "question: When did Beyonce start becoming popular? context: Beyoncé Giselle Knowles-Carter (/biːˈjɒnseɪ/ bee-YON-say) (born September 4, 1981) is an American singer, songwriter, record producer and actress. Born and raised in Houston, Texas, she performed in various singing and dancing competitions as a child, and rose to fame in the late 1990s as lead singer of R&B girl-group Destiny's Child. Managed by her father, Mathew Knowles, the group became one of the world's best-selling girl groups of all time. Their hiatus saw the release of Beyoncé's debut album, Dangerously in Love (2003), which established her as a solo artist worldwide, earned five Grammy Awards and featured the Billboard Hot 100 number-one singles Crazy in Love and Baby Boy"
example_tar     = 'answer: '
# example_question = "question: What color is the sky? context: Sky is blue answer: blue. question: What color is his shirt? context: His shirt is yellow. answer: yellow. question: what color is his bike? context: His bike is red."
# example_tar     = 'answer: '


In [65]:
# Creating Inputs

eval_inp     = example_question
eval_tar_inp = 'answer: '

eval_inp     = tokenizer.tokenize(eval_inp)
eval_tar_inp = tokenizer.tokenize(eval_tar_inp)

eval_inp     = torch.tensor(eval_inp, dtype=torch.long, device=device_)
eval_tar_inp = torch.tensor(eval_tar_inp, dtype=torch.long, device=device_)

eval_inp     = eval_inp.unsqueeze(0)
eval_tar_inp = eval_tar_inp.unsqueeze(0)

transformer.to(device_)
print('ok')

ok


In [66]:
# Getting Predictions and Printing Them Out | On Training Data | See Deployment Snapshots for results on unseen data
transformer.eval()
with torch.no_grad():
    eval_preds, _   = transformer(eval_inp, eval_tar_inp)
    _, topk_indices = torch.topk(eval_preds[:, -1, :], k=10, dim=-1)
    topk_indices    = topk_indices.int().tolist()
    decoded_answers = [tokenizer.detokenize([idx]) for idx in topk_indices[0]]

    print(f"\n[Eval Example]:\n{example_question}")
    print(f"Top 10 Predictions: {decoded_answers}\n")



[Eval Example]:
question: When did Beyonce start becoming popular? context: Beyoncé Giselle Knowles-Carter (/biːˈjɒnseɪ/ bee-YON-say) (born September 4, 1981) is an American singer, songwriter, record producer and actress. Born and raised in Houston, Texas, she performed in various singing and dancing competitions as a child, and rose to fame in the late 1990s as lead singer of R&B girl-group Destiny's Child. Managed by her father, Mathew Knowles, the group became one of the world's best-selling girl groups of all time. Their hiatus saw the release of Beyoncé's debut album, Dangerously in Love (2003), which established her as a solo artist worldwide, earned five Grammy Awards and featured the Billboard Hot 100 number-one singles Crazy in Love and Baby Boy
Top 10 Predictions: ['July', '2014', 'November', 'late', '1977', 'June', '2003', 'September', '1990', '2013']



**The correct answer given the context is "late 1990". We observe that both of these tokens appear in the model's top-10 predictions, indicating that the model is indeed learning and our setup is working as intended!**


In [75]:
# torch.save({'model_state_dict': transformer.state_dict()}, 'best_qA_model.pt')


# Conclusion

In this notebook, we pretrained our model for approximately 100 epochs. To evaluate the effectiveness of this pretraining, we conducted two experiments:

1. **Fine-tuning a pretrained model**: The pretrained model started yielding promising results within just 3–4 epochs of fine-tuning.
2. **Training an uninitialized (randomly initialized) model**: In contrast, the untrained model required around 60 epochs to achieve similar performance.

This clearly indicates that pretraining was effective and significantly accelerated convergence during fine-tuning.

During inference, we passed questions along with contextual information to the model. We observed that the pretrained and fine-tuned model was often able to return the correct answer — or something close — within its top-10 (Top-K) predictions. For example, when asked:

> *"What color is the sky?"*  
> *Context: "The sky is blue."*

At later stages of pretraining, the model frequently returned tokens like `"color"`, `"black"`, `"blue"`, and `"red"` — showing that it was learning meaningful representations. In contrast, earlier in training, its predictions were much less relevant.

Interestingly, right after pretraining, the model began producing good results within 4–5 epochs of fine-tuning, but then its performance declined — the correct answers were no longer within the top-10 predictions. It wasn’t until around epoch 80–90 that it began recovering and producing even better results than before. This behavior suggests potential **catastrophic forgetting** during full fine-tuning.

---

### What’s Next

To mitigate this issue in future experiments, we plan to explore **parameter-efficient fine-tuning** methods such as **LoRA** or **QLoRA**. These approaches could help preserve the model's pretrained knowledge while still adapting it effectively to downstream tasks.
