# Creating an Encoder-Decoder Transformer (T5) from scratch + Pretraining + Full fine tuning - pytorch implementation

### Links for the datasets used in this notebook: [1](https://github.com/Ariyanbgd/T5_Q-A) [2](https://huggingface.co/datasets/rajpurkar/squad_v2/tree/main)


#### The link for the github repository of full [Natural Language Processing Specialisation](https://github.com/AnsImran/DeepLearning.AI-Natural-Language-Processing-Specialization/tree/main/4-Natural%20Language%20Processing%20with%20Attention%20Models/Week%203).







# Assignment 3: Question Answering

Welcome to the third assignment of course 4. In this assignment you will explore question answering. You will implement the "Text to Text Transfer from Transformers" (better known as T5). Since you implemented transformers from scratch last week you will now be able to use them. 

<img src = "images/qa.png"> 


## Table of Contents

- [Overview](#0-1)
- [Importing the Packages](#0-2)
- [1 - Prepare the data for pretraining T5](#1)
    - [1.1 - Pre-Training Objective](#1-1)
    - [1.2 - C4 Dataset](#1-2)
    - [1.3 - Process C4](#1-3)
    - [1.4 - Decode to Natural Language](#1-4)
    - [1.5 - Tokenizing and Masking](#1-5)
        - [Exercise 1 - tokenize_and_mask](#ex-1)
    - [1.6 - Creating the Pairs](#1-6)
- [2 - Pretrain a T5 model using C4](#2)
    - [2.1 - Instantiate a new transformer model](#2-1)
    - [2.2 - C4 pretraining](#2-2)
- [3 - Fine tune the T5 model for Question Answering](#3)
    - [3.1 - Creating a list of paired question and answers](#3-1)
        - [Exercise 2 - Parse the SQuaD 2.0 dataset](#ex-2)
    - [3.2 - Fine tune the T5 model](#3-2)    
    - [3.3 - Implement your Question Answering model](#3-3)
        - [Exercise 3 - Implement the question answering function](#ex-3)    

<a name='0-1'></a>
## Overview

This assignment will be different from the two previous ones. Due to memory constraints of this environment and for the sake of time, your model will be trained with small datasets, so you won't get models that you could use in production but you will gain the necessary knowledge about how the Generative Language models are trained and used. Also you won't spend too much time with the architecture of the models but you will instead take a model that is pre-trained on a larger dataset and fine tune it to get better results.

After completing this labs you will:
* Understand how the C4 dataset is structured.
* Pretrain a transformer model using a Masked Language Model.
* Understand how the "Text to Text Transfer from Transformers" or T5 model works. 
* Fine tune the T5 model for Question answering

<a name='0-2'></a>
## Importing the Packages

Let's start by importing all the required libraries. 

In [1]:
import torch
import torch.nn as nn
import torch.optim as optim
from torch.nn.functional import log_softmax
import torchtext
torchtext.disable_torchtext_deprecation_warning()

import tensorflow as tf
import tensorflow_text as tf_text
import string
import itertools
import transformer_utils 

import os
os.environ['TF_CPP_MIN_LOG_LEVEL'] = '3'

import numpy as np
import pandas as pd

import matplotlib.pyplot as plt
import time
import utils

import textwrap
wrapper = textwrap.TextWrapper(width=70)

#import datasets
import json

from termcolor import colored

device = 'cuda' if torch.cuda.is_available() else 'cpu'


In [3]:
torch.cuda.is_available()


True

In [4]:
# import os
# os.environ['TF_CPP_MIN_LOG_LEVEL'] = '3'

# import traceback
# import time
# import json
# from termcolor import colored
# import string
# import textwrap
# import itertools
# import numpy as np
# import tensorflow_text as tf_text
# import tensorflow as tf

# import transformer_utils 
# import utils

# # Will come in handy later
# wrapper = textwrap.TextWrapper(width=70)

# # Set random seed
# np.random.seed(42)

<a name='1'></a>
## 1 -  Prepare the data for pretraining T5 

<a name='1-1'></a>
### 1.1 - Pre-Training Objective

In the initial phase of training a T5 model for a Question Answering task, the pre-training process involves leveraging a masked language model (MLM) on a very large dataset, such as the C4 dataset. The objective is to allow the model to learn contextualized representations of words and phrases, fostering a deeper understanding of language semantics. To initiate pre-training, it is essential to employ the Transformer architecture, which forms the backbone of T5. The Transformer's self-attention mechanism enables the model to weigh different parts of the input sequence dynamically, capturing long-range dependencies effectively.

Before delving into pre-training, thorough data preprocessing is crucial. The C4 dataset, a diverse and extensive collection of web pages, provides a rich source for language understanding tasks. The dataset needs to be tokenized into smaller units, such as subwords or words, to facilitate model input. Additionally, the text is often segmented into fixed-length sequences or batches, optimizing computational efficiency during training.

For the masked language modeling objective, a percentage of the tokenized input is randomly masked, and the model is trained to predict the original content of these masked tokens. This process encourages the T5 model to grasp contextual relationships between words and phrases, enhancing its ability to generate coherent and contextually appropriate responses during downstream tasks like question answering.

In summary, the pre-training of the T5 model involves utilizing the Transformer architecture on a sizable dataset like C4, coupled with meticulous data preprocessing to convert raw text into a format suitable for training. The incorporation of a masked language modeling objective ensures that the model learns robust contextual representations, laying a solid foundation for subsequent fine-tuning on specific tasks such as question answering.

**Note:** The word "mask" will be used throughout this assignment in context of hiding/removing word(s)

You will be implementing the Masked language model (MLM) as shown in the following image. 

<img src = "images/loss.png" width="600" height = "400">

Assume you have the following text: <span style = "color:blue"> **Thank you <span style = "color:red">for inviting </span> me to your party <span style = "color:red">last</span>  week** </span> 


Now as input you will mask the words in red in the text: 

<span style = "color:blue"> **Input:**</span> Thank you  **X** me to your party **Y** week.

<span style = "color:blue">**Output:**</span> The model should predict the words(s) for **X** and **Y**. 

**[EOS]** will be used to mark the end of the target sequence.

<a name='1-2'></a>
### 1.2 - C4 Dataset

The [C4 dataset](https://www.tensorflow.org/datasets/catalog/c4), also known as the Common Crawl C4 (Common Crawl Corpus C4), is a large-scale dataset of web pages collected by the [Common Crawl organization](https://commoncrawl.org/). It is commonly used for various natural language processing tasks and machine learning research. Each sample in the C4 dataset follows a consistent format, making it suitable for pretraining models like BERT. Here's a short explanation and description of the C4 dataset:

- Format: Each sample in the C4 dataset is represented as a JSON object, containing several key-value pairs.

- Content: The 'text' field in each sample contains the actual text content extracted from web pages. This text often includes a wide range of topics and writing styles, making it diverse and suitable for training language models.

- Metadata: The dataset includes metadata such as 'content-length,' 'content-type,' 'timestamp,' and 'url,' providing additional information about each web page. 'Content-length' specifies the length of the content, 'content-type' describes the type of content (e.g., 'text/plain'), 'timestamp' indicates when the web page was crawled, and 'url' provides the source URL of the web page.

- Applications: The C4 dataset is commonly used for training and fine-tuning large-scale language models, such as BERT. It serves as a valuable resource for tasks like text classification, named entity recognition, question answering, and more.

- Size: The C4 dataset is containing more than 800 GiB of text data, making it suitable for training models with billions of parameters.

Run the cell below to see how the C4 dataset looks like. 

In [7]:
# Load example jsons
with open('data/c4-en-10k.jsonl', 'r') as file:
    example_jsons = [json.loads(line.strip()) for line in file]

# Printing the examples to see how the data looks like
for i in range(5):
    print(f'example number {i+1}: \n\n{example_jsons[i]} \n')


example number 1: 

{'text': 'Beginners BBQ Class Taking Place in Missoula!\nDo you want to get better at making delicious BBQ? You will have the opportunity, put this on your calendar now. Thursday, September 22nd join World Class BBQ Champion, Tony Balay from Lonestar Smoke Rangers. He will be teaching a beginner level class for everyone who wants to get better with their culinary skills.\nHe will teach you everything you need to know to compete in a KCBS BBQ competition, including techniques, recipes, timelines, meat selection and trimming, plus smoker and fire information.\nThe cost to be in the class is $35 per person, and for spectators it is free. Included in the cost will be either a t-shirt or apron and you will be tasting samples of each meat that is prepared.'} 

example number 2: 

{'text': 'Discussion in \'Mac OS X Lion (10.7)\' started by axboi87, Jan 20, 2012.\nI\'ve got a 500gb internal drive and a 240gb SSD.\nWhen trying to restore using disk utility i\'m given the err

In [6]:
# i guess k ye jo jsonl, ('l') file hay, is main sirf 'l'ines, sir text val entryries hi vocab ki pri hui hain or metadata etc, har aik individual dict
# main nahi pra hua


<a name='1-3'></a>
### 1.3 - Process C4

For the purpose of pretaining the T5 model, you will only use the `content` of each entry. In the following code, you filter only the field `text` from all the entries in the dataset. This is the data that you will use to create the `inputs` and `targets` of your language model.


In [7]:
# Grab text field from dictionary
natural_language_texts = [example_json['text'] for example_json in example_jsons]

# Print the first text example
print(natural_language_texts[0])


Beginners BBQ Class Taking Place in Missoula!
Do you want to get better at making delicious BBQ? You will have the opportunity, put this on your calendar now. Thursday, September 22nd join World Class BBQ Champion, Tony Balay from Lonestar Smoke Rangers. He will be teaching a beginner level class for everyone who wants to get better with their culinary skills.
He will teach you everything you need to know to compete in a KCBS BBQ competition, including techniques, recipes, timelines, meat selection and trimming, plus smoker and fire information.
The cost to be in the class is $35 per person, and for spectators it is free. Included in the cost will be either a t-shirt or apron and you will be tasting samples of each meat that is prepared.


<a name='1-4'></a>
### 1.4 - Decode to Natural Language

The [SentencePieceTokenizer](https://www.tensorflow.org/text/api_docs/python/text/SentencepieceTokenizer), used in the code snippet, tokenizes text into subword units, enhancing handling of complex word structures, out-of-vocabulary words, and multilingual support. It simplifies preprocessing, ensures consistent tokenization, and seamlessly integrates with machine learning frameworks.

In this task, a SentencePiece model is loaded from a file, which is used to tokenize text into subwords represented by integer IDs.

In [8]:
# Special tokens
# PAD, EOS = 0, 1

with open("./models/sentencepiece.model", "rb") as f:
    pre_trained_tokenizer = f.read()
    
tokenizer = tf_text.SentencepieceTokenizer(pre_trained_tokenizer, out_type=tf.int32)


# # Another possibility

# import sentencepiece as spm

# # Load the SentencePiece model
# sp = spm.SentencePieceProcessor()
# sp.load('./models/sentencepiece.model')

# # Tokenize text using the loaded model
# text = "This is a sample text."
# tokenized_text = sp.encode(text, out_type=int)  # Use out_type=int to get int token ids | out_type, torch.tensor kar k daikhna, otherwise integer bhi thek hi hay

# print(tokenized_text)


In this tokenizer the string `</s>` is used as `EOS` token. By default, the tokenizer does not add the `EOS` to the end of each sentence, so you need to add it manually when required. Let's verify what id correspond to this token:

In [9]:
eos = tokenizer.string_to_id("</s>").numpy()

print("EOS: " + str(eos))


EOS: 1


In [10]:
# printing the encoding of each word to see how subwords are tokenized
tokenized_text = [(list(tokenizer.tokenize(word).numpy()), word) for word in natural_language_texts[2].split()]

print("Word\t\t-->\tTokenization\n")
for element in tokenized_text:
    print(f"{element[1]}\t-->\t{element[0]}")
    

Word		-->	Tokenization

Foil	-->	[4452, 173]
plaid	-->	[30772]
lycra	-->	[3, 120, 2935]
and	-->	[11]
spandex	-->	[8438, 26, 994]
shortall	-->	[710, 1748]
with	-->	[28]
metallic	-->	[18813]
slinky	-->	[3, 7, 4907, 63]
insets.	-->	[16, 2244, 7, 5]
Attached	-->	[28416, 15, 26]
metallic	-->	[18813]
elastic	-->	[15855]
belt	-->	[6782]
with	-->	[28]
O-ring.	-->	[411, 18, 1007, 5]
Headband	-->	[3642, 3348]
included.	-->	[1285, 5]
Great	-->	[1651]
hip	-->	[5436]
hop	-->	[13652]
or	-->	[42]
jazz	-->	[9948]
dance	-->	[2595]
costume.	-->	[11594, 5]
Made	-->	[6465]
in	-->	[16]
the	-->	[8]
USA.	-->	[2312, 5]


In [11]:
###################################################################################################################

In [12]:
len(natural_language_texts)


10000

In [13]:
print(natural_language_texts[2].split())


['Foil', 'plaid', 'lycra', 'and', 'spandex', 'shortall', 'with', 'metallic', 'slinky', 'insets.', 'Attached', 'metallic', 'elastic', 'belt', 'with', 'O-ring.', 'Headband', 'included.', 'Great', 'hip', 'hop', 'or', 'jazz', 'dance', 'costume.', 'Made', 'in', 'the', 'USA.']


In [14]:
tokenizer.tokenize('plaid')


<tf.Tensor: shape=(1,), dtype=int32, numpy=array([30772])>

In [15]:
tokenizer.tokenize('plaid').numpy()


array([30772])

In [16]:
tokenizer.tokenize('lycra')


<tf.Tensor: shape=(3,), dtype=int32, numpy=array([   3,  120, 2935])>

In [17]:
tokenizer.tokenize('lycra').numpy()


array([   3,  120, 2935])

In [18]:
tokenizer.detokenize([120, 2935]), tokenizer.detokenize([3])


(<tf.Tensor: shape=(), dtype=string, numpy=b'lycra'>,
 <tf.Tensor: shape=(), dtype=string, numpy=b''>)

In [19]:
###################################################################################################################

And as usual, the library provides a function to turn numeric tokens into human readable text. Look how it works. 

In [20]:
# We can see that detokenize successfully undoes the tokenization
print(f"tokenized: {tokenizer.tokenize('Beginners')}\ndetokenized: {tokenizer.detokenize(tokenizer.tokenize('Beginners'))}")


tokenized: [12847   277]
detokenized: b'Beginners'


As you can see above, you were able to take a piece of string and tokenize it. 

Now you will create `input` and `target` pairs that will allow you to train your model. T5 uses the ids at the end of the vocab file as sentinels. For example, it will replace: 
   - `vocab_size - 1` by `<Z>`
   - `vocab_size - 2` by `<Y>`
   - and so forth. 
   
It assigns every word a `chr`.

The `pretty_decode` function below, which you will use in a bit, helps in handling the type when decoding. Take a look and try to understand what the function is doing.


Notice that:
```python
string.ascii_letters = 'abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ'
```

**NOTE:** Targets may have more than the 52 sentinels we replace, but this is just to give you an idea of things.

In [21]:
def get_sentinels(tokenizer, display=False):
    sentinels = {}
    vocab_size = tokenizer.vocab_size(name=None)
    for i, char in enumerate(reversed(string.ascii_letters), 1):
        decoded_text = tokenizer.detokenize([vocab_size - i]).numpy().decode("utf-8")
        
        # Sentinels, ex: <Z> - <a>
        sentinels[decoded_text] = f'<{char}>'    
    
        if display:
            print(f'The sentinel is <{char}> and the decoded token is:', decoded_text)

    return sentinels



In [22]:
#############################################################################################################################

In [23]:
string.ascii_letters


'abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ'

In [24]:
reversed('abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ')
# Return a reverse iterator over the values of the given sequence.


<reversed at 0x21cbe6cee90>

In [25]:
for i, char in enumerate(reversed('abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ'), 1):
    print('i: ',i, '\tchar: ', char)


i:  1 	char:  Z
i:  2 	char:  Y
i:  3 	char:  X
i:  4 	char:  W
i:  5 	char:  V
i:  6 	char:  U
i:  7 	char:  T
i:  8 	char:  S
i:  9 	char:  R
i:  10 	char:  Q
i:  11 	char:  P
i:  12 	char:  O
i:  13 	char:  N
i:  14 	char:  M
i:  15 	char:  L
i:  16 	char:  K
i:  17 	char:  J
i:  18 	char:  I
i:  19 	char:  H
i:  20 	char:  G
i:  21 	char:  F
i:  22 	char:  E
i:  23 	char:  D
i:  24 	char:  C
i:  25 	char:  B
i:  26 	char:  A
i:  27 	char:  z
i:  28 	char:  y
i:  29 	char:  x
i:  30 	char:  w
i:  31 	char:  v
i:  32 	char:  u
i:  33 	char:  t
i:  34 	char:  s
i:  35 	char:  r
i:  36 	char:  q
i:  37 	char:  p
i:  38 	char:  o
i:  39 	char:  n
i:  40 	char:  m
i:  41 	char:  l
i:  42 	char:  k
i:  43 	char:  j
i:  44 	char:  i
i:  45 	char:  h
i:  46 	char:  g
i:  47 	char:  f
i:  48 	char:  e
i:  49 	char:  d
i:  50 	char:  c
i:  51 	char:  b
i:  52 	char:  a


In [26]:
tokenizer.vocab_size

<bound method SentencepieceTokenizer.vocab_size of <tensorflow_text.python.ops.sentencepiece_tokenizer.SentencepieceTokenizer object at 0x0000021CB5D8AB90>>

In [27]:
tokenizer.vocab_size(name=None)

<tf.Tensor: shape=(), dtype=int32, numpy=32000>

In [28]:
tokenizer.vocab_size()

<tf.Tensor: shape=(), dtype=int32, numpy=32000>

In [29]:
tokenizer.detokenize([32000 - 1]).numpy().decode("utf-8")


'Internațional'

In [30]:
#############################################################################################################################

In [31]:
sentinels = get_sentinels(tokenizer, display=True)


The sentinel is <Z> and the decoded token is: Internațional
The sentinel is <Y> and the decoded token is: erwachsene
The sentinel is <X> and the decoded token is: Cushion
The sentinel is <W> and the decoded token is: imunitar
The sentinel is <V> and the decoded token is: Intellectual
The sentinel is <U> and the decoded token is: traditi
The sentinel is <T> and the decoded token is: disguise
The sentinel is <S> and the decoded token is: exerce
The sentinel is <R> and the decoded token is: nourishe
The sentinel is <Q> and the decoded token is: predominant
The sentinel is <P> and the decoded token is: amitié
The sentinel is <O> and the decoded token is: erkennt
The sentinel is <N> and the decoded token is: dimension
The sentinel is <M> and the decoded token is: inférieur
The sentinel is <L> and the decoded token is: refugi
The sentinel is <K> and the decoded token is: cheddar
The sentinel is <J> and the decoded token is: unterlieg
The sentinel is <I> and the decoded token is: garanteaz
Th

In [32]:
def pretty_decode(encoded_str_list, sentinels, tokenizer):
    # If already a string, just do the replacements.
    if tf.is_tensor(encoded_str_list) and encoded_str_list.dtype == tf.string:
        for token, char in sentinels.items():
            encoded_str_list = tf.strings.regex_replace(encoded_str_list, token, char)
        return encoded_str_list
  
    # We need to decode and then prettyfy it.
    return pretty_decode(tokenizer.detokenize(encoded_str_list), sentinels, tokenizer)
    

Now, let's use the `pretty_decode` function in the following sentence. Note that all the words listed as sentinels, will be replaced by the function with the corresponding sentinel. It could be a drawback of this method, but don't worry about it now.

In [33]:
pretty_decode(tf.constant("I want to dress up as an Intellectual this halloween."), sentinels, tokenizer)


<tf.Tensor: shape=(), dtype=string, numpy=b'I want to dress up as an <V> this <b>.'>

In [34]:
##################################################################################################################

In [35]:
sentinels

{'Internațional': '<Z>',
 'erwachsene': '<Y>',
 'Cushion': '<X>',
 'imunitar': '<W>',
 'Intellectual': '<V>',
 'traditi': '<U>',
 'disguise': '<T>',
 'exerce': '<S>',
 'nourishe': '<R>',
 'predominant': '<Q>',
 'amitié': '<P>',
 'erkennt': '<O>',
 'dimension': '<N>',
 'inférieur': '<M>',
 'refugi': '<L>',
 'cheddar': '<K>',
 'unterlieg': '<J>',
 'garanteaz': '<I>',
 'făcute': '<H>',
 'réglage': '<G>',
 'pedepse': '<F>',
 'Germain': '<E>',
 'distinctly': '<D>',
 'Schraub': '<C>',
 'emanat': '<B>',
 'trimestre': '<A>',
 'disrespect': '<z>',
 'Erasmus': '<y>',
 'Australia': '<x>',
 'permeabil': '<w>',
 'deseori': '<v>',
 'manipulated': '<u>',
 'suggér': '<t>',
 'corespund': '<s>',
 'nitro': '<r>',
 'oyons': '<q>',
 'Account': '<p>',
 'échéan': '<o>',
 'laundering': '<n>',
 'genealogy': '<m>',
 'QuickBooks': '<l>',
 'constituted': '<k>',
 'Fertigung': '<j>',
 'goutte': '<i>',
 'regulă': '<h>',
 'overwhelmingly': '<g>',
 'émerg': '<f>',
 'broyeur': '<e>',
 'povești': '<d>',
 'emulator': '<c

In [36]:
a = tf.constant("I want to dress up as an Intellectual this halloween.")
a

<tf.Tensor: shape=(), dtype=string, numpy=b'I want to dress up as an Intellectual this halloween.'>

In [37]:
tf.strings.regex_replace(a, 'Internațional', '<Z>')


<tf.Tensor: shape=(), dtype=string, numpy=b'I want to dress up as an Intellectual this halloween.'>

In [38]:
##################################################################################################################

The functions above make your `inputs` and `targets` more readable. For example, you might see something like this once you implement the masking function below. 

- <span style="color:red"> Input sentence: </span> Younes and Lukasz were working together in the lab yesterday after lunch. 
- <span style="color:red">Input: </span> Younes and Lukasz  **Z** together in the **Y** yesterday after lunch.
- <span style="color:red">Target: </span> **Z** were working **Y** lab.


<a name='1-5'></a>
### 1.5 - Tokenizing and Masking

In this task, you will implement the `tokenize_and_mask` function, which tokenizes and masks input words based on a given probability. The probability is controlled by the `noise` parameter, typically set to mask around `15%` of the words in the input text. The function will generate two lists of tokenized sequences following the algorithm outlined below:


<a name='ex-1'></a>
### Exercise 1 - tokenize_and_mask

- Start with two empty lists: `inps` and `targs`
- Tokenize the input text using the given tokenizer.
- For each `token` in the tokenized sequence:
  - Generate a random number(simulating a weighted coin toss)
  - If the random value is greater than the given threshold(noise):
    - Add the current token to the `inps` list
  - Else:
    - If a new sentinel must be included(read note **):
      - Compute the next sentinel ID using a progression.
      - Add a sentinel into the `inps` and `targs` to mark the position of the masked element.
    - Add the current token to the `targs` list.

** There's a special case to consider. If two consecutive tokens get masked during the process, you don't need to add a new sentinel to the sequences. To account for this, use the `prev_no_mask` flag, which starts as `True` but is turned to `False` each time you mask a new element. The code that adds sentinels will only be executed if, before masking the token, the flag was in the `True` state.


In [39]:
# GRADED FUNCTION: tokenize_and_mask
def tokenize_and_mask(text, 
                      noise =0.15, 
                      randomizer=np.random.uniform, 
                      tokenizer=None):
    """Tokenizes and masks a given input.

    Args:
        text (str or bytes): Text input.
        noise (float, optional): Probability of masking a token. Defaults to 0.15.
        randomizer (function, optional): Function that generates random values. Defaults to np.random.uniform.
        tokenizer (function, optional): Tokenizer function. Defaults to tokenize.

    Returns:
        inps, targs: Lists of integers associated to inputs and targets.
    """
    
    # Current sentinel number (starts at 0)
    cur_sentinel_num = 0
    
    # Inputs and targets
    inps, targs = [], []

    # Vocab_size
    vocab_size = int(tokenizer.vocab_size())
    
    # EOS token id 
    # Must be at the end of each target!
    eos = tokenizer.string_to_id("</s>").numpy()
    
    ### START CODE HERE ###
    
    # prev_no_mask is True if the previous token was NOT masked, False otherwise
    # set prev_no_mask to True
    prev_no_mask = True
    
    # Loop over the tokenized text
    for token in tokenizer.tokenize(text).numpy():
        
        # Generate a random value between 0 and 1
        rnd_val = randomizer() 
        
        # Check if the noise is greater than a random value (weighted coin flip)
        if rnd_val < noise:
            
            # Check if previous token was NOT masked
            if prev_no_mask:
                
                # Current sentinel increases by 1
                cur_sentinel_num += 1
                
                # Compute end_id by subtracting current sentinel value out of the total vocabulary size
                end_id = vocab_size - cur_sentinel_num
                
                # Append end_id at the end of the targets
                targs.append(end_id)
                
                # Append end_id at the end of the inputs
                inps.append(end_id)
                
            # Append token at the end of the targets
            targs.append(token)
            
            # set prev_no_mask accordingly
            prev_no_mask = False

        else:
            
            # Append token at the end of the inputs
            inps.append(token)
            
            # Set prev_no_mask accordingly
            prev_no_mask = True
    
    
    # Add EOS token to the end of the targets
    targs.append(eos)
    
    ### END CODE HERE ###
    
    return inps, targs

In [40]:
################################################################################################################

In [41]:
if True:
    print('1')
    if False:
        print('2')
else:
    print('3')
    

1


In [42]:
################################################################################################################

In [43]:
# Some logic to mock a np.random value generator
# Needs to be in the same cell for it to always generate same output
def testing_rnd():
    def dummy_generator():
        vals = np.linspace(0, 1, 10)
        cyclic_vals = itertools.cycle(vals)
        for _ in range(100):
            yield next(cyclic_vals)

    dumr = itertools.cycle(dummy_generator())

    def dummy_randomizer():
        return next(dumr)
    
    return dummy_randomizer

input_str = 'Beginners BBQ Class Taking Place in Missoula!\nDo you want to get better at making delicious BBQ? You will have the opportunity, put this on your calendar now. Thursday, September 22nd join World Class BBQ Champion, Tony Balay from Lonestar Smoke Rangers.'

inps, targs = tokenize_and_mask(input_str, randomizer=testing_rnd(), tokenizer=tokenizer)
print(f"tokenized inputs - shape={len(inps)}:\n\n{inps}\n\ntargets - shape={len(targs)}:\n\n{targs}")

tokenized inputs - shape=53:

[31999, 15068, 4501, 3, 12297, 3399, 16, 5964, 7115, 31998, 531, 25, 241, 12, 129, 394, 44, 492, 31997, 58, 148, 56, 43, 8, 1004, 6, 474, 31996, 39, 4793, 230, 5, 2721, 6, 1600, 1630, 31995, 1150, 4501, 15068, 16127, 6, 9137, 2659, 5595, 31994, 782, 3624, 14627, 15, 12612, 277, 5]

targets - shape=19:

[31999, 12847, 277, 31998, 9, 55, 31997, 3326, 15068, 31996, 48, 30, 31995, 727, 1715, 31994, 45, 301, 1]


You will now use the inputs and the targets from the `tokenize_and_mask` function you implemented above. Take a look at the decoded version of your masked sentence using your `inps` and `targs` from the sentence above. 

In [44]:
print('Inputs: \n\n', pretty_decode(inps, sentinels, tokenizer).numpy())
print('\nTargets: \n\n', pretty_decode(targs, sentinels, tokenizer).numpy())


Inputs: 

 b'<Z> BBQ Class Taking Place in Missoul <Y> Do you want to get better at making <X>? You will have the opportunity, put <W> your calendar now. Thursday, September 22 <V> World Class BBQ Champion, Tony Balay <U>onestar Smoke Rangers.'

Targets: 

 b'<Z> Beginners <Y>a! <X> delicious BBQ <W> this on <V>nd join <U> from L'


<a name='1-6'></a>
### 1.6 - Creating the Pairs

You will now create pairs using your dataset. You will iterate over your data and create (inp, targ) pairs using the functions that we have given you. 

In [45]:
# Apply tokenize_and_mask
inputs_targets_pairs = [tokenize_and_mask(text.encode('utf-8', errors='ignore').decode('utf-8'), tokenizer=tokenizer) 
                        for text in natural_language_texts]
################################################################################################################################


In [46]:
################################################################################################################################

In [47]:
natural_language_texts[10]


'Pencarian FILM Untuk "Peace Breaker 2017"\nyuk mampir ke channel say..\nEdges East provides the l..\nA corrupt cop makes one w..\nPeace Breaker 2017 ~ 破�..\nNáo Loạn - Peace Break..\nPlease subscribe and hit ..\nuploaded in HD at http://..\nI cannot believe I manage..'

In [48]:
type(natural_language_texts[10])


str

In [49]:
natural_language_texts[10].encode('utf-8', errors='ignore')


b'Pencarian FILM Untuk "Peace Breaker 2017"\nyuk mampir ke channel say..\nEdges East provides the l..\nA corrupt cop makes one w..\nPeace Breaker 2017 ~ \xe7\xa0\xb4\xef\xbf\xbd..\nN\xc3\xa1o Lo\xe1\xba\xa1n - Peace Break..\nPlease subscribe and hit ..\nuploaded in HD at http://..\nI cannot believe I manage..'

In [50]:
natural_language_texts[10].encode('utf-8', errors='ignore').decode('utf-8')


'Pencarian FILM Untuk "Peace Breaker 2017"\nyuk mampir ke channel say..\nEdges East provides the l..\nA corrupt cop makes one w..\nPeace Breaker 2017 ~ 破�..\nNáo Loạn - Peace Break..\nPlease subscribe and hit ..\nuploaded in HD at http://..\nI cannot believe I manage..'

In [51]:
################################################################################################################################

In [52]:
def display_input_target_pairs(inputs_targets_pairs, sentinels, wrapper=textwrap.TextWrapper(width=70), tokenizer=tokenizer):
    for i, inp_tgt_pair in enumerate(inputs_targets_pairs, 1):
        inps, tgts = inp_tgt_pair
        inps = str(pretty_decode(inps, sentinels, tokenizer).numpy(), encoding='utf-8')
        tgts = str(pretty_decode(tgts, sentinels, tokenizer).numpy(), encoding='utf-8')
        print(f'[{i}]\n\n'
              f'inputs:\n{wrapper.fill(text=inps)}\n\n'
              f'targets:\n{wrapper.fill(text=tgts)}\n\n\n')

# Print the first 5 samples
display_input_target_pairs(inputs_targets_pairs[0:5], sentinels, wrapper, tokenizer)


[1]

inputs:
Beginners BBQ Class Taking Place in Missoula <Z> Do you want to get
<Y> at making delicious BBQ? You <X> the opportunity, put this on <W>
calendar now. Thursday, September <V>nd join <U> Class BBQ Champion,
Tony Balay <T> Lonestar Smoke Rangers. He will<S> teaching<R>a
beginner <Q> class for everyone who wants to get better with their
culinary skills. He will teach you everything you need to know to
compete<P> a KCBS BBQ competition, including techniques, recipes <O>,
meat selection and trimming, plus<N> and fire information. The cost to
be in <M> class is $35 per person, and for spectators it is free.
Included in the cost will be <L> a t-shirt or apron and you will be
tasting samples of each <K> that is prepared.

targets:
<Z>! <Y> better <X> will have <W> your <V> 22 <U> World <T> from<S>
be<R>  <Q> level<P> in <O>, timelines<N> smoker <M> the <L> either <K>
meat



[2]

inputs:
Discussion in 'Mac OS X Lion (10.7)' started by axboi87, Jan 20, 2012.
<Z>'ve got a 500g <Y> 

<a name='2'></a>
## 2 - Pretrain a T5 model using C4

Now you are going to use the Transformer's architecture that you coded in the previous assignment to summarize text, but this time to answer questions. Instead of training the question answering model from scratch, you will first "pre-train" the model using the C4 data set you just processed. This will help the model to learn the general structure of language from a large dataset. This is much easier to do, as you don't need to label any data, but just use the masking, which is done automatically. You will then use the data from the SQuAD set to teach the model to answer questions given a context. To start let's review the Transformer's architecture. 

<img src = "images/fulltransformer.png" width="300" height="600">



<a name='2-1'></a>
### 2.1 - Instantiate a new transformer model

We have packaged the code implemented in the previous week into the `Transformer.py` file. You can import it here, and setup with the same configuration used there.

In [53]:
# Define the model parameters
num_layers                 = 2
embedding_dim              = 128
fully_connected_dim        = 128
num_heads                  = 2
positional_encoding_length = 256

encoder_vocab_size = int(tokenizer.vocab_size())
decoder_vocab_size = encoder_vocab_size

# Initialize the model
transformer = transformer_utils.Transformer(
    num_layers, 
    embedding_dim, 
    num_heads, 
    fully_connected_dim,
    encoder_vocab_size, 
    decoder_vocab_size, 
    positional_encoding_length, 
    positional_encoding_length,
)

device = 'cuda' if torch.cuda.is_available() else 'cpu'


In [54]:
transformer

Transformer(
  (encoder): Encoder(
    (embedding): Embedding(32000, 128, padding_idx=0)
    (enc_layers): ModuleList(
      (0-1): 2 x EncoderLayer(
        (attention): MultiheadAttention(
          (out_proj): NonDynamicallyQuantizableLinear(in_features=128, out_features=128, bias=True)
        )
        (layernorm1): BatchNorm1d(128, eps=1e-06, momentum=0.1, affine=True, track_running_stats=True)
        (fc1): Linear(in_features=128, out_features=128, bias=True)
        (fc2): Linear(in_features=128, out_features=128, bias=True)
        (layernorm2): BatchNorm1d(128, eps=1e-06, momentum=0.1, affine=True, track_running_stats=True)
        (dropout): Dropout(p=0.1, inplace=False)
      )
    )
    (dropout): Dropout(p=0.1, inplace=False)
  )
  (decoder): Decoder(
    (embedding): Embedding(32000, 128, padding_idx=0)
    (dec_layers): ModuleList(
      (0-1): 2 x DecoderLayer(
        (mha1): MultiheadAttention(
          (out_proj): NonDynamicallyQuantizableLinear(in_features=128, o


Now, you will define the optimizer and the loss function. For this task the model will try to predict the masked words, so, as in the previous lab, the loss function will be the `torch.NLLLoss(...)`.


<a name='2-2'></a>
### 2.2 - C4 pretraining

For training a Tensorflow model you need to arrange the data into datasets. Now, you will get the `inputs` and the `targets` for the transformer model from the `inputs_targets_pairs`. Before creating the dataset, you need to be sure that all `inputs` have the same length by truncating the longer sequences and padding the shorter ones with `0`. The same must be done for the targets. The function `tf.keras.preprocessing.sequence.pad_sequences` will help you here, as in the previous week assignment.

You will use a `BATCH_SIZE = 64`

In [55]:
# Limit the size of the input and output data so this can run in this environment
encoder_maxlen = 150
decoder_maxlen = 50

inputs  = tf.keras.preprocessing.sequence.pad_sequences([x[0] for x in inputs_targets_pairs], maxlen=encoder_maxlen, padding='post', truncating='post')
targets = tf.keras.preprocessing.sequence.pad_sequences([x[1] for x in inputs_targets_pairs], maxlen=decoder_maxlen, padding='post', truncating='post')

inputs  = tf.cast(inputs,  dtype=tf.int32)
targets = tf.cast(targets, dtype=tf.int32)

# Create the final training dataset.
BUFFER_SIZE = 10000
BATCH_SIZE  = 64

dataset = tf.data.Dataset.from_tensor_slices((inputs, targets)).shuffle(BUFFER_SIZE).batch(BATCH_SIZE)


In [56]:
####################################################################################################################

In [57]:
print(inputs_targets_pairs[0][0]), print(), print(inputs_targets_pairs[0][1])


[12847, 277, 15068, 4501, 3, 12297, 3399, 16, 5964, 7115, 9, 31999, 531, 25, 241, 12, 129, 31998, 44, 492, 3326, 15068, 58, 148, 31997, 8, 1004, 6, 474, 48, 30, 31996, 4793, 230, 5, 2721, 6, 1600, 31995, 727, 1715, 31994, 4501, 15068, 16127, 6, 9137, 2659, 5595, 31993, 301, 782, 3624, 14627, 15, 12612, 277, 5, 216, 56, 31992, 2119, 31991, 9, 19529, 31990, 853, 21, 921, 113, 2746, 12, 129, 394, 28, 70, 17712, 1098, 5, 216, 56, 3884, 25, 762, 25, 174, 12, 214, 12, 5978, 31989, 3, 9, 3, 23405, 4547, 15068, 2259, 6, 379, 2097, 6, 5459, 31988, 6, 3604, 1801, 11, 27856, 6, 303, 31987, 11, 1472, 251, 5, 37, 583, 12, 36, 16, 31986, 853, 19, 25264, 399, 568, 6, 11, 21, 21380, 7, 34, 19, 339, 5, 15746, 26, 16, 8, 583, 56, 36, 31985, 3, 9, 3, 17, 18, 9486, 42, 3, 9, 1409, 29, 11, 25, 56, 36, 12246, 5977, 13, 284, 31984, 24, 19, 2657, 5]

[31999, 55, 31998, 394, 31997, 56, 43, 31996, 39, 31995, 1630, 31994, 1150, 31993, 45, 31992, 36, 31991, 3, 31990, 593, 31989, 16, 31988, 6, 13618, 7, 31987, 241

(None, None, None)

In [58]:
####################################################################################################################

Now, you can run the training loop for 10 epochs. Running it with a big dataset such as C4 on a good computer with enough memory and a good GPU could take more than 24 hours. Here, you will run few epochs using a small portion of the C4 dataset for illustration. It will only take a few minutes, but the model won't be very powerful.


In [59]:
for (batch, (inp, tar)) in enumerate(dataset):
    if batch >=2:
        break


In [60]:
len(dataset)

157

In [61]:
inp = torch.tensor(inp.numpy()).to(torch.int32).to(device)
tar = torch.tensor(tar.numpy()).to(torch.int32).to(device)
inp.shape, tar.shape


(torch.Size([64, 150]), torch.Size([64, 50]))

In [62]:
inp


tensor([[ 1210,    12,     8,  ...,     0,     0,     0],
        [ 4467,     7,  1849,  ...,   123,     6,   123],
        [31999,    12,   240,  ...,   256,    51,   202],
        ...,
        [20876,    19,     3,  ...,  2121,    11,  1769],
        [   37, 31999,  3849,  ...,   109,    18, 12931],
        [  421,  5193,  3325,  ...,     0,     0,     0]], device='cuda:0',
       dtype=torch.int32)

In [63]:
transformer.to(device)
preds, _ = transformer(inp, tar)
print('preds: ', preds.shape)


preds:  torch.Size([64, 50, 32000])


In [64]:
criterion = nn.NLLLoss(ignore_index=0)
optimizer = optim.Adam(transformer.parameters(), lr=0.001)

# Ensure outputs is of type Float
outputs = preds.float().clone()
outputs = outputs.reshape(-1, preds.shape[2])
print('outputs: ', outputs.shape)

# Ensure targets is of type Long
targets = tar.long().clone()
targets = targets.reshape(-1)
print('targets: ', targets.shape)

optimizer.zero_grad()

# Calculate loss
loss = criterion(outputs, targets)
loss.backward()

print(loss.item())


outputs:  torch.Size([3200, 32000])
targets:  torch.Size([3200])
10.490058898925781


In [65]:
tar

tensor([[31999,  4314, 31998,  ...,     0,     0,     0],
        [31999,    28, 31998,  ..., 31978,    75,     1],
        [31999,   571, 31998,  ...,    16, 31977,   225],
        ...,
        [31999,    11, 31998,  ..., 31976,    12, 31975],
        [31999,   638, 31998,  ..., 31978,    12, 31977],
        [31999,   876,    21,  ...,     0,     0,     0]], device='cuda:0',
       dtype=torch.int32)

In [66]:
tar[:, :-1] # tar_inp | to decoder i guess


tensor([[31999,  4314, 31998,  ...,     0,     0,     0],
        [31999,    28, 31998,  ...,    23, 31978,    75],
        [31999,   571, 31998,  ..., 31978,    16, 31977],
        ...,
        [31999,    11, 31998,  ...,  2849, 31976,    12],
        [31999,   638, 31998,  ...,     7, 31978,    12],
        [31999,   876,    21,  ...,     0,     0,     0]], device='cuda:0',
       dtype=torch.int32)

In [67]:
tar[:, 1:] # tar_real | to loss fn


tensor([[ 4314, 31998, 10149,  ...,     0,     0,     0],
        [   28, 31998,   543,  ..., 31978,    75,     1],
        [  571, 31998,    19,  ...,    16, 31977,   225],
        ...,
        [   11, 31998,    18,  ..., 31976,    12, 31975],
        [  638, 31998, 12185,  ..., 31978,    12, 31977],
        [  876,    21,     8,  ...,     0,     0,     0]], device='cuda:0',
       dtype=torch.int32)

In [68]:
tokenizer.detokenize([31999])

<tf.Tensor: shape=(), dtype=string, numpy=b'Interna\xc8\x9bional'>

In [69]:
inp[0,:]

tensor([ 1210,    12,     8, 10043, 17368, 18360, 31999,    61,    16, 31998,
        10669,     6,    37,  1485, 31997, 18301,   257,   138,  2345,    47,
         5330,  1192,    16,   507,  4560,    38,     3,     9,  1679,    18,
        11415,  2078,     5,     3, 13283,     3,     9,  2646,   865,    16,
          957,  3076,     6,     3,     9,  1472, 10932,    24, 31996,     5,
           71,   215,   865,     6,     8,  2078, 31995, 26935,     6,    48,
           97,    91,    13,  4459,  7944,     6, 31994,    47,  2425,    30,
            8,   337,  2140,     5,     0,     0,     0,     0,     0,     0,
            0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
            0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
            0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
            0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
            0,     0,     0,     0,     0,     0,     0,     0, 

In [70]:
inp_de = inp[0,:].to('cpu').clone()
inp_de = inp_de.numpy()

inp_de = tokenizer.detokenize(inp_de)

inp_de


<tf.Tensor: shape=(), dtype=string, numpy=b'Home to the oldest congregation (17 Interna\xc8\x9bional) in erwachseneetta, The First Cushiongregational Church was originally built in 1807 as a wood-frame church. Almost a century later in 1905, a fire destroyed that imunitar. A year later, the church Intellectual rebuilt, this time out of yellow brick, traditi was dedicated on the same spot.'>

In [71]:
tar[0,:]

tensor([31999,  4314, 31998, 10149, 31997,  1193, 31996,  1809, 31995,    47,
        31994,    11,     1,     0,     0,     0,     0,     0,     0,     0,
            0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
            0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
            0,     0,     0,     0,     0,     0,     0,     0,     0,     0],
       device='cuda:0', dtype=torch.int32)

In [72]:
tar_de = tar[0,:].to('cpu').clone()
tar_de = tar_de.numpy()

tar_de = tokenizer.detokenize(tar_de)

tar_de


<tf.Tensor: shape=(), dtype=string, numpy=b'Interna\xc8\x9bional96 erwachsene Mari Cushion Con imunitar structure Intellectual was traditi and'>

In [73]:
pretty_decode(inp_de, sentinels, tokenizer)


<tf.Tensor: shape=(), dtype=string, numpy=b'Home to the oldest congregation (17 <Z>) in <Y>etta, The First <X>gregational Church was originally built in 1807 as a wood-frame church. Almost a century later in 1905, a fire destroyed that <W>. A year later, the church <V> rebuilt, this time out of yellow brick, <U> was dedicated on the same spot.'>

In [74]:
pretty_decode(tar_de, sentinels, tokenizer)


<tf.Tensor: shape=(), dtype=string, numpy=b'<Z>96 <Y> Mari <X> Con <W> structure <V> was <U> and'>

In [75]:
########################################################################################################################

In [101]:
transformer.to(device)

# test_example  = 0
# true_summary  = summary_test[test_example]
# true_document = document_test[test_example]


# Training Hyperparameters
num_epochs    = 20 #20
learning_rate = 0.001
batch_size    = 64

# device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')


pad_idx    = 0
criterion  = nn.NLLLoss(ignore_index=pad_idx)
optimizer  = optim.Adam(transformer.parameters())



for epoch in range(num_epochs):
    print(f'Epoch [{epoch+1} / {num_epochs}]')

    start = time.time()

    running_loss = 0
    transformer.train()
    for (batch, (inp, tar)) in enumerate(dataset):

        inp1 = torch.tensor(inp.numpy()).long().clone().to(device)
        tar1 = torch.tensor(tar.numpy()).long().clone().to(device)

        # print('inp_data1: ', inp_data1.shape)
        # print('target1: ',   target1.shape)

        preds, _ = transformer(inp1, tar1[:, :-1])

        # Ensure outputs is of type Float
        outputs = preds.float().clone()
        outputs = outputs.reshape(-1, preds.shape[2])
        # print('outputs: ', outputs.shape)
        
        # Ensure targets is of type Long
        targets = tar1[:, 1:].long().clone()
        targets = targets.reshape(-1)
        # print('targets: ', targets.shape)


        
        optimizer.zero_grad()
        loss = criterion(outputs, targets)

        loss.backward()

        optimizer.step()


        running_loss = running_loss+loss.item()
        avg_running_loss = running_loss/(batch+1)

        if (batch+1)%20 == 0:
            print()
            print(f'{batch+1}: ', avg_running_loss)
            print()
        
        if (batch+1) >= len(dataset):
            break
    
    print (f'Time taken for one epoch: {time.time() - start} sec')
    # print('Example summarization on the test set:')
    # print('  True summarization:')
    # print(f'    {true_summary}')
    # print('  Predicted summarization:')
    
    # transformer.eval()
    # print(f'    {summarize(transformer, true_document)}\n')



Epoch [1 / 20]

20:  3.932839035987854


40:  3.9562532782554625


60:  3.9803857167561847


80:  3.9956479012966155


100:  4.009383769035339


120:  4.021211091677348


140:  4.030296976225717

Time taken for one epoch: 37.79567193984985 sec
Epoch [2 / 20]

20:  3.892728352546692


40:  3.894625049829483


60:  3.8986412366231282


80:  3.9094325363636018


100:  3.917003610134125


120:  3.922011075417201


140:  3.927750905922481

Time taken for one epoch: 37.6154363155365 sec
Epoch [3 / 20]

20:  3.8019819378852846


40:  3.813938891887665


60:  3.8208597898483276


80:  3.831940159201622


100:  3.840540907382965


120:  3.848538513978322


140:  3.8563846026148116

Time taken for one epoch: 37.58730173110962 sec
Epoch [4 / 20]

20:  3.7319002032279966


40:  3.7375342905521394


60:  3.7472851276397705


80:  3.753888776898384


100:  3.7614490818977355


120:  3.768587766091029


140:  3.7746338163103377

Time taken for one epoch: 37.53516244888306 sec
Epoch [5 / 20]

20:  3.6

In [144]:
# torch.save(transformer, "pretrained transformer (encoder-decoder) 10000 seqs 30 epoch.pt")

## Checking if given a context and an initial token, whether the model is indeed predicting the expected next token or not


In [76]:
transformer = torch.load('pretrained transformer (encoder-decoder) 10000 seqs 30 epoch.pt')


In [77]:
inp[2,:]

tensor([31999,    12,   240,   124,    13,    39, 21118,    28,     8,   199,
           13,     3,     9, 25740,    23,   152,    58,   290, 31998,   150,
          224,  1829,    38,  3609,    39,   861,    21,     8, 31997,    97,
           16, 31996,  6026,     5,    94,   656,    25,   473,     6,    38,
            3, 31995,     8,   829,  8800,    13,     8,   296,    19,   652,
          504,  3737,    30,    25,     5,   611,     6, 31994,   798,    13,
        26078,  4049,  1273,    15,     7,     6,   116,    39, 21118,  2347,
            3,  1092,     5,   304,  1792,    48,   589,     6,    25,   398,
          373,  1049,    16,   574,    28,     3,     9, 18325,    23,   152,
            5,   290,    33, 31993,  1155,   190,    84,     3,     9, 18325,
           23,   152,    54,  1539,    25,    16,   838,   124,    13,    39,
         9806,     5, 31992,   323, 31991,   669,    72,     5,    37,   166,
           11, 19839,   589,    19, 31990,   256,    51, 31989, 

In [78]:
tar[2,:]


tensor([31999,   571, 31998,    19, 31997,   166, 31996,    39, 31995,    99,
        31994,    48, 31993,   186, 31992, 25731, 31991,    12, 31990,     8,
        31989,   202, 31988,   366,    25, 31987, 18325,    23, 31986,   258,
           25, 31985,  2023, 31984,     7,    24, 31983,     6, 31982,   430,
        31981, 15266, 31980,   186, 31979,     7, 31978,    16, 31977,   225],
       device='cuda:0', dtype=torch.int32)

In [80]:
inp_de = np.copy(inp[2,:].to('cpu').numpy())


inp_de = tokenizer.detokenize(inp_de)

inp_de


<tf.Tensor: shape=(), dtype=string, numpy=b'Interna\xc8\x9bional to take care of your newborn with the help of a Pediatrician? There erwachsene no such feeling as holding your child for the Cushion time in imunitar arms. It makes you feel, as  Intellectual the whole happiness of the world is getting showered on you. However, traditi moment of bliss vanishes, when your newborn gets ill. To avoid this thing, you must always stay in contact with a pediatrician. There are disguise ways through which a pediatrician can guide you in taking care of your infant.exerce downnourishe learn more. The first and foremost thing is predominant immamiti\xc3\xa9izations. erkennt are in regular contact of adimensionan, inf\xc3\xa9rieur can easily refugi the immun'>

In [81]:
pretty_decode(inp_de, sentinels, tokenizer)


<tf.Tensor: shape=(), dtype=string, numpy=b'<Z> to take care of your newborn with the help of a Pediatrician? There <Y> no such feeling as holding your child for the <X> time in <W> arms. It makes you feel, as  <V> the whole happiness of the world is getting showered on you. However, <U> moment of bliss vanishes, when your newborn gets ill. To avoid this thing, you must always stay in contact with a pediatrician. There are <T> ways through which a pediatrician can guide you in taking care of your infant.<S> down<R> learn more. The first and foremost thing is <Q> imm<P>izations. <O> are in regular contact of a<N>an, <M> can easily <L> the immun'>

In [83]:
tar_de = np.copy(tar[2,:].to('cpu').numpy())
tar_de = tokenizer.detokenize(tar_de)
tar_de


<tf.Tensor: shape=(), dtype=string, numpy=b'Interna\xc8\x9bional How erwachsene is Cushion first imunitar your Intellectualif traditi this disguise manyexerce Scrollnourishe to predominant theamiti\xc3\xa9un erkennt When youdimension pediatrici inf\xc3\xa9rieur then you refugi schedule cheddars that unterlieg, garanteaz anotherf\xc4\x83cute confusingr\xc3\xa9glage many pedepsesGermain indistinctly should'>

In [84]:
pretty_decode(tar_de, sentinels, tokenizer)


<tf.Tensor: shape=(), dtype=string, numpy=b'<Z> How <Y> is <X> first <W> your <V>if <U> this <T> many<S> Scroll<R> to <Q> the<P>un <O> When you<N> pediatrici <M> then you <L> schedule <K>s that <J>, <I> another<H> confusing<G> many <F>s<E> in<D> should'>

In [85]:
tokenizer.tokenize(['Interna\xc8\x9bional'])

<tf.RaggedTensor [[3037, 29, 9, 2, 6318]]>

In [87]:
inp[2,:].to('cpu').numpy()
inp[2,:].to('cpu').numpy().reshape(1, inp[2,:].shape[0]).shape, np.array([[31999]]).shape


((1, 150), (1, 1))

In [96]:
inp1 = torch.tensor(np.copy(inp[2:4,:].to('cpu').numpy())).long().clone().to(device)
tar1 = torch.tensor(np.copy(np.array([[31997],[31999]]))).long().clone().to(device)


# print('inp_data1: ', inp_data1.shape)
# print('target1: ',   target1.shape)

preds, _ = transformer(inp1, tar1)


In [97]:
preds.shape

torch.Size([2, 1, 32000])

In [98]:
torch.argmax(preds[:,-1,:], dim=-1).shape, torch.argmax(preds, dim=-1)


(torch.Size([2]),
 tensor([[5],
         [3]], device='cuda:0'))

In [100]:
inp_de = torch.argmax(preds[:,-1,:], dim=-1).to(torch.int32).to('cpu').clone()
inp_de = inp_de.numpy()

inp_de = tokenizer.detokenize(inp_de)

inp_de


<tf.Tensor: shape=(), dtype=string, numpy=b'. '>

In [101]:
pretty_decode(inp_de, sentinels, tokenizer)


<tf.Tensor: shape=(), dtype=string, numpy=b'. '>

In [102]:
#############################################################################################################################

### Own Text. Note these cells were run after fine tuning

In [103]:
a1 = 'my name is ans imran'
a2 = 'i live in paris'

b1 = 'my name'
b2 = 'i live'


In [104]:
a1 = tokenizer.tokenize([a1]).to_tensor().numpy()
a2 = tokenizer.tokenize([a2]).to_tensor().numpy()

b1 = tokenizer.tokenize([b1]).to_tensor().numpy()
b2 = tokenizer.tokenize([b2]).to_tensor().numpy()


In [105]:
a1, a2, b1, b2


(array([[  82,  564,   19,   46,    7,  256, 2002]]),
 array([[  3,  23, 619,  16, 260, 159]]),
 array([[ 82, 564]]),
 array([[  3,  23, 619]]))

In [106]:
a1 = np.array([  82,  564,   19,   46,    7,  256, 2002])
a2 = np.array([  27,  619,   16,  260,  159,    0,    0])

b1 = np.array([82, 564])
b2 = np.array([27, 619])


In [109]:
a = torch.tensor([a1,a2]).to(device)
b = torch.tensor([b1,b2]).to(device)


In [110]:
preds, _ = transformer(a, b)


In [111]:
preds.shape


torch.Size([2, 2, 32000])

In [112]:
inp_de = torch.argmax(preds[:,-1,:], dim=-1).to(torch.int32).to('cpu').clone()
inp_de = inp_de.numpy()

inp_de = tokenizer.detokenize(inp_de)

inp_de


<tf.Tensor: shape=(), dtype=string, numpy=b'born.'>

In [113]:
# it'll work
pretty_decode(inp_de, sentinels, tokenizer)


<tf.Tensor: shape=(), dtype=string, numpy=b'born.'>

# **Load a pretrained model**

To show how powerful this model actually is, we trained it for several epochs with the full dataset in Colab and saved the weights for you. You can load them using the cell below. For the rest of the notebook, you will see the power of the transfer learning in action.

In [173]:
# transformer = torch.load('pretrained transformer (encoder-decoder) 10000 seqs 30 epoch.pt')


<a name='3'></a>
## 3. Fine tune the T5 model for Question Answering

Now,  you are going to fine tune the pretrained model for Question Answering using the [SQUad 2.0 dataset](https://rajpurkar.github.io/SQuAD-explorer/).

SQuAD, short for Stanford Question Answering Dataset, is a dataset designed for training and evaluating question answering systems. It consists of real questions posed by humans on a set of Wikipedia articles, where the answer to each question is a specific span of text within the corresponding article.

SQuAD 1.1, the previous version of the SQuAD dataset, contains 100,000+ question-answer pairs on about 500 articles.
SQuAD 2.0, contains 50.000 additional questions that are not meant to be answered. This extra set of questions can help to train models to detect unanswerable questions.

Let's load the dataset.

In [2]:
with open('data/train-v2.0.json', 'r') as f:
    example_jsons = json.load(f)

example_jsons = example_jsons['data']

print('Number of articles: ' + str(len(example_jsons)))


Number of articles: 442


The structure of each article is as follows:
- `title`: The article title
- `paragraphs`: A list of paragraphs and questions related to them
    - `context`: The actual paragraph text
    - `qas`: A set of question related to the paragraph
        - `question`: A question
        - `id`: The question unique identifier
        - `is_imposible`: Boolean, specifies if the question can be answered or not
        - `answers`: A set of possible answers for the question
            - `text`: The answer
            - `answer_start`: The index of the character that starts the sentence containing the explicit answer to the question
            
Take a look at an article by running the next cell. Notice that the `context` is usually the last element for every paragraph:           

In [3]:
example_article = example_jsons[0]
example_article

print("Title: " + example_article["title"])
print(example_article["paragraphs"][0])

Title: Beyoncé
{'qas': [{'question': 'When did Beyonce start becoming popular?', 'id': '56be85543aeaaa14008c9063', 'answers': [{'text': 'in the late 1990s', 'answer_start': 269}], 'is_impossible': False}, {'question': 'What areas did Beyonce compete in when she was growing up?', 'id': '56be85543aeaaa14008c9065', 'answers': [{'text': 'singing and dancing', 'answer_start': 207}], 'is_impossible': False}, {'question': "When did Beyonce leave Destiny's Child and become a solo singer?", 'id': '56be85543aeaaa14008c9066', 'answers': [{'text': '2003', 'answer_start': 526}], 'is_impossible': False}, {'question': 'In what city and state did Beyonce  grow up? ', 'id': '56bf6b0f3aeaaa14008c9601', 'answers': [{'text': 'Houston, Texas', 'answer_start': 166}], 'is_impossible': False}, {'question': 'In which decade did Beyonce become famous?', 'id': '56bf6b0f3aeaaa14008c9602', 'answers': [{'text': 'late 1990s', 'answer_start': 276}], 'is_impossible': False}, {'question': 'In what R&B group was she the

The previous article might be difficult to navigate so here is a nicely formatted example paragraph:
```python
{
  "context": "Beyoncé Giselle Knowles-Carter (/biːˈjɒnseɪ/ bee-YON-say) (born September 4, 1981) is an American singer, songwriter, record producer and actress. Born and raised in Houston, Texas, she performed in various singing and dancing competitions as a child, and rose to fame in the late 1990s as lead singer of R&B girl-group Destiny's Child. Managed by her father, Mathew Knowles, the group became one of the world's best-selling girl groups of all time. Their hiatus saw the release of Beyoncé's debut album, Dangerously in Love (2003), which established her as a solo artist worldwide, earned five Grammy Awards and featured the Billboard Hot 100 number-one singles 'Crazy in Love' and 'Baby Boy'",
  "qas": [
    {
      "question": "When did Beyonce start becoming popular?",
      "id": "56be85543aeaaa14008c9063",
      "answers": [
        {
          "text": "in the late 1990s",
          "answer_start": 269
        }
      ],
      "is_impossible": false
    },
    {
      "question": "What areas did Beyonce compete in when she was growing up?",
      "id": "56be85543aeaaa14008c9065",
      "answers": [
        {
          "text": "singing and dancing",
          "answer_start": 207
        }
      ],
      "is_impossible": false
    }
  ]
}
```

In [4]:
# GRADED FUNCTION: parse_squad
def parse_squad(dataset):
    """Extract all the answers/questions pairs from the SQuAD dataset

    Args:
        dataset (dict): The imported JSON dataset

    Returns:
        inputs, targets: Two lists containing the inputs and the targets for the QA model
    """

    inputs, targets = [], []

    ### START CODE HERE ###
    
    # Loop over all the articles
    for article in dataset:
        
        # Loop over each paragraph of each article
        for paragraph in article['paragraphs']:
            
            # Extract context from the paragraph
            context = paragraph['context']
            
            #Loop over each question of the given paragraph
            for qa in paragraph['qas']:
                
                # If this question is not impossible and there is at least one answer
                if len(qa['answers']) > 0 and not(qa['is_impossible']):
                    
                    # Create the question/context sequence
                    question_context = 'question: ' + qa['question'] + ' context: ' + context
                    
                    # Create the answer sequence. Use the text field of the first answer
                    answer = 'answer: ' + qa['answers'][0]['text']
                    
                    # Add the question_context to the inputs list
                    inputs.append(question_context)
                    
                    # Add the answer to the targets list
                    targets.append(answer)
    
    ### END CODE HERE ###
    
    return inputs, targets

In [5]:
inputs, targets =  parse_squad(example_jsons)          
print("Number of question/answer pairs: " + str(len(inputs)))

print('\nFirst Q/A pair:\n\ninputs: ' + colored(inputs[0], 'blue'))
print('\ntargets: ' + colored(targets[0], 'green'))
print('\nLast Q/A pair:\n\ninputs: ' + colored(inputs[-1], 'blue'))
print('\ntargets: ' + colored(targets[-1], 'green'))

Number of question/answer pairs: 86821

First Q/A pair:

inputs: [34mquestion: When did Beyonce start becoming popular? context: Beyoncé Giselle Knowles-Carter (/biːˈjɒnseɪ/ bee-YON-say) (born September 4, 1981) is an American singer, songwriter, record producer and actress. Born and raised in Houston, Texas, she performed in various singing and dancing competitions as a child, and rose to fame in the late 1990s as lead singer of R&B girl-group Destiny's Child. Managed by her father, Mathew Knowles, the group became one of the world's best-selling girl groups of all time. Their hiatus saw the release of Beyoncé's debut album, Dangerously in Love (2003), which established her as a solo artist worldwide, earned five Grammy Awards and featured the Billboard Hot 100 number-one singles "Crazy in Love" and "Baby Boy".[0m

targets: [32manswer: in the late 1990s[0m

Last Q/A pair:

inputs: [34mquestion: What is KMC an initialism of? context: Kathmandu Metropolitan City (KMC), in order to 

#### **Expected Output:**
```
Number of question/answer pairs: 86821

First Q/A pair:

inputs: question: When did Beyonce start becoming popular? context: Beyoncé Giselle Knowles-Carter (/biːˈjɒnseɪ/ bee-YON-say) (born September 4, 1981) is an American singer, songwriter, record producer and actress. Born and raised in Houston, Texas, she performed in various singing and dancing competitions as a child, and rose to fame in the late 1990s as lead singer of R&B girl-group Destiny's Child. Managed by her father, Mathew Knowles, the group became one of the world's best-selling girl groups of all time. Their hiatus saw the release of Beyoncé's debut album, Dangerously in Love (2003), which established her as a solo artist worldwide, earned five Grammy Awards and featured the Billboard Hot 100 number-one singles "Crazy in Love" and "Baby Boy".

targets: answer: in the late 1990s

Last Q/A pair:

inputs: question: What is KMC an initialism of? context: Kathmandu Metropolitan City (KMC), in order to promote international relations has established an International Relations Secretariat (IRC). KMC's first international relationship was established in 1975 with the city of Eugene, Oregon, United States. This activity has been further enhanced by establishing formal relationships with 8 other cities: Motsumoto City of Japan, Rochester of the USA, Yangon (formerly Rangoon) of Myanmar, Xi'an of the People's Republic of China, Minsk of Belarus, and Pyongyang of the Democratic Republic of Korea. KMC's constant endeavor is to enhance its interaction with SAARC countries, other International agencies and many other major cities of the world to achieve better urban management and developmental programs for Kathmandu.

targets: answer: Kathmandu Metropolitan City
```

You will use 50000 samples for training and 5000 samples for testing

In [6]:
# 50K pairs for training
inputs_train = inputs[0:40000] 
targets_train = targets[0:40000]  

# 5K pairs for testing
inputs_test = inputs[40000:45000] 
targets_test =  targets[40000:45000] 


Now, you can create the batch dataset of padded sequences. You will first tokenize the inputs and the targets. Then, using the function `tf.keras.preprocessing.sequence.pad_sequences`, you will ensure that the inputs and the outputs have the required lengths. Remember that the sequences longer than the required size will be truncated and the shorter ones will be padded with `0`. This setup is very similar to the other one used in this and the previous notebook.

In [9]:
# Limit the size of the input and output data so this can run in this environment
encoder_maxlen = 150
decoder_maxlen = 50

inputs_str = [tokenizer.tokenize(s) for s in inputs_train]
targets_str = [tf.concat([tokenizer.tokenize(s), [1]], 0) for s in targets_train]

inputs  = tf.keras.preprocessing.sequence.pad_sequences(inputs_str,  maxlen=encoder_maxlen, padding='post', truncating='post')
targets = tf.keras.preprocessing.sequence.pad_sequences(targets_str, maxlen=decoder_maxlen, padding='post', truncating='post')

inputs  = tf.cast(inputs,  dtype=tf.int32)
targets = tf.cast(targets, dtype=tf.int32)

# Create the final training dataset.
BUFFER_SIZE = 10000
BATCH_SIZE  = 64
dataset     = tf.data.Dataset.from_tensor_slices((inputs, targets)).shuffle(BUFFER_SIZE).batch(BATCH_SIZE)


In [10]:
for (batch, (inp, tar)) in enumerate(dataset):
    if batch >=2:
        break
inp.shape, tar.shape


(TensorShape([64, 150]), TensorShape([64, 50]))

In [11]:
inp

<tf.Tensor: shape=(64, 150), dtype=int32, numpy=
array([[  822,    10,    86, ..., 13883, 27668,   120],
       [  822,    10,   363, ...,     0,     0,     0],
       [  822,    10,  2840, ...,  1888,    13,  6852],
       ...,
       [  822,    10,   366, ...,     0,     0,     0],
       [  822,    10,   363, ...,     0,     0,     0],
       [  822,    10,   366, ..., 12663,    12,  5530]])>

In [19]:
inp[5,:]


<tf.Tensor: shape=(150,), dtype=int32, numpy=
array([  822,    10,   571,   186,  2417,   724,    54,  1400,    16,
         531,    40,   969,  8210,    58,  2625,    10,    37, 13604,
          19,     8,   192,    18,  5842,   336,  5640,    13,     8,
         774,     6, 27647,    53,    16,     3, 20508,     8,  4668,
           5,   242,  9385,    80,     6,   386,   190,  1296,     6,
          11, 27137,     6,    34,    47,  6878,    45,     8,   531,
          40,   969,  8210,     6,    84,    65,    46,  2417,  2614,
          13,  3241,  6180,  5548,     5,    37, 13604,    21,   774,
         192,   808,   286,    44,     8, 24723,   736, 11692,   532,
           9,   929,     6,    84,    65,    46,  2417,  2614,    13,
         147,     3, 14835,     5,    86,  9385,  2391,   190, 27255,
           6,     8,  5669,    47,    44,     8, 16887,  8210,     6,
          84,  4532,    46,  2417,    13,   147,     3, 18834,     5,
           0,     0,     0,     0,     0,   

In [15]:
inp[1,:]


<tf.Tensor: shape=(150,), dtype=int32, numpy=
array([  822,    10,   363,    56,   534, 12927,     7,    13,     8,
         467,  3480,    58,  2625,    10,  3608, 12927,     7,    13,
           8,   467,  3480,     3,     9,  9091,  7505,   736,    23,
          23,   115,    32, 31193,     6,    84, 12502,     7,     3,
           9,  2142,    23,   412,    18, 30810,     3,  8646,    15,
         106,   718,     8,    96,   254,     9,   162,    13, 18136,
           7,   121,    11,    54,  2331,   331,   147,    12,     8,
           3,  4685,  1421,  1027,  8804,     9,   467,     5,  2502,
        1027,  8804,     9,    18,  3897,   736,    23,    23,   115,
          32, 31193,     7,    43,  6746,  3621,    10,  7505,    11,
         304,   106,  7505, 30064,     3,  6770,     7,     6,  1027,
        8804,     9,    11,   451,    23,   157,  7882,  7505,    31,
           7,   533,     6,    11, 17980,   106,  8716,  4110,  7505,
          12,   240,  4394,    38,   231,  1

In [13]:
tokenizer.id_to_string(822)


<tf.Tensor: shape=(), dtype=string, numpy=b'\xe2\x96\x81question'>

In [20]:
tokenizer.id_to_string(5)


<tf.Tensor: shape=(), dtype=string, numpy=b'.'>

In [22]:
tar[5,:]


<tf.Tensor: shape=(50,), dtype=int32, numpy=
array([1525,   10, 6180, 5548,    1,    0,    0,    0,    0,    0,    0,
          0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
          0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
          0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
          0,    0,    0,    0,    0,    0])>

In [23]:
tar[1,:]


<tf.Tensor: shape=(50,), dtype=int32, numpy=
array([1525,   10, 9091, 7505,  736,   23,   23,  115,   32,    1,    0,
          0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
          0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
          0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
          0,    0,    0,    0,    0,    0])>

In [24]:
tokenizer.id_to_string(1525)


<tf.Tensor: shape=(), dtype=string, numpy=b'\xe2\x96\x81answer'>

In [25]:
tokenizer.detokenize(inp[5,:])

<tf.Tensor: shape=(), dtype=string, numpy=b'question: How many audience members can fit in Dolby Theatre? context: The finale is the two-hour last episode of the season, culminating in revealing the winner. For seasons one, three through six, and fourteen, it was broadcast from the Dolby Theatre, which has an audience capacity of approximately 3,400. The finale for season two took place at the Gibson Amphitheatre, which has an audience capacity of over 6,000. In seasons seven through thirteen, the venue was at the Nokia Theatre, which holds an audience of over 7,000.'>

In [26]:
tokenizer.detokenize(tar[5,:])


<tf.Tensor: shape=(), dtype=string, numpy=b'answer: 3,400'>

In [172]:
len(dataset)


625

<a name='3-2'></a>
### 3.2 Fine tune the T5 model

Now, you will train the model for 2 epochs. In the T5 model, all the weights are adjusted during the fine tuning. As usual, fine tuning this model to get state of the art results would require more time and resources than there are available in this environment, but you are welcome to train the model for more epochs and with more data using Colab GPUs.


In [404]:
transformer.to(device)

# test_example  = 0
# true_summary  = summary_test[test_example]
# true_document = document_test[test_example]


# Training Hyperparameters
num_epochs    = 10 #20
learning_rate = 0.001
batch_size    = 64

# device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')


pad_idx    = 0
criterion  = nn.NLLLoss(ignore_index=pad_idx)
optimizer  = optim.AdamW(transformer.parameters())



for epoch in range(num_epochs):
    print(f'Epoch [{epoch+1} / {num_epochs}]')

    start = time.time()

    running_loss = 0
    transformer.train()
    for (batch, (inp, tar)) in enumerate(dataset):

        inp1 = torch.tensor(inp.numpy()).long().clone().to(device)
        tar1 = torch.tensor(tar.numpy()).long().clone().to(device)

        # print('inp_data1: ', inp_data1.shape)
        # print('target1: ',   target1.shape)

        preds, _ = transformer(inp1, tar1[:, :-1])

        # Ensure outputs is of type Float
        outputs = preds.float().clone()
        outputs = outputs.reshape(-1, preds.shape[2])
        # print('outputs: ', outputs.shape)
        
        # Ensure targets is of type Long
        targets = tar1[:, 1:].long().clone()
        targets = targets.reshape(-1)
        # print('targets: ', targets.shape)


        
        optimizer.zero_grad()
        loss = criterion(outputs, targets)

        loss.backward()

        optimizer.step()


        running_loss     = running_loss+loss.item()
        avg_running_loss = running_loss/(batch+1)

        if (batch+1)%100 == 0:
            print()
            print(f'{batch+1}: ', avg_running_loss)
            print()
        
        if (batch+1) >= len(dataset):
            break
    
    print (f'Time taken for one epoch: {time.time() - start} sec')

    transformer.eval()
    eval_preds, _   = transformer(eval_inp, eval_tar_inp)
    # Get top 10 indices and values
    _, topk_indices = torch.topk(eval_preds[:,-1,:], k=10, dim=-1)
        
    topk_indices = topk_indices.to(torch.int32).to('cpu').clone()
    topk_indices = topk_indices.numpy()
    # topk_indices
    topk_indices =  tokenizer.detokenize(topk_indices)

    
    print(example_question)
    print('possible answers: ', topk_indices.numpy())
    print()
    torch.save(transformer, f"{90+1+batch} pretrained-finetuned-QA transformer (encoder-decoder) 40000 seqs 10 epochs.pt")



Epoch [1 / 10]

100:  0.5739165577292442


200:  0.6171659803390503


300:  0.6577478025356929


400:  0.6968136766552925


500:  0.7353509529829025


600:  0.7694324104984601

Time taken for one epoch: 138.55948305130005 sec
question: What color is the sky? context: Sky is blue
possible answers:  [b'green navigation blue orange  red French Light Gothic penny']

Epoch [2 / 10]

100:  0.5321232953667641


200:  0.5716301468014717


300:  0.6038602851827939


400:  0.6375588937848806


500:  0.6651678727269172


600:  0.6916743049522242

Time taken for one epoch: 129.28625297546387 sec
question: What color is the sky? context: Sky is blue
possible answers:  [b'green navigation French blue Light red orange yellow Rsbury']

Epoch [3 / 10]

100:  0.5389246946573257


200:  0.5762896163761616


300:  0.6035836226741473


400:  0.630036330372095


500:  0.6598749752044678


600:  0.6850455930829048

Time taken for one epoch: 129.98971796035767 sec
question: What color is the sky? context: Sky

In [392]:
# torch.save(transformer, "pretrained-finetuned-QA transformer (encoder-decoder) 40000 seqs 10 epochs.pt")

To get a model that works properly, you would need to train for about 100 epochs. So, we have pretrained a model for you. Just load the weights in the current model and let's use it for answering questions

In [3]:
# # Restore the weights
transformer = torch.load("best pretrained-finetuned-QA transformer (encoder-decoder) 40000 seqs 100 epochs.pt")

<a name='3-3'></a>
### 3.3 - Implement your Question Answering model
In this final step, you will implement the answer_question function, utilizing a pre-trained transformer model for question answering.

To help you out the `transformer_utils.next_word` function is provided. This function receives the question and beginning of the answer (both in tensor format) alongside the model to predict the next token in the answer. The next cell shows how to use this:

In [29]:
# Define an example question
example_question = "question: What color is the sky? context: Sky is blue"
# example_question = "question: What is the color of his shirt? context: His boots are red. His pants are orange. His shirt is pink. His hair are yellow"
# example_question = "question: Where is he sitting? context: He is sitting on a chair"

# example_question = "question: When did Beyonce start becoming popular? context: Beyoncé Giselle Knowles-Carter (/biːˈjɒnseɪ/ bee-YON-say) (born September 4, 1981) is an American singer, songwriter, record producer and actress. Born and raised in Houston, Texas, she performed in various singing and dancing competitions as a child, and rose to fame in the late 1990s as lead singer of R&B girl-group Destiny's Child. Managed by her father, Mathew Knowles, the group became one of the world's best-selling girl groups of all time. Their hiatus saw the release of Beyoncé's debut album, Dangerously in Love (2003), which established her as a solo artist worldwide, earned five Grammy Awards and featured the Billboard Hot 100 number-one singles Crazy in Love and Baby Boy"
example_question = "question: What color is the sky? context: Sky is blue answer: blue question What color is his shirt? context: His shirt is yellow answer: yellow question: what color is his bike? context: His bike is red answer: "



In [30]:
list  = [example_question]
ans   = 'answer: '
ans_l = [ans]


In [31]:
eval_inp = torch.tensor( tokenizer.tokenize(list).to_tensor().numpy(), dtype=torch.int32).to(device)
eval_inp, eval_inp.shape


(tensor([[ 822,   10,  363,  945,   19,    8, 5796,   58, 2625,   10, 5643,   19,
          1692, 1525,   10, 1692,  822,  363,  945,   19,  112, 8677,   58, 2625,
            10,  978, 8677,   19, 4459, 1525,   10, 4459,  822,   10,  125,  945,
            19,  112, 3724,   58, 2625,   10,  978, 3724,   19, 1131, 1525,   10]],
        device='cuda:0', dtype=torch.int32),
 torch.Size([1, 48]))

In [32]:
eval_tar_inp = torch.tensor( tokenizer.tokenize(ans_l).to_tensor().numpy(), dtype=torch.int32 ).to(device)
eval_tar_inp.shape


torch.Size([1, 2])

In [33]:
transformer.eval()
preds, _ = transformer(eval_inp, eval_tar_inp)


In [34]:


# Get top 10 indices and values
topk_values, topk_indices = torch.topk(preds[:,-1,:], k=100, dim=-1)


topk_indices = topk_indices.to(torch.int32).to('cpu').clone()

topk_indices = topk_indices.numpy()

# topk_indices
topk_indices =  tokenizer.detokenize(topk_indices)

print(topk_indices.numpy())


# print(example_question)
# print('ans: ', inp_de.numpy().decode('utf-8'))



[b'green orange white light  Laura yellow blue red given A un Light screen similar T darker natural Com the isfic Gal IPpper Test F we maark weak Usetic color Lord chemical elegant Stanley Can launch Hi Lewis considered Frenchshi stage ancient wheel Du will de bin black horse lead noble twotar Za simpleAIDS 13 no mo Tibetan State penny alegi Mac one hadity outputor mast recall beer Derby bulblig Universal Tall Gothic13 Nicholas Sche hiszel J weight battery frequency silver verticalual density industrial growth']


In [28]:
topk_values

tensor([[ -0.0813,  -2.7179,  -5.6275,  -6.4738,  -6.6124,  -6.8272,  -6.9897,
          -7.1885,  -7.2125,  -8.2757,  -8.5670,  -8.6899,  -9.0615,  -9.0740,
          -9.0888,  -9.3679,  -9.4953,  -9.6246,  -9.6907,  -9.7247,  -9.7611,
         -10.0464, -10.1812, -10.1812, -10.2328, -10.4417, -10.6004, -10.8265,
         -10.8823, -10.9613, -10.9624, -11.0092, -11.0180, -11.0255, -11.0472,
         -11.1022, -11.2539, -11.2852, -11.2997, -11.3050, -11.3092, -11.4974,
         -11.5023, -11.5213, -11.5324, -11.5428, -11.7049, -11.8381, -11.9445,
         -11.9921, -12.0436, -12.0628, -12.0748, -12.1048, -12.2280, -12.2547,
         -12.2842, -12.3096, -12.3120, -12.3284, -12.3333, -12.3620, -12.3821,
         -12.3840, -12.4263, -12.4953, -12.5118, -12.5983, -12.5990, -12.6279,
         -12.6814, -12.7099, -12.7760, -12.7765, -12.7945, -12.8074, -12.8269,
         -12.8546, -12.8660, -12.8794, -12.9298, -12.9732, -12.9931, -13.0472,
         -13.0588, -13.0716, -13.1669, -13.1907, -13