# KAIST AI605 Assignment 4: Question Answering with BERT and T5

TA in charge: Miyoung Ko (miyoungko@kaist.ac.kr)

**Due date**:  June 7 (Tue) 11:00pm, 2022  


## Your Submission
If you are a KAIST student, you will submit your assignment via [KLMS](https://klms.kaist.ac.kr). If you are a NAVER student, you will submit via [Google Form](https://forms.gle/FSng5HUwtQinTFAU8). 

You need to submit both (1) .ipynb file (needs to be fully executable on CoLab), and (2) a pdf of the file.

Use in-line LaTeX (see below) for mathematical expressions. Collaboration among students is allowed but it is not a group assignment so make sure your answer and code are your own. Make sure to mention your collaborators in your assignment with their names and their student ids.

## Grading
The entire assignment is out of 20 points.. For every late day, your grade will be deducted by 2 points (KAIST students only). You can use one of your no-penalty late days (7 days in total). Make sure to mention this in your submission. You will receive a grade of zero if you submit after 7 days.


## Environment
You will need Python 3.7+ and PyTorch 1.9+, which are already available on Colab:

In [11]:
from platform import python_version
import torch

print("python", python_version())
print("torch", torch.__version__)

python 3.7.13
torch 1.11.0+cu113


## 1. Hugging Face Library

In this assignment, you will  use `transformers` library by Hugging Face. The library provides you an easy way to utilize diverse pretrained language models. 
You should be familiar with the library by now, but if not, please go over Lab 09 and Lab 10, and/or Hugging Face's sequence classification tutorial (https://colab.research.google.com/github/huggingface/notebooks/blob/main/transformers_doc/en/training.ipynb). Also, you code might get quite long at the end, so it is recommended to use a library like PyTorch Lightning (https://www.pytorchlightning.ai/) that helps you organize yoru code.

For this assignment, make sure to install both `transformers` and `datasets` packages:

In [1]:
!pip install -q transformers datasets

In [2]:
import numpy as np

> **Problem 1.1** *(1 point)* Put your favorite emoji here 😇
https://getemoji.com/

Your favorite emoji: 💩

## 2. Machine Reading Comprehension with BERT

Here, you will formulate machine reading comprehension as a token classification problem, which means you attempt to predict the start and the end position of the answer in the context. In other words, you will create a d-by-2 linear layer on top of each token-level output of BERT (where the question and the context are concatenated), and you will use each dimension of the linear layer's output for the logits of the start and the end positions.

> **Problem 2.1** *(3 points)* Finetune `bert-base-cased` model for `squad` question answering dataset and report the accuracy on the validation set. For convenience, ignore examples that have more than 256 tokens after tokenization.  *Hint*: If you are having difficulty in implementation, take a peek at  (but do not copy!) https://github.com/huggingface/transformers/tree/main/examples/pytorch/question-answering, though keep in mind that the answer extraction module there is quite complex to maximize accuracy. In this assignment, however, consider simplifying it at the cost of (a bit of) accuracy.


# BERT Fine Tuning

## Imports

In [3]:
!pip install -q transformers datasets

In [4]:
from transformers import BertForQuestionAnswering
from transformers import BertTokenizer


In [5]:
from torch.utils.data import DataLoader 
from transformers import AdamW

In [6]:
from datasets import load_dataset
from pprint import pprint
import numpy as np
squad_dataset = load_dataset('squad')
pprint(squad_dataset['train'][0]) # 'context' contains the document

Reusing dataset squad (/root/.cache/huggingface/datasets/squad/plain_text/1.0.0/d6ec3ceb99ca480ce37cdd35555d6cb2511d223b9150cce08a837ef62ffea453)


  0%|          | 0/2 [00:00<?, ?it/s]

{'answers': {'answer_start': [515], 'text': ['Saint Bernadette Soubirous']},
 'context': 'Architecturally, the school has a Catholic character. Atop the '
            "Main Building's gold dome is a golden statue of the Virgin Mary. "
            'Immediately in front of the Main Building and facing it, is a '
            'copper statue of Christ with arms upraised with the legend '
            '"Venite Ad Me Omnes". Next to the Main Building is the Basilica '
            'of the Sacred Heart. Immediately behind the basilica is the '
            'Grotto, a Marian place of prayer and reflection. It is a replica '
            'of the grotto at Lourdes, France where the Virgin Mary reputedly '
            'appeared to Saint Bernadette Soubirous in 1858. At the end of the '
            'main drive (and in a direct line that connects through 3 statues '
            'and the Gold Dome), is a simple, modern stone statue of Mary.',
 'id': '5733be284776f41900661182',
 'question': 'To whom did t

##  Data set preparation

In [15]:
# training set 
context  = squad_dataset['train']['context'][0:30000]
questions= squad_dataset['train']['question'][0:30000]
answers  = squad_dataset['train']['answers'][0:30000]
print(answers[0]['text'])

# validation set 
context_v  = squad_dataset['validation']['context'][0:30000]
questions_v= squad_dataset['validation']['question'][0:30000]
answers_v  = squad_dataset['validation']['answers'][0:30000]


['Saint Bernadette Soubirous']


In [9]:
def process(context, questions, answers):
    """document is a list of sentences, be carefull anwser is a dictionnary with start token 
"""
    context_dict = {}
    light_context= []
    query_answers= []
    answers_l    = []
    count        = 0

    for context, question, answer in zip(context, questions, answers):
        if context in context_dict:
            context_id = context_dict[context]
        else:
            context_id = count
            context_dict[context] = count
            count += 1
            light_context.append(context)
        query_answers.append([context_id, question, answer ])
        answers_l.append(answer)
    return light_context, query_answers, answers_l


In [10]:
light_context    , light_qestions    , light_answers      = process(context  , questions  , answers)
light_context_val, light_qestions_val, light_answers_val  = process(context_v, questions_v, answers_v)

print(np.shape(light_context)) 

(6654,)


In [11]:
#Small test
print("question: ", light_qestions[0][1])
print("answer :  ", light_qestions[0][2]['text'], light_answers[0])
print("contex :  " , light_context[light_qestions[0][0]])

question:  To whom did the Virgin Mary allegedly appear in 1858 in Lourdes France?
answer :   ['Saint Bernadette Soubirous'] {'text': ['Saint Bernadette Soubirous'], 'answer_start': [515]}
contex :   Architecturally, the school has a Catholic character. Atop the Main Building's gold dome is a golden statue of the Virgin Mary. Immediately in front of the Main Building and facing it, is a copper statue of Christ with arms upraised with the legend "Venite Ad Me Omnes". Next to the Main Building is the Basilica of the Sacred Heart. Immediately behind the basilica is the Grotto, a Marian place of prayer and reflection. It is a replica of the grotto at Lourdes, France where the Virgin Mary reputedly appeared to Saint Bernadette Soubirous in 1858. At the end of the main drive (and in a direct line that connects through 3 statues and the Gold Dome), is a simple, modern stone statue of Mary.


In [12]:
## Add the end index to the answers and correct the wrong start index : 

def start_end_index(contexts, answers): 
  """ Be carefull answers can be either a list of plosible elements or a list of one element """
  # firstly we goes throught each of them : 
  for context, answer in zip (contexts, answers): 
    answer_text = answer['text']
    answer_start= answer['answer_start'][0]
    answer_end  = answer_start + len(answer_text)  # ideal end token if there is no mistakes in the start token
    if len(context[answer_start:answer_end]) == len(answer_text): # every thing is fine
      answer['answer_end']= [answer_end]
    else:  
      for index_shift in [1,2]:  # it is said that it is usually shif from one or two characters : 
        if context[answer_start-index_shift : answer_end - index_shift] == len(answer_text): # every thing is fine
          answer['answer_start']= [answer_start-index_shift]
          answer['answer_end']= [answer_end -index_shift]
          print(answer)

        elif context[answer_start + index_shift :answer_end + index_shift] == len(answer_text): # every thing is fine
          answer['answer_start']= [answer_start+index_shift]
          answer['answer_end']= [answer_end + index_shift]

start_end_index(light_context, light_answers)
#Small test
print("question: ", light_qestions[1000][1])
print("answer :  ", light_qestions[1000][2]['text'], light_answers[1000])
print("contex :  " , light_context[light_qestions[1000][0]])

question:  How much did Beyonce initially contribute to the foundation?
answer :   ['$250,000'] {'text': ['$250,000'], 'answer_start': [190], 'answer_end': [191]}
contex :   After Hurricane Katrina in 2005, Beyoncé and Rowland founded the Survivor Foundation to provide transitional housing for victims in the Houston area, to which Beyoncé contributed an initial $250,000. The foundation has since expanded to work with other charities in the city, and also provided relief following Hurricane Ike three years later.


In [13]:
def group_matrix(light_context,light_questions): 
  #contexts = [ [] for i in range( len(light_context))]
  #answers  = [[] for i in range( len(light_context))]
  #questions= [[] for i in range( len(light_context))]
  contexts = []
  answers  = []
  questions= []
  i = 0
  for q in light_questions : 
    contexts.append(light_context[q[0]])
    answers.append(q[2])
    questions.append(q[1])
  return contexts, answers, questions 

context_matrix  , answer_matrix  , question_matrix   = group_matrix(light_context,light_qestions)
context_matrix_v, answer_matrix_v, questions_matrix_v = group_matrix(light_context_val,light_qestions_val)


In [14]:
pprint(np.shape(question_matrix))
pprint(np.shape(context_matrix))
pprint(np.shape(answer_matrix))

pprint(question_matrix[0:3])
pprint(context_matrix[0:3])
pprint(answer_matrix[0:3])


(30000,)
(30000,)
(30000,)
['To whom did the Virgin Mary allegedly appear in 1858 in Lourdes France?',
 'What is in front of the Notre Dame Main Building?',
 'The Basilica of the Sacred heart at Notre Dame is beside to which structure?']
['Architecturally, the school has a Catholic character. Atop the Main '
 "Building's gold dome is a golden statue of the Virgin Mary. Immediately in "
 'front of the Main Building and facing it, is a copper statue of Christ with '
 'arms upraised with the legend "Venite Ad Me Omnes". Next to the Main '
 'Building is the Basilica of the Sacred Heart. Immediately behind the '
 'basilica is the Grotto, a Marian place of prayer and reflection. It is a '
 'replica of the grotto at Lourdes, France where the Virgin Mary reputedly '
 'appeared to Saint Bernadette Soubirous in 1858. At the end of the main drive '
 '(and in a direct line that connects through 3 statues and the Gold Dome), is '
 'a simple, modern stone statue of Mary.',
 'Architecturally, the sch

In [15]:
def get_info(light_context, light_qestions,i): 
  "return question, answer context"
  return (light_qestions[i][1], light_qestions[i][2], light_context[light_qestions[i][0]])
get_info(light_context, light_qestions,100) 

('In what year did the team lead by Knute Rockne win the Rose Bowl?',
 {'answer_end': [355], 'answer_start': [354], 'text': ['1925']},
 'One of the main driving forces in the growth of the University was its football team, the Notre Dame Fighting Irish. Knute Rockne became head coach in 1918. Under Rockne, the Irish would post a record of 105 wins, 12 losses, and five ties. During his 13 years the Irish won three national championships, had five undefeated seasons, won the Rose Bowl in 1925, and produced players such as George Gipp and the "Four Horsemen". Knute Rockne has the highest winning percentage (.881) in NCAA Division I/FBS football history. Rockne\'s offenses employed the Notre Dame Box and his defenses ran a 7–2–2 scheme. The last game Rockne coached was on December 14, 1930 when he led a group of Notre Dame all-stars against the New York Giants in New York City.')

In [16]:
## Make a dataset class to use the dataloader method after : 
import torch 

class Dataset(torch.utils.data.Dataset):
  def __init__(self, encodings):
    self.encodings = encodings
  
  def __getitem__(self, index) -> dict: 
    return ({ key: torch.tensor(value[index]) for key, value in self.encodings.items() })
  
  def __len__(self)->int: 
    return (len(self.encodings.input_ids))

## Tokenizer

In [None]:
##### Tokenizer : 
tokenizer = BertTokenizer.from_pretrained('bert-base-cased')

#context_matrix, answer_matrix, question_matrix

encoded_train = tokenizer(context_matrix, question_matrix, truncation=True,  padding='max_length')
encoded_val   = tokenizer(context_matrix_v, questions_matrix_v,truncation=True, padding='max_length')

Downloading:   0%|          | 0.00/208k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/29.0 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/570 [00:00<?, ?B/s]

Be aware, overflowing tokens are not returned for the setting you have chosen, i.e. sequence pairs with the 'longest_first' truncation strategy. So the returned list will always be empty even if some tokens have been removed.
Be aware, overflowing tokens are not returned for the setting you have chosen, i.e. sequence pairs with the 'longest_first' truncation strategy. So the returned list will always be empty even if some tokens have been removed.
Be aware, overflowing tokens are not returned for the setting you have chosen, i.e. sequence pairs with the 'longest_first' truncation strategy. So the returned list will always be empty even if some tokens have been removed.
Be aware, overflowing tokens are not returned for the setting you have chosen, i.e. sequence pairs with the 'longest_first' truncation strategy. So the returned list will always be empty even if some tokens have been removed.
Be aware, overflowing tokens are not returned for the setting you have chosen, i.e. sequence pai

In [None]:
# element retruned 
print(encoded_train.keys())
print(len(encoded_train['input_ids']))
pprint(encoded_train['input_ids'][0][0:10])

dict_keys(['input_ids', 'token_type_ids', 'attention_mask'])
30000
[101, 22182, 1193, 117, 1103, 1278, 1144, 170, 2336, 1959]


In [None]:
from re import A
c = 477
def list_in(a, b):
  return any(map(lambda x: b[x:x + len(a)] == a, range(len(b) - len(a) + 1)))

encoded_answer = tokenizer( answer_matrix[c]['text'],  truncation=True)
a = len(encoded_answer['input_ids'][0])
b = encoded_answer['input_ids'][0][1:a-1]

print("encoded_answer", encoded_answer['input_ids'][0])
print("answre", b)

text = tokenizer.convert_ids_to_tokens(encoded_train['input_ids'][c])
answer = tokenizer.convert_ids_to_tokens(encoded_answer['input_ids'][0])
print("text", text)
print("anser", answer)
def get_index(word, liste): 
  for i, word_list in enumerate(liste):
    if word==word_list: 
      return i

def get_bornes ( input_answer, input_context): 
  """ imput context is the matrix of the embedded context answer is the reel sentence """
  context =  tokenizer.convert_ids_to_tokens(input_context)
  encoded_answer = tokenizer(input_answer ,  truncation=True)
  answer = tokenizer.convert_ids_to_tokens(encoded_answer['input_ids'][0])
  word_start = answer[1]
  word_end = answer[-2]
  idx_start = get_index(word_start,context )
  if idx_start ==None: 
    return (1,2)
  if (len(b)==1): 
    #print("indx_start", idx_start)
    return (idx_start,idx_start+1)

  idx_end   = get_index(word_end, context)
  #print(idx_end)
  if idx_end < idx_start: 
      idx_end = idx_start +get_index(word_end,context[idx_start:] )+1
      #print("index", (idx_start, idx_end))
      return (idx_start, idx_end)
  else : 
    #print("index", (idx_start, idx_end))
    return (idx_start, idx_end)
print ("answer", answer_matrix[c]['text'])
if (len(encoded_answer['input_ids'][0])!=3):
  ind1, ind2 = get_bornes(answer_matrix[c]['text'],[c] )
  #print(text[ind1:ind2])
if (len(encoded_answer['input_ids'][0])==3):
  ind1 , id2= get_bornes(answer_matrix[c]['text'], encoded_train['input_ids'][c] )
  #print(text[ind1:id2])

encoded_answer [101, 1565, 102]
answre [1565]
text ['[CLS]', 'At', 'the', '52', '##nd', 'Annual', 'Grammy', 'Awards', ',', 'Beyoncé', 'received', 'ten', 'nominations', ',', 'including', 'Album', 'of', 'the', 'Year', 'for', 'I', 'Am', '.', '.', '.', 'Sasha', 'Fi', '##er', '##ce', ',', 'Record', 'of', 'the', 'Year', 'for', '"', 'Hal', '##o', '"', ',', 'and', 'Song', 'of', 'the', 'Year', 'for', '"', 'Single', 'Ladies', '(', 'Put', 'a', 'Ring', 'on', 'It', ')', '"', ',', 'among', 'others', '.', 'She', 'tied', 'with', 'Lau', '##ryn', 'Hill', 'for', 'most', 'Grammy', 'nominations', 'in', 'a', 'single', 'year', 'by', 'a', 'female', 'artist', '.', 'In', '2010', ',', 'Beyoncé', 'was', 'featured', 'on', 'Lady', 'Gaga', "'", 's', 'single', '"', 'Telephone', '"', 'and', 'its', 'music', 'video', '.', 'The', 'song', 'topped', 'the', 'US', 'Pop', 'Songs', 'chart', ',', 'becoming', 'the', 'sixth', 'number', '-', 'one', 'for', 'both', 'Beyoncé', 'and', 'Gaga', ',', 'tying', 'them', 'with', 'Maria', '##

In [None]:
from tqdm import tqdm 
def add_start_end_pos (encoded_data , answers): 
  contexts = encoded_data['input_ids']
  end_pos= []
  start_pos = []
  for index, answer in tqdm(enumerate (answers)): 
    answ       = answer['text']
    context    = contexts[index]
    #print(answ, index)
    start, end = get_bornes ( answ, context)
    
    start_pos.append(start)
    end_pos.append(end)
 
  encoded_data.update(
      {'start_positions' : start_pos, 
       'end_positions': end_pos
       }
  )

add_start_end_pos(encoded_train, answer_matrix)
add_start_end_pos(encoded_val, answer_matrix_v)


30000it [00:15, 1913.26it/s]
10570it [00:08, 1209.46it/s]


## Model Creation 

In [None]:
data_train = Dataset(encoded_train)
data_val   = Dataset(encoded_val)

In [None]:
from transformers import BertForQuestionAnswering
model = BertForQuestionAnswering.from_pretrained("bert-base-cased")


Downloading:   0%|          | 0.00/416M [00:00<?, ?B/s]

Some weights of the model checkpoint at bert-base-cased were not used when initializing BertForQuestionAnswering: ['cls.predictions.transform.LayerNorm.weight', 'cls.predictions.decoder.weight', 'cls.predictions.transform.LayerNorm.bias', 'cls.predictions.bias', 'cls.seq_relationship.bias', 'cls.predictions.transform.dense.bias', 'cls.seq_relationship.weight', 'cls.predictions.transform.dense.weight']
- This IS expected if you are initializing BertForQuestionAnswering from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForQuestionAnswering from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of BertForQuestionAnswering were not initialized from the model checkpoint at bert-base-cased and a

## Training

In [None]:
#### Fine tuning 

from torch.utils.data import DataLoader 
from transformers import AdamW

device = torch.device('cuda') if torch.cuda.is_available() else torch.device('cpu')
model.to(device)
model.train()
optim = AdamW(model.parameters(), lr = 5e-5)



In [None]:
!nvidia-smi

Tue Jun  7 23:34:59 2022       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 460.32.03    Driver Version: 460.32.03    CUDA Version: 11.2     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|   0  Tesla T4            Off  | 00000000:00:04.0 Off |                    0 |
| N/A   41C    P0    25W /  70W |   1854MiB / 15109MiB |     15%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Proces

In [None]:
train_loader = DataLoader( data_train ,batch_size= 10, shuffle= True)
val_loader   = DataLoader( data_val   ,batch_size= 10, shuffle= True)

In [None]:
""" from https://debuggercafe.com/saving-and-loading-the-best-model-in-pytorch/ """

import torch
import matplotlib.pyplot as plt
plt.style.use('ggplot')
class SaveBestModel:
    """
    Class to save the best model while training. If the current epoch's 
    validation loss is less than the previous least less, then save the
    model state.
    """
    def __init__(
        self, path, best_valid_loss=float('inf')
    ):
        self.best_valid_loss = best_valid_loss
        self.path = path 
        
    def __call__(
        self, current_valid_loss, 
        epoch, model, optimizer
    ):
        
        if current_valid_loss < self.best_valid_loss:
            self.best_valid_loss = current_valid_loss
            #print(f"\nBest validation loss: {self.best_valid_loss}")
            #print(f"\nSaving best model for epoch: {epoch+1}\n")
            #torch.save(model.state_dict(), '/content/gdrive/MyDrive/model_epoch200_tres.pt')
            torch.save({
                'epoch': epoch+1,
                'model_state_dict': model.state_dict(),
                'optimizer_state_dict': optimizer.state_dict(),
                'loss': current_valid_loss,
                }, self.path)
# save on collab
#torch.save(model.state_dict(), '/content/gdrive/MyDrive/model_epoch200_tres.pt')

In [None]:
from google.colab import drive
drive.mount('/content/gdrive')    
#saver1         = SaveBestModel('/content/gdrive/MyDrive/model_1.pth')

Mounted at /content/gdrive


In [None]:
from tqdm import tqdm
epochs = 2 

for epoch in range (epochs):
  loop = tqdm(train_loader)
  for batch in loop:   
    optim.zero_grad()
    input_ids   = batch['input_ids'].to(device)
    attention_mask = batch['attention_mask'].to(device)
    start_position  = batch['start_positions'].to(device)
    end_position    = batch['end_positions'].to(device)

    outputs = model(input_ids, attention_mask = attention_mask,  
                     start_positions = start_position,
                     end_positions =end_position  )
    
    loss = outputs[0]
    loss.backward()
    optim.step()
    #saver1.__call__(
    #    loss, 
    #    epoch,
    #    model, 
    #    optim
    #    )

    loop.set_description(f'Epoch {epoch}')
    loop.set_postfix(loss = loss.item())


Epoch 0: 100%|██████████| 3000/3000 [46:18<00:00,  1.08it/s, loss=1.27]
Epoch 1: 100%|██████████| 3000/3000 [46:22<00:00,  1.08it/s, loss=0.817]


In [None]:
model_path = '/content/gdrive/MyDrive/model/bert-custom'
model.save_pretrained(model_path)
tokenizer.save_pretrained(model_path)

('/content/gdrive/MyDrive/model/bert-custom/tokenizer_config.json',
 '/content/gdrive/MyDrive/model/bert-custom/special_tokens_map.json',
 '/content/gdrive/MyDrive/model/bert-custom/vocab.txt',
 '/content/gdrive/MyDrive/model/bert-custom/added_tokens.json')

## Evaluation 

In [None]:
model.eval()

BertForQuestionAnswering(
  (bert): BertModel(
    (embeddings): BertEmbeddings(
      (word_embeddings): Embedding(28996, 768, padding_idx=0)
      (position_embeddings): Embedding(512, 768)
      (token_type_embeddings): Embedding(2, 768)
      (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
      (dropout): Dropout(p=0.1, inplace=False)
    )
    (encoder): BertEncoder(
      (layer): ModuleList(
        (0): BertLayer(
          (attention): BertAttention(
            (self): BertSelfAttention(
              (query): Linear(in_features=768, out_features=768, bias=True)
              (key): Linear(in_features=768, out_features=768, bias=True)
              (value): Linear(in_features=768, out_features=768, bias=True)
              (dropout): Dropout(p=0.1, inplace=False)
            )
            (output): BertSelfOutput(
              (dense): Linear(in_features=768, out_features=768, bias=True)
              (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_

In [None]:

loop = tqdm(val_loader)
acc_start=[]
acc_end= []
for batch in tqdm (val_loader): 
  with torch.no_grad():
    input_ids   = batch['input_ids'].to(device)
    attention_mask = batch['attention_mask'].to(device)
    start_true  = batch['start_positions'].to(device)
    end_true    = batch['end_positions'].to(device)

    outputs = model(input_ids, attention_mask = attention_mask)
    
    start_pred = torch.argmax(outputs['start_logits'], dim = 1)
    end_pred   = torch.argmax(outputs['end_logits'], dim = 1)
    
    acc_start.append(((start_pred == start_true).sum()/len(start_pred)).item())
    acc_end.append(((end_pred == end_true).sum()/len(start_pred)).item())


  0%|          | 0/1057 [00:00<?, ?it/s][A

  0%|          | 0/1057 [1:07:00<?, ?it/s]


  0%|          | 1/1057 [00:00<06:02,  2.92it/s][A[A

  0%|          | 2/1057 [00:00<05:31,  3.18it/s][A[A

  0%|          | 3/1057 [00:00<05:22,  3.27it/s][A[A

  0%|          | 4/1057 [00:01<05:18,  3.30it/s][A[A

  0%|          | 5/1057 [00:01<05:16,  3.33it/s][A[A

  1%|          | 6/1057 [00:01<05:13,  3.35it/s][A[A

  1%|          | 7/1057 [00:02<05:13,  3.35it/s][A[A

  1%|          | 8/1057 [00:02<05:13,  3.35it/s][A[A

  1%|          | 9/1057 [00:02<05:11,  3.36it/s][A[A

  1%|          | 10/1057 [00:03<05:10,  3.37it/s][A[A

  1%|          | 11/1057 [00:03<05:10,  3.37it/s][A[A

  1%|          | 12/1057 [00:03<05:09,  3.38it/s][A[A

  1%|          | 13/1057 [00:03<05:09,  3.38it/s][A[A

  1%|▏         | 14/1057 [00:04<05:08,  3.38it/s][A[A

  1%|▏         | 15/1057 [00:04<05:09,  3.36it/s][A[A

  2%|▏         | 16/1057 [00:04<05:08,  3.37it/s][A[A

  2%|▏

In [None]:
print("======== Results of accuracy on the validation set==================")
print( "accuracy for the start position : " , sum(acc_start)/len(acc_start) ) 
print( "accuracy for the end position   : " , sum(acc_end)/len(acc_end) ) 
b = acc_start.copy()+acc_end
print("__________________________________")
print( "accuracy for the all set   : " , sum(b)/len(b) ) 


accuracy for the start position :  0.6073793820268496
accuracy for the end position   :  0.6028382284469839
___________________________________
accuracy for the all set   :  0.6051088052369168


**==============================================================================================================**

$$\text{So with bert we get around 60 % of accuracy on the all positions}$$

**==============================================================================================================**

> **Problem 3.2** *(2 points)* Try your own context/questions and find two failure cases. Explain why you think the model got them wrong.

In [None]:
from tqdm import tqdm 

In [None]:
my_context = [" I am a cat who lives in a flat between a bakery and a bank. My owner is a nice old lady, her name is Marry who has grey hair and glases. The neighbourg has a doc which I hate, the best fiend of the old lady has a rabbit which is my friend. ", 
              " Korea is a country of Asia, located next to Japan", 
              "Dog are disgusting while Cats are beautiful. Cats are way better than all the Dogs. "]
my_question = ["Who is the owner of the cat ? ", 
                "Where is Korea? ", 
               "Who is the best beteween Cats and Dogs ?  "]

encoded = tokenizer(my_context, my_question, truncation=True,  padding='max_length')
model_path = '/content/gdrive/MyDrive/model/bert-custom'
model_BERT = BertForQuestionAnswering.from_pretrained(model_path)
Bert_tokenizer = BertTokenizer.from_pretrained(model_path)

my_data = Dataset(encoded)
my_loader = DataLoader( my_data)
start_pred=[]
end_pred  =[]
for batch in tqdm (my_loader): 
  with torch.no_grad():
    input_ids      = batch['input_ids']
    attention_mask = batch['attention_mask']

    outputs = model_BERT(input_ids, attention_mask = attention_mask)
    start_pred.append(torch.argmax(outputs['start_logits'], dim = 1))
    end_pred.append(torch.argmax(outputs['end_logits'], dim = 1))

100%|██████████| 3/3 [00:06<00:00,  2.02s/it]


In [None]:
answer1 = tokenizer.convert_ids_to_tokens(encoded['input_ids'][0][int(start_pred[0]):(int(end_pred[0])+1)])
answer2 = tokenizer.convert_ids_to_tokens(encoded['input_ids'][1][int(start_pred[1]):(int(end_pred[1])+1)])
answer3 = tokenizer.convert_ids_to_tokens(encoded['input_ids'][2][int(start_pred[2]):(int(end_pred[2])+1)])

print("Question 1 : ", my_question[0])
print("answer: ", " ".join(answer1))
print("Question 2 :", my_question[1])
print("answer : ", " ".join(answer2))
print("Question 3 :", my_question[2])
print("answer : ", " ".join(answer3))

Question 1 :  Who is the owner of the cat ? 
answer:  Mar ##ry
Question 2 : Where is Korea? 
answer :  Asia ,
Question 3 : Who is the best beteween Cats and Dogs ?  
answer :  Dog are


Here BERT Bert manadge to find the answer for the first and second question.
For the question 3 BERT simply gets the oposite result, the answer was not dog but cats. 

## Truncation strategies 

> **Problem 3.3** *(2 points)* Can we do better than truncating tokens if the input length is too long? Suggest (but do not code) a strategy for a problem like SQuAD when the input has an arbitrary length with a pretrained model like BERT that has a predefined input length.

We can find to strategies the first one is to stick with truncature and the second one is to reduce the size of the texte without any truncature. 



1° - Truncature : 
This method would consist in reusing the tuncated part as anoter context. It appears like a subcontext. Like that we don't lost and aswer wich might be in the truncated part of the text. We can add truncation strategie to this method. One of the stratgie is to avoid as much as possible the [PAD] token by overlapping context one one another. Like in the following exemple : 
text(ex: 5, "I have a brown cat") longer than max_length(for ex:3), then instext of simply producing 2 subcontext with one with which is padded like a we go for b 

$$ \text{a: [("I have a"), ("brown cat [pad]")] becomes b:[("I have a"), ("have a brown"), ("a brown cat")]} $$

Like that we keep as mush information as possible in each context. 




2° - Summarization : 
An unbiased summarization model is potentially able to extract important parts of the text regardless of their original position, resulting in a highly generalized shortening method. With this method we could shortened the the size of the imput to precisely Max_Bert_Size - 2 (cause we have to had the start and end token). The summarization   

# 3. Machine Reading Comprehension with T5

Here, you will instead formulate machine reading comprehension as a generation problem, which means the answer is generated on the decoder side given the inputs on the encoder side. There are several choices for pretrained language models but you will use T5 here, as you have been familiarized in Lab 10.

In [None]:
"""https://www.youtube.com/watch?v=r6XY80Z9eSA"""

'https://www.youtube.com/watch?v=r6XY80Z9eSA'

> **Problem 3.1** *(3 points)* Finetune `t5-small` model for `squad` question answering dataset and report the accuracy on the validation set. For convenience, ignore examples that have more than 256 tokens after tokenization. 

# Imports 

In [None]:
! pip install -q tokenizer

[?25l[K     |████▏                           | 10 kB 29.8 MB/s eta 0:00:01[K     |████████▎                       | 20 kB 24.8 MB/s eta 0:00:01[K     |████████████▌                   | 30 kB 12.4 MB/s eta 0:00:01[K     |████████████████▋               | 40 kB 9.6 MB/s eta 0:00:01[K     |████████████████████▊           | 51 kB 4.7 MB/s eta 0:00:01[K     |█████████████████████████       | 61 kB 5.6 MB/s eta 0:00:01[K     |█████████████████████████████   | 71 kB 6.1 MB/s eta 0:00:01[K     |████████████████████████████████| 78 kB 3.9 MB/s 
[?25h

In [None]:
! pip install -q transformers

In [None]:
!pip install -q transformers datasets

In [None]:
!pip install -q pytorch-lightning==1.1.3

[K     |████████████████████████████████| 680 kB 5.5 MB/s 
[K     |████████████████████████████████| 829 kB 57.7 MB/s 
[?25h  Building wheel for future (setup.py) ... [?25l[?25hdone


In [None]:
### Imports
from transformers import AdamW
from transformers import T5ForConditionalGeneration
from transformers import T5TokenizerFast as T5Tokenizer

import torch 
import numpy  as np
import pandas as pd


from tqdm import tqdm

from torch.utils.data import Dataset, DataLoader

from datasets import load_dataset
from datasets import load_metric

from pprint import pprint

squad_dataset = load_dataset('squad')

Reusing dataset squad (/root/.cache/huggingface/datasets/squad/plain_text/1.0.0/d6ec3ceb99ca480ce37cdd35555d6cb2511d223b9150cce08a837ef62ffea453)


  0%|          | 0/2 [00:00<?, ?it/s]

# Dataset :

In [None]:
### Dataset creation 

# training set 
contexts  = squad_dataset['train']['context']
questions = squad_dataset['train']['question']
answers   = squad_dataset['train']['answers']
print(answers[0]['text'])

# validation set 
contexts_v = squad_dataset['validation']['context']
questions_v= squad_dataset['validation']['question']
answers_v  = squad_dataset['validation']['answers']

## Add the end index to the answers and correct the wrong start index : 

def start_end_index(contexts, answers): 
  """ Be carefull answers can be either a list of plosible elements or a list of one element 
      We add in the answer dictionnairy the start index and the endex index of the answer in 
      the context, the transformer requires it.  """
  # firstly we goes throught each of them : 
  for context, answer in zip (contexts, answers): 
    answer_text = answer['text']
    answer_start= answer['answer_start'][0]
    answer_end  = answer_start + len(answer_text)  # ideal end token if there is no mistakes in the start token
    if len(context[answer_start:answer_end]) == len(answer_text): # every thing is fine
      answer['answer_end']= [answer_end]
    else:  
      for index_shift in [1,2]:  # it is said that it is usually shif from one or two characters : 
        if context[answer_start-index_shift : answer_end - index_shift] == len(answer_text): # every thing is fine
          answer['answer_start']= [answer_start-index_shift]
          answer['answer_end']= [answer_end -index_shift]
          print(answer)

        elif context[answer_start + index_shift :answer_end + index_shift] == len(answer_text): # every thing is fine
          answer['answer_start']= [answer_start+index_shift]
          answer['answer_end']= [answer_end + index_shift]

# Add of the end in the training and validation set 
start_end_index(contexts  , answers)
start_end_index(contexts_v, answers_v)

#Small test
print("question: ", questions[1000])
print("answer :  ", answers[1000])
print("contex :  " , contexts[1000])

['Saint Bernadette Soubirous']
question:  How much did Beyonce initially contribute to the foundation?
answer :   {'text': ['$250,000'], 'answer_start': [190], 'answer_end': [191]}
contex :   After Hurricane Katrina in 2005, Beyoncé and Rowland founded the Survivor Foundation to provide transitional housing for victims in the Houston area, to which Beyoncé contributed an initial $250,000. The foundation has since expanded to work with other charities in the city, and also provided relief following Hurricane Ike three years later.


In [None]:
print("answer :  ", answers[1000].items())

answer :   dict_items([('text', ['$250,000']), ('answer_start', [190]), ('answer_end', [191])])


As we ca see the dictionnairy with the answers now contains the strat index and the endex index of the aswer in the context. 

In [None]:
# Made a clean panda dataset 

def datacontructor(questions, answers, contexts): 
  """ This methode creates a clean panda dataframe with all the info we will 
  feed to the Transformer dataframe"""
  data = []
  for question, context, answer in zip (questions, contexts, answers): 
    text         = answer["text"][0] 
    answer_start = answer["answer_start"]
    answer_end   = answer["answer_end"]
    data.append({
        "question" : question, 
        "context"  : context, 
        "text"          : text,
        "answer_start"  : answer_start,
        "answer_end"    : answer_end
    })
  return pd.DataFrame(data)

In [None]:
dataset = datacontructor(questions, answers, contexts)

In [None]:
pprint(dataset.head(10))

                                            question  \
0  To whom did the Virgin Mary allegedly appear i...   
1  What is in front of the Notre Dame Main Building?   
2  The Basilica of the Sacred heart at Notre Dame...   
3                  What is the Grotto at Notre Dame?   
4  What sits on top of the Main Building at Notre...   
5  When did the Scholastic Magazine of Notre dame...   
6   How often is Notre Dame's the Juggler published?   
7  What is the daily student paper at Notre Dame ...   
8  How many student news papers are found at Notr...   
9  In what year did the student paper Common Sens...   

                                             context  \
0  Architecturally, the school has a Catholic cha...   
1  Architecturally, the school has a Catholic cha...   
2  Architecturally, the school has a Catholic cha...   
3  Architecturally, the school has a Catholic cha...   
4  Architecturally, the school has a Catholic cha...   
5  As at most other universities, Notre Dame's 

In [None]:
len(dataset)

87599

# Tokenizer

Import of the T5 tokenizer

In [None]:
### Model Creation 
# We use the base T5 model wich is a lighter one 
tokenizer = T5Tokenizer.from_pretrained("t5-base")

Downloading:   0%|          | 0.00/773k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/1.32M [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/1.17k [00:00<?, ?B/s]

For now, this behavior is kept to avoid breaking backwards compatibility when padding/encoding with `truncation is True`.
- Be aware that you SHOULD NOT rely on t5-base automatically truncating your input to 512 when padding/encoding.
- If you want to encode/pad to sequences longer than 512 you can either instantiate this tokenizer with `model_max_length` or pass `max_length` when encoding/padding.


## Tokenizer 

Small test with the tokenizer : 

In [None]:
test = tokenizer("I am a lovely cat", "I am a lively frog")  # no padding that what there are only ones in the attention mask 
test  # 1 is the end of sequence token 

{'input_ids': [27, 183, 3, 9, 3061, 1712, 1, 27, 183, 3, 9, 18399, 3, 89, 3822, 1], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]}

In [None]:
decoded_sentence = [ tokenizer.decode(input, skip_special_tokens=True, clean_up_tokenization_spaces=True) # 
                    for input in test['input_ids']]
print(decoded_sentence)
print(" ".join(decoded_sentence)) # we can reform the sentence

['I', 'am', '', 'a', 'lovely', 'cat', '', 'I', 'am', '', 'a', 'lively', '', 'f', 'rog', '']
I am  a lovely cat  I am  a lively  f rog 


In [None]:
# special tokens
print(tokenizer.special_tokens_map)

{'eos_token': '</s>', 'unk_token': '<unk>', 'pad_token': '<pad>', 'additional_special_tokens': ['<extra_id_0>', '<extra_id_1>', '<extra_id_2>', '<extra_id_3>', '<extra_id_4>', '<extra_id_5>', '<extra_id_6>', '<extra_id_7>', '<extra_id_8>', '<extra_id_9>', '<extra_id_10>', '<extra_id_11>', '<extra_id_12>', '<extra_id_13>', '<extra_id_14>', '<extra_id_15>', '<extra_id_16>', '<extra_id_17>', '<extra_id_18>', '<extra_id_19>', '<extra_id_20>', '<extra_id_21>', '<extra_id_22>', '<extra_id_23>', '<extra_id_24>', '<extra_id_25>', '<extra_id_26>', '<extra_id_27>', '<extra_id_28>', '<extra_id_29>', '<extra_id_30>', '<extra_id_31>', '<extra_id_32>', '<extra_id_33>', '<extra_id_34>', '<extra_id_35>', '<extra_id_36>', '<extra_id_37>', '<extra_id_38>', '<extra_id_39>', '<extra_id_40>', '<extra_id_41>', '<extra_id_42>', '<extra_id_43>', '<extra_id_44>', '<extra_id_45>', '<extra_id_46>', '<extra_id_47>', '<extra_id_48>', '<extra_id_49>', '<extra_id_50>', '<extra_id_51>', '<extra_id_52>', '<extra_id_53

T5 model on the Squad Dataset : 

## Dataset class

In [None]:
# tokenisation of the dataset 

# for the answers , we have to change the 0 of the padding 
# to -100 we have to change them to -100

class SquadDataset(Dataset): 
  """ Class which build the dataset using buy extention of the Dataset module of pytorch, 
  it takes a Dataframe into input
  """
  def __init__(
      self, 
      data : pd.DataFrame, 
      tokenizer: T5Tokenizer, 
      len_max_token_source : int = 396  ,
      len_max_token_target : int = 40
  ): 
    self.tokenizer = tokenizer
    self.data      = data
    self.len_max_token_source = len_max_token_source
    self.len_max_token_target = len_max_token_target

  def __len__(self): 
     return (len(self.data))
  
  def __getitem__(self, index: int): 
    data_ligne = self.data.iloc[index] # to get theitem of location idex 

    # we use the tokenizer 
    encoded_source = tokenizer(
        data_ligne["question"],
        data_ligne["context"],
        max_length = self.len_max_token_source,
        padding = "max_length",
        truncation = "only_second",
        return_attention_mask = True, 
        add_special_tokens = True, 
        return_tensors = "pt"

    )

    encoded_target = tokenizer(
        data_ligne["text"],
        max_length = self.len_max_token_target,
        padding = "max_length",
        truncation = True,
        return_attention_mask = True, 
        add_special_tokens = True, 
        return_tensors = "pt"
    )

    labels = encoded_target["input_ids"]
    labels[labels ==0 ] = -100 #the labels, here the answers must have their label =-100

    return (dict(
        question = data_ligne['question'], 
        context  = data_ligne['context'], 
        text     = data_ligne['text'],   # the answer
        input_ids= encoded_source ['input_ids'].flatten(),
        attention_mask_source= encoded_source['attention_mask'].flatten(), 
        labels = labels.flatten()
    ))

## Small Test 

In [None]:
test_dataset = SquadDataset(dataset, tokenizer)


In [None]:
print(dict(dataset.head(1))["text"])

0    Saint Bernadette Soubirous
Name: text, dtype: object


In [None]:
print(test_dataset.__getitem__(0))

{'question': 'To whom did the Virgin Mary allegedly appear in 1858 in Lourdes France?', 'context': 'Architecturally, the school has a Catholic character. Atop the Main Building\'s gold dome is a golden statue of the Virgin Mary. Immediately in front of the Main Building and facing it, is a copper statue of Christ with arms upraised with the legend "Venite Ad Me Omnes". Next to the Main Building is the Basilica of the Sacred Heart. Immediately behind the basilica is the Grotto, a Marian place of prayer and reflection. It is a replica of the grotto at Lourdes, France where the Virgin Mary reputedly appeared to Saint Bernadette Soubirous in 1858. At the end of the main drive (and in a direct line that connects through 3 statues and the Gold Dome), is a simple, modern stone statue of Mary.', 'text': 'Saint Bernadette Soubirous', 'input_ids': tensor([  304,  4068,   410,     8, 16823,  3790,     3, 18280,  2385,    16,
          507,  3449,    16,   301,  1211,  1395,  1410,    58,     1, 3

In [None]:
# retreive the infos of our dataset 
for data in test_dataset : 
  print(data["question"])    # question
  print(data['text'])
  print(data['input_ids'][:20])
  print(data['labels'])      # answer
  break 

To whom did the Virgin Mary allegedly appear in 1858 in Lourdes France?
Saint Bernadette Soubirous
tensor([  304,  4068,   410,     8, 16823,  3790,     3, 18280,  2385,    16,
          507,  3449,    16,   301,  1211,  1395,  1410,    58,     1, 30797])
tensor([2788, 8942,    9,   26, 1954,  264, 8371, 8283,    1, -100, -100, -100,
        -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100,
        -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100,
        -100, -100, -100, -100])


## Creation of the dataframes - dataset 

Creation of the two dataframe which will be converted in dataset in the following contructing class

In [None]:
def generate_answer(question): 
  source_encoding = tokenizer(
      question["question"], 
      questions 
  )

In [None]:
### train dataframe
train_dataframe      = datacontructor(questions, answers, contexts)

### validation dataframe
validation_dataframe =  datacontructor(questions_v, answers_v, contexts_v)

## Tokenization of the dataframe

In [None]:
test = tokenizer ( "question :" + train_dataframe.iloc[1]["question"] + "context :" + train_dataframe.iloc[1]["context"])
print(test)

{'input_ids': [822, 3, 10, 5680, 19, 16, 851, 13, 8, 7711, 3, 17084, 5140, 5450, 58, 1018, 6327, 3, 10, 16768, 450, 1427, 6, 8, 496, 65, 3, 9, 6502, 1848, 5, 71, 2916, 8, 5140, 5450, 31, 7, 2045, 22161, 19, 3, 9, 7069, 12647, 13, 8, 16823, 3790, 5, 3, 29167, 16, 851, 13, 8, 5140, 5450, 11, 5008, 34, 6, 19, 3, 9, 8658, 12647, 13, 2144, 28, 6026, 3, 76, 24266, 28, 8, 9503, 96, 553, 15, 7980, 1980, 1212, 13285, 1496, 1280, 3021, 12, 8, 5140, 5450, 19, 8, 23711, 2617, 13, 8, 3, 24756, 6219, 5, 3, 29167, 1187, 8, 20605, 2617, 19, 8, 8554, 17, 235, 6, 3, 9, 17535, 286, 13, 7029, 11, 9619, 5, 94, 19, 3, 9, 16455, 13, 8, 3, 3844, 17, 235, 44, 301, 1211, 1395, 6, 1410, 213, 8, 16823, 3790, 3, 28285, 26, 120, 4283, 12, 2788, 8942, 9, 26, 1954, 264, 8371, 8283, 16, 507, 3449, 5, 486, 8, 414, 13, 8, 711, 1262, 41, 232, 16, 3, 9, 1223, 689, 24, 1979, 7, 190, 220, 12647, 7, 11, 8, 2540, 10576, 15, 201, 19, 3, 9, 650, 6, 941, 3372, 12647, 13, 3790, 5, 1], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1,

In [None]:
def dataframe_tokenizer(dataframe): 
  # creation of the sequence 
  input_sequences = []
  for i in range(0,len(dataframe)):
      sequence = 'question: '+dataframe.iloc[i]["question"] +' context: '+dataframe.iloc[i]["context"]
      input_sequences.append(sequence)
  
  max_source_length = 512
  max_target_length = 128

  # encoding
  encoding = tokenizer(   # T5 tokenizer
    input_sequences,
    padding="longest",
    max_length=max_source_length,
    truncation=True,
    return_tensors="pt",
  )

  target_encoding = tokenizer(
    dataframe["text"].tolist() ,
    padding="longest",
    max_length=max_target_length, 
    truncation=True, 
    return_tensors="pt",

  )

  labels = target_encoding.input_ids
  labels[labels == tokenizer.pad_token_id] = -100

  input_ids, attention_mask = encoding.input_ids, encoding.attention_mask
  return ({"input_ids"     : input_ids, 
           "attention_mask": attention_mask,
           "labels"        : labels})

train_input       = dataframe_tokenizer(train_dataframe)
validation_inputs = dataframe_tokenizer(train_dataframe)

In [None]:
train_input["input_ids"][0:10]

tensor([[822,  10, 304,  ...,   0,   0,   0],
        [822,  10, 363,  ...,   0,   0,   0],
        [822,  10,  37,  ...,   0,   0,   0],
        ...,
        [822,  10, 363,  ...,   0,   0,   0],
        [822,  10, 571,  ...,   0,   0,   0],
        [822,  10,  86,  ...,   0,   0,   0]])

In [None]:
(train_input["labels"][0:400])  ## we have -100 like that it know it is labels 

tensor([[ 2788,  8942,     9,  ...,  -100,  -100,  -100],
        [    3,     9,  8658,  ...,  -100,  -100,  -100],
        [    8,  5140,  5450,  ...,  -100,  -100,  -100],
        ...,
        [   37, 30979,     3,  ...,  -100,  -100,  -100],
        [30979,     3, 15291,  ...,  -100,  -100,  -100],
        [  381,   662,     1,  ...,  -100,  -100,  -100]])

# T5 Model

## Training 

In [None]:
n_epoch = 6
batch_size = 10
batch_number = 10
model = T5ForConditionalGeneration.from_pretrained("t5-small")
optimizer =AdamW(model.parameters(), lr=0.0001)
for i in (range(3)):
  print('\n')
  for j in range(0,10):
    
    optimizer.zero_grad()
    loss = model(
                 input_ids     = train_input     ["input_ids"][j*batch_size:((j+1)*batch_size)], 
                 attention_mask= train_input["attention_mask"][j*batch_size:((j+1)*batch_size)], 
                 labels        = train_input        ["labels"][j*batch_size:((j+1)*batch_size)]
                 ).loss
    loss.backward()
    optimizer.step()
  print("Epoch number ",  i ,":____________")
  print(loss.item(), '\n')
  

Downloading:   0%|          | 0.00/1.17k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/231M [00:00<?, ?B/s]





Epoch number  0 :____________
0.3186635673046112 



Epoch number  1 :____________
0.07763504981994629 



Epoch number  2 :____________
0.030573375523090363 



In [None]:
model_path = '/content/gdrive/MyDrive/model/T5-custom'
model.save_pretrained(model_path)
tokenizer.save_pretrained(model_path)

('/content/gdrive/MyDrive/model/T5-custom/tokenizer_config.json',
 '/content/gdrive/MyDrive/model/T5-custom/special_tokens_map.json',
 '/content/gdrive/MyDrive/model/T5-custom/tokenizer.json')

## Validation 

In [None]:
model_path = '/content/gdrive/MyDrive/model/T5-custom'
T5_tokenizer = T5Tokenizer.from_pretrained(model_path)
model_T5 = T5ForConditionalGeneration.from_pretrained(model_path)

def validation(train_dataframe):  
  formatted_predictions = list()
  references = list()
  for index in range(len(train_dataframe)):
      data_ligne = train_dataframe.iloc[index]
      
      inputs           = tokenizer('question: '+data_ligne["question"] +' context: '+data_ligne["context"]
                                  , return_tensors="pt"
                                  , padding=True)
      output_sequences = model_T5.generate(
                input_ids      = inputs["input_ids"],
                attention_mask = inputs["attention_mask"],
                do_sample      = False,  # disable sampling to test if batching affects output
                )

      dict_result_T5 = {'id':index, 
                        'prediction_text': tokenizer.batch_decode(output_sequences, skip_special_tokens=True)[0]}

      dict_reality   = {'answers':{
                                    'answer_start':data_ligne["answer_start"]*3, 
                                    'text':data_ligne["text"]*3},
                        'id': index
      }
      formatted_predictions.append(dict_result_T5)
      references.append(dict_reality)
  metric = load_metric("squad")
  print(metric.compute(predictions=formatted_predictions, references=references))
  return (formatted_predictions,references )

formatted_predictions, references = validation(validation_dataframe)

Token indices sequence length is longer than the specified maximum sequence length for this model (640 > 512). Running this sequence through the model will result in indexing errors


Downloading builder script:   0%|          | 0.00/1.72k [00:00<?, ?B/s]

Downloading extra modules:   0%|          | 0.00/1.12k [00:00<?, ?B/s]

{'exact_match': 0.8703878902554399, 'f1': 0.4862559908091554}


In [None]:
metric = load_metric("squad")
print(metric.compute(predictions=formatted_predictions, references=references))

{'exact_match': 0.8703878902554399, 'f1': 0.4862559908091554}


The scores are really low because usually T5 get one are two other word and not the exact answer. Nevertheless those score are strangely low so we look at manulaly the percentage of exact match and the pourcentage of time the answer proposed by T5 is in the real answer. 

In [None]:
exact_match = 0
real_answeri_in_T5 = 0 
T5_in_real = 0
size = 500
for i in tqdm(range (0, size)): 
  data_ligne = validation_dataframe.iloc[i]
  inputs           = T5_tokenizer('question: '+data_ligne["question"] +' context: '+data_ligne["context"]
                                    , return_tensors="pt"
                                    , padding=True)
  output_sequences = model_T5.generate(
                  input_ids      = inputs["input_ids"],
                  attention_mask = inputs["attention_mask"],
                  do_sample      = False,  # disable sampling to test if batching affects output
                  )

  predicted_text_T5 = T5_tokenizer.batch_decode(output_sequences, skip_special_tokens=True)[0]
  #print("======================================")
  #print("reel answer: " , data_ligne["text"])
  #print(" T5 answer : ", predicted_text_T5 , '\n')
  if (data_ligne["text"] ==predicted_text_T5 ):
    exact_match+=1
  if (predicted_text_T5 in data_ligne["text"] ):
    T5_in_real += 1
  if ( data_ligne["text"] in predicted_text_T5):
     real_answeri_in_T5 += 1
print('\n')
print( "exact match :" , exact_match*100 / size , "%")
print( "T5 answer is present in the real answer :" , T5_in_real*100 / size ,"%")
print( "The real answer in in T5" , real_answeri_in_T5*100 / size ,"%")

100%|██████████| 500/500 [01:56<00:00,  4.28it/s]



exact match : 64.6 %
T5 answer is present in the real answer : 71.6 %
The real answer in in T5 76.6 %





**==============================================================================================================**

So as we can see above : 
$$\text{ 64.6% of the time the answer and T5 answer are exactly the same. }$$
$$\text{And in 71.6% of the time the aswer proposed by T5 is in the real answer, and 76.6% of the time the real answer is in T5  }$$

**==============================================================================================================**

# Comparison between BERT and T5 

> **Problem 3.2** *(2 points)* Compare a few examples between BERT and T5 and explain how they differ. Note that `T5-small` is a smaller model than `BERT-base-cased`. 

In [None]:
from google.colab import drive
drive.mount('/content/gdrive')    

Mounted at /content/gdrive


In [None]:
## Make a dataset class to use the dataloader method after : 
import torch 

class Dataset(torch.utils.data.Dataset):
  def __init__(self, encodings):
    self.encodings = encodings
  
  def __getitem__(self, index) -> dict: 
    return ({ key: torch.tensor(value[index]) for key, value in self.encodings.items() })
  
  def __len__(self)->int: 
    return (len(self.encodings.input_ids))

In [None]:
### Examples with Bert : 

# load the model 
model_path = '/content/gdrive/MyDrive/model/bert-custom'
model_BERT = BertForQuestionAnswering.from_pretrained(model_path)
Bert_tokenizer = BertTokenizer.from_pretrained(model_path)

encoded   = Bert_tokenizer(context_matrix_v[0:20], questions_matrix_v[0:20],truncation=True, padding='max_length')


test_data = Dataset(encoded)
test_loader = DataLoader( test_data)
start_pred=[]
end_pred  =[]
for batch in tqdm (test_loader): 
  with torch.no_grad():
    input_ids      = batch['input_ids']
    attention_mask = batch['attention_mask']
    outputs = model_BERT(input_ids, attention_mask = attention_mask)
    start_pred.append(torch.argmax(outputs['start_logits'], dim = 1))
    end_pred.append(torch.argmax(outputs['end_logits'], dim = 1))


model_path = '/content/gdrive/MyDrive/model/T5-custom'
T5_tokenizer = T5Tokenizer.from_pretrained(model_path)
model_T5 = T5ForConditionalGeneration.from_pretrained(model_path)


for i in range (0, 20): 
  data_ligne = validation_dataframe.iloc[i]
  inputs           = T5_tokenizer('question: '+data_ligne["question"] +' context: '+data_ligne["context"]
                                    , return_tensors="pt"
                                    , padding=True)
  output_sequences = model_T5.generate(
                  input_ids      = inputs["input_ids"],
                  attention_mask = inputs["attention_mask"],
                  do_sample      = False,  # disable sampling to test if batching affects output
                  )

  predicted_text_T5 = T5_tokenizer.batch_decode(output_sequences, skip_special_tokens=True)[0]
  predicted_text_BERT = " ".join(Bert_tokenizer.convert_ids_to_tokens(encoded['input_ids'][i][int(start_pred[i]):(int(end_pred[i])+1)]))
  print("======================================")
  print("reel answer: " , data_ligne["text"])
  print("BERT answer: ", predicted_text_BERT)
  print(" T5 answer : ", predicted_text_T5 , '\n')


100%|██████████| 20/20 [00:31<00:00,  1.58s/it]


reel answer:  Denver Broncos
BERT answer:  Denver Broncos
 T5 answer :  Denver Broncos 

reel answer:  Carolina Panthers
BERT answer:  
 T5 answer :  Carolina Panthers 

reel answer:  Santa Clara, California
BERT answer:  San Francisco
 T5 answer :  San Francisco Bay Area 

reel answer:  Denver Broncos
BERT answer:  Denver Broncos
 T5 answer :  American Football Conference 

reel answer:  gold
BERT answer:  golden anniversary
 T5 answer :  gold 

reel answer:  "golden anniversary"
BERT answer:  golden anniversary
 T5 answer :  the "golden anniversary" 

reel answer:  February 7, 2016
BERT answer:  February 7
 T5 answer :  February 7, 2016 

reel answer:  American Football Conference
BERT answer:  Super Bowl 50 was an American football game to determine the champion of the National Football
 T5 answer :  Super Bowl L 

reel answer:  "golden anniversary"
BERT answer:  golden anniversary
 T5 answer :  the "golden anniversary" 

reel answer:  American Football Conference
BERT answer:  Amer

We can see that T5 tend to take more words into the answer, one or two tokens before and after the answer as with those exemples: 

 **reel answer:**  Santa Clara

 **T5 answer :**Santa Clara, California

**reel answer:** 2015

**T5 answer :** the 2015 season

Where the answer is correct but not exactly what is expected. 

> **Problem 3.3** *(2 points)* Generation model is known to often suffer from hallucination (see https://ehudreiter.com/2018/11/12/hallucination-in-neural-nlg/). Can you find or create one example that causes this phenomenon in T5? How does BERT behave with the same example?

In [None]:
print(questions_matrix_v[3999])
pprint(context_matrix_v[3999])
print(answer_matrix_v[3999])

encoded   = Bert_tokenizer(context_matrix_v[3999:4000], questions_matrix_v[3999:4000],truncation=True, padding='max_length')

#BERT
test_data = Dataset(encoded)
test_loader = DataLoader( test_data)
start_pred=[]
end_pred  =[]
for batch in  (test_loader): 
  with torch.no_grad():
    input_ids      = batch['input_ids']
    attention_mask = batch['attention_mask']
    outputs = model_BERT(input_ids, attention_mask = attention_mask)
    start_pred.append(torch.argmax(outputs['start_logits'], dim = 1))
    end_pred.append(torch.argmax(outputs['end_logits'], dim = 1))
predicted_text_BERT = " ".join(Bert_tokenizer.convert_ids_to_tokens(encoded['input_ids'][0][int(start_pred[0]):(int(end_pred[0])+1)]))

#T5
data_ligne = validation_dataframe.iloc[3999]
inputs           = T5_tokenizer('question: '+data_ligne["question"] +' context: '+data_ligne["context"]
                                    , return_tensors="pt"
                                    , padding=True)
output_sequences = model_T5.generate(
                  input_ids      = inputs["input_ids"],
                  attention_mask = inputs["attention_mask"],
                  do_sample      = False,  # disable sampling to test if batching affects output
                  )
predicted_text_T5 = T5_tokenizer.batch_decode(output_sequences, skip_special_tokens=True)[0]

print("BERT",predicted_text_BERT)
print("T5  ",  predicted_text_T5)

Was the testing of the LM during Apollo 5 a failure or a success?
('Apollo 5 (AS-204) was the first unmanned test flight of LM in Earth orbit, '
 'launched from pad 37 on January 22, 1968, by the Saturn IB that would have '
 'been used for Apollo 1. The LM engines were successfully test-fired and '
 'restarted, despite a computer programming error which cut short the first '
 'descent stage firing. The ascent engine was fired in abort mode, known as a '
 '"fire-in-the-hole" test, where it was lit simultaneously with jettison of '
 'the descent stage. Although Grumman wanted a second unmanned test, George '
 'Low decided the next LM flight would be manned.')
{'text': ['success', 'success', 'LM engines were successfully test-fired and restarted', 'successfully'], 'answer_start': [194, 194, 178, 194]}
BERT failure or a success
T5   failure


We can see here that BERT and T5 come up with an answer even if the answer is not present in the text. More over T5 got it wrong and BERT doesn't answer properly to the question. That is a caise of allucination. 

# 4. Dense Retrieval

Remember that in Assignment 2, you created a simple dense retrieval model that just averages BERT embeddings? In this part of the assignment, you will create a more practical dense retrieval system with the `[cls]` output of BERT. This means you use the BERT encoder to map each quesiton to a vector and the corresponding paragraph to another vector, and train it so that their inner product is higher than that of unrelated pairs. There are several ways to approach this problem and you are free to design however you like (just note that you need to use inner product, not L2). Note that Dense Passage Retrieval (https://arxiv.org/abs/2004.04906), and in particular In-Batch Negative training, is one of the simplest and effective ways and it is highly recommended to refer to it.

> **Problem 4.1** *(3 points)* Create a BERT-based (output, not just the embedding) dense retrieval model that finetunes on SQuAD and evaluate it on the validation data, in a similar setting to that of Assignment 2 (i.e. measure Recall@10).

## Import 

In [7]:
!pip install -q sentence-transformers datasets transformers

In [8]:
from sentence_transformers import SentenceTransformer
from transformers import AutoTokenizer, AutoModel
from datasets import load_dataset
import torch.nn.functional as F
import torch.optim as optim

## Tokenizer

In [None]:
tokenizer = AutoTokenizer.from_pretrained('sentence-transformers/bert-base-nli-mean-tokens')

Downloading:   0%|          | 0.00/399 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/625 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/226k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/455k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/2.00 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/112 [00:00<?, ?B/s]

In [None]:
light_contexts  = context[0:1000]
light_questions = questions[0:1000]

In [None]:
# Tokenize sentences
encoded_context  = tokenizer(light_contexts, 
                             padding=True, 
                             truncation=True, 
                             return_tensors='pt')

encoded_question = tokenizer(light_questions, 
                             padding=True, 
                             truncation=True, 
                             return_tensors='pt')

In [None]:
(encoded_question)

{'input_ids': tensor([[  101,  2000,  3183,  ...,     0,     0,     0],
        [  101,  2054,  2003,  ...,     0,     0,     0],
        [  101,  1996, 13546,  ...,     0,     0,     0],
        ...,
        [  101,  2129,  2172,  ...,     0,     0,     0],
        [  101,  2054,  7064,  ...,     0,     0,     0],
        [  101,  2054,  2106,  ...,     0,     0,     0]]), 'token_type_ids': tensor([[0, 0, 0,  ..., 0, 0, 0],
        [0, 0, 0,  ..., 0, 0, 0],
        [0, 0, 0,  ..., 0, 0, 0],
        ...,
        [0, 0, 0,  ..., 0, 0, 0],
        [0, 0, 0,  ..., 0, 0, 0],
        [0, 0, 0,  ..., 0, 0, 0]]), 'attention_mask': tensor([[1, 1, 1,  ..., 0, 0, 0],
        [1, 1, 1,  ..., 0, 0, 0],
        [1, 1, 1,  ..., 0, 0, 0],
        ...,
        [1, 1, 1,  ..., 0, 0, 0],
        [1, 1, 1,  ..., 0, 0, 0],
        [1, 1, 1,  ..., 0, 0, 0]])}

## Model

In [None]:

#passage_encoder = TFAutoModel.from_pretrained("nlpconnect/dpr-ctx_encoder_bert_uncased_L-12_H-128_A-2")
#query_encoder = TFAutoModel.from_pretrained("nlpconnect/dpr-question_encoder_bert_uncased_L-12_H-128_A-2")

#p_tokenizer = AutoTokenizer.from_pretrained("nlpconnect/dpr-ctx_encoder_bert_uncased_L-12_H-128_A-2")
#q_tokenizer = AutoTokenizer.from_pretrained("nlpconnect/dpr-question_encoder_bert_uncased_L-12_H-128_A-2")


In [None]:
model_c     = AutoModel.from_pretrained('sentence-transformers/bert-base-nli-mean-tokens')
model_q     = AutoModel.from_pretrained('sentence-transformers/bert-base-nli-mean-tokens')

optimizer_c = optim.Adam(model_c.parameters(), lr=0.0001)
optimizer_q = optim.Adam(model_q.parameters(), lr=0.0001)
model_c.to('cuda')
model_q.to('cuda')

Downloading:   0%|          | 0.00/418M [00:00<?, ?B/s]

BertModel(
  (embeddings): BertEmbeddings(
    (word_embeddings): Embedding(30522, 768, padding_idx=0)
    (position_embeddings): Embedding(512, 768)
    (token_type_embeddings): Embedding(2, 768)
    (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
    (dropout): Dropout(p=0.1, inplace=False)
  )
  (encoder): BertEncoder(
    (layer): ModuleList(
      (0): BertLayer(
        (attention): BertAttention(
          (self): BertSelfAttention(
            (query): Linear(in_features=768, out_features=768, bias=True)
            (key): Linear(in_features=768, out_features=768, bias=True)
            (value): Linear(in_features=768, out_features=768, bias=True)
            (dropout): Dropout(p=0.1, inplace=False)
          )
          (output): BertSelfOutput(
            (dense): Linear(in_features=768, out_features=768, bias=True)
            (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
            (dropout): Dropout(p=0.1, inplace=False)
          

In [None]:
""" from https://huggingface.co/sentence-transformers/bert-base-nli-mean-tokens """
def mean_pooling(model_output, attention_mask):
    token_embeddings = model_output[0] #First element of model_output contains all token embeddings
    input_mask_expanded = attention_mask.unsqueeze(-1).expand(token_embeddings.size()).float()
    return torch.sum(token_embeddings * input_mask_expanded, 1) / torch.clamp(input_mask_expanded.sum(1), min=1e-9)

In [None]:

def shuffle(c:torch.tensor, q:torch.tensor): 
  idx = torch.randperm(q.shape[0])
  permuted_c = c[idx].view(c.size())
  permuted_q = q[idx].view(q.size())
  return permuted_c, permuted_q 


In [None]:
encoded_context['input_ids'].size()

torch.Size([1000, 428])

In [None]:
encoded_question['input_ids'].size()

torch.Size([1000, 31])

In [None]:
from tqdm import tqdm
import torch 
n_epochs = 2
batch_size = 14
"""Without sentence-transformers, you can use the model like this: First, you pass
 your input through the transformer model, then you have to apply the right 
 pooling-operation on-top of the contextualized word embeddings.
"""
for epoch in range(n_epochs):
  loop = tqdm(range (0, len(light_questions),batch_size ))
  for  index in loop:
      optimizer_c.zero_grad()
      optimizer_q.zero_grad()

      #model_output_context = model(**encoded_context.to('cuda'))
      #model_output_question = model(**encoded_question.to('cuda'))
      #model_output_context  = model_c(**encoded_context[index:(index+batch_size)].to('cuda'))
      #model_output_question = model_q(**encoded_question[index:(index+batch_size)].to('cuda'))
      
      batch_c , batch_q = shuffle(encoded_context['input_ids'][index:(index+batch_size)] , 
                                  encoded_question['input_ids'][index:(index+batch_size)])
      model_output_context  = model_c(batch_c.to('cuda')).pooler_output.to('cuda')
      model_output_question = model_q(batch_q.to('cuda')).pooler_output.to('cuda')
      #print(model_output_question)
      #print("output question epoch" , i, '\n',  model_output_question[0][0].size())
      
      ### Perform pooling. In this case, max pooling.
      #context_embeddings  = mean_pooling(model_output_context, encoded_context['attention_mask'][index:(index+batch_size)])
      #question_embeddings = mean_pooling(model_output_question, encoded_question['attention_mask'][index:(index+batch_size)])
      #print(torch.eq(context_embeddings, question_embeddings))
      #print(context_embeddings)
      #print(question_embeddings)
      
      ### Inner product
      # as said in the subject, we cannot use the L2, we have to use the inner product 
      #between the embedded questions and the context 
      similarity = torch.mm(model_output_question,model_output_context.T) 
      # we obatin a tensor (tensor([[116.0925, 116.0925, 116.0925, 116.0925, 116.0925,  81.7312,  81.7312,
      #      81.7312,  81.7312,  81.7312],  -> question 1 
      #    [ 93.6791,  93.6791,  93.6791,  93.6791,  93.6791,  35.1938,  35.1938,
      #      35.1938,  35.1938,  35.1938],)  -> question 2  where the score is hirgher for the context of the question 
      #

      ### In batch negative samples
      if len(model_output_question.size()) > 1: # if there is no batch this loop is useless 
          q_num = model_output_question.size(0)
          scores = similarity.view(q_num, -1)  #we change the shape of similarity like that it has qnum lines
      #print("sc" , scores.size()) # line = question column = context 
      softmax_scores = F.log_softmax(scores, dim=1).to('cuda') # solftmax on the input of the loss
      #softmax_scores = F.log_softmax(similarity, dim=1)  # not somilar are spread aprart while similar are cloth to 0 
      #print("similarity epoch :", i, ' : \n',  similarity[:10,:10 ]) 
      # square matrix = num question x num context  (with num context = questions) if no batching
      
      loss = F.nll_loss(
          softmax_scores,
          torch.tensor([j for j in range(softmax_scores.shape[0])]).to('cuda'), 
          # if 14 questions tensor([ 0,  1,  2,  3,  4,  5,  6,  7,  8,  9, 10, 11, 12, 13]) because eacu question 
          # must be afficiated to the context on the diagonal of the embedding must be such as the score  has to be higher on the 
          # diagonale, question 1 has the context 5 the higher score is located in position (5,5) 
          reduction="mean",
      )
      
      loss.backward()
      optimizer_c.step()
      optimizer_q.step()
      loop.set_description(f'Epoch {epoch}')
      loop.set_postfix(loss = loss.item())
      if index == (len(light_questions)-1) :
        print(" epoch ",i ,"loss : " , loss.item())
        print("========================")



   

Epoch 0: 100%|██████████| 72/72 [01:35<00:00,  1.32s/it, loss=1.79]
Epoch 1: 100%|██████████| 72/72 [01:34<00:00,  1.32s/it, loss=1.79]


In [19]:
from google.colab import drive
drive.mount('/content/gdrive') 

Mounted at /content/gdrive


In [21]:

#model_path = '/content/gdrive/MyDrive/model/Bert_DRP_c'
#model_c.save_pretrained(model_path)
#tokenizer.save_pretrained(model_path)

#model_path = '/content/gdrive/MyDrive/model/Bert_DRP_q'
#model_q.save_pretrained(model_path)
#tokenizer.save_pretrained(model_path)

In [None]:
light_context_v = context_v[0:100]
light_questions_v = questions_v[0:100]
encoded_context_v  = tokenizer(light_context_v, 
                             padding=True, 
                             truncation=True, 
                             return_tensors='pt')

encoded_question_v = tokenizer(light_questions_v, 
                             padding=True, 
                             truncation=True, 
                             return_tensors='pt')
model_q.eval()
model_c.eval()
with torch.no_grad():
     model_output_context_v  = model_c(encoded_question_v['input_ids'][0:1].to('cuda')).pooler_output.to('cuda')
     model_output_question_v = model_q(encoded_context_v['input_ids'].to('cuda')).pooler_output.to('cuda')
     print(model_output_context_v.size, model_output_question_v.size)
     similarity = torch.mm(model_output_question_v,model_output_context_v.T) 
     print(similarity)
     context_index = torch.argmax(similarity)
     print(context_index)

<built-in method size of Tensor object at 0x7f9980d84d70> <built-in method size of Tensor object at 0x7f9980d84e30>
tensor([[286.8829],
        [286.8829],
        [286.8829],
        [286.8829],
        [286.8829],
        [286.8829],
        [286.8829],
        [286.8829],
        [286.8829],
        [286.8829],
        [286.8829],
        [286.8829],
        [286.8829],
        [286.8829],
        [286.8829],
        [286.8829],
        [286.8829],
        [286.8829],
        [286.8829],
        [286.8829],
        [286.8829],
        [286.8829],
        [286.8829],
        [286.8829],
        [286.8829],
        [286.8829],
        [286.8829],
        [286.8829],
        [286.8829],
        [286.8829],
        [289.4887],
        [289.4887],
        [289.4887],
        [289.4887],
        [289.4887],
        [289.4887],
        [289.4887],
        [289.4887],
        [289.4887],
        [289.4887],
        [289.4887],
        [289.4887],
        [289.4887],
        [289.4887],
    

In [None]:
model_c_t     = AutoModel.from_pretrained(model_path)
model_c_t.to('cuda')
print(model_c_t(encoded_question_v['input_ids'][0:1].to('cuda')).pooler_output.to('cuda'))

tensor([[-6.5102e-01, -2.3657e-01,  9.7032e-01,  3.5870e-01, -8.0167e-01,
         -4.1634e-01,  6.8466e-01,  3.8951e-01,  7.7361e-01, -9.7005e-01,
          5.4211e-01, -8.5508e-01,  9.2920e-01, -5.6524e-01,  8.4696e-01,
         -5.0329e-02,  7.8031e-02, -5.1392e-01,  4.0507e-01, -6.3107e-01,
          2.8595e-01, -9.8132e-01,  7.4457e-01,  3.8812e-01,  4.6911e-01,
         -9.5448e-01, -1.4167e-01,  8.4001e-01,  8.8603e-01,  4.2925e-01,
         -6.2648e-01,  4.4076e-01, -8.6885e-01, -4.7645e-01,  9.5903e-01,
         -8.4991e-01,  9.6467e-02, -6.8145e-01, -4.6164e-01, -2.5775e-01,
         -7.3630e-01,  3.7418e-01,  1.5683e-01, -1.5913e-01, -5.5493e-01,
         -5.5733e-01, -4.5068e-01,  4.7447e-01, -6.5579e-01, -9.5915e-01,
         -8.9668e-01, -9.7354e-01,  3.5796e-01,  3.0211e-01,  3.5895e-01,
          5.9699e-01,  1.6382e-01,  3.8684e-01, -1.7154e-01, -5.6182e-01,
         -6.7086e-01,  1.5218e-01,  7.8381e-01, -6.4030e-01, -9.3230e-01,
         -9.4135e-01, -3.4329e-01, -3.

## Validation

In [12]:
      
def shuffle(c:torch.tensor, q:torch.tensor): 
  idx = torch.randperm(q.shape[0])
  permuted_c = c[idx].view(c.size())
  permuted_q = q[idx].view(q.size())
  return permuted_c, permuted_q 


In [13]:
light_contexts_v  = list(set(squad_dataset['validation']['context']))  # we only need 1 context per question 
questions_v       = squad_dataset['validation']['question']
contexts_v        = squad_dataset['validation']['context']

In [16]:
import random as random 
index = [random.randint(0,1000) for i in range(100)]
light_context_v   = [ context_v[i] for i in index]
light_questions_v = [ questions_v[i] for i in index]

In [17]:

tokenizer = AutoTokenizer.from_pretrained('sentence-transformers/bert-base-nli-mean-tokens')
encoded_context_v  = tokenizer(light_context_v, 
                             padding=True, 
                             truncation=True, 
                             return_tensors='pt')

encoded_question_v = tokenizer(light_questions_v, 
                             padding=True, 
                             truncation=True, 
                             return_tensors='pt')

Asking to truncate to max_length but no maximum length is provided and the model has no predefined maximum length. Default to no truncation.


In [18]:
# context dictionnairy 
contexts_to_idx = dict()
for id, context in enumerate(light_context_v):
    contexts_to_idx[context] = id
# question dictionnairy 
questions_to_ctx_idx = dict()
for id, question in enumerate(light_questions_v):
    questions_to_ctx_idx[id] = contexts_to_idx[context_v[id]]

In [23]:
from google.colab import drive
drive.mount('/content/gdrive') 

Drive already mounted at /content/gdrive; to attempt to forcibly remount, call drive.mount("/content/gdrive", force_remount=True).


In [20]:
!pip install faiss-cpu
import faiss
index         = faiss.IndexFlatIP(416)


Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/


In [25]:
model_path_q = '/content/gdrive/MyDrive/model/Bert_DRP_q'
model_path_c = '/content/gdrive/MyDrive/model/Bert_DRP_c'


model_DRP_c = AutoModel.from_pretrained(model_path_c).to('cuda')
model_DRP_q = AutoModel.from_pretrained(model_path_q).to('cuda')
 
batch_size = 14
from tqdm import tqdm
for i in tqdm(range(0, len(light_context_v),batch_size )):
   out_put = model_DRP_c(encoded_context_v['input_ids'][i:i+batch_size].to('cuda')).pooler_output.to('cuda')
   index.add(out_put)



  0%|          | 0/8 [00:00<?, ?it/s]


AssertionError: ignored

In [None]:
k=10
right_guesses = 0
total_guesses = 0
j=0
for question in tqdm(light_questions_v):
    encoded_question_v = tokenizer(question, 
                             padding=True, 
                             truncation=True, 
                             return_tensors='pt')
    input_ids_q = encoded_question_v["input_ids"]
    embeddings_q = model_DRP_q(input_ids_q)
    scores = index.search(embeddings_q[1].detach().numpy(), k)
    print(context_index)
    context_index = questions_to_ctx_idx[j]
    #print(context_index)
    if context_index in scores[1][0].tolist():
        right_guesses+=1
     
    total_guesses+=1
    j+=1
print("")
print("Recall@10: "+str(right_guesses/total_guesses))

> **Problem 4.2** *(2 points)* Now that we have an MRC model and a retrieval model, if you want to create a open-domain QA system, you can simply connect them! Don't code, but describe in a few sentences how you will connect them and list two things that you need to be careful about (e.g. bias, inference speed, memory, etc.).

_Case 1_ : We know the context : 
In this case, we slip the context in different parts, for example , we can split it for each sentences. Then we use the MRC to find the sentence with the greatest similarity with this question. When we have find the precise part which answer the question, we use T5 or BERT to answer the question more precicely. Like that we could avoid the answer similar that might be true but which are not (due to similarity in word eg : Place nane or locations. 

_Case 2_ : We don't know the context
In that case you use the MRC to find the right context in the series of documents which can answer the problem. Then we use this context with T5 or BERT to retreive the answer 

# Test  - Garbadge 

In [None]:
class SquadModule(pl.LightningDataModule): 
  """ class with create the all dataset with the training and validation frames """
  def __init__(
      self, 
      train_dataframe: pd.DataFrame,
      validation_dataframe: pd.DataFrame,
      tokenizer : T5Tokenizer, 
      batch_size : int = 6,
      len_max_token_source : int = 396,
      len_max_token_target : int = 40,
  ):
    super().__init__()
    self.train_dataframe = train_dataframe
    self.validation_dataframe = validation_dataframe
    self.batch_size = batch_size
    self.tokenizer = tokenizer
    self.len_max_token_source = len_max_token_source
    self.len_max_token_target = len_max_token_target

  def setup (self): 
    self.train_dataset = SquadDataset (
        self.train_dataframe, 
        self.tokenizer ,
        self.len_max_token_source ,
        self.len_max_token_target 
    )
    self.validation_dataset = SquadDataset(
        self.validation_dataframe, 
        self.tokenizer ,
        self.len_max_token_source ,
        self.len_max_token_target
    )

  def train_dataloader(self): 
    return DataLoader(
      self.train_dataset, 
      batch_size = self.batch_size, 
      shuffle = True, 
      num_workers = 4
    )
  
  def validation_dataloader(self): 
    return DataLoader(
      self.validation_dataset, 
      batch_size = 1, 
      num_workers = 4
    )

  def test_dataloader(self): 
    return DataLoader(
      self.validation_dataset, 
      batch_size = 1, 
      num_workers = 4
    )

NameError: ignored

In [None]:
epoch_number = 6
batch_size   = 2

# creation of the dataset 
data_module  = SquadModule(train_dataframe, validation_dataframe, 
                           tokenizer, batch_size=batch_size)
data_module.setup()

In [None]:
model = T5ForConditionalGeneration.from_pretrained("t5-base", return_dict=True)

Downloading:   0%|          | 0.00/850M [00:00<?, ?B/s]

In [None]:
class SquadQA(pl.LightningModule):
  
  def __init__(self): 
    super().__init__()
    self.model = T5ForConditionalGeneration.from_pretrained("t5-base", return_dict=True)

  def forward(self, input_ids, attention_mask, labels = None ): 
    print("a", input_ids)
    output = self.model(
        input_ids = input_ids, 
        attention_mask = attention_mask, 
        labels= labels
    )
    return output.loss, output.logits

  def training_step(self, batch, index_batch):
    input_ids      = batch ["input_ids"]
    attention_mask = batch["attention_mask" ]
    labels = batch["labels"]
    loss, outputs  = self(input_ids, attention_mask, labels)
    self.log("train_loss", loss, prog_bar= True, logger =True)
    return loss

  def validation_step(self, batch, index_batch):
    input_ids      = batch ["input_ids"]
    attention_mask = batch["attention_mask" ]
    labels = batch["labels"]
    loss, outputs  = self(input_ids, attention_mask, labels)
    self.log("val_loss", loss, prog_bar= True, logger =True)
    return loss

  def test_step(self, batch, index_batch):
    input_ids      = batch ["input_ids"]
    attention_mask = batch["attention_mask" ]
    labels = batch["labels"]
    loss, outputs  = self(input_ids, attention_mask, labels)
    self.log("test_loss", loss, prog_bar= True, logger =True)
    return loss

  def optim_setup(self): 
    return AdamW(self.parameters(), lr=0.0001)




In [None]:
model = SquadQA()

In [None]:
# we stave the best check point at each times
checkpoint_saver = ModelCheckpoint(
    dirpath  = "checkpoints", 
    filename = "best-check", 
    save_top_k = 1, 
    verbose    = True, 
    monitor    = "val_loss", 
    mode       = "min"
)

In [None]:
trainer = pl.Trainer(
    checkpoint_callback = checkpoint_saver, 
    max_epochs          = epoch_number, 
    gpus                = 1, 
    progress_bar_refresh_rate= 30
)

  f"Setting `Trainer(checkpoint_callback={checkpoint_callback})` is deprecated in v1.5 and will "
  f"Setting `Trainer(progress_bar_refresh_rate={progress_bar_refresh_rate})` is deprecated in v1.5 and"
GPU available: True, used: True
TPU available: False, using: 0 TPU cores
IPU available: False, using: 0 IPUs
HPU available: False, using: 0 HPUs


In [None]:
trainer.fit-model (model, data_module )

a SquadQA(
  (model): T5ForConditionalGeneration(
    (shared): Embedding(32128, 768)
    (encoder): T5Stack(
      (embed_tokens): Embedding(32128, 768)
      (block): ModuleList(
        (0): T5Block(
          (layer): ModuleList(
            (0): T5LayerSelfAttention(
              (SelfAttention): T5Attention(
                (q): Linear(in_features=768, out_features=768, bias=False)
                (k): Linear(in_features=768, out_features=768, bias=False)
                (v): Linear(in_features=768, out_features=768, bias=False)
                (o): Linear(in_features=768, out_features=768, bias=False)
                (relative_attention_bias): Embedding(32, 12)
              )
              (layer_norm): T5LayerNorm()
              (dropout): Dropout(p=0.1, inplace=False)
            )
            (1): T5LayerFF(
              (DenseReluDense): T5DenseReluDense(
                (wi): Linear(in_features=768, out_features=3072, bias=False)
                (wo): Linear(in_features

AttributeError: ignored

In [None]:
trainer.test()

In [None]:
### Test 
trained_model = SquadQA.load_from_checkpoint("checkpoints/best-checkpoint.ckpt")
trained_model.freeze()