<a href="https://colab.research.google.com/github/AnuragChj/EnsembleLearningRepository_Anurag_Vidhi_Jian_Sampurna/blob/main/Copy_of_nlpex3_anurag_chatterjee_hugo_vanderperre_xinyu_hu.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [None]:
!pip install datasets
!pip install sacrerouge sacrebleu bert-score

!git clone https://github.com/huggingface/transformers.git
!pip install ./transformers/.

In [None]:
# import the necessary libraries
from transformers import BartTokenizer, BartForConditionalGeneration, BartConfig
from datasets import load_dataset
from transformers import T5Tokenizer, T5ForConditionalGeneration

In [None]:
# loading dataset
train_dataset = load_dataset('xsum', split='train')
valid_dataset = load_dataset('xsum', split='validation')
test_dataset = load_dataset('xsum', split='test')

# in this example we will use only one batch containing 10 examples 
batch_input = test_dataset["document"][0:10]
batch_output = test_dataset["summary"][0:10]

# Each model has a name on the hugging face website: you can search through the list of all models here https://huggingface.co/models
model = BartForConditionalGeneration.from_pretrained("sshleifer/distilbart-xsum-6-6")

# usually each model has a special tokenizer these tokenizers contain the vocabulary dictionary of all the tokens you should find the one that works with the model you use
# usually they have the same name but for the model "sshleifer/distilbart-xsum-6-6" which is made by the community we know that it is an adaptation of BART model so 
# it works with the BART tokenizer.
tok = BartTokenizer.from_pretrained("facebook/bart-base")  ## tokenizer

# using our loaded tokenizer: we will encode the input documents into 
# max_length is 1024 as the bart model allows to accept 1024 tokens max as an input 
input_encodings = tok.batch_encode_plus(batch_input, padding=True, max_length=1024, truncation=True, return_tensors="pt")
target_encodings = tok.batch_encode_plus(batch_output, padding=True, max_length=1024, truncation=True, return_tensors="pt")



# Given the batch Decode the answer from your model note that model.generate function takes many params, it will operate as greedy decoding if none is given. 
model_output = model.generate(input_encodings["input_ids"])

## converting model output into text (this is called tokenizer "decoding") (not to confuse with decoding algorithms like sampling and beam search these have the function model.generate)
model_output_decoded = tok.batch_decode(model_output, skip_special_tokens=True)
for s,t,g in zip(batch_input, batch_output, model_output_decoded):
  print("source:\t {}\ntarget:\t{}\ngenerated\t{}\n------\n".format(s[0:1000],t,g))

**Question 1.1**

**Task 1.1:Translation**

**Bible-para Dataset**

In [None]:
valid_dataset = load_dataset("bible_para",lang1 = 'en', lang2 = 'fr', split = "train[:20%]")
trans_bible_eval = valid_dataset[0:1000]

In [None]:
# Load the pre-trained model for bible_para dataset
model = T5ForConditionalGeneration.from_pretrained("t5-small")

# Tokenizer
tok = T5Tokenizer.from_pretrained("t5-small")

batch = valid_dataset['translation'][0:10]
input_sequences = [i['en'] for i in batch]
batch_output = [i['fr'] for i in batch]

task_prefix = "translate English to French: "
encoding = tok(
    [task_prefix + sequence for sequence in input_sequences],
    padding="longest",
    max_length=512,
    truncation=True,
    return_tensors="pt",
)
input_ids = encoding.input_ids
model_outputs = model.generate(input_ids)

## converting model output into text (this is called tokenizer "decoding") (not to confuse with decoding algorithms like sampling and beam search these have the function model.generate)
model_output_decoded = tok.batch_decode(model_outputs, skip_special_tokens=True)
for s,t,g in zip(input_sequences, batch_output, model_output_decoded):
  print("source:\t {}\ntarget:\t{}\ngenerated\t{}\n------\n".format(s[0:1000],t,g))

**un-multi dataset**

In [None]:
valid_dataset = load_dataset("ted_iwlst2013", "en-fr", split = "train[:1%]")
trans_unmulti_evel = valid_dataset[0:1000]

In [None]:
from transformers import T5Tokenizer, T5ForConditionalGeneration

# Load the pre-trained model for bible_para dataset
model = T5ForConditionalGeneration.from_pretrained("t5-small")

# Tokenizer
tok = T5Tokenizer.from_pretrained("t5-small")

batch = valid_dataset["translation"][0:10]
input_sequences = [i['en'] for i in batch]
batch_output = [i['fr'] for i in batch]

task_prefix = "translate English to French: "
encoding = tok(
    [task_prefix + sequence for sequence in input_sequences],
    padding="longest",
    max_length=512,
    truncation=True,
    return_tensors="pt",
)
input_ids = encoding.input_ids
model_outputs = model.generate(input_ids)

## converting model output into text (this is called tokenizer "decoding") (not to confuse with decoding algorithms like sampling and beam search these have the function model.generate)
model_output_decoded = tok.batch_decode(model_outputs, skip_special_tokens=True)
for s,t,g in zip(input_sequences, batch_output, model_output_decoded):
  print("source:\t {}\ntarget:\t{}\ngenerated\t{}\n------\n".format(s[0:1000],t,g))

**Task1.1:Summarization**

In [None]:
train_data = load_dataset("ccdv/cnn_dailymail", '3.0.0',split="train")
val_data = load_dataset("ccdv/cnn_dailymail", '3.0.0', split="validation[:10%]")
test_data =load_dataset("ccdv/cnn_dailymail", '3.0.0', split="test")
sum_cnn_eval = test_data[0:1000]

In [None]:
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM

#model = AutoModelForSeq2SeqLM.from_pretrained("google/pegasus-cnn_dailymail").to("cuda")
model = BartForConditionalGeneration.from_pretrained("sshleifer/distilbart-xsum-6-6")

# Tokenizer
tok = BartTokenizer.from_pretrained("facebook/bart-base")

batch_input = test_data["article"][0:10]
batch_output = test_data["highlights"][0:10]

# using our loaded tokenizer: we will encode the input documents into 
# max_length is 1024 as the bart model allows to accept 1024 tokens max as an input 
input_encodings = tok.batch_encode_plus(batch_input, padding=True, max_length=1024, truncation=True, return_tensors="pt")
target_encodings = tok.batch_encode_plus(batch_output, padding=True, max_length=1024, truncation=True, return_tensors="pt")

# Given the batch Decode the answer from your model note that model.generate function takes many params, it will operate as greedy decoding if none is given. 
model_output = model.generate(input_encodings["input_ids"])

## converting model output into text (this is called tokenizer "decoding") (not to confuse with decoding algorithms like sampling and beam search these have the function model.generate)
model_output_decoded = tok.batch_decode(model_output, skip_special_tokens=True)
for s,t,g in zip(batch_input, batch_output, model_output_decoded):
  print("source:\t {}\ntarget:\t{}\ngenerated\t{}\n------\n".format(s[0:1000],t,g))

**Task1.1:Question Answering**

In [None]:
# loading dataset
valid_dataset = load_dataset("boolq", split = "train")
qa_eval = valid_dataset[0:1000]

In [None]:
boolq_dataset = load_dataset('boolq')

batch = boolq_dataset['validation'][0:10]
passages = batch["passage"]
questions = batch["question"]
answers = batch["answer"]

# Define a function to combine passages and questions
def process_text(passage, question):
    text = "passage: <s> %s </s> question: %s" % (passage, question)
    return text 

context = [process_text(passage,question) for passage,question in zip(passages, questions)] 

tok.pad_token = tok.eos_token

task_prefix = "truefalse: "
encoding = tok(
                   [task_prefix + sequence for sequence in context],
                    max_length=512,
                    padding="max_length",
                    truncation="only_second",
                    add_special_tokens=True,
                    return_tensors="pt")

# encoding = tokenizer(passages, questions,return_tensors = "pt")
input_ids = encoding["input_ids"]
attention_masks = encoding["attention_mask"]

model_output = model.generate(input_ids = input_ids,
                                attention_mask = attention_masks,)

## converting model output into text (this is called tokenizer "decoding") (not to confuse with decoding algorithms like sampling and beam search these have the function model.generate)
model_output_decoded = tok.batch_decode(model_output, 
                                              skip_special_tokens=True)
                                              
for s,t,g in zip(passages, questions,model_output_decoded):
  print("source:\t {}\nquestions:\t {}\ngenerated\t{}\n------\n".format(s[0:1000],t,g))

**Task1.2- Implement Extra evaluation Metrics**

**Implement of Bleu**

In [None]:
!pip install rouge_score
!pip install mosestokenizer

In [None]:
import re 
import numpy as np 
from collections import Counter, namedtuple
from tqdm import tqdm 
from datasets import load_metric
import pandas as pd
from mosestokenizer import * 

class BLEU:

    def __init__(self):
        self.max_order = 4
        self.BleuScore = namedtuple('BleuScore', ['bleu','precisions','bp','translation_length','reference_length'])

    def _get_ngrams(self, sentence):
        ngram_counts = Counter()
        for i in range(1, self.max_order + 1):
            for j in range(0, len(sentence) - i+1):
                ngram = tuple(sentence[j:j+i])
                ngram_counts[ngram] += 1
        return ngram_counts

    def bleu_on_corpus(self,ref,tran,smooth=bool):
        ordered_matches, possible_matches = [0] * self.max_order, [0] * self.max_order
        len_reference, len_translation = 0, 0

        for (reference,translation) in zip(ref,tran):
            len_reference += len(reference)
            len_translation += len(translation)

            merged_ngrams_ref = Counter()
            for r in reference:
                merged_ngrams_ref |= self._get_ngrams(reference)
            merged_ngrams_trans = self._get_ngrams(translation)

            #Get the overlap now 
            intersect = merged_ngrams_ref & merged_ngrams_trans
            for grams in intersect:
                ordered_matches[len(grams)-1] += intersect[grams]

            #Obtain the dividend of the precision scores 
            for i in range(1, self.max_order + 1):
                matches = len(translation) - i + 1
                if matches > 0:
                    possible_matches[i-1] += matches

        #Obtain the modified precision 
        precision = [0] * self.max_order
        for i in range(0, self.max_order):
            if smooth:
                precision[i] = ((ordered_matches[i] + 1.0) / 
                               (possible_matches[i] + 1.0))
            else:
                if possible_matches[i] > 0:
                    precision[i] = (float(ordered_matches[i])/possible_matches[i])
                else:
                    precision[i] = 0.0

        if min(precision) > 0:
            geometric_mean = np.round(np.exp(sum((1.0/self.max_order) * np.log(p) for p in precision)),3)
        else:
            geometric_mean = 0 

        #Brevity penalty 
        ratio = np.round(float(len_translation)/len_reference,2)
        
        if ratio > 1.0:
            bp = 1.0
        else:
            bp = np.exp(1 - 1.0/ratio)

        bleu = np.round(geometric_mean * bp, 4)

        return self.BleuScore(bleu=bleu,precisions=precision,bp=bp,translation_length=len_translation,
                             reference_length=len_reference)
                                    



In [None]:
def compute_bleu(dataset):

    input = dataset['translation'][0:1000]
    batch_input = [input['en'] for input in input] 
    batch_output = [input['fr'] for input in input]
    
    task_prefix = "translate English to French: "
    inputs = tok([task_prefix + sentence for sentence in batch_input],
                       return_tensors="pt", 
                       padding=True)
    
    model_output = model.generate(input_ids=inputs["input_ids"],
                                    attention_mask=inputs["attention_mask"],
                                    do_sample=False)
    
    output = tok.batch_decode(model_output, skip_special_tokens=True)

    fr_tokenizer = MosesTokenizer('fr')
    reference = []
    prediction = []
    for i in batch_output:
        reference.append(fr_tokenizer(i))
        
    for i in output:
        prediction.append(fr_tokenizer(i))
    
    # My bleu
    bleu = BLEU()
    my_bleu = bleu.bleu_on_corpus(reference,prediction,smooth=False)[0]

    # Bleu inside huggingface
    reference_snts = [[r] for r in reference]
    bleu_metric = load_metric("bleu")
    true_bleu = np.round(bleu_metric.compute(predictions=prediction,references=reference_snts)['bleu'],4)

    return my_bleu, true_bleu

In [None]:
bible_dataset = load_dataset("bible_para",lang1 = 'en', lang2 = 'fr', split = "train[:20%]")
eu_dataset = load_dataset("ted_iwlst2013", "en-fr", split = "train[:1%]")

In [None]:
my_bleu_bible, true_bleu_bible = compute_bleu(bible_dataset)
my_bleu_eu, true_bleu_eu = compute_bleu(eu_dataset)

df_bleu = pd.DataFrame({"HuggingFace": [true_bleu_bible, true_bleu_eu],
                         "My result": [my_bleu_bible, my_bleu_eu]}, index=['bible Dataset','europa_ecdc_tm Dataset'])

df_bleu

![image.png](attachment:30496f17-3b02-42d9-a89f-b2b30f22b91c.png)

**Implementation of rouge**

In [None]:
#N-gram function
def generate_ngrams(s, n):
    # Convert to lowercases
    s = s.lower()
    s = s.replace("'","")
    
    # Replace all none alphanumeric characters with spaces
    s = re.sub(r'[^\w\s]', ' ', s)
    s = re.sub(r'[^a-zA-Z0-9\s]', '', s)
    
    # Break sentence in the token, remove empty tokens
    tokens = [token for token in s.split(" ") if token != ""]
    
    # Use the zip function to help us generate n-grams
    # Concatentate the tokens into ngrams and return
    ngrams = zip(*[tokens[i:] for i in range(n)])
    return [" ".join(ngram) for ngram in ngrams]

#ROUGE metric 
def ROUGE(candidate=str, reference=str, n=int):

    #Generate n-grams of the reference and candidate
    reference_grams = Counter(generate_ngrams(reference,n))
    candidate_grams = Counter(generate_ngrams(candidate,n))

    intersect = reference_grams & candidate_grams
    intersect = sum(intersect.values())

    # Compute the precision,recall and F1 score 
    precision = intersect / (sum(candidate_grams.values())+1)
    recall = intersect / sum(reference_grams.values())
    f1 = 2*precision*recall/(precision+recall+1e-8)

    return precision, recall, f1
       

In [None]:
# Compare ROUGE score for the summarization 
def compute_rouge(dataset):

    precision, recall, f1_ = [],[],[]
    precision_rouge, recall_rouge, f1_rouge = [],[],[]

    #Batch train in chunks of 10 to avoid memory overload
    for i in tqdm(range(0,1000,10)):
        test_input = dataset['test'][i:i+10]

        batch_input= [input.strip("(CNN)") for input in test_input['article']] 
        batch_output = [input.replace("\n", "") for input in test_input['highlights']]
        
        task_prefix = "summarize: "
        inputs = tok([task_prefix + sentence for sentence in batch_input],
                     max_length=512,
                     padding=True,
                     truncation=True,
                      return_tensors="pt")

        model_output = model.generate(input_ids=inputs["input_ids"])
        output = tok.batch_decode(model_output, skip_special_tokens=True)

        #HuggingFace implementation of ROUGE 
        rouge = load_metric("rouge")
        rouge = rouge.compute(predictions=output,references=batch_output)
        precision_rouge.append(rouge["rouge2"].mid.precision)
        recall_rouge.append(rouge["rouge2"].mid.recall)
        f1_rouge.append(rouge["rouge2"].mid.fmeasure)

        #My ROUGE 
        for j in range(10):
            pre,rec,f1 = ROUGE(output[j], batch_output[j], n=2)
            precision.append(pre)
            recall.append(rec)
            f1_.append(f1)

    return np.mean(precision) , np.mean(recall), np.mean(f1), np.mean(precision_rouge), np.mean(recall_rouge), np.mean(f1_rouge)



In [None]:
precision,recall,f1,precision_rouge,recall_rouge,f1_rouge = compute_rouge(cnn_dataset)

df_rouge_cnn = pd.DataFrame({"HuggingFace (n=2)":[precision_rouge,recall_rouge,f1_rouge],
                         "Hard Coded (n=2)":[precision, recall,f1]},index=['Precision','Recall','F1'])

df_rouge_cnn

![image.png](attachment:40873463-450b-4732-89bd-4e095fedb1e1.png)

**Task 1.3 - Implement Decoding metods on your own**

**Table 2 calculation and implementation**

**Bible-para translation:Beam Search**

In [None]:
# Load the pre-trained model for bible_para dataset
model = T5ForConditionalGeneration.from_pretrained("t5-small")

# Tokenizer
tok = T5Tokenizer.from_pretrained("t5-small")

num_return_sequences = 1
idx = 3 

valid_dataset = load_dataset("bible_para",lang1 = 'en', lang2 = 'fr', split = "train[:20%]")
trans_bible_eval = valid_dataset[0:1000]

batch = valid_dataset['translation'][0:10]
input_sequences = [i['en'] for i in batch]
batch_output = [i['fr'] for i in batch]

task_prefix = "translate English to French: "
encoding = tok(
    [task_prefix + sequence for sequence in input_sequences],
    padding="longest",
    max_length=512,
    truncation=True,
    return_tensors="pt",
)
input_ids = encoding.input_ids

#Result given by Beam Search
beam_outputs = model.generate(
    input_ids, 
    max_length=60, 
    num_beams= 10, 
    no_repeat_ngram_size=2, 
    num_return_sequences=num_return_sequences, 
    early_stopping=True,
    output_scores=True 
)

#Result given by T5 
model_outputs = model.generate(input_ids, output_scores=True)

model_output_decoded = tok.batch_decode(model_outputs, skip_special_tokens=True)

beam_search_output_decoded = tok.batch_decode(beam_outputs, skip_special_tokens=True)

for s,t,g_beam, g in zip(input_sequences, batch_output, beam_search_output_decoded, model_output_decoded):
  print("source:\t {}\ntarget:\t{}\ngenerated_beam\t{}\ngenerated\t{}\n------\n".format(s[0:1000],t,g_beam, g))

Checking results for different beam lengths

In [None]:
import bert_score
import sacrebleu
import pandas as pd 

In [None]:
nums_beans = [5,10,15,20,25,30,35,40,45,50]
score_precision = []
score_recall = []
score_f1 = []
for num_beans in nums_beans:
  beam_outputs = model.generate(
    input_ids, 
    max_length=60, 
    num_beams= num_beans, 
    num_return_sequences=num_return_sequences, 
    early_stopping=True,
    output_scores=True 
  )

  beam_search_output_decoded = tok.batch_decode(beam_outputs, skip_special_tokens=True)

  precision_beam,recall_beam,fscore_beam = bert_score.score(cands=beam_search_output_decoded, refs=batch_output, lang="en")
  score_precision.append(float(precision_beam.mean()))
  score_recall.append(float(recall_beam.mean()))
  score_f1.append(float(fscore_beam.mean()))

df_beam = pd.DataFrame(list(zip(nums_beans, score_precision, score_recall, score_f1)), columns = ["num_beams", "precision", "recall", "f1_score"])

df_beam

**Summarization-cnn dailymail dataset :Beam Search**

In [None]:
train_data = load_dataset("ccdv/cnn_dailymail", '3.0.0',split="train")
val_data = load_dataset("ccdv/cnn_dailymail", '3.0.0', split="validation[:10%]")
test_data =load_dataset("ccdv/cnn_dailymail", '3.0.0', split="test")
sum_cnn_eval = test_data[0:100]

num_return_sequences = 1
idx = 3 

model = T5ForConditionalGeneration.from_pretrained("t5-small")

tok = T5Tokenizer.from_pretrained("t5-small")

batch_input = test_data["article"][0:10]
batch_output = test_data["highlights"][0:10]

input_encodings = tok.batch_encode_plus(batch_input, padding=True, max_length=512, truncation=True, return_tensors="pt")
target_encodings = tok.batch_encode_plus(batch_output, padding=True, max_length=512, truncation=True, return_tensors="pt")

model_output = model.generate(input_encodings["input_ids"], max_length = 60)

beam_outputs = model.generate(
    input_encodings["input_ids"], 
    max_length=60, 
    num_beams= 10, 
    no_repeat_ngram_size=1, 
    num_return_sequences=num_return_sequences, 
    early_stopping=True 
)

## converting model output into text (this is called tokenizer "decoding") (not to confuse with decoding algorithms like sampling and beam search these have the function model.generate)
model_output_decoded = tok.batch_decode(model_output, skip_special_tokens=True)
beam_search_output_decoded = tok.batch_decode(beam_outputs, skip_special_tokens=True)

for s,t,g_beam, g in zip(batch_input, batch_output, beam_search_output_decoded, model_output_decoded):
  print("source:\t {}\ntarget:\t{}\ngenerated_beam\t{}\ngenerated\t{}\n------\n".format(s[0:1000],t,g_beam, g))

Checking results for different beam lengths

In [None]:
nums_beans = [5,10,15,20]
score_precision = []
score_recall = []
score_f1 = []
for num_beans in nums_beans:
  beam_outputs = model.generate(
    input_encodings["input_ids"], 
    max_length=60, 
    num_beams= num_beans, 
    no_repeat_ngram_size=1, 
    num_return_sequences=num_return_sequences, 
    early_stopping=True 
  ) 

  beam_search_output_decoded = tok.batch_decode(beam_outputs, skip_special_tokens=True)

  precision_beam,recall_beam,fscore_beam = bert_score.score(cands=beam_search_output_decoded, refs=batch_output, lang="en")
  score_precision.append(float(precision_beam.mean()))
  score_recall.append(float(recall_beam.mean()))
  score_f1.append(float(fscore_beam.mean()))

df_beam = pd.DataFrame(list(zip(nums_beans, score_precision, score_recall, score_f1)), columns = ["num_beams", "precision", "recall", "f1_score"])

df_beam

**Question Answering- BoolQ dataset:Beam Search**

In [None]:
boolq_dataset = load_dataset('boolq')

batch = boolq_dataset['validation'][0:10]
passages = batch["passage"]
questions = batch["question"]
answers = batch["answer"]

# Define a function to combine passages and questions
def process_text(passage, question):
    text = "passage: <s> %s </s> question: %s" % (passage, question)
    return text 

context = [process_text(passage,question) for passage,question in zip(passages, questions)] 

tok.pad_token = tok.eos_token

task_prefix = "truefalse: "
encoding = tok(
                   [task_prefix + sequence for sequence in context],
                    max_length=512,
                    padding="max_length",
                    truncation="only_second",
                    add_special_tokens=True,
                    return_tensors="pt")

# encoding = tokenizer(passages, questions,return_tensors = "pt")
input_ids = encoding["input_ids"]
attention_masks = encoding["attention_mask"]

model_output = model.generate(
                                input_ids = input_ids,
                                attention_mask = attention_masks, max_length = 60)
beam_outputs = model.generate(
    input_ids,
    attention_mask = attention_masks,
    max_length=60, 
    num_beams= 10, 
    no_repeat_ngram_size=2, 
    num_return_sequences=num_return_sequences, 
    early_stopping=True 
)
model_output_decoded = tok.batch_decode(model_output, skip_special_tokens=True)
beam_search_output_decoded = tok.batch_decode(beam_outputs, skip_special_tokens=True)

for s,t,g_beam, g in zip(questions, answers, beam_search_output_decoded, model_output_decoded):
    print("source:\t {}\ntarget:\t{}\ngenerated_beam\t{}\ngenerated\t{}\n------\n".format(s[0:1000],t,g_beam, g))


Checking results for different beam lengths

In [None]:
nums_beans = [5,10,15,20]
score_precision = []
score_recall = []
score_f1 = []
for num_beans in nums_beans:
  beam_outputs = model.generate(
    input_ids,
    attention_mask = attention_masks,
    max_length=60, 
    num_beams= num_beans, 
    no_repeat_ngram_size=1, 
    num_return_sequences=num_return_sequences, 
    early_stopping=True 
  ) 

  beam_search_output_decoded = tok.batch_decode(beam_outputs, skip_special_tokens=True)

  precision_beam,recall_beam,fscore_beam = bert_score.score(cands=beam_search_output_decoded, refs=batch_output, lang="en")
  score_precision.append(float(precision_beam.mean()))
  score_recall.append(float(recall_beam.mean()))
  score_f1.append(float(fscore_beam.mean()))

df_beam = pd.DataFrame(list(zip(nums_beans, score_precision, score_recall, score_f1)), columns = ["num_beams", "precision", "recall", "f1_score"])

df_beam

**Nucleus Sampling:Bible para translation**

In [None]:
# Load the pre-trained model for bible_para dataset
model = T5ForConditionalGeneration.from_pretrained("t5-small")

# Tokenizer
tok = T5Tokenizer.from_pretrained("t5-small")

num_return_sequences = 1
idx = 3 

valid_dataset = load_dataset("bible_para",lang1 = 'en', lang2 = 'fr', split = "train[:20%]")
trans_bible_eval = valid_dataset[0:1000]

batch = valid_dataset['translation'][0:10]
input_sequences = [i['en'] for i in batch]
batch_output = [i['fr'] for i in batch]

task_prefix = "translate English to French: "
encoding = tok(
    [task_prefix + sequence for sequence in input_sequences],
    padding="longest",
    max_length=512,
    truncation=True,
    return_tensors="pt",
)
input_ids = encoding.input_ids

#Result given by Beam Search
nucleus_outputs = model.generate(
    input_ids, 
    do_sample=True, 
    max_length=50, 
    top_p=0.92, 
    top_k=0
)

#Result given by T5 
model_outputs = model.generate(input_ids, output_scores=True)

model_output_decoded = tok.batch_decode(model_outputs, skip_special_tokens=True)

nucleus_output_decoded = tokenizer.batch_decode(nucleus_outputs, skip_special_tokens=True)

for s,t,g_nucleus, g in zip(input_sequences, batch_output, nucleus_output_decoded, model_output_decoded):
  print("source:\t {}\ntarget:\t{}\ngenerated_beam\t{}\ngenerated\t{}\n------\n".format(s[0:1000],t,g_nucleus, g))

Checking results for different top-p values

In [None]:
top_p = [0.70,0.75,0.80,0.85,0.90,0.95,1.0]
score_precision = []
score_recall = []
score_f1 = []
for p in top_p:
    nucleus_outputs = model.generate(
        input_ids, 
        do_sample=True, 
        max_length=50, 
        top_p=p, 
        top_k=0)
    
    nucleus_output_decoded = tokenizer.batch_decode(nucleus_outputs, skip_special_tokens=True)

    precision_nucleus,recall_nucleus,fscore_nucleus = bert_score.score(cands=nucleus_output_decoded, refs=batch_output, lang="en")
    score_precision.append(float(precision_nucleus.mean()))
    score_recall.append(float(recall_nucleus.mean()))
    score_f1.append(float(fscore_nucleus.mean()))

df_nucleus = pd.DataFrame(list(zip(top_p, score_precision, score_recall, score_f1)), columns = ["top_p", "precision", "recall", "f1_score"])

df_nucleus

**Nucleus Sampling:Summarization**

In [None]:
train_data = load_dataset("ccdv/cnn_dailymail", '3.0.0',split="train")
val_data = load_dataset("ccdv/cnn_dailymail", '3.0.0', split="validation[:10%]")
test_data =load_dataset("ccdv/cnn_dailymail", '3.0.0', split="test")
sum_cnn_eval = test_data[0:100]

num_return_sequences = 1
idx = 3 

model = T5ForConditionalGeneration.from_pretrained("t5-small")

tok = T5Tokenizer.from_pretrained("t5-small")

batch_input = test_data["article"][0:10]
batch_output = test_data["highlights"][0:10]

input_encodings = tok.batch_encode_plus(batch_input, padding=True, max_length=512, truncation=True, return_tensors="pt")
target_encodings = tok.batch_encode_plus(batch_output, padding=True, max_length=512, truncation=True, return_tensors="pt")

model_output = model.generate(input_encodings["input_ids"], max_length = 60)

nucleus_outputs = model.generate(
    input_encodings["input_ids"], 
    do_sample=True, 
    max_length=50, 
    top_p=0.92, 
    top_k=0
)

#Result given by T5 
model_outputs = model.generate(input_encodings["input_ids"], output_scores=True)

model_output_decoded = tok.batch_decode(model_outputs, skip_special_tokens=True)

nucleus_output_decoded = tok.batch_decode(nucleus_outputs, skip_special_tokens=True)

for s,t,g_nucleus, g in zip(batch_input, batch_output, nucleus_output_decoded, model_output_decoded):
  print("source:\t {}\ntarget:\t{}\ngenerated_nucleus\t{}\ngenerated\t{}\n------\n".format(s[0:1000],t,g_nucleus, g))

Checking results for different top-p values

In [None]:
top_p = [0.80,0.85,0.90,0.95]
score_precision = []
score_recall = []
score_f1 = []
for p in top_p:
    nucleus_outputs = model.generate(
        input_encodings["input_ids"],
        do_sample=True, 
        max_length=50, 
        top_p=p, 
        top_k=0)
    
    nucleus_output_decoded = tok.batch_decode(nucleus_outputs, skip_special_tokens=True)

    precision_nucleus,recall_nucleus,fscore_nucleus = bert_score.score(cands=nucleus_output_decoded, refs=batch_output, lang="en")
    score_precision.append(float(precision_nucleus.mean()))
    score_recall.append(float(recall_nucleus.mean()))
    score_f1.append(float(fscore_nucleus.mean()))

df_nucleus = pd.DataFrame(list(zip(top_p, score_precision, score_recall, score_f1)), columns = ["top_p", "precision", "recall", "f1_score"])

df_nucleus

**Nucleus Sampling:Question Answering**

In [None]:
boolq_dataset = load_dataset('boolq')

batch = boolq_dataset['validation'][0:10]
passages = batch["passage"]
questions = batch["question"]
answers = batch["answer"]

# Define a function to combine passages and questions
def process_text(passage, question):
    text = "passage: <s> %s </s> question: %s" % (passage, question)
    return text 

context = [process_text(passage,question) for passage,question in zip(passages, questions)] 

tok.pad_token = tok.eos_token

task_prefix = "truefalse: "
encoding = tok(
                   [task_prefix + sequence for sequence in context],
                    max_length=512,
                    padding="max_length",
                    truncation="only_second",
                    add_special_tokens=True,
                    return_tensors="pt")

# encoding = tokenizer(passages, questions,return_tensors = "pt")
input_ids = encoding["input_ids"]
attention_masks = encoding["attention_mask"]

model_output = model.generate(
                                input_ids = input_ids,
                                attention_mask = attention_masks, max_length = 60)
nucleus_outputs = model.generate(
    input_ids,
    attention_mask = attention_masks,
    do_sample=True, 
    max_length=50, 
    top_p=0.92, 
    top_k=0
)


model_output_decoded = tok.batch_decode(model_output, skip_special_tokens=True)

nucleus_output_decoded = tok.batch_decode(nucleus_outputs, skip_special_tokens=True)

for s,t,g_nucleus, g in zip(questions, answers,nucleus_output_decoded, model_output_decoded):
  print("source:\t {}\ntarget:\t{}\ngenerated_nucleus\t{}\ngenerated\t{}\n------\n".format(s[0:1000],t,g_nucleus, g))

Checking results for different top-p values

In [None]:
top_p = [0.80,0.85,0.90,0.95]
score_precision = []
score_recall = []
score_f1 = []
for p in top_p:
    nucleus_outputs = model.generate(
        input_ids,
        attention_mask = attention_masks,
        do_sample=True, 
        max_length=50, 
        top_p=p, 
        top_k=0)
    
    nucleus_output_decoded = tok.batch_decode(nucleus_outputs, skip_special_tokens=True)

    precision_nucleus,recall_nucleus,fscore_nucleus = bert_score.score(cands=nucleus_output_decoded, refs=batch_output, lang="en")
    score_precision.append(float(precision_nucleus.mean()))
    score_recall.append(float(recall_nucleus.mean()))
    score_f1.append(float(fscore_nucleus.mean()))

df_nucleus = pd.DataFrame(list(zip(top_p, score_precision, score_recall, score_f1)), columns = ["top_p", "precision", "recall", "f1_score"])

df_nucleus

**Softmax with temperature:Bible para translation**

In [None]:
# Load the pre-trained model for bible_para dataset
model = T5ForConditionalGeneration.from_pretrained("t5-small")

# Tokenizer
tok = T5Tokenizer.from_pretrained("t5-small")

num_return_sequences = 1
idx = 3 

valid_dataset = load_dataset("bible_para",lang1 = 'en', lang2 = 'fr', split = "train[:20%]")
trans_bible_eval = valid_dataset[0:1000]

batch = valid_dataset['translation'][0:10]
input_sequences = [i['en'] for i in batch]
batch_output = [i['fr'] for i in batch]

task_prefix = "translate English to French: "
encoding = tok(
    [task_prefix + sequence for sequence in input_sequences],
    padding="longest",
    max_length=512,
    truncation=True,
    return_tensors="pt",
)
input_ids = encoding.input_ids

#Result given by Beam Search
temperature_outputs = model.generate(
    input_ids, 
    do_sample=True, 
    max_length=50, 
    top_k=0, 
    temperature=0.7
)

#Result given by T5 
model_outputs = model.generate(input_ids, output_scores=True)

model_output_decoded = tok.batch_decode(model_outputs, skip_special_tokens=True)

temperature_output_decoded = tokenizer.batch_decode(temperature_outputs, skip_special_tokens=True)

for s,t,g_temperature, g in zip(input_sequences, batch_output, temperature_output_decoded, model_output_decoded):
  print("source:\t {}\ntarget:\t{}\ngenerated_beam\t{}\ngenerated\t{}\n------\n".format(s[0:1000],t,g_temperature, g))

Checking results for different temperature values

In [None]:
temp = [0.3,0.5,0.6,0.7,0.8]
score_precision = []
score_recall = []
score_f1 = []
for i in temp:
    temperature_outputs = model.generate(
    input_ids, 
    do_sample=True, 
    max_length=50, 
    top_k=0, 
    temperature=i)
    
    temperature_output_decoded = tokenizer.batch_decode(temperature_outputs, skip_special_tokens=True)

    precision_temperature,recall_temperature,fscore_temperature = bert_score.score(cands=temperature_output_decoded, refs=batch_output, lang="en")
    score_precision.append(float(precision_temperature.mean()))
    score_recall.append(float(recall_temperature.mean()))
    score_f1.append(float(fscore_temperature.mean()))

df_nucleus = pd.DataFrame(list(zip(temp, score_precision, score_recall, score_f1)), columns = ["temperature", "precision", "recall", "f1_score"])

df_nucleus

**Softmax with temperature:Summarization**

In [None]:
train_data = load_dataset("ccdv/cnn_dailymail", '3.0.0',split="train")
val_data = load_dataset("ccdv/cnn_dailymail", '3.0.0', split="validation[:10%]")
test_data =load_dataset("ccdv/cnn_dailymail", '3.0.0', split="test")
sum_cnn_eval = test_data[0:100]

num_return_sequences = 1
idx = 3 

model = T5ForConditionalGeneration.from_pretrained("t5-small")

tok = T5Tokenizer.from_pretrained("t5-small")

batch_input = test_data["article"][0:10]
batch_output = test_data["highlights"][0:10]

input_encodings = tok.batch_encode_plus(batch_input, padding=True, max_length=512, truncation=True, return_tensors="pt")
target_encodings = tok.batch_encode_plus(batch_output, padding=True, max_length=512, truncation=True, return_tensors="pt")

model_output = model.generate(input_encodings["input_ids"], max_length = 60)

temperature_outputs = model.generate(
    input_encodings["input_ids"],
    do_sample=True, 
    max_length=50, 
    top_k=0, 
    temperature=0.7
)

model_outputs = model.generate(input_encodings["input_ids"], output_scores=True)

model_output_decoded = tok.batch_decode(model_outputs, skip_special_tokens=True)

temperature_output_decoded = tok.batch_decode(temperature_outputs, skip_special_tokens=True)

for s,t,g_temperature, g in zip(batch_input, batch_output, temperature_output_decoded, model_output_decoded):
  print("source:\t {}\ntarget:\t{}\ngenerated_temp\t{}\ngenerated\t{}\n------\n".format(s[0:1000],t,g_temperature, g))

Checking results for different temperature values

In [None]:
temp = [0.3,0.5,0.6,0.7,0.8]
score_precision = []
score_recall = []
score_f1 = []
for i in temp:
    temperature_outputs = model.generate(
    input_encodings["input_ids"],
    do_sample=True, 
    max_length=50, 
    top_k=0, 
    temperature=i)
    
    temperature_output_decoded = tok.batch_decode(temperature_outputs, skip_special_tokens=True)

    precision_temperature,recall_temperature,fscore_temperature = bert_score.score(cands=temperature_output_decoded, refs=batch_output, lang="en")
    score_precision.append(float(precision_temperature.mean()))
    score_recall.append(float(recall_temperature.mean()))
    score_f1.append(float(fscore_temperature.mean()))

df_nucleus = pd.DataFrame(list(zip(temp, score_precision, score_recall, score_f1)), columns = ["temperature", "precision", "recall", "f1_score"])

df_nucleus

**Softmax with temperature:Question Answering**

In [None]:
boolq_dataset = load_dataset('boolq')

batch = boolq_dataset['validation'][0:10]
passages = batch["passage"]
questions = batch["question"]
answers = batch["answer"]

# Define a function to combine passages and questions
def process_text(passage, question):
    text = "passage: <s> %s </s> question: %s" % (passage, question)
    return text 

context = [process_text(passage,question) for passage,question in zip(passages, questions)] 

tok.pad_token = tok.eos_token

task_prefix = "truefalse: "
encoding = tok(
                   [task_prefix + sequence for sequence in context],
                    max_length=512,
                    padding="max_length",
                    truncation="only_second",
                    add_special_tokens=True,
                    return_tensors="pt")

# encoding = tokenizer(passages, questions,return_tensors = "pt")
input_ids = encoding["input_ids"]
attention_masks = encoding["attention_mask"]

model_output = model.generate(
                                input_ids = input_ids,
                                attention_mask = attention_masks, max_length = 60)
temperature_outputs = model.generate(
    input_ids,
    attention_mask = attention_masks,
    do_sample=True, 
    max_length=50, 
    top_k=0, 
    temperature=0.7
)

model_outputs = model.generate(input_encodings["input_ids"], output_scores=True)

model_output_decoded = tok.batch_decode(model_output, skip_special_tokens=True)

temperature_output_decoded = tok.batch_decode(temperature_outputs, skip_special_tokens=True)

for s,t,g_temperature, g in zip(batch_input, batch_output, temperature_output_decoded, model_output_decoded):
  print("source:\t {}\ntarget:\t{}\ngenerated_temp\t{}\ngenerated\t{}\n------\n".format(s[0:1000],t,g_temperature, g))

**Checking results for different temperature values**

In [None]:
temp = [0.3,0.5,0.6,0.7,0.8]
score_precision = []
score_recall = []
score_f1 = []
for i in temp:
    temperature_outputs = model.generate(
    input_ids,
    attention_mask = attention_masks,
    do_sample=True, 
    max_length=50, 
    top_k=0, 
    temperature=i)
    
    temperature_output_decoded = tok.batch_decode(temperature_outputs, skip_special_tokens=True)

    precision_temperature,recall_temperature,fscore_temperature = bert_score.score(cands=temperature_output_decoded, refs=batch_output, lang="en")
    score_precision.append(float(precision_temperature.mean()))
    score_recall.append(float(recall_temperature.mean()))
    score_f1.append(float(fscore_temperature.mean()))

df_temp = pd.DataFrame(list(zip(temp, score_precision, score_recall, score_f1)), columns = ["temperature", "precision", "recall", "f1_score"])

df_temp

**TABLE 2
SUMMARY**


![image.png](attachment:790ed1a5-8538-46a5-9fbe-dbc88edd600f.png)!

**Short Report**

**Observations(see plots below for support):- For the beam search, initially the precision decreases by greater margins as beam size is increased but stagnates when the number of beams increases beyond a certain point(like 40 for translation dataset).f1-score and precision vary in the same manner for all beam search cases.For nucleus sampling,the precision usually peaks around 0.85-0.90 p-value.It is observed that for nucleus sampling,the precision,recall and f1 have close values for very high or very low top-p.The reverse usually happens to f1-score.For softmax with temperature, the cooler (lower temperature),the more the f1-score.The optimum temperatue for best performance is around 0.5.The most preferable algorithm for me is nucleus sampling. The precision,recall and f1 can be controlled at high top-p to achieve desirable high performance.**

![image.png](attachment:08700db0-fb6a-4f5d-b5db-f85aca264db2.png)

Beam search:Bible para translation plot

![image.png](attachment:ee95f3a1-1dde-4b65-9b97-79dce114d13f.png)

Nucleus Sampling:Bible para dataset

![image.png](attachment:189d7b50-fa83-4ad5-b43d-9da66a7ee862.png)

Softmax with temperature:Bible para dataset plot

**Table 1 calculation**

**Implementation of Beam Search**

In [None]:
import torch

valid_dataset = load_dataset("bible_para",lang1 = 'en', lang2 = 'fr', split = "train[:20%]")
trans_bible_eval = valid_dataset[0:1000]
model = T5ForConditionalGeneration.from_pretrained("t5-small")

# Tokenizer
tok = T5Tokenizer.from_pretrained("t5-small")

batch = valid_dataset['translation'][0:10]
input_sequences = [i['en'] for i in batch]
batch_output = [i['fr'] for i in batch]

task_prefix = "translate English to French: "
encoding = tok(
    [task_prefix + sequence for sequence in input_sequences],
    padding="longest",
    max_length=512,
    truncation=True,
    return_tensors="pt",
)
input_ids = encoding.input_ids

In [None]:
iterations =10
num_beams=5
result = []

def beam_search(model, inputs, iter=10, num_beams=5):
  result=[]
  for input_id in input_ids:
    seq = [(1,[0])]
    for i in range(iter):
      
      enc_input = input_id.repeat(len(seq), 1)
      dec_input = torch.tensor([ids for _, ids in seq])
      model

      outputs = model(enc_input, decoder_input_ids = dec_input) 
      logits = outputs.logits[:,-1,:]
      probs = torch.softmax(logits, 1) 
      top_beam = probs.topk(num_beams, 1) 

      temp = []
      for i, (score, ids) in enumerate(seq):
        for k in range(num_beams):
          temp.append((score*top_beam.values[i,k].item(), ids + [top_beam.indices[i,k].item()]))
          seq = sorted(temp,reverse = True)[:num_beams]  
    result.append([tok.decode(torch.tensor(ids), skip_special_tokes=True) for _, ids in seq][0]) 

  return result

model_output_decoded = beam_search(model,input_ids, iter=10, num_beams=5)

Evaluation

In [None]:
tok_moses = MosesTokenizer('en')
rouge = load_metric('rouge')
percision,recall,fscore = bert_score.score(cands=model_output_decoded, refs=batch_output ,lang="en")
bleu = sacrebleu.corpus_bleu(model_output_decoded, [batch_output]).score #machine translation


df = pd.DataFrame({"dataset": 'bible',
    "BLEU":bleu,
    "BERTSCORE-percision": [float(percision.mean())],
    "BERTSCORE-recall": [float(recall.mean())],
    "BERTSCORE-fscore": [float(fscore.mean())]
    })

df

**Implementation of Nucleus Sampling**

In [None]:
t=5
iterations=10
result = []

for input_id in input_ids:
  seq = [0]
  enc_input = input_id.repeat(len(seq), 1)    
  for i in range(iterations):
    dec_input = torch.tensor([seq])
    model

    outputs = model(enc_input, decoder_input_ids = dec_input)
    sorted, idx = torch.sort(outputs[0][0,-1,:], descending=True)
    sorted = torch.softmax(sorted, dim=0)
    cumsum_probs = torch.cumsum(sorted, 0)
    probs_p = (cumsum_probs < t).nonzero()
    temp = 1 if len(probs_p)==0 else probs_p[-1][0].item() + 2
    temp = torch.multinomial(sorted[:temp], 1)
    seq.append(idx[temp].item())
  result.append(tok.decode(torch.tensor(seq), skip_special_tokes=True))

In [None]:
model_t5_output_decoded = result
tok_moses = MosesTokenizer('en')
rouge = load_metric('rouge')
percision,recall,fscore = bert_score.score(cands=model_t5_output_decoded, refs=batch_output, lang="en")
bleu = sacrebleu.corpus_bleu(model_output_decoded, [batch_output]).score 

df = pd.DataFrame({"dataset": 'bible',
    "BLEU":bleu,
    "BERTSCORE-percision": [float(percision.mean())],
    "BERTSCORE-recall": [float(recall.mean())],
    "BERTSCORE-fscore": [float(fscore.mean())]
    })

df

**Implementation of Softmax with Temperature**

In [None]:
result = []
T = 2
iterations=10

for input_id in input_ids: 
  seq = [0]
  enc_input = input_id.repeat(len(seq), 1)
  for i in range(iterations):
    dec_input =torch.tensor([seq])
    outputs = model(enc_input, decoder_input_ids=dec_input)
    sorted, idx = torch.sort(outputs[0][0,-1,:], descending=True)
    sorted = torch.softmax(sorted/T, 0)
    temp = torch.multinomial(sorted, 1)
    seq.append(idx[temp].item())
  result.append(tok.decode(torch.tensor(seq), skip_special_tokes=True))

In [None]:
model_output_decoded = result
tok_moses = MosesTokenizer('en')
rouge = load_metric('rouge')

bleu = sacrebleu.corpus_bleu(model_output_decoded, [batch_output]).score 
scores_rouge = rouge.compute(predictions = model_output_decoded, references= batch_output) 


df = pd.DataFrame({"dataset": 'bible',
    "BLEU":bleu,
    "BERTSCORE-percision": [float(percision.mean())],
    "BERTSCORE-recall": [float(recall.mean())],
    "BERTSCORE-fscore": [float(fscore.mean())]
    })

df

**TABLE 1 SUMMARY**

![image.png](attachment:785fd27b-d507-4915-bb32-f436bad16d6c.png)

**Question 2.1**

In [None]:
!pip install bertviz
!pip install datasets
!pip install sacrerouge sacrebleu bert-score

!git clone https://github.com/huggingface/transformers.git
!pip install ./transformers/.
!pip install SentencePiece==0.1.95

In [None]:
from transformers import AutoTokenizer, AutoModel, utils
from bertviz import model_view, head_view
from transformers import T5Tokenizer, T5ForConditionalGeneration, T5EncoderModel
utils.logging.set_verbosity_error()  # Suppress standard warnings

model_name = "t5-small" 
input_text = 'Fear is the path to the dark side. Fear leads to anger. Anger leads to hate. Hate leads to suffering.' 
label_text = 'La peur est le chemin vers le côté obscur : la peur mène à la colère, la colère mène à la haine, la haine mène à la souffrance.'
model = T5ForConditionalGeneration.from_pretrained(model_name, output_attentions=True)  # Configure model to return attention values
tokenizer = T5Tokenizer.from_pretrained(model_name)
input_ids = tokenizer(input_text, return_tensors='pt').input_ids  # Tokenize input text
labels = tokenizer(label_text, return_tensors='pt').input_ids
outputs = model(input_ids = input_ids, labels = labels, output_attentions = True)  # Run model
attention = outputs[-1]  # Retrieve attention from model outputs
tokens = tokenizer.convert_ids_to_tokens(input_ids[0])  # Convert input ids to token strings
model_view(attention, tokens)  # Display model view

In [None]:
head_view(attention, tokens)

In [None]:
import seaborn as sns
import matplotlib.pyplot as plt
for i in range(8):
  fig, axs = plt.subplots(figsize=(8, 5))
  sns.heatmap(attention[0][0][i].detach().numpy())

**Report: The first thing we observe is that most of the words refer to the end of sentence symbol, whatever visualization we chose. Then, it appears (more clearly on the last representation) that each word refers mostly to the word right after it or right before it. Most of the layers show a diagonal pattern, which means that words are considered independent from one another. This can hint that some layers have more impact than others in the traduction process. Hence the proposition from 'Do Multilingual Neural Machine Translation Models Contain Language Pair Specific Attention Heads?' article to 'prune' the attentions.**

**Question 2.2**

In [None]:
confidence = np.delete(np.delete(attention[0][0][0].detach().numpy(), -1, 1), -1, 0)
for layer in range(1, 8):
  confidence += np.delete(np.delete(attention[0][0][layer].detach().numpy(), -1, 1), -1, 0)

In [None]:
sns.heatmap(confidence/8)

**Report: Here we chose to represent the confidence aggregation method. The confidence here is the mean of attention over heads layer-wise. Thanks to confidence, we observe that the word 'anger' is considered rather independent while 'path' relates mostly to itself and to 'to'. This aggregation visualization is less precise than the previous observations but it helps understand the decision process better, which is the aim of attention. The results are even more flagrant and easier to interpret for a shorter sentence.**