# 0 Preparations
First, install the packages needed in this notebook:

In [1]:
! pip install transformers[torch] datasets evaluate bert_score sacrebleu spacy rouge_score
! pip install git+https://github.com/google-research/bleurt.git

Collecting datasets
  Downloading datasets-2.19.2-py3-none-any.whl (542 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m542.1/542.1 kB[0m [31m6.7 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting evaluate
  Downloading evaluate-0.4.2-py3-none-any.whl (84 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m84.1/84.1 kB[0m [31m7.4 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting bert_score
  Downloading bert_score-0.3.13-py3-none-any.whl (61 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m61.1/61.1 kB[0m [31m5.9 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting sacrebleu
  Downloading sacrebleu-2.4.2-py3-none-any.whl (106 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m106.7/106.7 kB[0m [31m7.2 MB/s[0m eta [36m0:00:00[0m
Collecting rouge_score
  Downloading rouge_score-0.1.2.tar.gz (17 kB)
  Preparing metadata (setup.py) ... [?25l[?25hdone
Collecting accelerate>=0.21.0 (from transformers[torch])
  Downloadin

In [2]:
# Downlaod spacy model
! python -m spacy download en_core_web_sm

Collecting en-core-web-sm==3.7.1
  Downloading https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-3.7.1/en_core_web_sm-3.7.1-py3-none-any.whl (12.8 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m12.8/12.8 MB[0m [31m21.3 MB/s[0m eta [36m0:00:00[0m
[38;5;2m✔ Download and installation successful[0m
You can now load the package via spacy.load('en_core_web_sm')
[38;5;3m⚠ Restart to reload dependencies[0m
If you are in a Jupyter or Colab notebook, you may need to restart Python in
order to load all the package's dependencies. You can do this by selecting the
'Restart kernel' or 'Restart runtime' option.


# 1 Seq2seq evaluation metrics

### 1.1 You are given a candidate and a reference translation and the score of a metric. What type of metrics was used? Can you suggest better metric? Justify your answer!

```
Reference: "My cat loves to watch the birds outside the window."
Candidate: "My cat hates to watch the birds outside the window."
-> score: 0.99
```


BERTscore was used, which is an embedding-based metric. Embeddings often fail to detect antonyms. A better solution would be to use an overlap-based metric as this could detect the mismatching words or learned metrics as they were trained to detect such inconsistencies.

### 1.2 You want to train a machine translation system but you only have a few thousand aligned sentences. Are there metrics that are especially suited for this low-resource scenario? Why?

Metrics that evaluate against the source and not a reference translation, i.e., reference-free metrics are especially suited here. You can generate a validation and test set from monolingual source language data and can use all the available aligned data for training.

### 1.3 Your friend tells you this: "I cannot use a learned metric for my task because my data is from a very special domain and there will be a domain mismatch." - Is she right? Does she miss something?

Most of the metrics are trained with data available at large scale, e.g. from news domain. Therefore, a domain mismatch is likely when evaluation on a very specialised domain. Nevertheless, learned metrics are also adaptable and can be fine-tuned for a specific task or domain so they are still worth considering.

## 1.4 Recreate the scores from the lecture slides with Huggingface evaluate

In [3]:
%%capture
from evaluate import load # use the Huggingface evaluate implementations
bertscore = load("bertscore")
bleu = load("sacrebleu")
bleurt = load("bleurt", module_type="metric", checkpoint="Elron/bleurt-base-128")



In [5]:
print(bleu.compute(predictions=["My weekend was bad"], references=["My weekend was superb"])['score'])
print(bleu.compute(predictions=["At the weekend, we ate my grandma's house."], references=["At the weekend, we visited my grandma's house and ate cake."])['score'])
print(bleu.compute(predictions=["At the weekend, we visited my grandma's house. And we ate cake."], references=["At the weekend, we visited my grandma's house and ate cake."])['score'])

{'score': 59.460355750136046, 'counts': [3, 2, 1, 0], 'totals': [4, 3, 2, 1], 'precisions': [75.0, 66.66666666666667, 50.0, 50.0], 'bp': 1.0, 'sys_len': 4, 'ref_len': 4}
41.154215810165745
64.75445426291287


In [16]:
# This function makes comparing different scores for a given reference-candidate pair more handy
def evaluate_and_compare_scores(reference: str, candidate: str, language: str='en') -> None:
    print("Reference: ", reference)
    print("Candidate: ", candidate)

    score_bleu = bleu.compute(predictions=[candidate], references=[reference], smooth_method='none')['score']
    print(f"BLEU: {score_bleu}")
    score_bertscore = bertscore.compute(predictions=[candidate], references=[reference], lang=language)['f1']
    print(f"BERTscore: {score_bertscore}")
    score_bleurt = bleurt.compute(predictions=[candidate], references=[reference])['scores']
    print(f"BlEURT: {score_bleurt}")

In [19]:
####################################################################
# TODO come up with own examples and try to fool the scores
# Can you make further observations?
####################################################################
ref = "My cat loves to watch the birds outside the window."
cands = ["My cat hates to watch the birds outside the window."]
####################################################################
for cand in cands:
    evaluate_and_compare_scores(ref, cand)
    print('***')

ref_de = "Dieses Haus ist in einer großen Stadt."
cand_de = "Das Haus in einer großen Stadt ist."
#evaluate_and_compare_scores(ref_de, cand_de, language='de')

Reference:  My cat loves to watch the birds outside the window.
Candidate:  My cat hates to watch the birds outside the window.
BLEU: 74.19446627365011
BERTscore: [0.989577054977417]
BlEURT: [0.04989204928278923]
***


In [20]:
####################################################################
# TODO Look at the Huggingface metrics page (https://huggingface.co/metrics)
# Select two additional metrics and test them on our sample sentences
# Note!: you may have to install additional packages to use these metrics!
####################################################################
metric1 = load("rouge")
metric2 = load("ter")
####################################################################

for cand in cands:
  print("Reference: ", ref)
  print("Candidate: ", cand)
  print(f"{metric1.name}: ", metric1.compute(predictions=[cand], references=[ref]))
  print(f"{metric2.name}: ", metric2.compute(predictions=[cand], references=[ref]))

Downloading builder script:   0%|          | 0.00/6.27k [00:00<?, ?B/s]

Downloading builder script:   0%|          | 0.00/9.99k [00:00<?, ?B/s]

Reference:  My cat loves to watch the birds outside the window.
Candidate:  My cat hates to watch the birds outside the window.
rouge:  {'rouge1': 0.9, 'rouge2': 0.7777777777777778, 'rougeL': 0.9, 'rougeLsum': 0.9}
ter:  {'score': 10.0, 'num_edits': 1, 'ref_length': 10.0}


## 1.5 Explain the predicted scores

Instead of using the Huggingface evaluate library, you can also load the scoring models with the transformers library. With this, you can use any explainability framework that can interact with Huggingface to explain your score.

In [14]:
from transformers import AutoModelForSequenceClassification, AutoTokenizer

In [21]:
#%%capture
model_name = "Elron/bleurt-base-128"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model_bleurt = AutoModelForSequenceClassification.from_pretrained(model_name)
model_bleurt.eval()

def predict_bleurt_score(reference:str, candidate:str) -> None:
    print("Reference: ", reference)
    print("Candidate: ", candidate)
    ####################################################################
    # TODO Tokenize the reference and candidate and feed the tokenizer
    # output into the model. Print the score prediction.
    ####################################################################
    tokenizer_out = tokenizer([reference], [candidate], return_tensors='pt', padding=True, truncation=True)
    print(tokenizer.batch_decode(tokenizer_out['input_ids']))
    print(model_bleurt(**tokenizer_out).logits.item())
    ####################################################################

In [22]:
ref = ("At the weekend, we visited my grandma's house and ate cake. She has baked a chocolate cake especially for me as it is my favourite cake. "
  "Afterwards, we went for a long walk across the fields. The weather was superb and we saw a lot of birds, squirrels and even some wild rabbids.")

cand = ("At the weekend, we visited my grandma's house and ate cake. She has baked a chocolate cake especially for me as it is my favourite cake. It was really delicious! "
  "Afterwards, we went for a long walk across the fields. The weather was superb and we saw a lot of birds, squirrels and even some wild rabbids.")

cand2 = ("At the weekend, we visited my grandma's house and ate cake. She has baked a chocolate cake especially for me as it is my favourite cake. "
  "Afterwards, we went for a long walk across the fields. The weather was superb and we saw a lot of birds, squirrels and even some wild rabbids. It was really delicious!")
predict_bleurt_score(ref, cand)
print('***')
predict_bleurt_score(ref, cand2)

Reference:  At the weekend, we visited my grandma's house and ate cake. She has baked a chocolate cake especially for me as it is my favourite cake. Afterwards, we went for a long walk across the fields. The weather was superb and we saw a lot of birds, squirrels and even some wild rabbids.
Candidate:  At the weekend, we visited my grandma's house and ate cake. She has baked a chocolate cake especially for me as it is my favourite cake. It was really delicious! Afterwards, we went for a long walk across the fields. The weather was superb and we saw a lot of birds, squirrels and even some wild rabbids.
["[CLS] at the weekend, we visited my grandma's house and ate cake. she has baked a chocolate cake especially for me as it is my favourite cake. afterwards, we went for a long walk across the fields. the weather was superb and we saw a lot of birds, squirrels and even some wild rabbids [SEP] at the weekend, we visited my grandma's house and ate cake. she has baked a chocolate cake especia

### Both candidates hallucinate "It was really delicious!". However, the second candidate does not seem to get punished for it. Can you think of an explanation why?


# 2 Faithfulness

In this section, we fine-tune a question generation system to create a question-answering based hallucination detection system.

The steps for such a system are:


1.   Answer span extraction
2.   Question generation
3.   Question answering
4.   Answer comparison



In [1]:
####################################################################
# TODO think of additional candidates that you want to evaluate
####################################################################
source = "John became an older brother because Mary gave birth to a girl."
candidates = [
    "Mary had a baby.",
    "John gave birth to a girl.",
    "John has a younger sister."
]
####################################################################

## 2.1 Answer span extraction

For simplicity, we will only focus on noun answers.

Parse the candidates with spacy and extract all nouns.

In [2]:
import spacy
nlp = spacy.load("en_core_web_sm")

# Parse the first candidate and print its annotations.
doc = nlp(candidates[0])
for token in doc:
  print(token.text, token.dep_, token.pos_, token.morph)

Mary nsubj PROPN Number=Sing
had ROOT VERB Tense=Past|VerbForm=Fin
a det DET Definite=Ind|PronType=Art
baby dobj NOUN Number=Sing
. punct PUNCT PunctType=Peri


In [3]:
# Extract all nouns from the candidates

answers = {candidate: [] for candidate in candidates}
for candidate in candidates:
  ####################################################################
  # TODO parse the candidate with spacy and append all noun tokens to
  # the answers of that candidate
  ####################################################################
  for token in nlp(candidate):
    if token.pos_ == "PROPN":
      answers[candidate].append(token.text)
  ####################################################################
answers

{'Mary had a baby.': ['Mary'],
 'John gave birth to a girl.': ['John'],
 'John has a younger sister.': ['John']}

## 2.2.1 Train a question generation system

In [4]:
# Load the SQuAD dataset
from datasets import load_dataset

squad = load_dataset("squad", split="train[:5000]")
squad = squad.train_test_split(test_size=0.2)
squad["train"][0]

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


{'id': '56d3a9282ccc5a1400d82dc8',
 'title': 'Frédéric_Chopin',
 'context': 'Polish composers of the following generation included virtuosi such as Moritz Moszkowski, but, in the opinion of J. Barrie Jones, his "one worthy successor" among his compatriots was Karol Szymanowski (1882–1937). Edvard Grieg, Antonín Dvořák, Isaac Albéniz, Pyotr Ilyich Tchaikovsky and Sergei Rachmaninoff, among others, are regarded by critics as having been influenced by Chopin\'s use of national modes and idioms. Alexander Scriabin was devoted to the music of Chopin, and his early published works include nineteen mazurkas, as well as numerous études and preludes; his teacher Nikolai Zverev drilled him in Chopin\'s works to improve his virtuosity as a performer. In the 20th century, composers who paid homage to (or in some cases parodied) the music of Chopin included George Crumb, Bohuslav Martinů, Darius Milhaud, Igor Stravinsky and Heitor Villa-Lobos.',
 'question': "Who was Chopin's worthy successor accor

In [5]:
# Load the model's tokenizer
from transformers import AutoTokenizer

model_name = "google/flan-t5-small"
tokenizer = AutoTokenizer.from_pretrained(model_name)

tokenizer_args = {
    #"padding": "max_length",
    #"return_tensors": "pt",
    "truncation": True
}

In [6]:
def prompt_pattern(answer, context):
  ####################################################################
  # TODO Design a prompt pattern for the question generation
  ####################################################################
  prompt = f"answer: {answer} context: {context}"
  ####################################################################
  return prompt

def preprocess(samples):
  ####################################################################
  # TODO Write a preprocessing function:
  # 1. Combine the answers and the contexts in a prompt
  # 2. Tokenize the inputs
  # 3. Tokenize the questions
  ####################################################################
  text_inputs = [prompt_pattern(answer["text"][0], context) for answer, context in zip(samples["answers"], samples["context"])]
  inputs = tokenizer.batch_encode_plus(text_inputs, **tokenizer_args)
  inputs["labels"] = tokenizer(samples["question"], **tokenizer_args, max_length=64)["input_ids"]
  ####################################################################
  return inputs

tokenized_squad = squad.map(preprocess, batched=True, remove_columns=squad["train"].column_names)

Map:   0%|          | 0/4000 [00:00<?, ? examples/s]

Map:   0%|          | 0/1000 [00:00<?, ? examples/s]

In [7]:
# Load the model
from transformers import T5ForConditionalGeneration

model = T5ForConditionalGeneration.from_pretrained(model_name)

In [8]:
# Train the model
from transformers import DataCollatorForSeq2Seq, Seq2SeqTrainingArguments, Seq2SeqTrainer

training_args = Seq2SeqTrainingArguments(
    output_dir="my_awesome_qg_model",
    ####################################################################
    # Set the hyperparameters for training
    ####################################################################
    eval_strategy="epoch",
    learning_rate=2e-5,
    per_device_train_batch_size=16,
    #gradient_accumulation=4,
    per_device_eval_batch_size=4,
    num_train_epochs=4,
    weight_decay=0.01,
    predict_with_generate=True, #!!
    push_to_hub=False
    ####################################################################
)

data_collator = DataCollatorForSeq2Seq(tokenizer)

trainer = Seq2SeqTrainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_squad["train"],
    eval_dataset=tokenized_squad["test"],
    tokenizer=tokenizer,
    data_collator=data_collator,
)

trainer.train()

Epoch,Training Loss,Validation Loss
1,No log,1.745163
2,2.075200,1.707578
3,2.075200,1.692752
4,1.904500,1.687621


TrainOutput(global_step=1000, training_loss=1.9898489990234376, metrics={'train_runtime': 505.6829, 'train_samples_per_second': 31.64, 'train_steps_per_second': 1.978, 'total_flos': 2145728286720000.0, 'train_loss': 1.9898489990234376, 'epoch': 4.0})

In [9]:
model.save_pretrained("my_awesome_qg_model")

## 2.2.2 Generate questions

In [10]:
from transformers import pipeline

question_generator = pipeline("text2text-generation", model="/content/my_awesome_qg_model", tokenizer=tokenizer)

In [11]:
questions = {candidate: [] for candidate in candidates}
for candidate in candidates:
  ####################################################################
  # TODO Use the trained model to extract questions for our samples
  ####################################################################
  for answer in answers[candidate]:
    questions[candidate].append(question_generator(prompt_pattern(answer, candidate))[0]["generated_text"])
  ####################################################################
questions



{'Mary had a baby.': ['What did Mary do to her baby?'],
 'John gave birth to a girl.': ['Who gave birth to a girl?'],
 'John has a younger sister.': ['Who is the younger sister?']}

## 2.3 Question answering

Open the [HuggingFace model hub](https://huggingface.co/models) and search for a suitable question answering model.

In [12]:
from  transformers  import  AutoTokenizer, AutoModelWithLMHead, pipeline

####################################################################
# TODO Load the model and write a function to call the model and
# retrieve the answer based on the context
####################################################################
model_name = "MaRiOrOsSi/t5-base-finetuned-question-answering"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelWithLMHead.from_pretrained(model_name)

def question_answering(question, context):
  input = f"question: {question} context: {context}"
  encoded_input = tokenizer([input],
                              return_tensors='pt',
                              max_length=512,
                              truncation=True)
  output = model.generate(input_ids = encoded_input.input_ids,
                              attention_mask = encoded_input.attention_mask)
  output = tokenizer.decode(output[0], skip_special_tokens=True)
  return output
####################################################################

You are using the default legacy behaviour of the <class 'transformers.models.t5.tokenization_t5.T5Tokenizer'>. This is expected, and simply means that the `legacy` (previous) behavior will be used so nothing changes for you. If you want to use the new behaviour, set `legacy=False`. This should only be set if you understand what it means, and thoroughly read the reason why this was added as explained in https://github.com/huggingface/transformers/pull/24565


In [13]:
for candidate in candidates:
  print("****", candidate)
  for answer, question in zip(answers[candidate], questions[candidate]):
    print("\t", question)
    print("\t\t Original answer:", answer)
    print("\t\t Answer candidate:", question_answering(question, candidate))
    print("\t\t Answer source:", question_answering(question, source))

**** Mary had a baby.
	 What did Mary do to her baby?
		 Original answer: Mary
		 Answer candidate: Mary had a baby.
		 Answer source: 
**** John gave birth to a girl.
	 Who gave birth to a girl?
		 Original answer: John
		 Answer candidate: John
		 Answer source: Mary
**** John has a younger sister.
	 Who is the younger sister?
		 Original answer: John
		 Answer candidate: John
		 Answer source: John


### **Discussion**
*  Did you find any hallucinations?
*  What kind of hallucinations cannot be detected with such a system?
*  What system could you use for these hallucinations?


