<a href="https://colab.research.google.com/github/MJavadHzr/LLMs/blob/master/ETH_LLM_Assignment2_Q4.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Assignment 2 - RAG

Parts which require your interaction are marked with `TODO:`

In [1]:
%%capture
# Install the relevant dependencies
!pip3 install datasets sentence_transformers tqdm numpy

Imagine your task is to build a question-answering (QA) system for a company. You are given a language model and have to create this product out of it.
The requirements of the system need to adapt very quickly to the new data without training.
For this, we will use **Retrieval Augmented Generation (RAG)**.
The company insists you use their in-house LM model trained on multiple tasks, a _flan-t5-small_.
You can test its QA functionality by asking the question _"When ETH was founded?"_:

In [2]:
# Example inference with the model.
# TODO: run me to test the environment

# TODO: if you are using Colab, make sure to go to Runtime->Change runtime type and select GPU
# if you are running this without GPU (not recommended), remove `device=0`.
from transformers import pipeline
vanilla_qa_pipe = pipeline("text2text-generation", model="google/flan-t5-small", device=0, truncation=True)

QUESTION = "QUESTION: When was ETH founded?"

vanilla_qa_pipe(f"{QUESTION} ANSWER:", max_new_tokens=10)[0]["generated_text"]

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


config.json:   0%|          | 0.00/1.40k [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/308M [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/147 [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/2.54k [00:00<?, ?B/s]

spiece.model:   0%|          | 0.00/792k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/2.42M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/2.20k [00:00<?, ?B/s]

'1897'

In [3]:
vanilla_qa_pipe(f"""
        CONTEXT: ETH Zurich (German: Eidgenoessische Technische Hochschule Zurich; English:
        Federal Institute of Technology Zurich) is a public research university in Zurich,
        Switzerland. Founded in 1854 with the stated mission to educate engineers and scientists,
        the university focuses primarily on science, technology, engineering, and mathematics. It
        consistently ranks among the top universities in the world and its 16 departments span a
        variety of disciplines and subjects.
        {QUESTION}
        ANSWER:",
    """,
    max_new_tokens=10
)[0]["generated_text"]

'1854'

The first output is 1897, which is incorrect.

This is not a problem, we can use RAG to automatically provide the passage from an [external source](https://en.wikipedia.org/wiki/ETH_Zurich) and make the model answer. Concatenating the first paragraph from Wikipedia to the question makes the model yield the correct answer 1854.

In [4]:
# Define model function; do not modify
from typing import List

def rag_qa_pipe(question: str, passages: List[str]) -> str:
    """
    Define the RAG pipeline which concatenates passages to the question.
    :param question: Question text.
    :param passages: Relevant text passages.
    :return: Generated text from the pipeline.
    """
    passages = "\n".join([f"CONTEXT: {c}" for c in passages])
    return vanilla_qa_pipe(f"{passages}\nQUESTION: {question}\nANSWER: ", max_new_tokens=10)[0]["generated_text"]

To make sure you understand the function `rag_qa_pipe`, ask some question without and with some relevant context.

In [5]:
# TODO: use rag_qa_pipeline some random question that you might have just to test this function

context="""
Albert Einstein; (14 March 1879 – 18 April 1955) was a German-born theoretical physicist who is widely held as one of the most influential scientists. Best known for developing the theory of relativity, Einstein also made important contributions to quantum mechanics.[1][6] His mass–energy equivalence formula E = mc2, which arises from special relativity, has been called "the world's most famous equation".[7] He received the 1921 Nobel Prize in Physics "for his services to theoretical physics, and especially for his discovery of the law of the photoelectric effect",[8] a pivotal step in the development of quantum theory.
"""

print(rag_qa_pipe("When was Einstein born?", []))
print(rag_qa_pipe("When was Einstein born?", [context]))

18 September 1897
14 March 1879


Start with the provided model and the first 500 questions from the validation part of the _SQuAD_ dataset. The dataset has a ground truth Wikipedia passage linked to it and you can directly use it.

Then, compute the QA performance of the model with and without prepended passage using `rag_qa_pipe(question, passages)`.

Report the average case-sensitive answer exact match (model output is identical to the gold answer, EM) and case-insensitive [answer F1 scores](https://kierszbaumsamuel.medium.com/f1-score-in-nlp-span-based-qa-task-5b115a5e7d41) (F1) for both setups.
Because each question has multiple possible answers, take the maximum score for a model answer across all gold answers.

In [6]:
# baseline model evaluation
# TODO: the this cell requires <30 new lines

import tqdm
import numpy as np
from datasets import load_dataset
dataset = load_dataset("rajpurkar/squad")

def metric_exact_match(ans_pred: str, ans_true: str) -> float:
    """
    Case-sensitive answer exact match, model output is identical to the gold answer.
    :param ans_pred: Predicted answer
    :param ans_true: Ground truth answer
    :return: 1. if the answers are the same, 0. otherwise
    """
    # TODO: ~1 line
    return 1. * (ans_pred == ans_true)

def metric_f1(ans_pred: str, ans_true: str) -> float:
    """
    Case-insensitive answer F1 score. Use white-space separated words as "tokens".
    :param ans_pred: Predicted answer.
    :param ans_true: Ground truth answer.
    :return: F1 score between the predicted and ground truth answers.
    """
    # TODO: ~10 lines
    pred_tokens = ans_pred.split(' ')
    true_tokens = ans_true.split(' ')
    shared_tokens = len([x for x in pred_tokens if x in true_tokens])
    if shared_tokens == 0:
        return 0
    precision = shared_tokens / len(pred_tokens)
    recall = shared_tokens / len(true_tokens)
    return 2 * precision * recall / (precision + recall)


w_f1, w_em = [], []
wo_f1, wo_em = [], []
for line in tqdm.tqdm(dataset["validation"].select(range(500))):
    # hint: use `line["question"]`, `line["context"]`, and `line["answers"]`
    # TODO: run with and without prepended passage
    w_pred = rag_qa_pipe(line['question'], passages=[line['context']])
    wo_pred = rag_qa_pipe(line['question'], passages=[])

    w_f1.append(np.max([metric_f1(w_pred, gt) for gt in line['answers']['text']]))
    w_em.append(np.max([metric_exact_match(w_pred, gt) for gt in line['answers']['text']]))

    wo_f1.append(np.max([metric_f1(wo_pred, gt) for gt in line['answers']['text']]))
    wo_em.append(np.max([metric_exact_match(wo_pred, gt) for gt in line['answers']['text']]))


# TODO: Print mean of the exact match and mean of F1 scores for the model with and without prepended passage
print(f'\nw passages:\tF1:{np.mean(w_f1):.2f},\tEM: {np.mean(w_em):.2f}')
print(f'w/o passages:\tF1:{np.mean(wo_f1):.2f},\tEM: {np.mean(wo_em):.2f}')

README.md:   0%|          | 0.00/7.62k [00:00<?, ?B/s]

train-00000-of-00001.parquet:   0%|          | 0.00/14.5M [00:00<?, ?B/s]

validation-00000-of-00001.parquet:   0%|          | 0.00/1.82M [00:00<?, ?B/s]

Generating train split:   0%|          | 0/87599 [00:00<?, ? examples/s]

Generating validation split:   0%|          | 0/10570 [00:00<?, ? examples/s]

  1%|          | 3/500 [00:00<01:36,  5.17it/s]You seem to be using the pipelines sequentially on GPU. In order to maximize efficiency please use a dataset
100%|██████████| 500/500 [01:42<00:00,  4.90it/s]


w passages:	F1:0.77,	EM: 0.71
w/o passages:	F1:0.08,	EM: 0.02





You will likely see improvements in scores by providing a passage to the model.

In contrast to the previous evaluation, during inference in a real world scenario, we do not have access to the ground truth passage.
All we have access to is the question from a user.
Luckily, the company is providing you with an unstructured knowledge base. This could be the whole of Wikipedia but in our scenario, we use all the passages from the SQuAD dataset and shuffle them to remove any existing structure.

In [7]:
import random

kb = list(set(dataset["validation"]["context"]))

# make sure that there is no remaining structure
random.Random(42).shuffle(kb)
print(len(kb), "passages in the knowledge base")

2067 passages in the knowledge base


Now whenever we receive a question, we need to find the relevant passage(s) from the knowledge base and put it in the model input.
This is a non-trivial task and a whole research field of Information Retrieval is devoted to it.


We are going to convert all the knowledge base passages into vectors using TF-IDF and the provided embedding model ([bert-base-nli-max-tokens](https://huggingface.co/sentence-transformers/bert-base-nli-max-tokens)).
The model inference is already implemented for you but you need to fill in all the functions in the `KnowledgeBase` class.
You will need to implement the retrieval, the distance metrics, and the three similarity metrics (Euclidean, cosine, inner product).

We need to build an abstraction for the knowledge base. It needs to support:
- adding new keys (vectors) and their corresponding values
- retrieving the closest key given one, based on 3 vector distance metrics

The implementation does not need to be efficient.

Hint: it's ok to just add all the elements to a list and on retrieval sort the list by the distance.

In [8]:
# Knowledge base building. This cell requires <20 new lines.
from typing import Literal, List, Any

Vec = List
Val = Any

class KnowledgeBase:
    def __init__(self, dim: int):
        """
        Initialize a knowledge base with a given dimensionality.
        :param dim: the dimensionality of the vectors to be stored
        """
        # TODO: initialize a persistent structure, such as a simple list
        self.data = list()

    def add_item(self, key: Vec, val: Val):
        """
        Store the key-value pair in the knowledge base.
        :param key: key
        :param val: value
        """
        # TODO: add to the persistent structure
        self.data.append({'key': key, 'val': val})
        self.dist_metrics = {
            'l2': self._sim_euclidean,
            'cos': self._sim_cosine,
            'ip': self._sim_inner_product
        }

    def retrieve(
        self, key: Vec, metric: Literal['l2', 'cos', 'ip'], k: int = 1
    ) -> List[Val]:
        """
        Retrieve the top k values from the knowledge base given a key and similarity metric.
        :param key: key
        :param metric: Similarity metric to use.
        :param k: Top k similar items to retrieve.
        :return: List of top k similar values.
        """
        # TODO: retrieve the k closest vectors and return their corresponding values
        # Hint: this does not have to be efficient, feel free to just sort the whole persistent structure and return the top k
        dists = [self.dist_metrics[metric](key, x['key']) for x in self.data]
        return [x['val'] for x in np.array(self.data)[np.argsort(dists)[:k]]]

    @staticmethod
    def _sim_euclidean(a: Vec, b: Vec) -> float:
        """
        Compute Euclidean (L2) distance between two vectors.
        :param a: Vector a
        :param b: Vector b
        :return: Similarity score
        """
        # hint: use numpy
        # TODO: compute the Euclidean distance between two vectors
        return np.linalg.norm(np.array(a) - np.array(b))

    @staticmethod
    def _sim_cosine(a: Vec, b: Vec) -> float:
        """
        Compute the cosine similarity between two vectors.
        :param a: Vector a
        :param b: Vector b
        :return: Similarity score
        """
        # hint: use numpy
        # TODO: compute the cosine distance between two vectors
        return 1 - (np.dot(a, b) / (np.linalg.norm(a) * np.linalg.norm(b)))

    @staticmethod
    def _sim_inner_product(a: Vec, b: Vec) -> float:
        """
        Compute the inner product between two vectors.
        :param a: Vector a
        :param b: Vector b
        :return: Similarity score
        """
        # hint: use numpy
        # TODO: compute the iner product distance between two vectors
        return 1 - np.dot(a, b)


In [9]:
# Build knowledge base index
# In ideal case this does not need to be changed and can just be run.
# Make modifications if you feel they are necessary.

from sklearn.feature_extraction.text import TfidfVectorizer
from sentence_transformers import SentenceTransformer

# # Sparse retrieval using TF-IDF - vectorize with tfidf and retrieve
vectorizer = TfidfVectorizer(max_features=768, norm=None)
kb_vectorized = np.asarray(vectorizer.fit_transform([x for x in kb]).todense())
kb_index_tfidf = KnowledgeBase(dim=768)
for passage_index, passage_embd in enumerate(kb_vectorized):
    kb_index_tfidf.add_item(passage_embd.squeeze(), passage_index)

# Dense retrieval using Sentence Transformers
model_embd = SentenceTransformer("bert-base-nli-mean-tokens").to("cuda:0")
kb_index_embed = KnowledgeBase(dim=768)
for passage_index, passage_embd in enumerate(tqdm.tqdm(kb)):
    kb_index_embed.add_item(model_embd.encode(passage_embd).squeeze(), passage_index)

modules.json:   0%|          | 0.00/229 [00:00<?, ?B/s]

config_sentence_transformers.json:   0%|          | 0.00/122 [00:00<?, ?B/s]

README.md:   0%|          | 0.00/3.99k [00:00<?, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/625 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/438M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/399 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

added_tokens.json:   0%|          | 0.00/2.00 [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]

1_Pooling/config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

100%|██████████| 2067/2067 [00:29<00:00, 70.99it/s]


For the same first 500 questions from the validation split evaluate how often is the retrieved passage the correct one (formally Recall@1) or among the top 5 retrieved (Recall@5).
Perform the retrieval with three distance metrics: euclidean distance, cosine distance, and inner product. The result for this should be 12 numbers.

In [10]:
import itertools

def evaluate_retrieval(metric, k):
    embed_succ, tfidf_succ = 0, 0
    for line in tqdm.tqdm(dataset["validation"].select(range(500))):
        # embed query
        q_embed = model_embd.encode(line['question']).squeeze()
        q_tfidf = np.asarray(vectorizer.transform([line['question']]).todense()).reshape(-1)

        # retrieve closest indices
        embed_retrieved = kb_index_embed.retrieve(q_embed, metric=metric, k=k)
        tfidf_retrieved = kb_index_tfidf.retrieve(q_tfidf, metric=metric, k=k)

        # get gt indices
        gt_index = kb.index(line['context'])
        embed_succ += gt_index in embed_retrieved
        tfidf_succ += gt_index in tfidf_retrieved

    return embed_succ, tfidf_succ


metrics = ["l2", "cos", "ip"]
K = [1, 5]
for k, metric in itertools.product(K, metrics):
    print(f'k: {k}, metric: {metric.upper()}')
    embed_succ, tfidf_succ = evaluate_retrieval(metric, k)
    print(f'Embed: {embed_succ/500:.2f}\tTF-IDF: {tfidf_succ/500:.2f}')

k: 1, metric: L2


100%|██████████| 500/500 [00:21<00:00, 23.14it/s]


Embed: 0.23	TF-IDF: 0.01
k: 1, metric: COS


100%|██████████| 500/500 [00:44<00:00, 11.21it/s]


Embed: 0.24	TF-IDF: 0.13
k: 1, metric: IP


100%|██████████| 500/500 [00:21<00:00, 23.65it/s]


Embed: 0.21	TF-IDF: 0.04
k: 5, metric: L2


100%|██████████| 500/500 [00:32<00:00, 15.59it/s]


Embed: 0.52	TF-IDF: 0.04
k: 5, metric: COS


100%|██████████| 500/500 [00:40<00:00, 12.42it/s]


Embed: 0.56	TF-IDF: 0.26
k: 5, metric: IP


100%|██████████| 500/500 [00:22<00:00, 22.21it/s]

Embed: 0.55	TF-IDF: 0.20





In production, you receive a question from the user and to answer it, you need to first retrieve the relevant passage(s), pass it to the model, and only then generate the answer.

Evaluate the model performance with passages retrieved by TFIDF and EMBD vectorization.
Consider top-1 and top-5 passages.
This time use only case-insensitive F1.
The result for this cell should be |vectorizations $\times$ passage sizes $\times$ distance metrics = 2 x 2 x 3 = 12 numbers.

Answer the following questions:
* Based on the results, what are the advantages and disadvantages of using multiple retrieved passages?
* Describe one approach to detect if none of the retrieved passages is relevant to the user question.

TODO:
* Using multiple passages increases the chance of retrieving the correct and related context, however the number of unrelated passages increases. Therefore model would consider the unrelated passages and output the wrong answer.
* Using a threshold would be helpful for this purpose



In [11]:
# In ideal case this does not need to be changed and can just be run.
# Make modifications if you feel they are necessary.

metrics = ["l2", "cos", "ip"]
K = [1, 5]

for k, metric in itertools.product(K, metrics):
    print(f'k: {k}, metric: {metric.upper()}')

    f1_embed, f1_tfidf = 0., 0.
    for line in tqdm.tqdm(dataset["validation"].select(range(500))):
        # TODO: evaluate the retrieval
        # TODO: store RAG model output
        # This requires <30 new lines
        # embed query
        q_embed = model_embd.encode(line['question']).squeeze()
        q_tfidf = np.asarray(vectorizer.transform([line['question']]).todense()).reshape(-1)

        # retrieve closest indices and passages
        embed_retrieved = kb_index_embed.retrieve(q_embed, metric=metric, k=k)
        tfidf_retrieved = kb_index_tfidf.retrieve(q_tfidf, metric=metric, k=k)
        embed_passages = [kb[i] for i in embed_retrieved]
        tfidf_passages = [kb[i] for i in tfidf_retrieved]

        # predict using passages
        embed_pred = rag_qa_pipe(line['question'], passages=embed_passages)
        tfidf_pred = rag_qa_pipe(line['question'], passages=tfidf_passages)

        # calculate f1
        f1_embed += np.max([metric_f1(embed_pred, gt) for gt in line['answers']['text']])
        f1_tfidf += np.max([metric_f1(tfidf_pred, gt) for gt in line['answers']['text']])

    print(f'Embed: {f1_embed/500:.2f}\tTF-IDF: {f1_tfidf/500:.2f}\n=============')

k: 1, metric: L2


100%|██████████| 500/500 [02:08<00:00,  3.88it/s]


Embed: 0.26	TF-IDF: 0.09
k: 1, metric: COS


100%|██████████| 500/500 [01:44<00:00,  4.79it/s]


Embed: 0.27	TF-IDF: 0.18
k: 1, metric: IP


100%|██████████| 500/500 [01:24<00:00,  5.88it/s]


Embed: 0.25	TF-IDF: 0.11
k: 5, metric: L2


100%|██████████| 500/500 [02:11<00:00,  3.80it/s]


Embed: 0.04	TF-IDF: 0.09
k: 5, metric: COS


100%|██████████| 500/500 [02:46<00:00,  3.00it/s]


Embed: 0.06	TF-IDF: 0.03
k: 5, metric: IP


100%|██████████| 500/500 [02:25<00:00,  3.43it/s]

Embed: 0.10	TF-IDF: 0.01





Answer the following questions about similarity metrics:
* Compare and contrast the three metrics, what they might be influenced by, and their advantages and disadvantages.
* Consider the scenario if the vectors in the knowledge base were normalized so that $|x|_2 = 1$. What would the results look like? Hint: look at the formulas with this vector assumption.
* Answer what the Recall@k of the three distance metrics is relative to each other (i.e. which vector metric is the best and which is the worst one?).

TODO:

* In embeding spaces normally the direction
* If we assume $\|x\|_2 = 1$ then the cosine and inner product distance would be tha same:
$$
cosine(x_1, x_2) = 1 - \frac{x_1\cdot x_2}{\|x_1\|_2\cdot\|x_2\|_2} = 1 - x_1\cdot x_2 = inner(x_1, x_2)
$$
Also, the $l2$ distance would be proportional to the root of the other two distances. Therefore, the would not be any differance between these metrics and all three would perform equally.
* Cosine distence is the best and l2 is the worst one.


Lastly, it is a good practice to analyze failure cases of your solution to better understand the pipeline.
Find the first example of each and compute how often the situation happens (percentage). Use the maximum exact match to determine correctness and L2 + embedding for retrieval.

- For top-1: The retrieved passage is **correct** but the model is **not correct**.
- For top-1: The retrieved passage is **not correct** but the model is **still correct**.
- For top-5: One of the retrieved passages is the **correct** one but the model is **not correct**.
- For top-1: Without retrieved passage is the model **correct** but with the passage the model becomes **incorrect**.
- For top-1: Without retrieved passage is the model **incorrect** and with the passage the model becomes **incorrect** but in a different way (different answer).

In [12]:
# compute the 5 phenomena statistics (relative frequency) and find examples

# TODO: <30 lines
stats = [0] * 5
stats_idx = [None] * 5
model_preds = [None] * 5
retrieved_passages = [None] * 5
for idx, line in enumerate(tqdm.tqdm(dataset["validation"].select(range(500)))):
        q_embed = model_embd.encode(line['question']).squeeze()

        retrieved_top1 = kb_index_embed.retrieve(q_embed, metric='l2', k=1)
        retrieved_top5 = kb_index_embed.retrieve(q_embed, metric='l2', k=5)

        passages_top1 = [kb[i] for i in retrieved_top1]
        passages_top5 = [kb[i] for i in retrieved_top5]

        gt_index = kb.index(line['context'])
        retrieved1_corr = gt_index in retrieved_top1
        retrieved5_corr = gt_index in retrieved_top5

        w1_pred = rag_qa_pipe(line['question'], passages=passages_top1)
        w5_pred = rag_qa_pipe(line['question'], passages=passages_top5)
        wo_pred = rag_qa_pipe(line['question'], passages=[])

        w1_pred_corr = 1. in [metric_exact_match(w1_pred, gt) for gt in line['answers']['text']]
        w5_pred_corr = 1. in [metric_exact_match(w5_pred, gt) for gt in line['answers']['text']]
        wo_pred_corr = 1. in [metric_exact_match(wo_pred, gt) for gt in line['answers']['text']]

        stats_check = [
            retrieved1_corr and not w1_pred_corr,
            not retrieved1_corr and w1_pred_corr,
            retrieved5_corr and not w5_pred_corr,
            wo_pred_corr and not w1_pred_corr,
            not wo_pred_corr and not w1_pred_corr and wo_pred != w1_pred
        ]
        for i, flag in enumerate(stats_check):
            if flag:
                stats[i] += 1
                stats_idx[i] = idx if stats_idx[i] is None else stats_idx[i]
                model_preds[i] = (w1_pred, wo_pred) if model_preds[i] is None else model_preds[i]
                retrieved_passages[i] = passages_top1 if retrieved_passages[i] is None else retrieved_passages[i]

100%|██████████| 500/500 [02:36<00:00,  3.19it/s]


In [13]:
for i, (count, idx) in enumerate(zip(stats, stats_idx)):
    if count == 0:
        print('Not found any example for this!')
    else:
        print(f'freq\t: {count/500}')
        print('context:\t', dataset["validation"][idx]['context'])
        print('question:\t', dataset["validation"][idx]['question'])
        print('answer:\t', dataset["validation"][idx]['answers'])
        print('retrieved:\t', retrieved_passages[i])
        print('preds:\t', model_preds[i])
    print()

freq	: 0.056
context:	 Super Bowl 50 was an American football game to determine the champion of the National Football League (NFL) for the 2015 season. The American Football Conference (AFC) champion Denver Broncos defeated the National Football Conference (NFC) champion Carolina Panthers 24–10 to earn their third Super Bowl title. The game was played on February 7, 2016, at Levi's Stadium in the San Francisco Bay Area at Santa Clara, California. As this was the 50th Super Bowl, the league emphasized the "golden anniversary" with various gold-themed initiatives, as well as temporarily suspending the tradition of naming each Super Bowl game with Roman numerals (under which the game would have been known as "Super Bowl L"), so that the logo could prominently feature the Arabic numerals 50.
question:	 Which NFL team represented the NFC at Super Bowl 50?
answer:	 {'text': ['Carolina Panthers', 'Carolina Panthers', 'Carolina Panthers'], 'answer_start': [249, 249, 249]}
retrieved:	 ['Super B

A client is complaining that the model answers incorrectly the question _"Who is the current Governor of Victoria?"_.
1. Show your model output to this question with top-1 retrieved passage using any metric.
2. Show which top-1 context is retrieved by L2 embd.

Hint for the correct answer, see: [en.wikipedia.org/wiki/Premier_of_Victoria](https://en.wikipedia.org/wiki/Premier_of_Victoria).

In [14]:
QUESTION = "Who is the premier of Victoria?"
# TODO: < 20 lines

q_embed = model_embd.encode(QUESTION).squeeze()
retrieved = kb_index_embed.retrieve(q_embed, metric='l2', k=1)
passages = [kb[i] for i in retrieved]
pred = rag_qa_pipe(QUESTION, passages=passages)

print(f'retrieved passage: {passages}')
print(f'model output: {pred}')

retrieved passage: ["The Premier of Victoria is the leader of the political party or coalition with the most seats in the Legislative Assembly. The Premier is the public face of government and, with cabinet, sets the legislative and political agenda. Cabinet consists of representatives elected to either house of parliament. It is responsible for managing areas of government that are not exclusively the Commonwealth's, by the Australian Constitution, such as education, health and law enforcement. The current Premier of Victoria is Daniel Andrews."]
model output: Daniel Andrews


Answer the following questions:
* Provide a reason why your model is giving the incorrect answer. (information tracing)
* Propose a way by which this could be remedied. (information editing)

TODO:
* The retrieved context is related but the problem is outdated information in out knowledge base. The premier has recently changed after acquisition of the dataset.
* Constatly updating the knowledge base!

Note on compute: the GPU time of the gold solution is ~15 minutes. If your solution requires much more compute (e.g. hours), then you are likely doing something incorrectly.