## Authors: Guerlain Messin & Ion Panteleiciuc

# Lab: Semantic search on Question Answering datasets

## Objectives:

1. Explore and understand the **S**tanford **Qu**estion **A**nswering **D**ataset [Squad](https://aclanthology.org/D16-1264/) dataset and the associated task.  
2. Adapt this dataset for a *local* semantic search task and propose an appropriate evaluation metric:
    - Implement a simple baseline based on **TF-IDF**.
    - Use a pre-trained transformer-based model, and fine-tune it.
3. Test these approaches on the [CommonSense QA](https://aclanthology.org/N19-1421/) dataset. 
4. Adapt these approaches for a *global* semantic search task on the [WikiQA](https://aclanthology.org/D15-1237/) dataset for open domain question answering.
5. **Bonus** (Optional) Apply a model (any, as long as it's running) to the original Squad QA task.

In [107]:
import numpy as np
import pandas as pd

import spacy
from evaluate import load
from datasets import load_dataset, DatasetDict
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity
from sklearn.metrics import precision_score, recall_score, f1_score, accuracy_score
from sentence_transformers import SentenceTransformer, losses, SentencesDataset, InputExample
from torch.utils.data import DataLoader
import faiss
from rouge_score import rouge_scorer

## 1 - The SQuAD dataset

<div class='alert alert-block alert-warning'>
            Questions:</div>

1. Load the dataset **SQuAD** - for example, using the [```dataset``` package](https://huggingface.co/docs/datasets/index) from Huggingface and loading the dataset ```'squad'```. You can also explore it using the [website](https://rajpurkar.github.io/SQuAD-explorer/). 
2. Look at the metrics used to evaluate models on the dataset. You can also load the metric ```'squad'``` from the [```evaluate``` package](https://huggingface.co/docs/evaluate/index) from Huggingface. 
3. Explain succintly - and in your own words - what is the task: how could we use a model to solve it ? Treat the case of encoder models adapted to *classification tasks* and encoder-decoder models adapted to *text generation*. 

In [2]:
squad_dataset = load_dataset('squad')
metric = load('squad')

train_dataset = squad_dataset['train'].shuffle(seed=42).select([i for i in range(1000)])
valid_dataset = squad_dataset['validation'].shuffle(seed=42).select([i for i in range(1000)])

squad_dataset = DatasetDict({'train': train_dataset, 'validation': valid_dataset})

Found cached dataset squad (/Users/Guerlain/.cache/huggingface/datasets/squad/plain_text/1.0.0/d6ec3ceb99ca480ce37cdd35555d6cb2511d223b9150cce08a837ef62ffea453)


  0%|          | 0/2 [00:00<?, ?it/s]

Loading cached shuffled indices for dataset at /Users/Guerlain/.cache/huggingface/datasets/squad/plain_text/1.0.0/d6ec3ceb99ca480ce37cdd35555d6cb2511d223b9150cce08a837ef62ffea453/cache-10de9997c4b83f65.arrow
Loading cached shuffled indices for dataset at /Users/Guerlain/.cache/huggingface/datasets/squad/plain_text/1.0.0/d6ec3ceb99ca480ce37cdd35555d6cb2511d223b9150cce08a837ef62ffea453/cache-f3a033b6ac26514f.arrow


In [3]:
df_train = pd.DataFrame(train_dataset)
df_valid = pd.DataFrame(valid_dataset)

In [4]:
df_train.head()

Unnamed: 0,id,title,context,question,answers
0,573173d8497a881900248f0c,Egypt,The Pew Forum on Religion & Public Life ranks ...,What percentage of Egyptians polled support de...,"{'text': ['84%'], 'answer_start': [468]}"
1,57277e815951b619008f8b52,"Ann_Arbor,_Michigan",The Ann Arbor Hands-On Museum is located in a ...,Ann Arbor ranks 1st among what goods sold?,"{'text': ['books'], 'answer_start': [402]}"
2,5727e2483acd2414000deef0,Rule_of_law,One important aspect of the rule-of-law initia...,"In developing countries, who makes most of the...","{'text': ['the executive'], 'answer_start': [6..."
3,5728f5716aef0514001548cc,Samurai,"In December 1547, Francis was in Malacca (Mala...",Who impressed Xavier by taking notes in church?,"{'text': ['Anjiro'], 'answer_start': [160]}"
4,572826002ca10214002d9f16,Group_(mathematics),Groups are also applied in many other mathemat...,What represents elements of the fundamental gr...,"{'text': ['loops'], 'answer_start': [489]}"


In [5]:
df_train.context[40]

'Gaddafi was a very private individual, who described himself as a "simple revolutionary" and "pious Muslim" called upon by Allah to continue Nasser\'s work. Reporter Mirella Bianco found that his friends considered him particularly loyal and generous, and asserted that he adored children. She was told by Gaddafi\'s father that even as a child he had been "always serious, even taciturn", a trait he also exhibited in adulthood. His father said that he was courageous, intelligent, pious, and family oriented.'

In [6]:
df_train.loc[40].question

'Whose efforts did Gaddafi see himself as continuing?'

In [7]:
df_train.loc[40].answers

{'text': ['Nasser'], 'answer_start': [141]}

In [8]:
df_valid.head()

Unnamed: 0,id,title,context,question,answers
0,572759665951b619008f8884,Private_school,Private schooling in the United States has bee...,In what year did Massachusetts first require c...,"{'text': ['1852', '1852', '1852'], 'answer_sta..."
1,57296de03f37b3190047839e,Chloroplast,The chloroplast membranes sometimes protrude o...,When were stromules discovered?,"{'text': ['1962', '1962', '1962'], 'answer_sta..."
2,5726d4a45951b619008f7f6c,Victoria_and_Albert_Museum,Not only the work of British artists and craft...,Which artist who had a major influence on the ...,"{'text': ['Horace Walpole', 'Horace Walpole', ..."
3,572843304b864d1900164848,University_of_Chicago,"In the 1890s, the University of Chicago, fearf...","In 1890, who did the university decide to team...",{'text': ['several regional colleges and unive...
4,56d729180d65d21400198427,Super_Bowl_50,"After a punt from both teams, Carolina got on ...",Who got a touchdown making the score 10-7?,"{'text': ['Jonathan Stewart', 'Jonathan Stewar..."


In [9]:
df_train.answers.apply(lambda x: len(x['text'])).value_counts()

answers
1    1000
Name: count, dtype: int64

In [10]:
df_valid.answers.apply(lambda x: len(x['text'])).value_counts()

answers
3    803
5    120
4     59
2     16
6      2
Name: count, dtype: int64

<div class="alert alert-success">
The task in the Stanford Question Answering Dataset (SQuAD) is to develop models that can accurately answer questions based on a given passage of text. In SQuAD, answers can be any sequence of words within the provided text.</br>

- For <b>encoder models</b> adapted to classification tasks, the goal is to train a model that takes both the question and the passage as input and predicts the span of text that forms the correct answer. The model's encoder processes the input text and produces a representation that is used for classification, identifying the start and end positions of the answer within the passage.

- For <b>encoder-decoder models</b> adapted to text generation, the focus is on generating coherent and contextually relevant answers. The encoder processes the input question and passage, while the decoder generates a sequence of words constituting the answer. This type of model is well-suited for tasks where the answer is not a direct span of text but requires a more expressive generation of language.

In summary, for encoder models adapted to classification, the task is to predict answer spans, while for encoder-decoder models adapted to text generation, the task is to generate answers in a more flexible manner. Both approaches aim to make the model understand and process the context provided by the passage to accurately respond to questions.
</div>

## 2 - Design a *local* semantic search from squad

This taks is a little complicated to implement. Let us simplify squad to be a **semantic search** task !
We will divide the context containing the answer into several pieces, and ask a model to find which one contains the answer **by vectorizing the question and each piece** and trying to look for the most relevant piece using **cosine similarity** between the vectors, making it a fairly simple task.


For example, the following question of the dataset:

```python
'Which NFL team represented the AFC at Super Bowl 50?'
```

with the answer:

```python
'Denver Broncos'
```

We could divide the corresponding ```'context'``` into the following list:

```python
['Super Bowl 50 was an American football game to determine the champion of the National Football League (NFL) for the 2015 season. The American Football Conference (AFC) champion Denver Broncos d',
 "efeated the National Football Conference (NFC) champion Carolina Panthers 24–10 to earn their third Super Bowl title. The game was played on February 7, 2016, at Levi's Stadium in the San Franc",
 'isco Bay Area at Santa Clara, California. As this was the 50th Super Bowl, the league emphasized the "golden anniversary" with various gold-themed initiatives, as well as temporarily suspending',
 ' the tradition of naming each Super Bowl game with Roman numerals (under which the game would have been known as "Super Bowl L"), so that the logo could prominently feature the Arabic numerals 50.']
```

and indicate the location of the answer as:

```python
label = [1, 0, 0, 0]
```

<div class='alert alert-block alert-warning'>
            Question:</div>
            
At first, we won't do any training: you should work with the ```validation``` part of the dataset. Be careful, there may be several good answers ! Propose a scheme to divide the context into pieces and to label each piece as containing the answer or not. How do we evaluate for this task - would simple accuracy suffice ?

<div class="alert alert-success">

- <b>1. Divide the context into pieces:</b> We can divide the context into chunks, considering each chunk as a potential candidate for containing the answer. A simple approach would be to use sliding windows to create overlapping or non-overlapping chunks.

- <b>2. Label each piece:</b> Label each chunk based on whether it contains the answer or not. If a chunk contains the answer, label it as 1; otherwise, label it as 0.

- <b>3. Vectorize the question and each piece:</b> Use vector embeddings for the question and each chunk of the context (with BERT for example).

- <b>4. Calculate cosine similarity:</b> Compute the cosine similarity between the vector representation of the question and each chunk. The chunk with the highest cosine similarity is considered the most relevant.

- <b>5. Evaluate the model:</b> For our evaluation, we could have used the dedicated squad metric to evaluate the quality of our model. This metric takes into account the exact matchs and the f1-score of our propositon. But here, we will just use classic metrics (precision, recall, and f1-score).
</div>

<div class='alert alert-block alert-info'>
            Code:</div>
            
For efficient processing, you can use the ```map``` method associated to the dataset. It can create a new feature for each example. In this case, you can create a new feature containing the context divided into pieces, and a new feature containing labels for if the pieces contain the answer. You can also use it for your evaluation function.

In [11]:
nlp = spacy.load("en_core_web_sm")

In [12]:
def cut_and_label(dataset):
    doc = nlp(dataset['context'])
    cut_context = [sent.text for sent in doc.sents]
    labels = [1 if dataset['answers']['text'][0] in sentence else 0 for sentence in cut_context]

    return {'cut_context': cut_context, 'labels': labels}

In [13]:
squad_dataset = squad_dataset.map(cut_and_label)

Loading cached processed dataset at /Users/Guerlain/.cache/huggingface/datasets/squad/plain_text/1.0.0/d6ec3ceb99ca480ce37cdd35555d6cb2511d223b9150cce08a837ef62ffea453/cache-1750a4f86fb84402.arrow


Map:   0%|          | 0/1000 [00:00<?, ? examples/s]

In [14]:
squad_dataset['train'][0]

{'id': '573173d8497a881900248f0c',
 'title': 'Egypt',
 'context': 'The Pew Forum on Religion & Public Life ranks Egypt as the fifth worst country in the world for religious freedom. The United States Commission on International Religious Freedom, a bipartisan independent agency of the US government, has placed Egypt on its watch list of countries that require close monitoring due to the nature and extent of violations of religious freedom engaged in or tolerated by the government. According to a 2010 Pew Global Attitudes survey, 84% of Egyptians polled supported the death penalty for those who leave Islam; 77% supported whippings and cutting off of hands for theft and robbery; and 82% support stoning a person who commits adultery.',
 'question': 'What percentage of Egyptians polled support death penalty for those leaving Islam?',
 'answers': {'text': ['84%'], 'answer_start': [468]},
 'cut_context': ['The Pew Forum on Religion & Public Life ranks Egypt as the fifth worst country in the 

### 2.1 - Local search: Independant Tf-idf representations

<div class='alert alert-block alert-info'>
            Code:</div>
            
Implement a function that will for each example:
- Create a tf-idf ```vectorizer``` from all the text in the question and context. 
- Create tf-idf representations for the question and the pieces of the context,
- Find the representation the closest to the question among the pieces.

Then, evaluate the method !

In [15]:
def tfidf_predict_label(dataset):
    vectorizer = TfidfVectorizer()
    vectorizer.fit([dataset['context'] + ' ' + dataset['question']])

    tfidf_vectorized_cut_contexts = vectorizer.transform(dataset['cut_context']).toarray()
    tfidf_vectorized_questions = vectorizer.transform([dataset['question']]).toarray()

    similarities = [cosine_similarity(tfidf_vectorized_questions, vectorized_cut_context.reshape(1, -1))[0][0] for vectorized_cut_context in tfidf_vectorized_cut_contexts]

    index_with_most_similarity = np.argmax(similarities)

    predicted_labels = [1 if i == index_with_most_similarity else 0 for i in range(len(similarities))]

    return {'index_with_most_similarity': index_with_most_similarity, 'predicted_labels': predicted_labels}

In [16]:
tfidf_squad_dataset = squad_dataset['validation'].map(tfidf_predict_label)

Map:   0%|          | 0/1000 [00:00<?, ? examples/s]

In [17]:
predicted_labels = tfidf_squad_dataset['predicted_labels']
labels = tfidf_squad_dataset['labels']

flat_predicted_labels = [predicted_label for predicted_unit in predicted_labels for predicted_label in predicted_unit]
flat_labels = [label for label_unit in labels for label in label_unit]

score_precision = precision_score(flat_predicted_labels, flat_labels)
score_recall = recall_score(flat_predicted_labels, flat_labels)
score_f1 = f1_score(flat_predicted_labels, flat_labels)

print(f'precision: {100*score_precision:.1f}%')
print(f'recall: {100*score_recall:.1f}%')
print(f'f1 score: {100*score_f1:.1f}%')

precision: 65.7%
recall: 73.8%
f1 score: 69.5%


### 2.2 - Local search: Pre-trained sentence representations transformer-based model

<div class='alert alert-block alert-info'>
            Code:</div>

Reproduce the same process using a pre-trained transformer model. You can use a model that you will find on huggingface. You can also look into the [```SentenceTransformer``` library](https://www.sbert.net/), dedicated to represent documents. Also:
- Try to verify if the model has been trained on SQuAD !
- Fine-tune the model (at least a little) to check that it improves results.

In [18]:
model = SentenceTransformer('nq-distilbert-base-v1')

In [19]:
def pretrained_vectorizer_predict_label(dataset):

    bert_encoded_cut_contexts = model.encode(dataset['cut_context'])
    bert_encoded_questions = model.encode(dataset['question'])

    similarities = [cosine_similarity([bert_encoded_questions], [encoded_cut_context])[0][0] for encoded_cut_context in bert_encoded_cut_contexts]

    index_with_most_similarity = np.argmax(similarities)

    predicted_labels = [1 if i == index_with_most_similarity else 0 for i in range(len(similarities))]

    return {'index_with_most_similarity': index_with_most_similarity, 'predicted_labels': predicted_labels}

In [20]:
pretrained_vectorizer_squad_dataset = squad_dataset['validation'].map(pretrained_vectorizer_predict_label)

Map:   0%|          | 0/1000 [00:00<?, ? examples/s]

In [21]:
predicted_labels = tfidf_squad_dataset['predicted_labels']
labels = tfidf_squad_dataset['labels']

flat_predicted_labels = [predicted_label for predicted_unit in predicted_labels for predicted_label in predicted_unit]
flat_labels = [label for label_unit in labels for label in label_unit]

score_precision = precision_score(flat_predicted_labels, flat_labels)
score_recall = recall_score(flat_predicted_labels, flat_labels)
score_f1 = f1_score(flat_predicted_labels, flat_labels)

print(f'precision: {100*score_precision:.1f}%')
print(f'recall: {100*score_recall:.1f}%')
print(f'f1 score: {100*score_f1:.1f}%')

precision: 65.7%
recall: 73.8%
f1 score: 69.5%


We want to finetune this model to see if we can improve the results on the QA labeling context task.

In [22]:
# Triplet approach with the context, the question and the labels
df_train = squad_dataset['train']
cut_context_train = df_train['cut_context']
question_train = df_train['question']
labels_train = df_train['labels']

In [23]:
question_for_train = []
for i in range(len(question_train)):
    for _ in range(len(cut_context_train[i])):
        question_for_train.append(question_train[i])

In [24]:
cut_context_for_train = [cut_context for cut_context_unit in cut_context_train for cut_context in cut_context_unit]
labels_for_train = [label for label_unit in labels_train for label in label_unit]

In [25]:
data_for_training = []

for index in range(len(cut_context_for_train)):
    new_row = {
        'question': question_for_train[index],
        'context': cut_context_for_train[index],
        'label': labels_for_train[index]
    }
    data_for_training.append(new_row)

In [26]:
train_examples = [InputExample(texts=[data['question'], data['context']], label=float(data['label'])) for data in data_for_training]

We have generated a training dataset for the purpose of fine-tuning our Sentence Transformer model. In this process, we will employ the Cosine Similarity loss function. The input pairs consist of two sentences, representing a question and its corresponding context. The model's objective is to predict whether these two sentences are similar (indicating that the answer to the question is present in the context, labeled as 1) or dissimilar (labeled as 0).

In [27]:
train_dataset = SentencesDataset(train_examples, model=model)
train_dataloader = DataLoader(train_dataset, batch_size=8, shuffle=True)

train_loss = losses.CosineSimilarityLoss(model=model)

In [28]:
model.fit(train_objectives=[(train_dataloader, train_loss)], epochs=3, warmup_steps=100, output_path='fine_tuned_model_squad')

model.save('fine_tuned_model_squad')

Epoch:   0%|          | 0/3 [00:00<?, ?it/s]

Iteration:   0%|          | 0/621 [00:00<?, ?it/s]

Iteration:   0%|          | 0/621 [00:00<?, ?it/s]

Iteration:   0%|          | 0/621 [00:00<?, ?it/s]

In [30]:
finetuned_model = SentenceTransformer('fine_tuned_model_squad')

In [31]:
def optim_vectorizer_predict_label(dataset):

    bert_encoded_cut_contexts = model.encode(dataset['cut_context'])
    bert_encoded_questions = model.encode(dataset['question'])

    similarities = [cosine_similarity([bert_encoded_questions], [encoded_cut_context])[0][0] for encoded_cut_context in bert_encoded_cut_contexts]

    index_with_most_similarity = np.argmax(similarities)

    predicted_labels = [1 if i == index_with_most_similarity else 0 for i in range(len(similarities))]

    return {'index_with_most_similarity': index_with_most_similarity, 'predicted_labels': predicted_labels}

In [32]:
finetuned_vectorizer_squad_dataset = squad_dataset['validation'].map(optim_vectorizer_predict_label)

Map:   0%|          | 0/1000 [00:00<?, ? examples/s]

In [33]:
predicted_labels = finetuned_vectorizer_squad_dataset['predicted_labels']
labels = finetuned_vectorizer_squad_dataset['labels']

flat_predicted_labels = [predicted_label for predicted_unit in predicted_labels for predicted_label in predicted_unit]
flat_labels = [label for label_unit in labels for label in label_unit]

finetuned_score_precision = precision_score(flat_predicted_labels, flat_labels)
finetuned_score_recall = recall_score(flat_predicted_labels, flat_labels)
finetuned_score_f1 = f1_score(flat_predicted_labels, flat_labels)

In [35]:
print('------------- Fine tuned model -------------')
print(f'Fine tuned precision: {100*finetuned_score_precision:.1f}%')
print(f'Fine tuned recall: {100*finetuned_score_recall:.1f}%')
print(f'Fine tuned f1 score: {100*finetuned_score_f1:.1f}%')

print('')

print('------------- Base model -------------')
print(f'Basic precision: {100*score_precision:.1f}%')
print(f'Basic recall: {100*score_recall:.1f}%')
print(f'Basic f1 score: {100*score_f1:.1f}%')

------------- Fine tuned model -------------
Fine tuned precision: 74.3%
Fine tuned recall: 83.4%
Fine tuned f1 score: 78.6%

------------- Base model -------------
Basic precision: 65.7%
Basic recall: 73.8%
Basic f1 score: 69.5%


Upon evaluating both models, we observed that fine-tuning on the SQuAD dataset enhanced the model's performance in the context of the question-answering task.

## 3 - Local search on another dataset: does it work ? 

<div class='alert alert-block alert-warning'>
            Question:</div>
            
Let's implement our local semantic search on another dataset, to check if performance follows the same trend. You can use the [```commonsense_qa``` dataset](https://huggingface.co/datasets/commonsense_qa). Do the same exploration and explanation you did for the SQuAD task. How is this dataset different ? 

<div class='alert alert-block alert-info'>
            Code:</div>
            
Look at the data and apply the same two approaches you did before. What do you observe ? Propose an explanation.

In [36]:
cqa_dataset = load_dataset("commonsense_qa")

Downloading builder script:   0%|          | 0.00/3.64k [00:00<?, ?B/s]

Downloading metadata:   0%|          | 0.00/3.04k [00:00<?, ?B/s]

Downloading readme:   0%|          | 0.00/7.22k [00:00<?, ?B/s]

Downloading and preparing dataset commonsense_qa/default to /Users/Guerlain/.cache/huggingface/datasets/commonsense_qa/default/1.0.0/28d68f56649a7f0c23bc68eae850af914aa03f95f810011ae8cf58cc5ff5051b...


Downloading data files:   0%|          | 0/3 [00:00<?, ?it/s]

Downloading data:   0%|          | 0.00/3.79M [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/472k [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/423k [00:00<?, ?B/s]

Extracting data files:   0%|          | 0/3 [00:00<?, ?it/s]

Generating train split:   0%|          | 0/9741 [00:00<?, ? examples/s]

Generating validation split:   0%|          | 0/1221 [00:00<?, ? examples/s]

Generating test split:   0%|          | 0/1140 [00:00<?, ? examples/s]

Dataset commonsense_qa downloaded and prepared to /Users/Guerlain/.cache/huggingface/datasets/commonsense_qa/default/1.0.0/28d68f56649a7f0c23bc68eae850af914aa03f95f810011ae8cf58cc5ff5051b. Subsequent calls will reuse this data.


  0%|          | 0/3 [00:00<?, ?it/s]

In [37]:
print(squad_dataset['validation'].features.keys())
print(cqa_dataset['validation'].features.keys())

dict_keys(['id', 'title', 'context', 'question', 'answers', 'cut_context', 'labels'])
dict_keys(['id', 'question', 'question_concept', 'choices', 'answerKey'])


The dataset is designed to address questions by selecting from various answer options. It comprises questions with multiple potential answers and includes a conceptual component to provide additional context cues to the model. Unlike the SQuAD dataset, where the model extracts information from a given text to answer the question, this task relies on the model's commonsense knowledge. Consequently, the dataset offers only a single word of context to assist the model, as opposed to an entire text containing the answer.

### TF-IDF

In [41]:
def tfidf_predict_label(dataset):
    vectorizer = TfidfVectorizer()
    vectorizer.fit([dataset['question'] + ' ' + dataset['question_concept'] + ' ' + ' '.join(dataset['choices']['text'])])

    tfidf_vectorized_questions = vectorizer.transform([dataset['question']]).toarray()
    tfidf_vectorized_choices = vectorizer.transform(dataset['choices']['text']).toarray()

    similarities = [cosine_similarity(tfidf_vectorized_questions, tfidf_vectorized_choice.reshape(1, -1))[0][0] for tfidf_vectorized_choice in tfidf_vectorized_choices]

    index_with_most_similarity = np.argmax(similarities)

    predicted_answer = dataset['choices']['label'][index_with_most_similarity]

    return {'index_with_most_similarity': index_with_most_similarity, 'predicted_answer': predicted_answer}

In [42]:
tfidf_cqa_dataset = cqa_dataset['validation'].map(tfidf_predict_label)

Map:   0%|          | 0/1221 [00:00<?, ? examples/s]

In [46]:
predictions = tfidf_cqa_dataset['predicted_answer']
labels = tfidf_cqa_dataset['answerKey']

score_accuracy = accuracy_score(predictions, labels)

print(f'Accuracy: {100*score_accuracy:.1f}%')

Accuracy: 20.1%


### Pre-trained vectorizer

In [49]:
model = SentenceTransformer('nq-distilbert-base-v1')

def pretrained_vectorizer_predict_label(dataset):

    bert_encoded_choices = model.encode(dataset['choices']['text'])
    bert_encoded_questions = model.encode(dataset['question'] + ' ' + dataset['question_concept'])

    similarities = [cosine_similarity([bert_encoded_questions], [bert_encoded_choice])[0][0] for bert_encoded_choice in bert_encoded_choices]

    index_with_most_similarity = np.argmax(similarities)

    predicted_answer = dataset['choices']['label'][index_with_most_similarity]

    return {'index_with_most_similarity': index_with_most_similarity, 'predicted_answer': predicted_answer}


In [50]:
pretrained_vectorizer_cqa_dataset = cqa_dataset['validation'].map(pretrained_vectorizer_predict_label)

Map:   0%|          | 0/1221 [00:00<?, ? examples/s]

In [51]:
predictions = pretrained_vectorizer_cqa_dataset['predicted_answer']
labels = pretrained_vectorizer_cqa_dataset['answerKey']

score_accuracy = accuracy_score(predictions, labels)

print(f'Accuracy: {100*score_accuracy:.1f}%')

Accuracy: 36.2%


In this dataset that demands common sense knowledge, it is observed that the Transformer model outperforms the alternative. This outcome aligns with expectations, given that the Transformer model is pre-trained, and as a consequence, the embeddings it produces for phrases and words encompass significant semantic and contextual information.

### Pretrained model - finetuning

In [52]:
# Approach with the question, the question concept, the choices and the answer
df_train = cqa_dataset['train']
question_train = df_train['question']
concept_train = df_train['question_concept']
choices_train = df_train['choices']
answers_train = df_train['answerKey']

In [53]:
question_for_train = []
concept_for_train = []
for i in range(len(question_train)):
    for _ in range(len(choices_train[i]['label'])):
        question_for_train.append(question_train[i])
        concept_for_train.append(concept_train[i])

In [54]:
choices_dict = {key: [choice[key] for choice in choices_train] for key in choices_train[0]}
texts_for_train = [text for texts_unit in choices_dict['text'] for text in texts_unit]

In [56]:
binary_labels = []
for i in range(len(answers_train)):
    binary_labels.append([1 if label == answers_train[i] else 0 for label in choices_dict['label'][i]])

labels_for_train = [label for labels_unit in binary_labels for label in labels_unit]

In [57]:
data_for_training = []

for i in range(len(question_train)):
    new_row = {
        'question_concept': question_train[i] + ' ' + concept_train[i],
        'choices': choices_train[i],
        'labels': labels_for_train[i]
    }
    data_for_training.append(new_row)

In [58]:
train_examples = [InputExample(texts=[data['question_concept'], data['choices']], label=float(data['labels'])) for data in data_for_training]

In [59]:
train_dataset = SentencesDataset(train_examples, model=model)
train_dataloader = DataLoader(train_dataset, batch_size=8, shuffle=True)

train_loss = losses.CosineSimilarityLoss(model=model)

In [61]:
model.fit(train_objectives=[(train_dataloader, train_loss)], epochs=3, warmup_steps=100, output_path='fine_tuned_model_cqa')

model.save('fine_tuned_model_cqa')

Epoch:   0%|          | 0/3 [00:00<?, ?it/s]

Iteration:   0%|          | 0/1218 [00:00<?, ?it/s]

Iteration:   0%|          | 0/1218 [00:00<?, ?it/s]

Iteration:   0%|          | 0/1218 [00:00<?, ?it/s]

In [62]:
finetuned_model = SentenceTransformer('fine_tuned_model_cqa')

In [63]:
def optim_vectorizer_predict_label(dataset):

    bert_encoded_choices = finetuned_model.encode(dataset['choices']['text'])
    bert_encoded_questions = finetuned_model.encode(dataset['question'] + ' ' + dataset['question_concept'])

    similarities = [cosine_similarity([bert_encoded_questions], [bert_encoded_choice])[0][0] for bert_encoded_choice in bert_encoded_choices]

    index_with_most_similarity = np.argmax(similarities)

    predicted_answer = dataset['choices']['label'][index_with_most_similarity]

    return {'index_with_most_similarity': index_with_most_similarity, 'predicted_answer': predicted_answer}

In [70]:
finetuned_vectorizer_cqa_dataset = cqa_dataset['validation'].map(optim_vectorizer_predict_label)

Map:   0%|          | 0/1221 [00:00<?, ? examples/s]

In [71]:
predicted_answers = finetuned_vectorizer_cqa_dataset['predicted_answer']
labels = finetuned_vectorizer_cqa_dataset['answerKey']

finetuned_score_accuracy = accuracy_score(predicted_answers, labels)

In [72]:
print(f'Accuracy: {100*finetuned_score_accuracy:.1f}%')

Accuracy: 31.7%


We should increase the number of epochs to reach a better fine-tuned model. We chose 3 epochs because our computation lasts for 30 minutes with only 3 epochs so we are ressourced-llimited to make bigger improvements on our task.

## 4 - Global search on Wikipedia data

<div class='alert alert-block alert-warning'>
            Question:</div>
            
Again, look at the data of the [```wiki_qa``` dataset](https://huggingface.co/datasets/wiki_qa), understand the task. We are now going to perform a **global** search, as the dataset is open domain: when trying to answer for a question, we will search among all vectors, rather than only the ones representing the context the answer is found in. How would you verify that the model managed to find the right answer ? Let's try to use to very different ways to evaluate how well the approaches work:
- Looking if the right result is in the top-$k$ predictions returned by the model.
- Using the [ROUGE](https://aclanthology.org/W04-1013/) score. 
Explain how you understand these metrics and how they could be useful here.

<div class='alert alert-block alert-info'>
            Code:</div>
            
We will use the same embeddings as before, but we will use a tool called ```faiss``` for indexing all of them and facilitate the search ! Look at the [documentation](https://huggingface.co/docs/datasets/faiss_es). Then, implement or use tools implementing the two metrics, and evaluate both approaches.

In [79]:
wiki_dataset = load_dataset('wiki_qa')
wiki_test = wiki_dataset['test']

Downloading builder script:   0%|          | 0.00/3.79k [00:00<?, ?B/s]

Downloading metadata:   0%|          | 0.00/1.77k [00:00<?, ?B/s]

Downloading readme:   0%|          | 0.00/13.6k [00:00<?, ?B/s]

Downloading and preparing dataset wiki_qa/default to /Users/Guerlain/.cache/huggingface/datasets/wiki_qa/default/0.1.0/d2d236b5cbdc6fbdab45d168b4d678a002e06ddea3525733a24558150585951c...


Downloading data:   0%|          | 0.00/7.09M [00:00<?, ?B/s]

Generating test split:   0%|          | 0/6165 [00:00<?, ? examples/s]

Generating validation split:   0%|          | 0/2733 [00:00<?, ? examples/s]

Generating train split:   0%|          | 0/20360 [00:00<?, ? examples/s]

Dataset wiki_qa downloaded and prepared to /Users/Guerlain/.cache/huggingface/datasets/wiki_qa/default/0.1.0/d2d236b5cbdc6fbdab45d168b4d678a002e06ddea3525733a24558150585951c. Subsequent calls will reuse this data.


  0%|          | 0/3 [00:00<?, ?it/s]

In [80]:
model = SentenceTransformer('nq-distilbert-base-v1')

In [81]:
answers = [elem['answer'] for elem in wiki_dataset['train']]

bert_encoded_answers = np.array(model.encode(answers)).astype('float32')

In [82]:
index = faiss.IndexFlatL2(bert_encoded_answers.shape[1])
index.add(bert_encoded_answers)

In [83]:
def k_global_search(question, k=0):

    bert_encoded_question = np.array(model.encode(question)).astype('float32').reshape(1,-1)

    _, bound = index.search(bert_encoded_question, k)
    answers_top_k = [answers[i] for i in bound[0]]

    return answers_top_k

In [87]:
def evaluate_base(question, answer, k=0):

    answers_top_k = k_global_search(question, k)
    
    return answer in answers_top_k

In [86]:
def evaluate_rouge(prediction, ground_truth):

    scoring_module = rouge_scorer.RougeScorer(['rouge1', 'rougeL'], use_stemmer=True)
    rouge_scores = scoring_module.score(ground_truth, prediction)

    return rouge_scores

In [104]:
wiki_val = wiki_dataset['validation']  

correct_count, total_count, k = 0, 0, 3

In [105]:
for row in wiki_test:

    question = row['question']
    ground_truth = row['answer']

    if row['label'] == 1:

        if evaluate_base(question, ground_truth, k):
            correct_count += 1
            
        total_count += 1

accuracy = correct_count / total_count

In [106]:
print(f'Top-{k} accuracy: {100*accuracy:.1f}%')

Top-3 accuracy: 9.2%


In [100]:
for element in wiki_val.select(range(12,16)):

    question = element['question']
    ground_truth = element['answer'] 
    prediction = k_global_search(question, k=5)[0] if element['label'] == 1 else 'Not in Top K result'
    
    rouge_scores = evaluate_rouge(prediction, ground_truth) if element['label'] == 1 else 'N/A'
    
    print(f'Question: {question}')
    print(f'Prediction: {prediction}')
    print(f'Ground truth: {ground_truth}')
    print(f'ROUGE scores: {rouge_scores}\n')

Question: how big is bmc software in houston, tx
Prediction: It is based in Houston, Texas .
Ground truth: Employing over 6,000, BMC is often credited with pioneering the BSM concept as a way to help better align IT operations with business needs.
ROUGE scores: {'rouge1': Score(precision=0.3333333333333333, recall=0.08, fmeasure=0.12903225806451613), 'rougeL': Score(precision=0.16666666666666666, recall=0.04, fmeasure=0.06451612903225806)}

Question: how big is bmc software in houston, tx
Prediction: It is based in Houston, Texas .
Ground truth: For 2011, the company recorded an annual revenue of $2.1 billion, making it the #20 largest software company in terms of revenue for that year.
ROUGE scores: {'rouge1': Score(precision=0.3333333333333333, recall=0.07692307692307693, fmeasure=0.125), 'rougeL': Score(precision=0.3333333333333333, recall=0.07692307692307693, fmeasure=0.125)}

Question: how long was i love lucy on the air
Prediction: Not in Top K result
Ground truth: I Love Lucy is

Leveraging ROUGE scores has enabled us to gauge the degree of alignment between the content predicted in the response and the correct answer. Upon scrutinizing our examples, it becomes apparent that the ROUGE scores exhibit a degree of deficiency. This implies a constrained similarity between the predicted responses and the correct answers, both concerning individual words (ROUGE-1) and the sequence/order of words (ROUGE-L).

These scores function constitute a valuable metric for assessing our model's proficiency in encapsulating the fundamental content of correct answers. Diminished scores may potentially indicate challenges, such as the model misconstruing the question or falling short in retrieving pertinent information.