# Lab: Semantic search on Question Answering datasets

Groups:

**Valentina HU**

**Nicolas SAINT**

**Selma ZARGA**


## Objectives:

1. Explore and understand the **S**tanford **Qu**estion **A**nswering **D**ataset [Squad](https://aclanthology.org/D16-1264/) dataset and the associated task.  
2. Adapt this dataset for a *local* semantic search task and propose an appropriate evaluation metric:
    - Implement a simple baseline based on **TF-IDF**.
    - Use a pre-trained transformer-based model, and fine-tune it.
3. Test these approaches on the [CommonSense QA](https://aclanthology.org/N19-1421/) dataset. 
4. Adapt these approaches for a *global* semantic search task on the [WikiQA](https://aclanthology.org/D15-1237/) dataset for open domain question answering.
5. **Bonus** (Optional) Apply a model (any, as long as it's running) to the original Squad QA task.

## Modalities:

The goal of this lab is to make you search for and learn to use recent tools for NLP tasks. 
- You should feel free to use any tool, implementation and model you prefer. 
- You are not expected to reach a particular performance. 
- You can work on this lab by groups of up to 3. 
- You should submit this lab on the Moodle by Friday 22th.

## 1 - The SQuAD dataset

<div class='alert alert-block alert-warning'>
            Questions:</div>

1. Load the dataset **SQuAD** - for example, using the [```dataset``` package](https://huggingface.co/docs/datasets/index) from Huggingface and loading the dataset ```'squad'```. You can also explore it using the [website](https://rajpurkar.github.io/SQuAD-explorer/). 
2. Look at the metrics used to evaluate models on the dataset. You can also load the metric ```'squad'``` from the [```evaluate``` package](https://huggingface.co/docs/evaluate/index) from Huggingface. 
3. Explain succintly - and in your own words - what is the task: how could we use a model to solve it ? Treat the case of encoder models adapted to *classification tasks* and encoder-decoder models adapted to *text generation*. 

<div class='alert alert-block alert-warning' style='color:green;'>
    Answer:
</div>


In [13]:
from datasets import load_dataset

dataset = load_dataset("squad")
print(dataset)

DatasetDict({
    train: Dataset({
        features: ['id', 'title', 'context', 'question', 'answers'],
        num_rows: 87599
    })
    validation: Dataset({
        features: ['id', 'title', 'context', 'question', 'answers'],
        num_rows: 10570
    })
})


In [14]:
from datasets import load_metric

metrics = load_metric("squad")
print(metrics)

Metric(name: "squad", features: {'predictions': {'id': Value(dtype='string', id=None), 'prediction_text': Value(dtype='string', id=None)}, 'references': {'id': Value(dtype='string', id=None), 'answers': Sequence(feature={'text': Value(dtype='string', id=None), 'answer_start': Value(dtype='int32', id=None)}, length=-1, id=None)}}, usage: """
Computes SQuAD scores (F1 and EM).
Args:
    predictions: List of question-answers dictionaries with the following key-values:
        - 'id': id of the question-answer pair as given in the references (see below)
        - 'prediction_text': the text of the answer
    references: List of question-answers dictionaries with the following key-values:
        - 'id': id of the question-answer pair (see above),
        - 'answers': a Dict in the SQuAD dataset format
            {
                'text': list of possible texts for the answer, as a list of strings
                'answer_start': list of start positions for the answer, as a list of ints
   

**Remark :**

The primary metric used to evaluate models on the SQuAD dataset is the **Exact Match (EM) score**. EM measures the percentage of predicted answers that match exactly with the ground truth answers. Another commonly used metric is the **F1 score**, which computes the overlap between predicted and true answers in terms of words.

### Models

The SQuAD task is a question-answering (QA) task where the model is given a context paragraph and a question and needs to identify the answer span within the context paragraph. The dataset consists of passages from a variety of sources, and each passage is associated with multiple questions.

- **Encoder Models for Classification Tasks**: In this setting, the model takes a context paragraph and a question as input and outputs the start and end indices of the answer span within the context. This can be formulated as a classification task, where the model is trained to predict whether each token in the context is the start or end of the answer span.

- **Encoder-Decoder Models for Text Generation**: Alternatively, an encoder-decoder model could be used, where the context and question are encoded, and the model generates the answer span as a sequence of words. This approach involves training the model in a sequence-to-sequence fashion.

The task involves understanding the context, comprehending the question, and precisely identifying the answer within the context. The success of the model is measured by its ability to produce accurate and exact answers. Thus, we could use the same metrics as mentionend before (EM & F1 scores).



## 2 - Design a *local* semantic search from squad

This taks is a little complicated to implement. Let us simplify squad to be a **semantic search** task !
We will divide the context containing the answer into several pieces, and ask a model to find which one contains the answer **by vectorizing the question and each piece** and trying to look for the most relevant piece using **cosine similarity** between the vectors, making it a fairly simple task.


For example, the following question of the dataset:

```python
'Which NFL team represented the AFC at Super Bowl 50?'
```

with the answer:

```python
'Denver Broncos'
```

We could divide the corresponding ```'context'``` into the following list:

```python
['Super Bowl 50 was an American football game to determine the champion of the National Football League (NFL) for the 2015 season. The American Football Conference (AFC) champion Denver Broncos d',
 "efeated the National Football Conference (NFC) champion Carolina Panthers 24–10 to earn their third Super Bowl title. The game was played on February 7, 2016, at Levi's Stadium in the San Franc",
 'isco Bay Area at Santa Clara, California. As this was the 50th Super Bowl, the league emphasized the "golden anniversary" with various gold-themed initiatives, as well as temporarily suspending',
 ' the tradition of naming each Super Bowl game with Roman numerals (under which the game would have been known as "Super Bowl L"), so that the logo could prominently feature the Arabic numerals 50.']
```

and indicate the location of the answer as:

```python
label = [1, 0, 0, 0]
```

<div class='alert alert-block alert-warning'>
            Question:</div>
            
At first, we won't do any training: you should work with the ```validation``` part of the dataset. Be careful, there may be several good answers ! Propose a scheme to divide the context into pieces and to label each piece as containing the answer or not. How do we evaluate for this task - would simple accuracy suffice ?

1) **Scheme to Divide Context into Pieces**

To design a local semantic search from the SQuAD dataset, we can follow these steps:

- **Divide Context into Pieces**: it breaks down the context into smaller, meaningful pieces. Each piece should be a coherent segment that contains relevant information.

- **Vectorize Question and Context Pieces**: we can use a pre-trained transformer-based model (e.g., BERT, RoBERTa) to vectorize the question and each context piece. This involves obtaining embeddings for the question and each context piece.

- **Cosine Similarity**: We can calculate the cosine similarity between the vectorized question and each vectorized context piece. Cosine similarity measures the cosine of the angle between two vectors and is commonly used to determine similarity.

- **Select Most Relevant Piece**: Identifying the context piece with the highest cosine similarity as the predicted answer-containing piece.

- **Labeling**: Creating labels for each context piece. The piece with the highest cosine similarity will be labeled as containing the answer, while others are labeled as not containing the answer.


<div class='alert alert-block alert-info'>
            Code:</div>
            
For efficient processing, you can use the ```map``` method associated to the dataset. It can create a new feature for each example. In this case, you can create a new feature containing the context divided into pieces, and a new feature containing labels for if the pieces contain the answer. You can also use it for your evaluation function.

In [15]:
import nltk
import numpy as np
from sklearn.feature_extraction.text import CountVectorizer
nltk.download('punkt')

[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\Valen\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!


True

In [16]:
def local_semantic(example):
    # Split the context into sentences
    sentences = nltk.sent_tokenize(example['context'])
    vectorizer = CountVectorizer()

    all_text = sentences + [example['question']]
    all_bow = vectorizer.fit_transform(all_text)

    bow_sentences = all_bow[:-1].toarray()
    bow_question = all_bow[-1].toarray()

    sims = []
    for i in range(bow_sentences.shape[0]):
        bow_sent = bow_sentences[i]
        sim = bow_sent @ bow_question.T / (np.linalg.norm(bow_sent.flatten()) * np.linalg.norm(bow_question.flatten()))
        sims.append(sim)
    
    # Determine the index of the sentence with the highest similarity
    max_index = np.argmax(sims) if sims else -1

    # Create a label vector and assign the label to the sentence with the highest similarity
    labels = [1 if i == max_index else 0 for i in range(len(sentences))]

    return {'sentences': sentences, 'labels': labels}


# Create a new dataset with the processed data
processed_data = []
processed_dataset = dataset['validation'].map(local_semantic, batched=False)

In [17]:
#print the first 10 examples
for i in range(10):
    print(processed_dataset[i])

{'id': '56be4db0acb8001400a502ec', 'title': 'Super_Bowl_50', 'context': 'Super Bowl 50 was an American football game to determine the champion of the National Football League (NFL) for the 2015 season. The American Football Conference (AFC) champion Denver Broncos defeated the National Football Conference (NFC) champion Carolina Panthers 24–10 to earn their third Super Bowl title. The game was played on February 7, 2016, at Levi\'s Stadium in the San Francisco Bay Area at Santa Clara, California. As this was the 50th Super Bowl, the league emphasized the "golden anniversary" with various gold-themed initiatives, as well as temporarily suspending the tradition of naming each Super Bowl game with Roman numerals (under which the game would have been known as "Super Bowl L"), so that the logo could prominently feature the Arabic numerals 50.', 'question': 'Which NFL team represented the AFC at Super Bowl 50?', 'answers': {'text': ['Denver Broncos', 'Denver Broncos', 'Denver Broncos'], 'ans

Evaluating this model is complicated because the standard metrics used for SQuAD (like Exact Match and F1 score) rely on comparing predicted answer spans (start and end positions) with the actual spans. But our local semantic search return the most semantically similar sentence to the question, not a specific answer span.

To evaluate the model we can create a simple accuracy model by counting every time the true answer is in our chosen sentences. This model has limits beause this method assumes the answer is a subset of a single sentence, which may not always be true.


In [18]:
def simple_accuracy_evaluation(dataset, num_samples=100):
    correct = 0
    for i in range(num_samples):

        example = dataset[i]
        answer = example['answers']['text'][0]  # get the first real answer
        
        # Get the predicted sentence by our local semantic search
        predicted_sentences = example['sentences']
        labels = example['labels']
        predicted_sentence = predicted_sentences[labels.index(1)] if 1 in labels else ""

        # Check if the answer is contained in the predicted sentence
        if answer in predicted_sentence:
            correct += 1

    # Compute the accuracy
    accuracy = correct / num_samples
    return accuracy

accuracy = simple_accuracy_evaluation(processed_dataset)
print("Accuracy:", accuracy)

Accuracy: 0.56


In [19]:
def simple_f1_evaluation(dataset, num_samples=100):
    true_positives = 0
    false_positives = 0
    false_negatives = 0

    for i in range(num_samples):
        example = dataset[i]
        answer = example['answers']['text'][0]  # Get the first real answer

        # Get the predicted sentence by our local semantic search
        predicted_sentences = example['sentences']
        labels = example['labels']
        predicted_sentence = predicted_sentences[labels.index(1)] if 1 in labels else ""

        # Check if the answer is contained in the predicted sentence
        if answer in predicted_sentence:
            true_positives += 1
        else:
            false_negatives += 1

        # Check for false positives
        if 1 in labels and not answer in predicted_sentence:
            false_positives += 1

    # Compute precision, recall, and F1 score
    precision = true_positives / (true_positives + false_positives) if true_positives + false_positives > 0 else 0
    recall = true_positives / (true_positives + false_negatives) if true_positives + false_negatives > 0 else 0
    f1_score = 2 * (precision * recall) / (precision + recall) if precision + recall > 0 else 0

    return f1_score

# Calculate F1 score
f1_score = simple_f1_evaluation(processed_dataset)
print("F1 Score:", f1_score)


F1 Score: 0.56


The accuracy of our simple local semantic search is 0.56. Our prediction using BOW representaion as embedding is quite random. Bag Of Words model is a simplistic representation and disregard grammar and word order. There is therefore every reason to envisage a more complex embedding.

### 2.1 - Local search: Independant Tf-idf representations

<div class='alert alert-block alert-info'>
            Code:</div>
            
Implement a function that will for each example:
- Create a tf-idf ```vectorizer``` from all the text in the question and context. 
- Create tf-idf representations for the question and the pieces of the context,
- Find the representation the closest to the question among the pieces.

Then, evaluate the method !

In [20]:
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity

def tfidf_local_search(example):
    # Get the previously computed sentences
    sentences = nltk.sent_tokenize(example['context'])
    question = example['question']

    # Combine the question and sentences for vectorization
    texts = [question] + sentences

    # Create a tf-idf vectorizer and fit it on the texts
    vectorizer = TfidfVectorizer()
    tfidf_matrix = vectorizer.fit_transform(texts)

    # tf-idf representation for the question and sentences
    question_vec = tfidf_matrix[0]
    sentences_vecs = tfidf_matrix[1:]

    # Compute cosine similarities between question and each sentence
    similarities = cosine_similarity(question_vec, sentences_vecs)

    # Find the index of the sentence with the highest similarity
    most_similar_idx = similarities.argsort()[0][-1]

    # Return the most similar sentence
    return sentences[most_similar_idx]

In [21]:
# Evaluate the model
def evaluate_tfidf_accuracy(dataset, num_samples=100):
    correct = 0

    for i in range(num_samples):
        example = dataset[i]
        answer = example['answers']['text'][0]  

        # Use the tf-idf model to predict the sentence
        predicted_sentence = tfidf_local_search(example)

        # Check if the predicted sentence contains the answer
        if answer in predicted_sentence:
            correct += 1

    accuracy = correct / num_samples
    return accuracy

dataset = load_dataset("squad", split='validation') 
accuracy = evaluate_tfidf_accuracy(dataset)
print("TF-IDF Model Accuracy:", accuracy)

TF-IDF Model Accuracy: 0.63


Keeping the same evaluation method as before, we try the TF-IDF representation. This embedding is more complex than BOW and consider word frequencies.
As excepted, the accuracy is better. Hence, TF-IDF embedding captures better the relation between the words of our questions.

### 2.2 - Local search: Pre-trained sentence representations transformer-based model

<div class='alert alert-block alert-info'>
            Code:</div>

Reproduce the same process using a pre-trained transformer model. You can use a model that you will find on huggingface. You can also look into the [```SentenceTransformer``` library](https://www.sbert.net/), dedicated to represent documents. Also:
- Try to verify if the model has been trained on SQuAD !
- Fine-tune the model (at least a little) to check that it improves results.

In [22]:
import json
import torch
from pathlib import Path
from torch.utils.data import DataLoader, Dataset
from transformers import BertTokenizerFast, BertForQuestionAnswering, AdamW
from datasets import load_metric

We choose to use DistilBert because on the Hugging face website we saw some application of this model on the Squad dataset

In [23]:
from transformers import DistilBertTokenizerFast, DistilBertModel
import numpy as np
from scipy.spatial.distance import cosine

# Load the tokenizer and model
tokenizer = DistilBertTokenizerFast.from_pretrained('distilbert-base-uncased')
model = DistilBertModel.from_pretrained('distilbert-base-uncased')

def vectorize(text, tokenizer, model):
    inputs = tokenizer(text, return_tensors="pt", truncation=True, padding=True)
    outputs = model(**inputs)
    return torch.mean(outputs.last_hidden_state, dim=1)[0]

def transformer_local_search(question, sentences):
    # Vectorize the question
    question_vec = vectorize(question, tokenizer, model).detach().numpy()

    # Vectorize sentences and compute cosine similarities
    sims = []
    for sent in sentences:
        sent_vec = vectorize(sent, tokenizer, model).detach().numpy()
        sim = 1 - cosine(sent_vec, question_vec)
        sims.append(sim)

    # Find the most similar sentence
    return sentences[np.argmax(sims)]

# Example usage
example = dataset[0]
print(example)

print("Question:", example['question'])
print("Answer:", example['answers']['text'][0])
closest_sentence = transformer_local_search(example['question'], nltk.sent_tokenize(example['context']))
print("Closest sentence:", closest_sentence)

{'id': '56be4db0acb8001400a502ec', 'title': 'Super_Bowl_50', 'context': 'Super Bowl 50 was an American football game to determine the champion of the National Football League (NFL) for the 2015 season. The American Football Conference (AFC) champion Denver Broncos defeated the National Football Conference (NFC) champion Carolina Panthers 24–10 to earn their third Super Bowl title. The game was played on February 7, 2016, at Levi\'s Stadium in the San Francisco Bay Area at Santa Clara, California. As this was the 50th Super Bowl, the league emphasized the "golden anniversary" with various gold-themed initiatives, as well as temporarily suspending the tradition of naming each Super Bowl game with Roman numerals (under which the game would have been known as "Super Bowl L"), so that the logo could prominently feature the Arabic numerals 50.', 'question': 'Which NFL team represented the AFC at Super Bowl 50?', 'answers': {'text': ['Denver Broncos', 'Denver Broncos', 'Denver Broncos'], 'ans

#### Verify if the model has been trained on SQuAD !

In [24]:
def squad_question(size):
    context = dataset['context'][:size]
    question = dataset['question'][:size]
    answer = [ans['text'][0] for ans in dataset['answers'][:size]]

    return {'context': context, 'question': question, 'answer': answer}

def run_inference():
    size = 300
    squad_like_data = squad_question(size)
    accuracy = 0

    for i in range(size):
        context_sentences = nltk.sent_tokenize(squad_like_data['context'][i])
        question = squad_like_data['question'][i]
        answer = squad_like_data['answer'][i]
        bert_answer = transformer_local_search(question, context_sentences)

        if answer in bert_answer:
            accuracy += 1

    return accuracy / size

In [25]:
acc = run_inference()
print(f"Accuracy : {acc}")

Accuracy : 0.64


The information about the data used to train the model is usually mentionned on the description of the dataset or in the logs. In case, the list of the datas on what the model is trained is not available publicly.

We can use the above method. First, we send a sample of questions to our pre-trained model. Then we calculate the accuracy of the predicted response and the ground truth. However, it's difficult to know whether the dataset is or isn't used to train the model.

Over our sample, the accuracy is equal to 0.64, this is quite low. We can suppose that the model is not trained an SQuAD dataset.

Finally, without the official information of the data used to train it is difficult to know if our pre-trained model is trained on SQuAD dataset.

#### Fine Tuned BERT model
##### Preprocessing

To helps us with the preporcessing we used an exemple on the Huggign face website.

In [26]:
from datasets import load_dataset
from torch.utils.data import DataLoader
import torch
from transformers import AdamW
from transformers import DistilBertTokenizerFast, DistilBertForQuestionAnswering

# DistillBert
tokenizer = DistilBertTokenizerFast.from_pretrained('distilbert-base-uncased')
model = DistilBertForQuestionAnswering.from_pretrained('distilbert-base-uncased')

dataset = load_dataset('squad')

def process_data(batch):
    answer_starts = []
    answer_ends = []
    batch_size = len(batch['question'])

    for i in range(batch_size):
        answer_start = batch['answers'][i]['answer_start'][0]
        answer_text = batch['answers'][i]['text'][0]
        answer_end = answer_start + len(answer_text)
        answer_starts.append(answer_start)
        answer_ends.append(answer_end)

    # input_ids
    tokenized_inputs = tokenizer(batch['question'], batch['context'], truncation=True, padding='max_length', max_length=512, return_tensors='pt')
    tokenized_inputs.update({'start_positions': answer_starts, 'end_positions': answer_ends})

    attention_masks = tokenized_inputs['attention_mask']
    batch['attention_masks'] = torch.tensor(attention_masks)

    return tokenized_inputs

Some weights of DistilBertForQuestionAnswering were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['qa_outputs.bias', 'qa_outputs.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


In [27]:
train_dataset = dataset['train'].map(process_data, batched=True)
val_dataset = dataset['validation'].map(process_data, batched=True)
train_dataset.set_format(type='torch', columns=['id', 'input_ids', 'attention_mask', 'start_positions', 'end_positions'])
val_dataset.set_format(type='torch', columns=['id', 'input_ids', 'attention_mask', 'start_positions', 'end_positions'])

device = torch.device('cuda') if torch.cuda.is_available() else torch.device('cpu')
print(device)

model.to(device)
model.train()

cuda


DistilBertForQuestionAnswering(
  (distilbert): DistilBertModel(
    (embeddings): Embeddings(
      (word_embeddings): Embedding(30522, 768, padding_idx=0)
      (position_embeddings): Embedding(512, 768)
      (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
      (dropout): Dropout(p=0.1, inplace=False)
    )
    (transformer): Transformer(
      (layer): ModuleList(
        (0-5): 6 x TransformerBlock(
          (attention): MultiHeadSelfAttention(
            (dropout): Dropout(p=0.1, inplace=False)
            (q_lin): Linear(in_features=768, out_features=768, bias=True)
            (k_lin): Linear(in_features=768, out_features=768, bias=True)
            (v_lin): Linear(in_features=768, out_features=768, bias=True)
            (out_lin): Linear(in_features=768, out_features=768, bias=True)
          )
          (sa_layer_norm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
          (ffn): FFN(
            (dropout): Dropout(p=0.1, inplace=False)
      

We only run the fine tune on 1 epoch due to the long training time (1h30/2h).

In [28]:
optimizer = AdamW(model.parameters(), lr=5e-5)
train_loader = DataLoader(train_dataset, batch_size=32, shuffle=True)

for epoch in range(1):
    for batch in train_loader:
        input_ids = batch['input_ids'].to(device)
        start_positions = batch['start_positions'].to(device)
        attention_mask = batch['attention_mask'].to(device)
        start_positions = batch['start_positions'].to(device)
        end_positions = batch['end_positions'].to(device)

        outputs = model(input_ids, attention_mask=attention_mask, start_positions=start_positions, end_positions=end_positions)
        loss = outputs[0]

        if torch.isnan(loss):
            continue

        # Backward
        optimizer.zero_grad()
        loss.backward()
        torch.nn.utils.clip_grad_norm_(model.parameters(), max_norm=1.0)
        optimizer.step()

        # Display training information
        print(f"Epoch: {epoch}, Loss: {loss.item()}")

model.eval()

'optimizer = AdamW(model.parameters(), lr=5e-5)\ntrain_loader = DataLoader(train_dataset, batch_size=32, shuffle=True)\n\nfor epoch in range(1):\n    for batch in train_loader:\n        input_ids = batch[\'input_ids\'].to(device)\n        start_positions = batch[\'start_positions\'].to(device)\n        attention_mask = batch[\'attention_mask\'].to(device)\n        start_positions = batch[\'start_positions\'].to(device)\n        end_positions = batch[\'end_positions\'].to(device)\n\n        outputs = model(input_ids, attention_mask=attention_mask, start_positions=start_positions, end_positions=end_positions)\n        loss = outputs[0]\n\n        if torch.isnan(loss):\n            continue\n\n        # Backward\n        optimizer.zero_grad()\n        loss.backward()\n        torch.nn.utils.clip_grad_norm_(model.parameters(), max_norm=1.0)\n        optimizer.step()\n\n        # Display training information\n        print(f"Epoch: {epoch}, Loss: {loss.item()}")\n\nmodel.eval()'

##### Save the fine tuned model

In [29]:
torch.save(model, 'fine_tuned_bert_model_full.pth')
torch.save(model.state_dict(), 'fine_tuned_bert_model_state_dict.pth')

##### Use fine tuned model

We saved our fine tuned model : [Download Link](https://drive.google.com/file/d/1-bBx7A_xkj9RgWen_3KF_Nrkb27dwUv4/view?usp=sharing)

In [30]:
fine_tuned_model = DistilBertForQuestionAnswering.from_pretrained('distilbert-base-uncased')
fine_tuned_model.load_state_dict(torch.load('fine_tuned_bert_model_state_dict.pth'))
fine_tuned_model.eval()

Some weights of DistilBertForQuestionAnswering were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['qa_outputs.bias', 'qa_outputs.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


DistilBertForQuestionAnswering(
  (distilbert): DistilBertModel(
    (embeddings): Embeddings(
      (word_embeddings): Embedding(30522, 768, padding_idx=0)
      (position_embeddings): Embedding(512, 768)
      (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
      (dropout): Dropout(p=0.1, inplace=False)
    )
    (transformer): Transformer(
      (layer): ModuleList(
        (0-5): 6 x TransformerBlock(
          (attention): MultiHeadSelfAttention(
            (dropout): Dropout(p=0.1, inplace=False)
            (q_lin): Linear(in_features=768, out_features=768, bias=True)
            (k_lin): Linear(in_features=768, out_features=768, bias=True)
            (v_lin): Linear(in_features=768, out_features=768, bias=True)
            (out_lin): Linear(in_features=768, out_features=768, bias=True)
          )
          (sa_layer_norm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
          (ffn): FFN(
            (dropout): Dropout(p=0.1, inplace=False)
      

##### Evaluate fine tuned model

In [31]:
from transformers import pipeline

example = dataset["validation"][0]
qa_pipeline = pipeline(
    "question-answering",
    model=fine_tuned_model,
    tokenizer=tokenizer
)

output = qa_pipeline({
    "question": example["question"],
    "context": example["context"]
})
answer_text = example["answers"]["text"][0]
print("output answer matches expected answer: ", output["answer"] == answer_text)
output

output answer matches expected answer:  False


{'score': 0.0001058857305906713,
 'start': 116,
 'end': 163,
 'answer': '2015 season. The American Football Conference ('}

In [33]:
def fine_tuned_eval(dataset):
    correct = 0
    size = 500
    for i in range(size):
        sample = dataset[i]
        qa_pipeline = pipeline(
            "question-answering",
            model=fine_tuned_model,
            tokenizer=tokenizer
        )
        output = qa_pipeline({
            "question": sample["question"],
            "context": sample["context"]
        })
        answer_text = sample["answers"]["text"][0]

        if output["answer"] in answer_text:
            correct += 1

    return correct/size

print('Fine tuned accuracy :', fine_tuned_eval(dataset['validation']))

Fine tuned accuracy : 0.006


When we run the fine tuned model, the loss after one epoch is 5. That may explain why, the accuracy is extremely low. We should run the fine tunning over more epoch, the second epoch return a loss of 4.

## 3 - Local search on another dataset: does it work ? 

<div class='alert alert-block alert-warning'>
            Question:</div>
            
Let's implement our local semantic search on another dataset, to check if performance follows the same trend. You can use the [```commonsense_qa``` dataset](https://huggingface.co/datasets/commonsense_qa). Do the same exploration and explanation you did for the SQuAD task. How is this dataset different ? 

<div class='alert alert-block alert-info'>
            Code:</div>
            
Look at the data and apply the same two approaches you did before. What do you observe ? Propose an explanation.

In [34]:
from datasets import load_dataset
# Load dataset
dataset_commonsense = load_dataset("commonsense_qa")
print(dataset_commonsense)

print(dataset_commonsense['train'][2])

Downloading builder script:   0%|          | 0.00/3.64k [00:00<?, ?B/s]

Downloading metadata:   0%|          | 0.00/3.04k [00:00<?, ?B/s]

Downloading readme:   0%|          | 0.00/7.22k [00:00<?, ?B/s]

Downloading data files:   0%|          | 0/3 [00:00<?, ?it/s]

Downloading data:   0%|          | 0.00/3.79M [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/472k [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/423k [00:00<?, ?B/s]

Extracting data files:   0%|          | 0/3 [00:00<?, ?it/s]

Generating train split:   0%|          | 0/9741 [00:00<?, ? examples/s]

Generating validation split:   0%|          | 0/1221 [00:00<?, ? examples/s]

Generating test split:   0%|          | 0/1140 [00:00<?, ? examples/s]

DatasetDict({
    train: Dataset({
        features: ['id', 'question', 'question_concept', 'choices', 'answerKey'],
        num_rows: 9741
    })
    validation: Dataset({
        features: ['id', 'question', 'question_concept', 'choices', 'answerKey'],
        num_rows: 1221
    })
    test: Dataset({
        features: ['id', 'question', 'question_concept', 'choices', 'answerKey'],
        num_rows: 1140
    })
})
{'id': '4c1cb0e95b99f72d55c068ba0255c54d', 'question': 'To locate a choker not located in a jewelry box or boutique where would you go?', 'question_concept': 'choker', 'choices': {'label': ['A', 'B', 'C', 'D', 'E'], 'text': ['jewelry store', 'neck', 'jewlery box', 'jewelry box', 'boutique']}, 'answerKey': 'A'}


Looking at the dataset, we observe that it has a structure similar to the SQuAD dataset, with a question and possible answers. The answers are already separated. 

Without applying the same approaches as before, by examining an example from the dataset, we can conclude that our local search will not work because the answers are not contained within the question.


#### Local search : Tf-IDF representation

In [35]:
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity
from datasets import load_dataset
import nltk

# Download the punkt tokenizer for sentence tokenization
nltk.download('punkt')

def tfidf_local_search(example):
  
    choices = example['choices']['text']
    question = example['question']

    # Combine the question and choices for vectorization
    texts = [question] + choices

    # Create a tf-idf vectorizer and fit it on the texts
    vectorizer = TfidfVectorizer()
    tfidf_matrix = vectorizer.fit_transform(texts)

    # tf-idf representation for the question and choices
    question_vec = tfidf_matrix[0]
    choices_vecs = tfidf_matrix[1:]

    # Compute cosine similarities between question and each choice
    similarities = cosine_similarity(question_vec, choices_vecs)
    most_similar_idx = similarities.argsort()[0][-1]

    return example['choices']['label'][most_similar_idx]

def evaluate_tfidf_accuracy(dataset):
    correct = 0

    num_samples = len(dataset['validation'])
    for i in range(num_samples):
        example = dataset['validation'][i]
        answer = example['answerKey']
        predicted_choice_label = tfidf_local_search(example)
        if answer == predicted_choice_label:
            correct += 1

    accuracy = correct / num_samples
    return accuracy


dataset_commonsense = load_dataset("commonsense_qa")
accuracy = evaluate_tfidf_accuracy(dataset_commonsense)
print("TF-IDF Model Accuracy on Commonsense Dataset:", accuracy)


[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\Valen\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!


TF-IDF Model Accuracy on Commonsense Dataset: 0.17526617526617527


The accuracy is at 0.1, which is quite low. We obtain these results because in our dataset, the correct answer is not contained within the question, or a false answer is present in the question.

For instance, consider a question like 'To locate a choker not located in a jewelry box or boutique, where would you go?' with possible answers ['jewelry store', 'neck', 'jewelry box', 'jewelry box', 'boutique'], and the correct answer is 'jewelry store'. Our model selects the answer with the highest similarity to the question, so in this case, it would be misled and choose 'jewelry box'.

A model designed to predict answers should be more accurate, consider negations, and grasp the true meaning of the question to respond appropriately.



#### Local search with a pre trained model


In [36]:
from transformers import AutoTokenizer, AutoModel
import torch
import numpy as np
from scipy.spatial.distance import cosine
from datasets import load_dataset

# Load the tokenizer and model
tokenizer = AutoTokenizer.from_pretrained("distilbert-base-uncased")
model = AutoModel.from_pretrained("distilbert-base-uncased")

# Load the commonsense_qa dataset
dataset_commonsense = load_dataset("commonsense_qa")

def vectorize(text, tokenizer, model):
    inputs = tokenizer(text, return_tensors="pt", truncation=True, padding=True)
    outputs = model(**inputs)
    return torch.mean(outputs.last_hidden_state, dim=1)[0]

def transformer_local_search(question, choices):
    # Vectorize the question
    question_vec = vectorize(question, tokenizer, model).detach().numpy()

    # Vectorize choices and compute cosine similarities
    sims = []
    for choice in choices['text']:
        choice_vec = vectorize(choice, tokenizer, model).detach().numpy()
        sim = 1 - cosine(choice_vec, question_vec)
        sims.append(sim)

    # Find the most similar choice
    return choices['label'][np.argmax(sims)]


def evaluate_transformer_local_search_accuracy(dataset):
    correct = 0

    num_samples = len(dataset['validation'])
    for i in range(num_samples):
        example = dataset['validation'][i]
        answer = example['answerKey']
        predicted_choice_label = transformer_local_search(example['question'], example['choices'])
        if answer == predicted_choice_label:
            correct += 1

    accuracy = correct / num_samples
    return accuracy

# Example usage
accuracy = evaluate_transformer_local_search_accuracy(dataset_commonsense)
print("Accuracy:", accuracy)

Accuracy: 0.2858312858312858


As we guessed, even with a pre-trained model, accuracy is very low.

## 4 - Global search on Wikipedia data

<div class='alert alert-block alert-warning'>
            Question:</div>
            
Again, look at the data of the [```wiki_qa``` dataset](https://huggingface.co/datasets/wiki_qa), understand the task. We are now going to perform a **global** search, as the dataset is open domain: when trying to answer for a question, we will search among all vectors, rather than only the ones representing the context the answer is found in. How would you verify that the model managed to find the right answer ? Let's try to use to very different ways to evaluate how well the approaches work:
- Looking if the right result is in the top-$k$ predictions returned by the model.
- Using the [ROUGE](https://aclanthology.org/W04-1013/) score. 
Explain how you understand these metrics and how they could be useful here.

<div class='alert alert-block alert-info'>
            Code:</div>
            
We will use the same embeddings as before, but we will use a tool called ```faiss``` for indexing all of them and facilitate the search ! Look at the [documentation](https://huggingface.co/docs/datasets/faiss_es). Then, implement or use tools implementing the two metrics, and evaluate both approaches.

In [37]:
wiki_dataset = load_dataset("wiki_qa", split='validation')
print(wiki_dataset)
print(wiki_dataset[0])

Dataset({
    features: ['question_id', 'question', 'document_title', 'answer', 'label'],
    num_rows: 2733
})
{'question_id': 'Q8', 'question': 'How are epithelial tissues joined together?', 'document_title': 'Tissue (biology)', 'answer': 'Cross section of sclerenchyma fibers in plant ground tissue', 'label': 0}


A faiss index is used for indexing and searching a set of vectors.  FAISS is a library for dense retrieval. It retrieves documents based on their vector representations, by doing a nearest neighbors search. It will allow us to search through all the answers to find the one closest to the question. We need to first give a vector representation of our elements.

Then, we send our query and the number of results we want to the .search() method. It will return the k vectors closest to the query.

In [38]:
import faiss
import torch
import numpy as np
from transformers import AutoTokenizer, AutoModel
from datasets import load_dataset

Create the vectorize method to embed the elements

In [39]:
def vectorize(text, tokenizer, model):
    inputs = tokenizer(text, return_tensors="pt", truncation=True, padding=True)
    outputs = model(**inputs)
    return torch.mean(outputs.last_hidden_state, dim=1)[0]

Now we build our faiss embeddings.

In [40]:

def build_faiss_index(dataset, model, tokenizer):
    
    # Generate embeddings for all answers
    answer_embeddings = []

    for answer in dataset['answer']:
        inputs = tokenizer(answer, return_tensors="pt", truncation=True, padding=True)
        outputs = model(**inputs)
        answer_embedding = torch.mean(outputs.last_hidden_state, dim=1)[0].detach().numpy()
        answer_embeddings.append(answer_embedding)

    # Convert embeddings to NumPy array for Faiss
    answer_embeddings = np.array(answer_embeddings)

    # Build Faiss index
    index = faiss.IndexFlatL2(answer_embeddings.shape[1])
    index.add(answer_embeddings)
    return index



In [41]:
def global_search_top_k(k, dataset, model, tokenizer, faiss_index, question):
    # Embed the question
    new_question_embedding = vectorize(question, tokenizer, model).detach().numpy()
    # Search for the k closest vectors
    distances, indices = faiss_index.search(new_question_embedding.reshape(1, -1), k)
    # Retrieve the top-K answers using indices
    top_k_answers = [dataset['answer'][i] for i in indices[0]]

    return top_k_answers

In [44]:
def find_correct_answer(question, dataset):
    question_indexes = []
    for i in range(len(dataset)):
        if dataset[i]['question'] == question:
            question_indexes.append(i)
            
    labels = []
    for i in question_indexes:
        labels.append(dataset[i]['label'])

    # Check if labels if not empty
    if labels:
        correct_answer_index = np.argmax(labels)
        if correct_answer_index == 1:
            return dataset['label'][correct_answer_index]
    return False

def check_answer(top_k_answers, question):
    correct_answer = find_correct_answer(question, dataset['validation'])
    if correct_answer is not False:
        for answer in top_k_answers:
            if correct_answer in answer:
                return True
    return False

Evaluation of the model using the k-top predictions

In [45]:
tokenizer = AutoTokenizer.from_pretrained("distilbert-base-uncased")
model = AutoModel.from_pretrained("distilbert-base-uncased")
wiki_dataset = load_dataset("wiki_qa", split='validation')

# Build Faiss index
faiss_index = build_faiss_index(wiki_dataset, model, tokenizer)

# Perform global search
question = wiki_dataset[0]['question']
top_k_answers = global_search_top_k(5, wiki_dataset, model, tokenizer, faiss_index, question)
print(top_k_answers)

res = check_answer(top_k_answers, question)
print(res)

['Organs are then formed by the functional grouping together of multiple tissues.', 'Changes to the oxygen-absorbing tissues', 'Cross section of sclerenchyma fibers in plant ground tissue', 'Alveoli are particular to mammalian lungs.', 'The alveolar membrane is the gas-exchange surface.']
False


### Using the rouge score

The rouge score ROUGE return metrics range between 0 and 1, with higher scores indicating higher similarity between the automatically produced summary and the reference.


In [60]:
from rouge import Rouge

def find_answer_text(wiki_dataset, question):
    for i in range(len(wiki_dataset)):
        if wiki_dataset[i]['question'] == question and wiki_dataset[i]['label'] == 1:
            return wiki_dataset[i]['answer']
        else:
            return 'Not found'

def compute_rouge_scores(retrieved_answers, reference_answers):
    # Initialize the Rouge scoring object
    rouge = Rouge()

    if isinstance(retrieved_answers, str):
        retrieved_answers = [retrieved_answers]
    if isinstance(reference_answers, str):
        reference_answers = [reference_answers]

    if not retrieved_answers or not reference_answers:
        print("One or both of the answer sets are empty.")
        return None

    # Prepare the answers for Rouge evaluation
    retrieved_concatenated = ' '.join(retrieved_answers)
    reference_concatenated = ' '.join(reference_answers)

    try:
        # Compute ROUGE scores
        scores = rouge.get_scores(retrieved_concatenated, reference_concatenated, avg=True)
        return scores
    except Exception as e:
        print(f"Error calculating ROUGE scores: {e}")
        return None

reference_answers = find_answer_text(wiki_dataset, question)

print("Reference Answer:", reference_answers)
print("Retrieved Answers:", top_k_answers)

# Compute ROUGE scores
rouge_scores = compute_rouge_scores(top_k_answers, reference_answers)
print("ROUGE Scores:", rouge_scores)

Reference Answer: Not found
Retrieved Answers: ['Organs are then formed by the functional grouping together of multiple tissues.', 'Changes to the oxygen-absorbing tissues', 'Cross section of sclerenchyma fibers in plant ground tissue', 'Alveoli are particular to mammalian lungs.', 'The alveolar membrane is the gas-exchange surface.']
ROUGE Scores: {'rouge-1': {'r': 0.0, 'p': 0.0, 'f': 0.0}, 'rouge-2': {'r': 0.0, 'p': 0.0, 'f': 0.0}, 'rouge-l': {'r': 0.0, 'p': 0.0, 'f': 0.0}}


## 5 - BONUS: Run a model on the original Squad task

<div class='alert alert-block alert-info'>
            Code:</div>

Of course, we need to know that you understood the code: simplify it to the maximum (only what's necessary to obtain predictions) and comment abundantly ! 

There's a lot of models that we could use for this bonus question. After few searches online, we did find that BERT if the most accurate model for Squad task. But choosing a model often depends on the balance that we want to have between accuracy and computational ressources. Since, they are a lot of versions of BERT depending on this topic we would like to test the difference between 2 of those models.

From what we saw, **RoBERTa** is SOTA for this task. As we saw in class, it is a BERT model that was trained for longer and with more data. Thus, it is a more accurate model but more complex which include more computational ressources.

We will then try to see the difference between this model and **DistilBERT** which is a distilled version of BERT that retains most of BERT’s performance but with fewer parameters and faster processing.

In [61]:
from transformers import AutoModelForQuestionAnswering

#### Generic functions to load the model, tokenizer and get the predictions for Q&A

In [62]:
# Function to load a model and its tokenizer
def load_model_and_tokenizer(model_name):
    tokenizer = AutoTokenizer.from_pretrained(model_name)
    model = AutoModelForQuestionAnswering.from_pretrained(model_name)
    return tokenizer, model

# Function to get predictions from the model
def get_prediction(model, tokenizer, context, question):
    inputs = tokenizer.encode_plus(question, context, return_tensors="pt")
    input_ids = inputs["input_ids"].tolist()[0]

    with torch.no_grad():
        outputs = model(**inputs)
        answer_start_scores = outputs.start_logits
        answer_end_scores = outputs.end_logits

        # Finds the tokens with the highest 'start' and 'end' scores
        answer_start = torch.argmax(answer_start_scores)
        answer_end = torch.argmax(answer_end_scores) + 1

        # Convert tokens to the answer string
        answer = tokenizer.convert_tokens_to_string(tokenizer.convert_ids_to_tokens(input_ids[answer_start:answer_end]))

        return answer

#### Function to compute the accuracy of the models

In [63]:
def compute_accuracy(dataset, model, tokenizer, num_samples=100):
    correct = 0

    for item in dataset.select(range(num_samples)):
        context = item['context']
        question = item['question']
        # SQuAD has multiple answers, here we use the first answer as an example
        actual_answer = item['answers']['text'][0] if item['answers']['text'] else ''

        predicted_answer = get_prediction(model, tokenizer, context, question)

        # Increments correct count if the prediction matches the actual answer
        if predicted_answer.lower().strip() == actual_answer.lower().strip():
            correct += 1

    accuracy = correct / num_samples
    return accuracy

In [64]:
# Reload the dataset for safety measures
squad_dataset = load_dataset("squad")

# Choosing the question we want to answer with it's associated context
question_to_answer = 0
context = squad_dataset['validation'][question_to_answer]['context']
question = squad_dataset['validation'][question_to_answer]['question']

**Remark :** We we'll use only a sample of the dataset to compute the accuracies for computational gains.

### DistilBERT

In [65]:
# Loading the model and tokenizer for DistilBERT
distilbert_tokenizer, distilbert_model = load_model_and_tokenizer("distilbert-base-uncased-distilled-squad")

tokenizer_config.json:   0%|          | 0.00/28.0 [00:00<?, ?B/s]

To support symlinks on Windows, you either need to activate Developer Mode or to run Python as an administrator. In order to see activate developer mode, see this article: https://docs.microsoft.com/en-us/windows/apps/get-started/enable-your-device-for-development


config.json:   0%|          | 0.00/451 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/265M [00:00<?, ?B/s]

In [66]:
# Testing DistilBERT on SQuAD
answer = get_prediction(distilbert_model, distilbert_tokenizer, context, question)
print("Question : ", question)
print("\nDistilBERT anwser :", answer)

Question :  Which NFL team represented the AFC at Super Bowl 50?

DistilBERT anwser : denver broncos


In [67]:
# Accuracy of DistilBERT
distilbert_accuracy = compute_accuracy(squad_dataset['validation'], distilbert_model, distilbert_tokenizer, 500)
print(f"DistilBERT Accuracy: {distilbert_accuracy * 100:.2f}%")

DistilBERT Accuracy: 64.20%


#### RoBERTa

In [68]:
# Loading the model and tokenizer for RoBERTa
roberta_tokenizer, roberta_model = load_model_and_tokenizer("deepset/roberta-base-squad2")

tokenizer_config.json:   0%|          | 0.00/79.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/571 [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/899k [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/772 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/496M [00:00<?, ?B/s]

In [69]:
# Testing RoBERTa on SQuAD
answer = get_prediction(roberta_model, roberta_tokenizer, context, question)
print("Question : ", question)
print("\nRoBERTa answer :", answer)

Question :  Which NFL team represented the AFC at Super Bowl 50?

RoBERTa answer :  Denver Broncos


In [70]:
# Accuracy of RoBERTa
distilbert_accuracy = compute_accuracy(squad_dataset['validation'], roberta_model, roberta_tokenizer, 500)
print(f"RoBERTa Accuracy: {distilbert_accuracy * 100:.2f}%")

RoBERTa Accuracy: 75.40%


**Conclusion :** As planned, RoBERTa is more accurate than DistilBERT for the 500 samples we took. However, it is also much longer to process the samples with almost twice more time to process. The balance between accuracy and computational ressources is thus a very important parameter to take into account in the NLP field. SOTA models are not always suited of every task as they are often more complex and longer to run.