# Lab: Semantic search on Question Answering datasets

## Objectives:

1. Explore and understand the **S**tanford **Qu**estion **A**nswering **D**ataset [Squad](https://aclanthology.org/D16-1264/) dataset and the associated task.  
2. Adapt this dataset for a *local* semantic search task and propose an appropriate evaluation metric:
    - Implement a simple baseline based on **TF-IDF**.
    - Use a pre-trained transformer-based model, and fine-tune it.
3. Test these approaches on the [CommonSense QA](https://aclanthology.org/N19-1421/) dataset. 
4. Adapt these approaches for a *global* semantic search task on the [WikiQA](https://aclanthology.org/D15-1237/) dataset for open domain question answering.
5. **Bonus** (Optional) Apply a model (any, as long as it's running) to the original Squad QA task.

## Modalities:

The goal of this lab is to make you search for and learn to use recent tools for NLP tasks. 
- You should feel free to use any tool, implementation and model you prefer. 
- You are not expected to reach a particular performance. 
- You can work on this lab by groups of up to 3. 
- You should submit this lab on the Moodle by Friday 22th.

In [3]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns; sns.set()

from datasets import load_dataset
# new comment


## 1 - The SQuAD dataset

<div class='alert alert-block alert-warning'>
            Questions:</div>

1. Load the dataset **SQuAD** - for example, using the [```dataset``` package](https://huggingface.co/docs/datasets/index) from Huggingface and loading the dataset ```'squad'```. You can also explore it using the [website](https://rajpurkar.github.io/SQuAD-explorer/). 
2. Look at the metrics used to evaluate models on the dataset. You can also load the metric ```'squad'``` from the [```evaluate``` package](https://huggingface.co/docs/evaluate/index) from Huggingface. 
3. Explain succintly - and in your own words - what is the task: how could we use a model to solve it ? Treat the case of encoder models adapted to *classification tasks* and encoder-decoder models adapted to *text generation*. 

In [5]:
squad_dataset = load_dataset('squad')
df_train = squad_dataset['train']
df_valid = squad_dataset['validation']

Found cached dataset squad (/Users/Guerlain/.cache/huggingface/datasets/squad/plain_text/1.0.0/d6ec3ceb99ca480ce37cdd35555d6cb2511d223b9150cce08a837ef62ffea453)


  0%|          | 0/2 [00:00<?, ?it/s]

DatasetDict({
    train: Dataset({
        features: ['id', 'title', 'context', 'question', 'answers'],
        num_rows: 87599
    })
    validation: Dataset({
        features: ['id', 'title', 'context', 'question', 'answers'],
        num_rows: 10570
    })
})

In [4]:
squad_dataset = pd.DataFrame(load_dataset('squad'))
squad_dataset.head()

Found cached dataset squad (/Users/Guerlain/.cache/huggingface/datasets/squad/plain_text/1.0.0/d6ec3ceb99ca480ce37cdd35555d6cb2511d223b9150cce08a837ef62ffea453)


  0%|          | 0/2 [00:00<?, ?it/s]

ValueError: All arrays must be of the same length

## 2 - Design a *local* semantic search from squad

This taks is a little complicated to implement. Let us simplify squad to be a **semantic search** task !
We will divide the context containing the answer into several pieces, and ask a model to find which one contains the answer **by vectorizing the question and each piece** and trying to look for the most relevant piece using **cosine similarity** between the vectors, making it a fairly simple task.


For example, the following question of the dataset:

```python
'Which NFL team represented the AFC at Super Bowl 50?'
```

with the answer:

```python
'Denver Broncos'
```

We could divide the corresponding ```'context'``` into the following list:

```python
['Super Bowl 50 was an American football game to determine the champion of the National Football League (NFL) for the 2015 season. The American Football Conference (AFC) champion Denver Broncos d',
 "efeated the National Football Conference (NFC) champion Carolina Panthers 24–10 to earn their third Super Bowl title. The game was played on February 7, 2016, at Levi's Stadium in the San Franc",
 'isco Bay Area at Santa Clara, California. As this was the 50th Super Bowl, the league emphasized the "golden anniversary" with various gold-themed initiatives, as well as temporarily suspending',
 ' the tradition of naming each Super Bowl game with Roman numerals (under which the game would have been known as "Super Bowl L"), so that the logo could prominently feature the Arabic numerals 50.']
```

and indicate the location of the answer as:

```python
label = [1, 0, 0, 0]
```

<div class='alert alert-block alert-warning'>
            Question:</div>
            
At first, we won't do any training: you should work with the ```validation``` part of the dataset. Be careful, there may be several good answers ! Propose a scheme to divide the context into pieces and to label each piece as containing the answer or not. How do we evaluate for this task - would simple accuracy suffice ?

<div class='alert alert-block alert-info'>
            Code:</div>
            
For efficient processing, you can use the ```map``` method associated to the dataset. It can create a new feature for each example. In this case, you can create a new feature containing the context divided into pieces, and a new feature containing labels for if the pieces contain the answer. You can also use it for your evaluation function.

### 2.1 - Local search: Independant Tf-idf representations

<div class='alert alert-block alert-info'>
            Code:</div>
            
Implement a function that will for each example:
- Create a tf-idf ```vectorizer``` from all the text in the question and context. 
- Create tf-idf representations for the question and the pieces of the context,
- Find the representation the closest to the question among the pieces.

Then, evaluate the method !

### 2.2 - Local search: Pre-trained sentence representations transformer-based model

<div class='alert alert-block alert-info'>
            Code:</div>

Reproduce the same process using a pre-trained transformer model. You can use a model that you will find on huggingface. You can also look into the [```SentenceTransformer``` library](https://www.sbert.net/), dedicated to represent documents. Also:
- Try to verify if the model has been trained on SQuAD !
- Fine-tune the model (at least a little) to check that it improves results.

## 3 - Local search on another dataset: does it work ? 

<div class='alert alert-block alert-warning'>
            Question:</div>
            
Let's implement our local semantic search on another dataset, to check if performance follows the same trend. You can use the [```commonsense_qa``` dataset](https://huggingface.co/datasets/commonsense_qa). Do the same exploration and explanation you did for the SQuAD task. How is this dataset different ? 

<div class='alert alert-block alert-info'>
            Code:</div>
            
Look at the data and apply the same two approaches you did before. What do you observe ? Propose an explanation.

## 4 - Global search on Wikipedia data

<div class='alert alert-block alert-warning'>
            Question:</div>
            
Again, look at the data of the [```wiki_qa``` dataset](https://huggingface.co/datasets/wiki_qa), understand the task. We are now going to perform a **global** search, as the dataset is open domain: when trying to answer for a question, we will search among all vectors, rather than only the ones representing the context the answer is found in. How would you verify that the model managed to find the right answer ? Let's try to use to very different ways to evaluate how well the approaches work:
- Looking if the right result is in the top-$k$ predictions returned by the model.
- Using the [ROUGE](https://aclanthology.org/W04-1013/) score. 
Explain how you understand these metrics and how they could be useful here.

<div class='alert alert-block alert-info'>
            Code:</div>
            
We will use the same embeddings as before, but we will use a tool called ```faiss``` for indexing all of them and facilitate the search ! Look at the [documentation](https://huggingface.co/docs/datasets/faiss_es). Then, implement or use tools implementing the two metrics, and evaluate both approaches.

## 5 - BONUS: Run a model on the original Squad task

<div class='alert alert-block alert-info'>
            Code:</div>

Of course, we need to know that you understood the code: simplify it to the maximum (only what's necessary to obtain predictions) and comment abundantly ! 