#Introduction
In this project, you will build a production-ready question-answering project.

"Question-answering" means that the project stored a large amount of texts on various topics into a text-oriented database called ElasticSearch. Then you ask a question to your project, and hopefully it finds the most relevant answer into one of the texts.

"Production-ready" means that the response time is fast, whatever the size of the dataset, and also that the data doesn't need to fit in memory.

In order to ease our work, we will use a library called [Haystack](https://haystack.deepset.ai/overview/intro)

**As you can see, you get 17 points by answering the questions and filling in the code below. You can get 3 additional points, depending on the quality of the code, the global understanding of the topic, and tests that you might have decided to try by yourself.**

## [ElasticSearch](https://www.elastic.co/fr/)
This database has become more and more famous in the last years. In a nutshell, it is a no-SQL text-indexing database, that allows to store and retrieve vast amounts of texts quickly.

It can be used in many ways, from logs storage and analysis to web site indexing.

In this project,our usage of ElasticSearch will be simple: store the texts into an "index" (the equivalent of a table for RDBMS), and search for answers to the questions into this index.

#Install Haystack
... Then click on the button "restart runtime" that will appear at the end of the installation

In [None]:
#!pip install grpcio-tools==1.34.1
!pip install farm-haystack==1.11

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting farm-haystack==1.11
  Downloading farm_haystack-1.11.0-py3-none-any.whl (588 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m588.1/588.1 KB[0m [31m31.1 MB/s[0m eta [36m0:00:00[0m
Collecting sentence-transformers>=2.2.0
  Downloading sentence-transformers-2.2.2.tar.gz (85 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m86.0/86.0 KB[0m [31m12.7 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25l[?25hdone
Collecting rapidfuzz<2.8.0,>=2.0.15
  Downloading rapidfuzz-2.7.0-cp38-cp38-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (2.1 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m2.1/2.1 MB[0m [31m70.3 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting transformers==4.21.2
  Downloading transformers-4.21.2-py3-none-any.whl (4.7 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [3

#Download and setup the Elasticsearch instance



In [None]:
%%bash

#Warning: the latest version 7.15.1 crashes in Google Colab, so we use 7.9.2

wget -q https://artifacts.elastic.co/downloads/elasticsearch/elasticsearch-7.9.2-linux-x86_64.tar.gz
wget -q https://artifacts.elastic.co/downloads/elasticsearch/elasticsearch-7.9.2-linux-x86_64.tar.gz.sha512
tar -xzf elasticsearch-7.9.2-linux-x86_64.tar.gz
sudo chown -R daemon:daemon elasticsearch-7.9.2/
shasum -a 512 -c elasticsearch-7.9.2-linux-x86_64.tar.gz.sha512 

elasticsearch-7.9.2-linux-x86_64.tar.gz: OK


Run the instance as a daemon process

In [None]:
%%bash --bg

sudo -H -u daemon elasticsearch-7.9.2/bin/elasticsearch

In [None]:
import time

# Sleep for few seconds to let the instance start.
time.sleep(30)

Once the instance has been started, grep for `elasticsearch` in the processes list to confirm the availability.

In [None]:
%%bash

ps -ef | grep elasticsearch

root        1124    1122  0 18:42 ?        00:00:00 sudo -H -u daemon elasticsearch-7.9.2/bin/elasticsearch
daemon      1125    1124 81 18:42 ?        00:00:30 /content/elasticsearch-7.9.2/jdk/bin/java -Xshare:auto -Des.networkaddress.cache.ttl=60 -Des.networkaddress.cache.negative.ttl=10 -XX:+AlwaysPreTouch -Xss1m -Djava.awt.headless=true -Dfile.encoding=UTF-8 -Djna.nosys=true -XX:-OmitStackTraceInFastThrow -XX:+ShowCodeDetailsInExceptionMessages -Dio.netty.noUnsafe=true -Dio.netty.noKeySetOptimization=true -Dio.netty.recycler.maxCapacityPerThread=0 -Dio.netty.allocator.numDirectArenas=0 -Dlog4j.shutdownHookEnabled=false -Dlog4j2.disable.jmx=true -Djava.locale.providers=SPI,COMPAT -Xms1g -Xmx1g -XX:+UseG1GC -XX:G1ReservePercent=25 -XX:InitiatingHeapOccupancyPercent=30 -Djava.io.tmpdir=/tmp/elasticsearch-1600510571720068807 -XX:+HeapDumpOnOutOfMemoryError -XX:HeapDumpPath=data -XX:ErrorFile=logs/hs_err_pid%p.log -Xlog:gc*,gc+age=trace,safepoint:file=logs/gc.log:utctime,pid,tags:filecou

query the base endpoint to retrieve information about the cluster.

In [None]:
%%bash

curl "http://localhost:9200/"

{
  "name" : "99dc4b589416",
  "cluster_name" : "elasticsearch",
  "cluster_uuid" : "4GmYos5DQ-KKzbXsfVZw3Q",
  "version" : {
    "number" : "7.9.2",
    "build_flavor" : "default",
    "build_type" : "tar",
    "build_hash" : "d34da0ea4a966c4e49417f2da2f244e3e97b4e6e",
    "build_date" : "2020-09-23T00:45:33.626720Z",
    "build_snapshot" : false,
    "lucene_version" : "8.6.2",
    "minimum_wire_compatibility_version" : "6.8.0",
    "minimum_index_compatibility_version" : "6.0.0-beta1"
  },
  "tagline" : "You Know, for Search"
}


  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
  0     0    0     0    0     0      0      0 --:--:-- --:--:-- --:--:--     0100   537  100   537    0     0   7356      0 --:--:-- --:--:-- --:--:--  7458


#Create ElasticSearch index before storing data inside
Before storing data inside ElasticSearch, we have to create the index in which we will store the texts.

Here we call our index 'aurelius', but we could have chosen any other name.

In [None]:
from haystack.document_stores import ElasticsearchDocumentStore

In [None]:
aurelius_doc_store = ElasticsearchDocumentStore(
    host='localhost',
    username='', password='',
    index='aurelius'
)

In [None]:
!curl http://localhost:9200/_cat/indices

yellow open aurelius 5OO3T911SF-YDwP07UYe7Q 1 1 0 0 208b 208b
yellow open label    dwSO-7ceQUesBT6lmRgiHw 1 1 0 0 208b 208b


#Download the dataset and store it into ElasticSearch (1 point)

In [None]:
!wget https://raw.githubusercontent.com/jamescalam/transformers/main/data/text/meditations/clean.txt

--2023-01-08 18:43:47--  https://raw.githubusercontent.com/jamescalam/transformers/main/data/text/meditations/clean.txt
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.108.133, 185.199.109.133, 185.199.110.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.108.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 241387 (236K) [text/plain]
Saving to: ‘clean.txt’


2023-01-08 18:43:47 (52.3 MB/s) - ‘clean.txt’ saved [241387/241387]



In [None]:
with open('clean.txt') as file:
    text = file.readlines()

In [None]:
text[:5]

['From my grandfather Verus I learned good morals and the government of my temper.\n',
 'From the reputation and remembrance of my father, modesty and a manly character.\n',
 'From my mother, piety and beneficence, and abstinence, not only from evil deeds, but even from evil thoughts; and further, simplicity in my way of living, far removed from the habits of the rich.\n',
 'From my great-grandfather, not to have frequented public schools, and to have had good teachers at home, and to know that on such things a man should spend liberally.\n',
 "From my governor, to be neither of the green nor of the blue party at the games in the Circus, nor a partizan either of the Parmularius or the Scutarius at the gladiators' fights; from him too I learned endurance of labour, and to want little, and to work with my own hands, and not to meddle with other people's affairs, and not to be ready to listen to slander.\n"]

As explained in the [documentation](https://haystack.deepset.ai/components/document-store#writing-documents-sparse-retrievers), we have to use the method  'write_documents(...)' of the object 'aurelius_doc_store' in order to insert the texts into ElasticSearch. Which data format is this method expecting ?

```
Which data format is this method expecting ?
```
Cette method attend un format **List**.

In [None]:
dicts = [
    {
      'content' : sentence,
      'meta' : {
          'source' : 'clean.txt'
      }
    } for sentence in text
]
 #YOUR CODE HERE

In [None]:
aurelius_doc_store.write_documents(dicts)

Now you can call the ElasticSearch REST end point '[_count](https://www.elastic.co/guide/en/elasticsearch/reference/current/search-count.html)' in order to know how many texts you have inserted in the index.

In [None]:
!curl http://localhost:9200/aurelius/_count

{"count":507,"_shards":{"total":1,"successful":1,"skipped":0,"failed":0}}

#Setup Question - Answer retriever - BM25

##[ElasticsearchRetriever](https://haystack.deepset.ai/components/retriever) (4 points)

Please read the documentation of Haystack by clicking on the link above, and answer the questions

According to the documentation, this class implements the algorithm BM25, which is an improvement over TF-IDF. 

- Why is it better than TF-IDF ?
- Is it a BOW model ? Why ? 
- What other kind of model might give better results than BOW models ? Why ?

```
Why is it better than TF-IDF?
```
**BM25 est généralement considérée comme une fonction de classement plus efficace que TF-IDF pour plusieurs raisons :**

* BM25 tient compte de la longueur des documents, alors que TF-IDF ne le fait pas. Cela signifie que BM25 est capable de pénaliser les longs documents qui pourraient autrement dominer le classement, car ils peuvent contenir plus d'occurrences d'un terme particulier mais ne sont pas aussi pertinents pour la requête que les documents plus courts.

* BM25 utilise une approche plus sophistiquée de la fréquence des termes que TF-IDF. Alors que TF-IDF compte simplement le nombre d'occurrences d'un terme dans un document, BM25 utilise une fonction d'échelle logarithmique pour atténuer l'impact des fréquences de termes élevées. Cela permet d'éviter qu'un seul terme très fréquent domine le classement.

```
Is it a BOW model ? Why ? 
```
BM25 est un modèle de sac de mots (Bag-Of-Words) qui ordonne les documents en fonction de la fréquence des termes qui apparaissent dans chaque document, indépendamment des relations pouvant exister entre ces termes.

```
What other kind of model might give better results than BOW models ? Why ?
```
Il existe plusieurs types de modèles qui peuvent être utilisés pour les tâches de traitement du langage naturel et qui peuvent donner de meilleurs résultats que les modèles de sac de mots (BOW) :

* Word embedding models (Word2Vec et GloVe)
* Recurrent neural networks (RNNs)
* Transformative models(Transformer et BERT)

Ces modèles peuvent donner de meilleurs résultats que les modèles BOW car ils sont capables de saisir le sens et le contexte des mots, ce qui est important pour de nombreuses tâches de traitement du langage naturel.

In [None]:
# BM25Retriever implements the BM25 algorithm, which is
# an improvement of TF-IDF, because it emphasizes small documents.
# Also it is better suited for question answering searches, see
# https://www.quora.com/Aarkstore-Global-Text-Analytics-Market-Which-one-is-more-robust-between-tf-idf-and-BM25
from haystack.nodes import BM25Retriever

## [Reader](https://haystack.deepset.ai/components/reader) (2 points)
As you can see from the documentation, Haystack is chaining several modules. Now we have to instanciate the reader.

- In this architecture, what is the role of the ElasticsearchRetriever that we instanciated above ? And what is the role of the Reader that we are going to instanciate now ?

```
In this architecture, what is the role of the ElasticsearchRetriever that we instanciated above ? And what is the role of the Reader that we are going to instanciate now ?
```
Le rôle de l'**ElasticsearchRetriever**, est de retourner un jeu de plusieurs documents susceptible de répondre à une réquête donnée en entrée.

Le rôle du **Reader**, est de retourner une réponse à partir d'une question et de plusieurs documents que l'ElasticsearchRetriever aura renvoyé.

In [None]:
from haystack.nodes import FARMReader

In [None]:
reader = FARMReader(model_name_or_path="deepset/bert-base-cased-squad2", use_gpu=True)
# YOUR CODE HERE: instanciate a reader as per the documentation

Downloading config.json:   0%|          | 0.00/508 [00:00<?, ?B/s]

Downloading pytorch_model.bin:   0%|          | 0.00/413M [00:00<?, ?B/s]

Downloading tokenizer_config.json:   0%|          | 0.00/152 [00:00<?, ?B/s]

Downloading vocab.txt:   0%|          | 0.00/208k [00:00<?, ?B/s]

Downloading special_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]

##[ExtractiveQAPipeline](https://haystack.deepset.ai/components/ready-made-pipelines) (2 points)
According to the documentation, what is the purpose of the ExtractiveQAPipeline ?

```
what is the purpose of the ExtractiveQAPipeline ?
```
A l'aide d'une question, de l'ElasticsearchRetriever et du Reader, l'ExtractiveQAPipeline va proposer la réponse à la requête donnée.

Précision : il est possible d'ajouter des paramètres :
- pour le retriver -> imposer le nombre de document à selectionner
- pour le reader -> imposer le nombre de réponses à proposer

In [None]:
from haystack.pipelines import ExtractiveQAPipeline

In [None]:
retriever = BM25Retriever(aurelius_doc_store)
qa = ExtractiveQAPipeline(reader, retriever)
# YOUR CODE HERE: instanciate an ExtractiveQAPipeline

In [None]:
query = "What did your grandfather teach?"
result = qa.run(query=query, params={"Retriever": {"top_k": 10}, "Reader": {"top_k": 3}})

from haystack.utils import print_answers
print_answers(result)
# YOUR CODE HERE: find the 3 most relevant answers to the question 'What did your grandfather teach?'

Inferencing Samples: 100%|██████████| 1/1 [00:01<00:00,  1.47s/ Batches]


Query: What did your grandfather teach?
Answers:
[   <Answer {'answer': 'good morals and the government of my temper', 'type': 'extractive', 'score': 0.1665748953819275, 'context': 'From my grandfather Verus I learned good morals and the government of my temper.\n', 'offsets_in_document': [{'start': 36, 'end': 79}], 'offsets_in_context': [{'start': 36, 'end': 79}], 'document_id': '3b11ad328b74aeb6facbfa9de45c71bd', 'meta': {'source': 'clean.txt'}}>,
    <Answer {'answer': 'thy life under thy mother', 'type': 'extractive', 'score': 0.0241286251693964, 'context': 'rn thy thoughts now to thy life under thy grandfather, then to thy life under thy mother, then to thy life under thy father; and as thou findest many ', 'offsets_in_document': [{'start': 352, 'end': 377}], 'offsets_in_context': [{'start': 63, 'end': 88}], 'document_id': '405c064e1e0bdb647bda713c10b9ee07', 'meta': {'source': 'clean.txt'}}>,
    <Answer {'answer': 'good teachers', 'type': 'extractive', 'score': 0.011039214208722




#Squad dataset, using Dense vectors




##Pros and cons (2 points)

Please answer the question:

According to the course, Deep Learning has been a big improvement over the earlier BOW models, but it is not perfect. 

- In both cases, texts are represented by vectors of number, but there is a big difference between both approaches. Can you tell what it is ?

- What are the pros and cons of the deep learning approach, vs. the BOW one ?

```
In both cases, texts are represented by vectors of number, but there is a big difference between both approaches. Can you tell what it is ?
```

**Les approches Deep Learning :**

Les approches Deep Learning représentent les données textuelles comme une séquence de mots et utilisent des réseaux neuronaux pour apprendre des relations complexes entre les données d'entrée et de sortie. Elles sont capables de capturer le contexte et la signification des mots dans un texte. Toutefois, elles nécessitent une grande quantité de données étiquetées pour être entrainées.

**Les approches BOW :**

Les approches BOW représentent les données textuelles comme un sac de ses mots, ignorant l'ordre et le contexte des mots. Elles sont moins puissantes et ne conviennent pas aux tâches qui nécessitent de comprendre le contexte ou le sens des mots. Toutefois, elles sont plus simples et plus rapides à mettre en œuvre et à entrainer.


```
What are the pros and cons of the deep learning approach, vs. the BOW one ?
```

**Deep learning:**

* Avantages :
  * Les modèles Deep Learning peuvent traiter une grande quantitée de données.
  * Il est capable de reconnaitre des relations entre des mots

* Inconvénients :
  * Les modèles Deep Learning ont besoin d'une grande quantité de données étiquetées pour s'entrainer.
  * Les modèles d'apprentissage profond peuvent être difficiles à interpréter et à comprendre, car ils sont composés de nombreuses couches de neurones interconnectés.

**Bag-of-words (BOW):**

* Avantages:
  * Les modèles BOW sont simples à comprendre et à mettre en œuvre.

* Inconvénients :
  * Les modèles BOW ne capturent pas le contexte ou le sens des mots.

##Download the Squad dataset

As seen in the course, Squad is a famous dataset that is being used to evaluate the quality of question answering algorithms.

In [None]:
!wget https://rajpurkar.github.io/SQuAD-explorer/dataset/train-v2.0.json

--2023-01-08 18:44:22--  https://rajpurkar.github.io/SQuAD-explorer/dataset/train-v2.0.json
Resolving rajpurkar.github.io (rajpurkar.github.io)... 185.199.111.153, 185.199.109.153, 185.199.108.153, ...
Connecting to rajpurkar.github.io (rajpurkar.github.io)|185.199.111.153|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 42123633 (40M) [application/json]
Saving to: ‘train-v2.0.json’


2023-01-08 18:44:24 (345 MB/s) - ‘train-v2.0.json’ saved [42123633/42123633]



In [None]:
!wget https://rajpurkar.github.io/SQuAD-explorer/dataset/dev-v2.0.json

--2023-01-08 18:44:24--  https://rajpurkar.github.io/SQuAD-explorer/dataset/dev-v2.0.json
Resolving rajpurkar.github.io (rajpurkar.github.io)... 185.199.111.153, 185.199.109.153, 185.199.108.153, ...
Connecting to rajpurkar.github.io (rajpurkar.github.io)|185.199.111.153|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 4370528 (4.2M) [application/json]
Saving to: ‘dev-v2.0.json’


2023-01-08 18:44:25 (229 MB/s) - ‘dev-v2.0.json’ saved [4370528/4370528]



In [None]:
import json

In [None]:
def preprocess_squad_file(filename):
  
  with open(filename, 'rb') as f:
    squad = json.load(f)
    
  new_squad = []
  for group in squad['data']:
    for paragraph in group['paragraphs']:
      context = paragraph['context']
      for qa_pair in paragraph['qas']:
        question = qa_pair['question']
        if 'answers' in qa_pair.keys() and len(qa_pair['answers']) > 0:
          answer = qa_pair['answers'][0]['text']
        elif 'plausible_answers' in qa_pair.keys() and len(qa_pair['plausible_answers']) > 0:
          answer = qa_pair['plausible_answers'][0]['text']
        else:
          answer = None
        
        new_squad.append({'question': question, 'answer': answer, 'context': context})

  return new_squad

In [None]:
squad_train = preprocess_squad_file("train-v2.0.json")
## YOUR CODE HERE: read the file train-v2.0.json and create a dict that contains the data, in a less complex format.
## You should get a list of dicts {'question': question, 'answer': answer, 'context': context}
## Beware: for some questions, the answer is stored into a field called 'answers' in the squad dataset, but for other
## questions, the answer is stored into a field called 'plausible_answers'

In [None]:
squad_dev = preprocess_squad_file("dev-v2.0.json")
## YOUR CODE HERE: same for the file 'dev-v2.0.json'

##Store the questions into ElasticSearch (2 points)

In [None]:
def format_for_write_documents(preprocess_squad):
    questions = [sample['question'] for sample in preprocess_squad]
    questions = list(set(questions))
    result = [{'content': question} for question in questions]
    return result

squad_train_es = format_for_write_documents(squad_train)
## YOUR CODE HERE: extract the questions from the list squad_train, and store them into the format expected by the method write_documents(...),
## as we did previously
squad_dev_es = format_for_write_documents(squad_dev)
## YOUR CODE HERE: same with the list squad_dev

In [None]:
squad_doc_store = ElasticsearchDocumentStore(
    host='localhost',
    username='', password='',
    index='squad_docs',
    similarity="dot_product"
)

In [None]:
squad_doc_store.write_documents(squad_train_es)

##Facebook Dense Passage Retriever (DPR) (2 points)
Here we are using Facebook's state of the art sentence embedding model for question answering, see https://huggingface.co/sentence-transformers/facebook-dpr-ctx_encoder-single-nq-base

You can find the name of the model here, in order to download it: https://huggingface.co/models?sort=downloads&search=dpr

Facebook's DPR is working that way:
- dpr-ctx_encoder-single-nq-base computes the vectors corresponding to the texts that you have stored into the database. Haystack stores these vectors into the database, along with the texts, as we will see below.
- dpr-question_encoder-single-nq-base computes the vector corresponding to your search, in a clever way. Let's say that your question is "What is the capital of France?".
The question encoder will generate a vector corresponding to an incomplete answer, such as "The capital of France is...", because it is aware of the structure of the     language.
- Then the DensePassageRetriever searches for the nearest vectors stored in the database

In [None]:
from haystack.nodes import DensePassageRetriever

In [None]:
squad_dense_retriever = DensePassageRetriever(
    document_store=squad_doc_store,
    query_embedding_model='facebook/dpr-question_encoder-single-nq-base',
    passage_embedding_model='facebook/dpr-ctx_encoder-single-nq-base',
    embed_title=True
)

Downloading tokenizer_config.json:   0%|          | 0.00/28.0 [00:00<?, ?B/s]

Downloading vocab.txt:   0%|          | 0.00/226k [00:00<?, ?B/s]

Downloading tokenizer.json:   0%|          | 0.00/455k [00:00<?, ?B/s]

Downloading config.json:   0%|          | 0.00/493 [00:00<?, ?B/s]

Downloading pytorch_model.bin:   0%|          | 0.00/418M [00:00<?, ?B/s]

Downloading tokenizer_config.json:   0%|          | 0.00/28.0 [00:00<?, ?B/s]

Downloading vocab.txt:   0%|          | 0.00/226k [00:00<?, ?B/s]

Downloading tokenizer.json:   0%|          | 0.00/455k [00:00<?, ?B/s]

Downloading config.json:   0%|          | 0.00/492 [00:00<?, ?B/s]

The tokenizer class you load from this checkpoint is not the same type as the class this function is called from. It may result in unexpected tokenization. 
The tokenizer class you load from this checkpoint is 'DPRQuestionEncoderTokenizer'. 
The class this function is called from is 'DPRContextEncoderTokenizerFast'.


Downloading pytorch_model.bin:   0%|          | 0.00/418M [00:00<?, ?B/s]

Now you have to compute the embeddings of the texts, and store them into ElasticSearch. This computation can be very long on CPUs, depending on the number of texts that you have in the database.

In [None]:
squad_doc_store.update_embeddings(squad_dense_retriever)

Updating embeddings:   0%|          | 0/130217 [00:00<?, ? Docs/s]

Create embeddings:   0%|          | 0/10000 [00:00<?, ? Docs/s]

Create embeddings:   0%|          | 0/10000 [00:00<?, ? Docs/s]

Create embeddings:   0%|          | 0/10000 [00:00<?, ? Docs/s]

Create embeddings:   0%|          | 0/10000 [00:00<?, ? Docs/s]

Create embeddings:   0%|          | 0/10000 [00:00<?, ? Docs/s]

Create embeddings:   0%|          | 0/10000 [00:00<?, ? Docs/s]

Create embeddings:   0%|          | 0/10000 [00:00<?, ? Docs/s]

Create embeddings:   0%|          | 0/10000 [00:00<?, ? Docs/s]

Create embeddings:   0%|          | 0/10000 [00:00<?, ? Docs/s]

Create embeddings:   0%|          | 0/10000 [00:00<?, ? Docs/s]

Create embeddings:   0%|          | 0/10000 [00:00<?, ? Docs/s]

Create embeddings:   0%|          | 0/10000 [00:00<?, ? Docs/s]

Create embeddings:   0%|          | 0/10000 [00:00<?, ? Docs/s]

Create embeddings:   0%|          | 0/224 [00:00<?, ? Docs/s]

In [None]:
# YOUR CODE HERE: 
# select a random question from squad_train, 
import random
random.seed()
random_squad = squad_train[random.choice(range(0, len(squad_train)))]

query = random_squad["question"]

# search for the best answer to that question into ElasticSearch using our squad_dense_retriever and reader
squad_dense_reader = FARMReader(model_name_or_path="deepset/bert-base-cased-squad2", use_gpu=True)

qa = ExtractiveQAPipeline(squad_dense_reader, squad_dense_retriever)

result = qa.run(query=query, params={"Retriever": {"top_k": 10}, "Reader": {"top_k": 1}})

# compare it with the expected answer from squad_train
print_answers(result)

print("reponse attendue : " + random_squad["answer"])

Inferencing Samples: 100%|██████████| 1/1 [00:00<00:00,  4.92 Batches/s]


Query:  Where do traditional mandolin orchestras remain unpopular?
Answers:
[   <Answer {'answer': 'dance music', 'type': 'extractive', 'score': 0.4510022699832916, 'context': 'These mandolin are used in unpopular dance music called?', 'offsets_in_document': [{'start': 37, 'end': 48}], 'offsets_in_context': [{'start': 37, 'end': 48}], 'document_id': 'c06467c0120aa92d17acc368430f15e4', 'meta': {}}>]
reponse attendue : Japan and Germany





#ROUGE score (2 points)
A basic way to score the model would be to count exact matches: is the predicted answer exactly equal to the reference ?

It would not be a very good score, because, as you know, many answers could be considered as good by human, even if they are not expressed exactly in the words as the reference.

In order to fix that issue, you could introduce basic NLP treatments, such as punctuation removal, stopwords removal, stemming or lemmatization... But still you could do better.

In order to be more fuzzy, the ROUGE score is checking the match with n-grams.

In [None]:
!pip install rouge

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting rouge
  Downloading rouge-1.0.1-py3-none-any.whl (13 kB)
Installing collected packages: rouge
Successfully installed rouge-1.0.1


In [None]:
from rouge import Rouge

f = F1 score 
   - harmonic mean between precision and recall, see [chapter 1.9](https://colab.research.google.com/drive/1OlagQQMeEjAyYT73O5mBnLAySzMi8IDg#scrollTo=UIcqq_TJHamB)

p = precision 
   - % of the prediction which is present in the correct answer
   - nb ngrams in common in prediction and reference / nb ngrams in prediction = 3/7 in the example below, if estimated on monograms
   - favors shorter answers

r = recall 
   - % of the correct answer which is present in the prediction 
   - nb ngrams in common in prediction and reference / nb ngrams in reference = 3/3 in the example below, if estimated on monograms
   - favors longer answers

rouge-1 = 1-grams, ie. words: 'the', 'hello', 'a', 'cat'...

rouge-2 = 2-grams: 'the hello', 'hello a', 'a cat', ...

rouge-l = longest common subsequence
  - 'fox jumps' in the example below
  - rouge-l recall = nb words in LCS / nb words in reference = 2/3
  - rouge-l precision = nb words in LCS / nb words in prediction = 2/7

In [None]:
predicted_answer = 'the hello a cat dog fox jumps '
correct_answer = 'the fox jumps'

rouge_scorer = Rouge()

rouge_scorer.get_scores(predicted_answer, correct_answer)

[{'rouge-1': {'r': 1.0, 'p': 0.42857142857142855, 'f': 0.5999999958},
  'rouge-2': {'r': 0.5, 'p': 0.16666666666666666, 'f': 0.24999999625000005},
  'rouge-l': {'r': 1.0, 'p': 0.42857142857142855, 'f': 0.5999999958}}]

In [None]:
squad_dev_doc_store = ElasticsearchDocumentStore(
    host='localhost',
    username='', password='',
    index='squad_dev_docs',
    similarity="dot_product"
)
squad_dev_doc_store.write_documents(squad_dev_es)

rouge_dpr_retriever = DensePassageRetriever(
    document_store=squad_dev_doc_store,
    query_embedding_model='facebook/dpr-question_encoder-single-nq-base',
    passage_embedding_model='facebook/dpr-ctx_encoder-single-nq-base',
    embed_title=True
)

squad_dev_doc_store.update_embeddings(rouge_dpr_retriever)

rouge_reader = FARMReader(model_name_or_path="deepset/bert-base-cased-squad2", use_gpu=True)

rouge_bow_retriever = BM25Retriever(squad_dev_doc_store)


The tokenizer class you load from this checkpoint is not the same type as the class this function is called from. It may result in unexpected tokenization. 
The tokenizer class you load from this checkpoint is 'DPRQuestionEncoderTokenizer'. 
The class this function is called from is 'DPRContextEncoderTokenizerFast'.


Updating embeddings:   0%|          | 0/11864 [00:00<?, ? Docs/s]

Create embeddings:   0%|          | 0/10000 [00:00<?, ? Docs/s]

Create embeddings:   0%|          | 0/1872 [00:00<?, ? Docs/s]

In [None]:
print(len(squad_dev))

11873


In [None]:
## YOUR CODE HERE:

qa_bow = ExtractiveQAPipeline(rouge_reader, rouge_bow_retriever)
qa_dpr = ExtractiveQAPipeline(rouge_reader, rouge_dpr_retriever)

score_bow_sum = 0
score_dpr_sum = 0

def get_rouge_score(qa, query):
  question = query["question"]
  result = qa.run(query=question, params={"Retriever": {"top_k": 10}, "Reader": {"top_k": 1}})

  correct_answer = query["answer"]
  predicted_answer = result["answers"][0].answer
  try:
    score = rouge_scorer.get_scores(predicted_answer, correct_answer)[0]["rouge-l"]["f"]
    return score
  except:
    return 0
  

for query in squad_dev:
  score_bow_sum += get_rouge_score(qa_bow, query)
  score_dpr_sum += get_rouge_score(qa_dpr, query)
  
avg_bow_score = score_bow_sum / len(squad_dev)
avg_dpr_score = score_dpr_sum / len(squad_dev)

print("")
print("Moyenne des scores BOW : ", avg_bow_score)
print("Moyenne des scores DPR : ", avg_dpr_score)

## - compute the average ROUGE scores of the BOW model on the dev Squad dataset
## - compute the average ROUGE scores of the DPR model on the dev Squad dataset

[1;30;43mLe flux de sortie a été tronqué et ne contient que les 5000 dernières lignes.[0m
Inferencing Samples: 100%|██████████| 1/1 [00:00<00:00,  6.18 Batches/s]
Inferencing Samples: 100%|██████████| 1/1 [00:00<00:00,  6.21 Batches/s]
Inferencing Samples: 100%|██████████| 1/1 [00:00<00:00,  6.31 Batches/s]
Inferencing Samples: 100%|██████████| 1/1 [00:00<00:00,  6.19 Batches/s]
Inferencing Samples: 100%|██████████| 1/1 [00:00<00:00,  6.24 Batches/s]
Inferencing Samples: 100%|██████████| 1/1 [00:00<00:00,  6.09 Batches/s]
Inferencing Samples: 100%|██████████| 1/1 [00:00<00:00,  6.09 Batches/s]
Inferencing Samples: 100%|██████████| 1/1 [00:00<00:00,  6.22 Batches/s]
Inferencing Samples: 100%|██████████| 1/1 [00:00<00:00,  6.26 Batches/s]
Inferencing Samples: 100%|██████████| 1/1 [00:00<00:00,  6.18 Batches/s]
Inferencing Samples: 100%|██████████| 1/1 [00:00<00:00,  6.24 Batches/s]
Inferencing Samples: 100%|██████████| 1/1 [00:00<00:00,  6.19 Batches/s]
Inferencing Samples: 100%|██████


Moyenne des scores BOW :  0.05488946217277628
Moyenne des scores DPR :  0.048836173608780224



