# Long Form Question Answering with ELI5 and Wikipedia  

---  

### Table of Contents  

1. [Introduction](#intro)  
    a. [Preliminaries](#prelims)
2. [Task and Data Description](#task_description)  
    a. [Note on Data and Biases](#reddit_biases)
3. [Retrieving Support Documents](#retrieval)  
    a. [Sparse Retrieval with ElasticSearch](#elasticsearch)  
    b. [Training a Dense Retriever with ELI5 and in-batch Negatives](#dense_train)  
    c. [Using a Trained Dense Retriever](#dense_use)  
    d. [Retriever Evaluation](#dense_eval)  
4. [Answer Generation Model](#generation)  
    a. [Conditional Generation with Seq2seq Models](#seq2seq_presentation)  
    b. [Fine-Tuning Seq2seq Models](#seq2seq_train)  
5. [Conclusion](#conclusion)  


---

<img src="images/choco_bis.svg" width="900" align="center"/>  


## Introduction
<a id='intro'></a>

Imagine that you are taken with a sudden desire to understand **how the fruit of a tropical tree gets transformed into chocolate bars**, or want to understand **the role of fever in the human body's immune response**: how would you go about finding that information?

If your specific question has already been asked and answered clearly and succintly on one of the many question answering platforms available on the Internet (such as [**Quora**](https://www.quora.com/How-is-chocolate-made), [**Reddit**](https://www.reddit.com/user/ex_5_libris/comments/9c8gb1/chocolate_how_chocolate_is_made/), or [**Yahoo Answers**](https://answers.yahoo.com/question/index?qid=20070615082202AArsYN1)), you're in luck: modern search engines will probably take you to that pre-existing answer pretty reliably in a matter of a few clicks.  

Otherwise, the process will be a little more involved. You will likely have to collect relevant information from a variety of sources, figure out how these pieces of knowledge fit together in relation to your query, and synthetize a narrative that answers your initial question.

Now, wouldn't it be great if your computer could do all of that for you: **gather** the right sources, **synthetize** the information, and **write up** an easy-to-read summary of the relevant points? Such a system isn't quite available yet, at least not one that can provide *reliable* information in its summary. However, a number of recent advances in natural language understanding and generation have made working toward solving this problem much easier! These advances include progress in the pre-training (e.g. [BART](https://arxiv.org/abs/1910.13461), [T5](https://arxiv.org/abs/1910.10683)) and evaluation (e.g. for [factuality](https://arxiv.org/abs/2004.04228)) of sequence-to-sequence models for conditional text generation, new ways to use language understanding models to find information in Wikipedia (e.g. [REALM](https://kentonl.com/pub/gltpc.2020.pdf), [DPR](https://arxiv.org/abs/2004.04906)), and new [training datasets](https://arxiv.org/abs/1907.09190).

**In this notebook,** we show how we can take advantage of some of these recent works to train a **long form question answering** system which takes in a question, fetches 10 relevant passages from a [Wikipedia snapshot](https://www.aclweb.org/anthology/2020.lrec-1.297/), and writes a multi-sentence answer based on the question and retrieved passages. Follow along to learn about the steps involved and read some background on the state of the art for some related tasks, or go straight to the:  
## [**Live Demo!**](http://35.226.96.115:8080/)  
(And don't forget to scroll down on the left sidebar to show all of the generation options!)

### Preliminaries  
<a id='prelims'></a>

The implementation presented here relies on the [HuggingFace](https://huggingface.co/) [🤗transformers](https://github.com/huggingface/transformers) and [🤗nlp](https://github.com/huggingface/nlp) libraries. Wikipedia indexing relies on [ElasticSearch](https://www.elastic.co/elasticsearch) with its [python bindings](https://github.com/elastic/elasticsearch-py) for the sparse version, and [faiss](https://github.com/facebookresearch/faiss/) for the dense version. You can get all of these by running:
> pip install elasticsearch  
> pip install faiss_gpu  
> pip install nlp  
> pip install transformers  
>  
> wget https://artifacts.elastic.co/downloads/elasticsearch/elasticsearch-7.7.1-linux-x86_64.tar.gz  
> tar -xzvf elasticsearch-7.7.1-linux-x86_64.tar.gz  

The training relies on two datasets: [ELI5](https://arxiv.org/abs/1907.09190), a processed version of the [r/explainlikeimfive](https://www.reddit.com/r/explainlikeimfive/) subreddit, and the [Wiki40b](https://www.aclweb.org/anthology/2020.lrec-1.297/) Wikipedia image.

Downloading ELI5 can take up to 72 hours since we need to to filter through all of the Reddit dumps for 8 years, so we suggest that you do that first (you will need a good download speed and about 10GB of disk space):

In [1]:
import nlp
eli5 = nlp.load_dataset('explainlikeimfive', name='LFQA_reddit', experimental=True)

This notebook is meant to be run from the `transformers/examples/eli5` folder in the [🤗transformers](https://github.com/huggingface/transformers), as all of the useful methods called here are compiled in the [eli5_utils.py](https://github.com/yjernite/transformers/blob/eli5_examples/examples/eli5/eli5_utils.py) script located there:

In [2]:
from eli5_utils import *

## Task and Data Description
<a id='task_description'></a>

Let's recap: we are interested in the task of Long Form Question Answering. As in other Question Answering tasks, the model is presented with a question, and is required to generate a natural language answer. Whereas a majority of QA datasets contain mostly **factoid** questions, where the answer, such as a date or the name of a single entity, can be expressed in a few words or single sentence, Long Form QA focuses on questions which call for an **explanation** consisting of a few sentences or a few paragraphs.

In order to teach a model to answer such questions, we use questions and answers written by Reddit users. Note that the `nlp.load_dataset` command above actually downloaded questions and their associated answers from the [r/explainlikeimfive](https://www.reddit.com/r/explainlikeimfive/), [r/askscience](https://www.reddit.com/r/askscience/), and [r/AskHistorians](https://www.reddit.com/r/AskHistorians/) subreddits. We focus here on the **ELI5/explainlikeimfive** part to train the system, as these examples tend to be a little simpler.  

Let's look at one item from the test set:

In [3]:
eli5['test_eli5'][12345]

{'q_id': '8houtx',
 'title': 'Why does water heated to room temperature feel colder than the air around it?',
 'selftext': '',
 'document': '',
 'subreddit': 'explainlikeimfive',
 'answers': {'a_id': ['dylcnfk', 'dylcj49'],
  'text': ["Water transfers heat more efficiently than air. When something feels cold it's because heat is being transferred from your skin to whatever you're touching. Since water absorbs the heat more readily than air, it feels colder.",
   "Air isn't as good at transferring heat compared to something like water or steel (sit on a room temperature steel bench vs. a room temperature wooden bench, and the steel one will feel more cold).\n\nWhen you feel cold, what you're feeling is heat being transferred out of you.  If there is no breeze, you feel a certain way.  If there's a breeze, you will get colder faster (because the moving air is pulling the heat away from you), and if you get into water, its quite good at pulling heat from you.   Get out of the water and ha

So here we have the question:
> Why does water heated to room temperature feel colder than the air around it?  


This definitely requires a multi-step explanation: no single phrase can sum up all of the information we are looking for. Here are the answers that were given on ELI5, and were given scores of +5 and +2 respectively by Reddit users:
> 1. Water transfers heat more efficiently than air. When something feels cold it's because heat is being transferred from your skin to whatever you're touching. Since water absorbs the heat more readily than air, it feels colder.  

> 2. Air isn't as good at transferring heat compared to something like water or steel (sit on a room temperature steel bench vs. a room temperature wooden bench, and the steel one will feel more cold). When you feel cold, what you're feeling is heat being transferred out of you. If there is no breeze, you feel a certain way.  If there's a breeze, you will get colder faster (because the moving air is pulling the heat away from you), and if you get into water, its quite good at pulling heat from you. Get out of the water and have a breeze blow on you while you're wet, all of the water starts evaporating, pulling even more heat from you.  

First, note that in this case **we have two answers** which broadly describe the same phenomenon: the first one is scored higher because it is more succint and to the point. This example already illustrates one important feature of the LFQA task: **there are usually several valid ways to answer a given question.** Of the 272K examples in the ELI5 training set, nearly two thirds (167K) have at least two answers. We'll need to keep this in mind when training and evaluation of the model.  

Secondly, we need to give our model access to the information that is expressed in both these answers. While recently released large models have been shown to hold a significant amount of information about the world in their parameters (see e.g. the [Closed-book QA performance of the T5 model](https://arxiv.org/abs/2002.08910)), there are several advantages to giving the model explicit access to information in text form. First, a larger number of parameters in a model implies a larger computational cost. Secondly, getting information from a text database allows us to easily update the model's knowledge without having to re-train its parameters.

Here, we choose to give the model access to Wikipedia text. We follow previous work in splitting Wikipedia articles into disjoint snippets of 100 words, and keep track of the title of the article and sections a snippet came from. Here's how you can get a pre-processed Wiki40b version split into 100-word passages with the `nlp` library, and an example snippet which has some of the information we're looking for ("*little conduction would occur since air is a poor conductor of heat*"):

In [25]:
wiki40b_snippets = nlp.load_dataset('wiki_snippets', name='wiki40b_en_100_0', experimental=True)['train']
wiki40b_snippets[8991855]

{'_id': '{"nlp_id": 1665419, "wiki_id": "Q179635", "sp": 12, "sc": 653, "ep": 12, "ec": 1223}',
 'nlp_id': 1665419,
 'wiki_id': 'Q179635',
 'start_paragraph': 12,
 'start_character': 653,
 'end_paragraph': 12,
 'end_character': 1223,
 'article_title': 'Heat transfer',
 'section_title': 'Conduction',
 'passage_text': 'from one place to another place without the movement of particles is called conduction, such as when placing a hand on a cold glass of water - heat is conducted from the warm skin to the cold glass, but if the hand is held a few inches from the glass, little conduction would occur since air is a poor conductor of heat. Steady state conduction is an idealized model of conduction that happens when the temperature difference driving the conduction is constant, so that after a time, the spatial distribution of temperatures in the conducting object does not change any'}

In the next section, we show how we can use either a [sparse retriever](#elasticsearch) or a [trained dense retriever](#dense_train) to automatically find relevant snippets for a question.

### Note on Data and Biases
<a id='reddit_biases'></a>

Before we go any further: status of toxicity of reddit, hopefully eli5/askscience is a bit better, still have much to do

TODO: retrieve from badass when it gets rebooted

## Retrieving Support Documents
<a id='retrieval'></a>

There has been a renewed interest in open domain question answering tasks in the last few years, as the availability of better text representations has made it possible to train models to first re-rank the outputs of "classical" information retrieval systems, and more recently to function as full IR systems themselves. See [this presentation](https://docs.google.com/presentation/d/1A5wJEzFYGdNem7egJ-BTm6EMI3jGNe1lalyChYL54gw) from the [Hugging Face reading group](https://github.com/huggingface/awesome-papers) for a non-exhaustive overview of work in this field published up to April 2020.

In the rest of this section, we show how to use either such a "classical" IR system based on **sparse** word matching with ElasticSearch, or how to train a model to compute **dense** vector representations of Wikipedia passages which can be queried through Max Inner Product Search (MIPS) with a query embedding.

### Sparse Retrieval with ElasticSearch
<a id='elasticsearch'></a>

[ElasticSearch](https://www.elastic.co/elasticsearch/) provides a convenient way to index documents so they can easily be queried for nearest neighbor search using the [BM25](https://en.wikipedia.org/wiki/Okapi_BM25) similarity function. In other words, given a query, the system can efficiently return a list of documents that have the most TF/IDF-weighted words in common with that query. While this word-matching based approach has obvious limitations, such as failing to take synonyms and sometimes grammatical variation into account, it does pretty well overall and has only recently been overtaken by embedding-based systems for Wikipedia-based Open-Domain QA tasks.

In order to use ElasticSearch, you will first need to launch a server. In a different window, run:
> ./elasticsearch-7.7.0/bin/elasticsearch

By default, your ElasticSearch server will be listening on `localhost` port `9200`. To connect to it run: 

In [34]:
es_client = Elasticsearch([{'host': 'localhost', 'port': '9200'}])

The `eli5_utils.py` script provides utilities to create (`make_es_index_snippets`) and query (`query_es_index`) an ElasticSearch index from within Python.

The main implementation details are:  
1. We index the article title, section title, and text of each of the passages for BM25 passages, using the standard ElasticSearch list of English stopwords. These choices are implemented in the `index_config` variable:  
```python
    index_config = {
      "settings": {
        "number_of_shards": 1,
        "analysis": {
          "analyzer": {
            "stop_standard": {"type": "standard", " stopwords": "_english_"}
          }
        }
      },
      "mappings": {
        "properties": {
          "article_title": {"type": "text", "analyzer": "standard", "similarity": "BM25"},
          "section_title": {"type": "text", "analyzer": "standard", "similarity": "BM25"},
          "passage_text": {"type": "text", "analyzer": "standard", "similarity": "BM25"}
        }
      }
    }
```
2. To query the index, we found it useful to add a few task-dependent stop-words. The query text is then compared to all of the indexed fields, giving more weight to the passage text:
```python
    banned = ['how', 'why', 'what', 'where', 'which', 'do', 'does', 'is', '?', 'eli5', 'eli5:']
    q = ' '.join([w for w in q.split() if w not in banned])
    response = es_client.search(
        index = index_name,
        body = {
            "query": {
                "multi_match": {
                    "query": q,
                    "fields": ["article_title", "section_title", "passage_text^2"],
                    "type": "cross_fields",
                }
            },
            "size": n_results,
        }
    )
```  

Here's the command to create the index, it should take one to three hours depending on your system.

In [30]:
if not es_client.indices.exists('wiki40b_snippets_100w'):
    make_es_index_snippets(es_client, wiki40b_snippets, index_name='wiki40b_snippets_100w')

Now let's test the ElasticSearch retriever with our running example ELI5 question about skin-to-water heat transfer by returning the 10 best candidate passages:

In [35]:
question = eli5['test_eli5'][12345]['title']
doc, res_list = query_es_index(question, es_client, index_name='wiki40b_snippets_100w', n_results=10)

print(question)
print('-----\n')
for i, res in enumerate(res_list):
    print(i+1, "{}: \n  {}\n----\n{}\n".format(
        res['article_title'],
        res['section_title'] if res['section_title'].strip() != '' else res['article_title'],
        res['passage_text']
    ))

Why does water heated to room temperature feel colder than the air around it?
-----

1 Salt fingering: 
  Salt fingering
----
Salt fingering Salt fingering is a mixing process that occurs when relatively warm, salty water overlies relatively colder, fresher water. It is driven by the fact that heated water diffuses more readily than salty water. A small parcel of warm, salty water sinking downwards into a colder, fresher region will lose its heat before losing its salt, making the parcel of water increasingly denser than the water around it and sinking further. Likewise, a small parcel of colder, fresher water will be displaced upwards and gain heat by diffusion from surrounding water, which will then make it lighter than the

2 Solar water heating: 
  Flat plate & Evacuated tube
----
protected by a glass panel. Consequently, these types of collectors are much less efficient when water temperature exceeds ambient air temperatures. For pool heating applications, the water to be heated i

We can immediately see both the strengths and limitations of this approach. The system manages to retrieve documents that are all broadly on topic, emntioning some combination of *water*, *air*, *relative temperature*, and *temperature transfer*. In spite of this, only example 8 ends up containing information that is actually relevant to the question:
> Cold air with high relative humidity "feels" colder than dry air of the same temperature because high humidity in cold weather increases the conduction of heat from the body.  

We got lucky this time, but this passage could as easily have been ranked 11th and not been included in the support document we provide to the answer generation system. As it is, the model will have to sort through mostly off-topic information to find this sentence when reading the resulting supporting document.

### Training a Dense Retriever with ELI5 and in-batch Negatives
<a id='dense_train'></a>

The sparse retriever seems to struggle with understanding the central theme of the query (human-perceived temperature), and gives equal weights to all of the words mentioned. Can we take advantage of our data to do train a sytem that better understands the intent of the question?  

[DPR](https://arxiv.org/abs/2004.04906): cool but relies on gold annotations, which we don't have

[REALM](https://arxiv.org/abs/2002.08909) with [Inverse Cloze Task](https://arxiv.org/abs/1906.00300) pre-training: works with intrinsinc objective, but requires very large batch + expensive pre-training

We propose: contrastive training matching an ELI5 to its answer against answers of other questions.

We start with a pre-trained sentence embedding model. We want a good balance of size: using one of the distilled BERT models presented in [this paper](https://arxiv.org/abs/1909.10351). We learn two different projection matrices to dimension 128 for the question and answer embedding.

We compute the dot product between a question embedding end embeddings of all the answers in the batch, and compute the cross-entropy loss for the matching score. With gradient checkpointing, batch size 512, pretty compute-efficient on one GPU.

We then use the answer embedding system to compute a representation for all Wikipedia snippets. Querying the index is then just MIPS between the question embedding and snippets representations.

In [3]:
qar_train_dset = ELI5DatasetQARetriver(eli5['train_eli5'], min_answer_length=64, training=True)
qar_valid_dset = ELI5DatasetQARetriver(eli5['validation_eli5'], min_answer_length=64, training=False)

class ArgumentsQAR():
    def __init__(self):
        self.batch_size = 512
        self.max_length = 128
        self.checkpoint_batch_size = 32
        self.print_freq = 10
        self.pretrained_model_name = "google/bert_uncased_L-8_H-768_A-12"
        self.model_save_name = "retriever_models/eli5_retriever"
        self.learning_rate = 2e-4
        self.num_epochs = 20

qar_args = ArgumentsQAR()

qar_tokenizer, qar_model = make_qa_retriever_model(
    model_name=qar_args.pretrained_model_name,
    from_file=None,
    device="cuda:0"
)

qar_optimizer = AdamW(qar_model.parameters(), lr=qar_args.learning_rate, eps=1e-8)
qar_scheduler = get_linear_schedule_with_warmup(
        qar_optimizer,
        num_warmup_steps=100,
        num_training_steps=qar_args.num_epochs * math.ceil(len(qar_train_dset) / qar_args.batch_size)
)

In [None]:
for e in range(qar_args.num_epochs):
    train_qa_retriever_epoch(
        qar_model, qar_train_dset, qar_tokenizer,
        qar_optimizer, qar_scheduler, qar_args, e
    )
    m_save_dict = {
        'model': qar_model.state_dict(),
        'optimizer': qar_optimizer.state_dict(),
        'scheduler': qar_scheduler.state_dict(),
    }
    print("Saving model {}".format(qar_args.model_save_name))
    torch.save(m_save_dict, '{}_{}.pth'.format(qar_args.model_save_name, e))
    eval_loss = evaluate_qa_retriever(qar_model, qar_valid_dset, qar_tokenizer, qar_args)
    print("Evaluation loss epoch {:4d}: {:.3f}".format(e, eval_loss))

 0     0 of   532 	 L: 6.489 	 -- 6.480
 0     1 of   532 	 L: 6.465 	 -- 12.869
 0    10 of   532 	 L: 6.345 	 -- 70.560
 0    20 of   532 	 L: 6.190 	 -- 134.626
 0    30 of   532 	 L: 5.623 	 -- 198.792
 0    40 of   532 	 L: 4.562 	 -- 262.964
 0    50 of   532 	 L: 3.838 	 -- 327.112
 0    60 of   532 	 L: 3.340 	 -- 391.108
 0    70 of   532 	 L: 3.009 	 -- 455.193
 0    80 of   532 	 L: 2.823 	 -- 519.262
 0    90 of   532 	 L: 2.625 	 -- 583.353
 0   100 of   532 	 L: 2.514 	 -- 647.421
 0   110 of   532 	 L: 2.432 	 -- 711.494
 0   120 of   532 	 L: 2.266 	 -- 775.736
 0   130 of   532 	 L: 2.260 	 -- 839.770
 0   140 of   532 	 L: 2.196 	 -- 903.798
 0   150 of   532 	 L: 2.060 	 -- 968.103
 0   160 of   532 	 L: 2.055 	 -- 1032.143
 0   170 of   532 	 L: 1.981 	 -- 1096.322
 0   180 of   532 	 L: 1.910 	 -- 1160.407
 0   190 of   532 	 L: 1.931 	 -- 1224.338
 0   200 of   532 	 L: 1.865 	 -- 1288.429
 0   210 of   532 	 L: 1.829 	 -- 1352.615
 0   220 of   532 	 L: 1.802 	 -

The code to build the index is:

### Using a Trained Dense Retriever
<a id='dense_use'></a>

Can we take advantage of our data to do better?

In [4]:
qar_tokenizer, qar_model = make_qa_retriever_model(
    model_name="google/bert_uncased_L-8_H-768_A-12",
    from_file="retriever_models/eli5_retriever_model_l-8_h-768_b-512-512_9.pth",
    device="cuda:0"
)

In [7]:
faiss_res = faiss.StandardGpuResources()
wiki40b_passage_reps = np.memmap(
            'wiki40b_passages_reps_32_l-8_h-768_b-512-512.dat',
            dtype='float32', mode='r',
            shape=(wiki40b_snippets.num_rows, 128)
)

wiki40b_index_flat = faiss.IndexFlatIP(128)
wiki40b_gpu_index = faiss.index_cpu_to_gpu(faiss_res, 1, wiki40b_index_flat)
wiki40b_gpu_index.add(wiki40b_passage_reps)

In [33]:
question = eli5['test_eli5'][12345]['title']
doc, res_list = query_qa_dense_index(
    question,
    qar_model, qar_tokenizer,
    wiki40b_snippets, wiki40b_gpu_index,
    n_results=10
)

print(question)
print('-----\n')
for i, res in enumerate(res_list):
    print(i, "{}: \n  {}\n----\n{}\n".format(
        res['article_title'],
        res['section_title'] if res['section_title'].strip() != '' else res['article_title'],
        res['passage_text']
    ))

Why does water heated to room temperature feel colder than the air around it?
-----

0 Fugacity: 
  History
----
played a similar role to that of temperature in heat flow.

1 Heat transfer: 
  Heat transfer in the human body & Evaporative cooling
----
when the skin is completely wet. The body continuously loses water by evaporation but the most significant amount of heat loss occurs during periods of increased physical activity. Evaporative cooling Evaporative cooling happens when water vapor is added to the surrounding air. The energy needed to evaporate the water is taken from the air in the form of sensible heat and converted into latent heat, while the air remains at a constant enthalpy. Latent heat describes the amount of heat that is needed to evaporate the liquid; this heat comes from the liquid itself and the surrounding gas and surfaces.

2 Johan Sandström: 
  Sandström  Theorem
----
at greater pressures. There is an ambiguity, however, as to the meaning of the terms 'heating'

### Retriever Evaluation
<a id='dense_eval'></a>

How can we evaluate the embedding model? Let's start by grabbing a couple of useful metrics from the `nlp` library:

In [124]:
%%capture --no-stdout
# load the ROUGE and BERTscore metrics from the nlp library
nlp_rouge = nlp.load_metric('rouge')
nlp_bertscore = nlp.load_metric('bertscore')

# takes a list of retrieved documents and a list of possible answers
# for a question and returns a measure of the lexical overlap between the
# passages and answer
def get_aggregate_rouge(res_list, answers):
    res = np.zeros((len(res_list), len(answers), 3))
    for i, hit in enumerate(res_list):
        for j, a in enumerate(answers):
            if len(hit.strip()) > 0 and len(a.strip()) > 0:
                # get Rouge-1 P/R/F for each passage/answer pair
                score = nlp_rouge.compute([hit], [a], rouge_types=['rouge1'])['rouge1'].mid
                res[i,j] = np.array([score.precision, score.recall, score.fmeasure])
    # average P/R/F rouge scores, then find best passage-answer match
    return res.mean(axis=2).max()

# Same with BERTscore metri which aligns contextual word embedings
def get_aggregate_bertscore(res_list, answers):
    res = np.zeros((len(res_list), len(answers), 3))
    for i, hit in enumerate(res_list):
        for j, a in enumerate(answers):
            if len(hit.strip()) > 0 and len(a.strip()) > 0:
                # get Rouge-1 P/R/F for each passage/answer pair
                score = nlp_bertscore.compute([hit], [a], lang='en')
                res[i,j] = np.array([score['precision'].item(), score['recall'].item(), score['f1'].item()])
    # average P/R/F rouge scores, then find best passage-answer match
    return res.mean(axis=2).max()

# Compare which retriever finds passages that have the most
# lexical overlap with the ELI5 answers
st_time = time()
tot_rg_sparse = 0.
tot_bs_sparse = 0.
tot_rg_dense = 0.
tot_bs_dense = 0.
valid_slice = eli5['validation_eli5'][:1000]
for i, (question, answers) in enumerate(zip(valid_slice['title'], valid_slice['answers'])):
    # get documents with sparse retriever
    _, sparse_res_list = query_es_index(
        question,
        es_client, index_name='wiki40b_snippets_100w',
        n_results=5
    )
    sparse_passages = [res['passage_text'] for res in sparse_res_list]
    if len(sparse_passages) == 0:
        sparse_passages = [question]
    tot_rg_sparse += get_aggregate_rouge(sparse_passages, answers['text'])
    tot_bs_sparse += get_aggregate_bertscore(sparse_passages, answers['text'])
    # get documents with dense retriever
    _, dense_res_list = query_qa_dense_index(
        question,
        qar_model, qar_tokenizer,
        wiki40b_snippets, wiki40b_gpu_index,
        n_results=5
    )
    dense_passages = [res['passage_text'] for res in dense_res_list]
    tot_rg_dense += get_aggregate_rouge(dense_passages, answers['text'])
    tot_bs_dense += get_aggregate_bertscore(dense_passages, answers['text'])
    # show average scores side by side
    if (i+1) % 10 == 0:
        print("{:03d} Sparse: RG-{:.4f} BS-{:.4f} | Dense: RG-{:.4f} BS-{:.4f} \t {:.2f}".format(
            i+1,
            tot_rg_sparse / (i+1), tot_bs_sparse / (i+1),
            tot_rg_dense / (i+1), tot_bs_dense / (i+1),
            time() - st_time
        ))

010 Sparse: RG-0.2652 BS-0.8053 | Dense: RG-0.2521 BS-0.8141 	 103.34
020 Sparse: RG-0.2657 BS-0.8068 | Dense: RG-0.2647 BS-0.8193 	 174.52
030 Sparse: RG-0.2631 BS-0.8052 | Dense: RG-0.2591 BS-0.8156 	 261.19
040 Sparse: RG-0.2623 BS-0.8049 | Dense: RG-0.2594 BS-0.8149 	 352.42
050 Sparse: RG-0.2660 BS-0.8060 | Dense: RG-0.2639 BS-0.8176 	 457.43
060 Sparse: RG-0.2698 BS-0.8053 | Dense: RG-0.2649 BS-0.8172 	 540.86
070 Sparse: RG-0.2684 BS-0.8058 | Dense: RG-0.2630 BS-0.8187 	 602.46
080 Sparse: RG-0.2671 BS-0.8062 | Dense: RG-0.2640 BS-0.8185 	 694.04
090 Sparse: RG-0.2646 BS-0.8063 | Dense: RG-0.2622 BS-0.8182 	 763.54
100 Sparse: RG-0.2627 BS-0.8058 | Dense: RG-0.2619 BS-0.8190 	 822.09
110 Sparse: RG-0.2646 BS-0.8056 | Dense: RG-0.2626 BS-0.8186 	 900.97
120 Sparse: RG-0.2673 BS-0.8055 | Dense: RG-0.2661 BS-0.8177 	 1013.28
130 Sparse: RG-0.2685 BS-0.8053 | Dense: RG-0.2678 BS-0.8175 	 1080.84
140 Sparse: RG-0.2660 BS-0.8050 | Dense: RG-0.2654 BS-0.8180 	 1380.19
150 Sparse: RG-0.

KeyboardInterrupt: 

In [123]:
sparse_res_list

[]

In [120]:
print("{:03d} Sparse: RG-{:.4f} BS-{:.4f} | Dense: RG-{:.4f} BS-{:.4f} \t {:.2f}".format(
            i+1,
            tot_rg_sparse / (i+1), tot_bs_sparse / (i+1),
            tot_rg_dense / (i+1), tot_bs_dense / (i+1),
            time() - st_time
        ))

199 Sparse: RG-0.2609 BS-0.8014 | Dense: RG-0.2634 BS-0.8135 	 1923.04


## Answer Generation Model
<a id='generation'></a>

Once we have a question and a document containing



In [2]:
class ArgumentsS2S():
    def __init__(self):
        self.batch_size = 16
        self.backward_freq = 8
        self.max_length = 1024
        self.print_freq = 100
        self.model_save_name = "seq2seq_models/bart_model"
        self.learning_rate = 2e-4
        self.num_epochs = 20

s2s_args = ArgumentsS2S()

In [3]:
qa_s2s_tokenizer, pre_model = make_qa_s2s_model(
    model_name="facebook/bart-large",
    from_file=None,
    device="cuda:0"
)
qa_s2s_model = torch.nn.DataParallel(pre_model)

In [4]:
eli5_train_docs = json.load(open('precomputed/eli5_train_precomputed_dense_docs.json'))
eli5_valid_docs = json.load(open('precomputed/eli5_valid_precomputed_dense_docs.json'))

s2s_train_dset = ELI5DatasetS2S(eli5['train_eli5'], document_cache=dict([(k, d) for k, d, src_ls in eli5_train_docs]))
s2s_valid_dset = ELI5DatasetS2S(eli5['validation_eli5'], document_cache=dict([(k, d) for k, d, src_ls in eli5_valid_docs]), training=False)

In [5]:
s2s_optimizer = AdamW(qa_s2s_model.parameters(), lr=s2s_args.learning_rate, eps=1e-8)
s2s_scheduler = get_linear_schedule_with_warmup(
        s2s_optimizer,
        num_warmup_steps=400,
        num_training_steps=s2s_args.num_epochs * math.ceil(len(s2s_train_dset) / s2s_args.batch_size)
)

In [6]:
for e in range(s2s_args.num_epochs):
    train_qa_s2s_epoch(
        qa_s2s_model,
        s2s_train_dset, qa_s2s_tokenizer,
        s2s_optimizer, s2s_scheduler,
        s2s_args, e
    )
    m_save_dict = {
        'model': qa_s2s_model.state_dict(),
        'optimizer': s2s_optimizer.state_dict(),
        'scheduler': s2s_scheduler.state_dict(),
    }
    print("Saving model {}".format(s2s_args.model_save_name))
    torch.save(m_save_dict, '{}_{}.pth'.format(s2s_args.model_save_name, e))



 0     0 of 36184 	 L: 4.835 	 -- 29.596
 0     1 of 36184 	 L: 4.687 	 -- 31.556
 0   100 of 36184 	 L: 4.478 	 -- 150.420
 0   200 of 36184 	 L: 3.755 	 -- 269.327
 0   300 of 36184 	 L: 3.469 	 -- 387.855
 0   400 of 36184 	 L: 3.360 	 -- 505.592
 0   500 of 36184 	 L: 3.311 	 -- 624.245
 0   600 of 36184 	 L: 3.274 	 -- 741.951
 0   700 of 36184 	 L: 3.242 	 -- 860.024
 0   800 of 36184 	 L: 3.231 	 -- 977.969
 0   900 of 36184 	 L: 3.214 	 -- 1096.494
 0  1000 of 36184 	 L: 3.230 	 -- 1214.687
 0  1100 of 36184 	 L: 3.227 	 -- 1332.183
 0  1200 of 36184 	 L: 3.186 	 -- 1450.259
 0  1300 of 36184 	 L: 3.206 	 -- 1568.458
 0  1400 of 36184 	 L: 3.209 	 -- 1686.349
 0  1500 of 36184 	 L: 3.184 	 -- 1804.035
 0  1600 of 36184 	 L: 3.204 	 -- 1922.431
 0  1700 of 36184 	 L: 3.190 	 -- 2040.502
 0  1800 of 36184 	 L: 3.174 	 -- 2158.630
 0  1900 of 36184 	 L: 3.181 	 -- 2276.222
 0  2000 of 36184 	 L: 3.187 	 -- 2393.720
 0  2100 of 36184 	 L: 3.192 	 -- 2511.511
 0  2200 of 36184 	 L: 

KeyboardInterrupt: 

In [15]:
torch.cuda.empty_cache()

In [11]:
_ = qa_s2s_model.eval()
s2s_args.print_freq = 100
eval_qa_s2s_epoch(
        qa_s2s_model,
        s2s_valid_dset, qa_s2s_tokenizer,
        s2s_args
)

    0 of  2453 	 L: 3.521 	 -- 0.315
 1000 of  2453 	 L: 3.260 	 -- 319.746
 2000 of  2453 	 L: 3.264 	 -- 638.111
Total 	 L: 3.265 	 -- 782.534


In [29]:
eli5['validation_eli5'][11]

{'q_id': '20q8w1',
 'title': 'How do apps like soundhound and shazam know what song is playing?',
 'selftext': '',
 'document': '',
 'subreddit': 'explainlikeimfive',
 'answers': {'a_id': ['cg5r130'],
  'text': ['ELI5:\n\nThink about when you hear your parents, you can recognize their voice right? Or when you see a dog, you can recognize it\'s a dog in general. Now what kind of dog? You can typically recognize it\'s a chihuahua or, my fav, a golden retriever. How? Chihuahuas are small and annoying with short hair, whereas a golden retriever is cute, cuddly, friendly, with long hair (I may have some bias here).\n\nIn the same way, Shazam and Soundhound does that! They take a look at features of a song, like the pitch, tone, or waveform (the "shape" of the song) and try to match it to a song in their memory.'],
  'score': [2]},
 'title_urls': {'url': []},
 'selftext_urls': {'url': []},
 'answers_urls': {'url': []}}

In [9]:
print(qa_s2s_generate(
        s2s_valid_dset[11][0], qa_s2s_model.module, qa_s2s_tokenizer,
        num_answers=1,
        num_beams=8,
        min_len=64,
        max_len=256,
        max_input_length=1024,
        device="cuda:0"
    )[0])

They don't know what song is playing, they just know that it's playing.

Shazam, for example, has an app called "shazam" that has a list of songs that can be played at any time of the day.  Shazam then uses that list to determine which songs are being played at that time.


In [None]:
generated = []
st_time = time()
for i in range(2000):
    generated += [qa_s2s_generate(
        s2s_valid_dset[i][0], qa_s2s_model.module, qa_s2s_tokenizer,
        num_answers=1,
        num_beams=8,
        min_len=64,
        max_len=256,
        max_input_length=1024,
        device="cuda:0"
    )[0]]
    if i % 100 == 0:
        print(eli5['validation_eli5'].num_rows, i, time() - st_time)

In [37]:
def qda_difficulty(question_doc, answer):
    qd_words = dict([(w, True) for w in question_doc.lower().split()])
    recall = len([w for w in answer.lower().split() if w in qd_words]) / len(answer.split())
    return recall

In [47]:
recall_diff = [(i, qda_difficulty(*s2s_train_dset[i])) for i in range(10000)]

In [48]:
sorted(recall_diff, key=lambda x:x[1], reverse=True)[:10]

[(4885, 1.0),
 (4112, 0.9523809523809523),
 (8829, 0.9090909090909091),
 (9692, 0.9090909090909091),
 (3443, 0.9032258064516129),
 (2563, 0.9),
 (4930, 0.9),
 (6940, 0.9),
 (9267, 0.8928571428571429),
 (7644, 0.8888888888888888)]