# Long Form Question Answering with ELI5 and Wikipedia  

---  

### Table of Contents  

1. [Introduction](#intro)  
2. [Task and Data Description](#task_description)  
3. [Retrieving Support Documents](#retrieval)  
    a. [Sparse Retrieval with ElasticSearch](#elasticsearch)  
    b. [Using a Trained Dense Retriever](#dense_train)  
    c. [Dense Retriever Evaluation](#dense_eval)  
4. [Answer Generation Model](#generation)  
    a. [Conditional Generation with Seq2seq Models](#seq2seq_presentation)  
    b. [Fine-Tuning Seq2seq Models](#seq2seq_train)  
5. [Conclusion](#conclusion)  


---

<img src="images/choco_bis.svg" width="900" align="center"/>  


## Introduction
<a id='intro'></a>

Imagine that you are taken with a sudden desire to understand **how the fruit of a tropical tree gets transformed into chocolate bars**, or want to understand **the role of fever in the human body's immune response**: how would you go about finding that information?

If your specific question has already been asked and provided a clear and succint answer on one of the many question answering platforms answering on the Internet (such as [**Quora**](https://www.quora.com/How-is-chocolate-made), [**Reddit**](https://www.reddit.com/user/ex_5_libris/comments/9c8gb1/chocolate_how_chocolate_is_made/), or [**Yahoo Answers**](https://answers.yahoo.com/question/index?qid=20070615082202AArsYN1)), you're in luck: modern search engine will probably take you to that pre-existing answer pretty reliably. Otherwise, the process will be a little more involved. You will likely have to collect relevant information from a variety of sources, figure out how these pieces of knowledge fit together in relation to your query, and synthetize a narrative that answers your initial question.

Now, wouldn't it be great if your computer could do all of that for you: **gather** the right sources, **synthetize** the information, and **write up** an easy-to-read summary of the relevant points? The bad news is: such a system isn't quite available yet, at least not one that can provide *reliable* information in its summary. The good news on the other hand: a number of recent advances in natural language understanding and generation have made working toward solving this task much easier. These advances include progress in the pre-training (e.g. [BART](https://arxiv.org/abs/1910.13461), [T5](https://arxiv.org/abs/1910.10683)) and evaluation (e.g. for [factuality](https://arxiv.org/abs/2004.04228)) of sequence-to-sequence models used for conditional text generation, new ways to use these models to find information in Wikipedia (e.g. [REALM](https://kentonl.com/pub/gltpc.2020.pdf), [DPR](https://arxiv.org/abs/2004.04906)), and new [training datasets](https://arxiv.org/abs/1907.09190).

**In this notebook,** we show how we can take advantage of some of these recent works to train a **long form question answering** system which takes in a question, fetches 10 relevant passages from a [Wikipedia snapshot](https://www.aclweb.org/anthology/2020.lrec-1.297/), and writes a multi-sentence answer based on the question and retrieved passages. Follow along to learn about the steps involved and read some background on the state of the art for some related tasks, or go straight to the:  
## [**Live Demo!**](http://35.226.96.115:8080/)  
(And don't forget to scroll down on the left sidebar to show all of the generation options!)

#### Preliminaries  

The implementation presented here relies on the [HuggingFace](https://huggingface.co/) [🤗transformers](https://github.com/huggingface/transformers) and [🤗nlp](https://github.com/huggingface/nlp) libraries. Wikipedia indexing relies on [ElasticSearch](https://www.elastic.co/elasticsearch) with its [python bindings](https://github.com/elastic/elasticsearch-py) for the sparse version, and [faiss](https://github.com/facebookresearch/faiss/) for the dense version. You can get all of these by running:
> pip install elasticsearch  
> pip install faiss_gpu  
> pip install nlp  
> pip install transformers  
>  
> wget https://artifacts.elastic.co/downloads/elasticsearch/elasticsearch-7.7.1-linux-x86_64.tar.gz  
> tar -xzvf elasticsearch-7.7.1-linux-x86_64.tar.gz  

The training relies on two datasets: [ELI5](https://arxiv.org/abs/1907.09190), a processed version of the [r/explainlikeimfive](https://www.reddit.com/r/explainlikeimfive/) subreddit, and the [Wiki40b](https://www.aclweb.org/anthology/2020.lrec-1.297/) Wikipedia image. Downloading these and splitting Wikipedia into snippets for indexing can take a long time (up to 72 hours for the ELI5 dataset creation, which has to filter through all of the Reddit dumps). We suggest that you start by downloading and pre-processing these as follows before doing anything else:

In [1]:
import nlp

wiki40b_snippets = nlp.load_dataset('wiki_snippets', name='wiki40b_en_100_0', experimental=True)['train']
eli5 = nlp.load_dataset('explainlikeimfive', name='LFQA_reddit', experimental=True)

## Task and Data Description
<a id='task_description'></a>

The task of Long Form Question Answering

In [2]:
eli5['validation_eli5'][123]

{'q_id': '37a8or',
 'title': 'Why is Google Fibre taking so long to roll out?',
 'selftext': '',
 'document': '',
 'subreddit': 'explainlikeimfive',
 'answers': {'a_id': ['crkzpne',
   'crl0n7a',
   'crkzcxh',
   'crkyyph',
   'crl3fkq',
   'crl8oid'],
  'text': ["One does not simply lay down a large fiber network. First, you have to have the money. That's not really an issue for Google. Then, you have to convince municipal governments to let you build a network, and you have to get past the incumbent ISP, who wants to keep their monopoly intact. You have to find enough subscribers, you have to find people to build the network, you have to do customer service and installation, and you have to not be hated by the public. Throwing money at those problems is ineffective.",
   "I'm in Georgia so hopefully I'll see fiber within my lifetime.  \n\nConsidering that we've paid for fiber to the home twice over.  Please see the $300 billion broadband scandal.  \n\nThe fiber is actually the cheap 

## Retrieving Support Documents
<a id='retrieval'></a>

The first question is...

### Sparse Retrieval with ElasticSearch
<a id='elasticsearch'></a>

The traditional approach until...  

First, let's create a dense index

In [4]:
from eli5_utils import *

es_client = Elasticsearch([{'host': 'localhost', 'port': '9200'}])
if not es_client.indices.exists('wiki40b_snippets_100w'):
    make_es_index_snippets(es_client, wiki40b_snippets, index_name='wiki40b_snippets_100w')

Now let's test for one of the ELI5 questions:

In [9]:
question = eli5['validation_eli5'][123]['title']
doc, res_list = query_es_index(question, es_client, index_name='wiki40b_snippets_100w', n_results=10)

print(question)
print('-----\n')
for res in res_list:
    print("{}: \n  {}\n".format(
        res['article_title'],
        res['section_title'] if res['section_title'].strip() != '' else res['article_title']
    ))

Why is Google Fibre taking so long to roll out?
-----

Internet in New Zealand: 
  Local loop unbundling and the structural separation of Telecom & Recent developments

Bharat Broadband Network: 
  BharatNet Phase-II (Dec 2018)

Google Voice Search: 
  Google Voice Search on Google.com & History

Internet in New Zealand: 
  DSL & Fibre

Roving: 
  Roving

Rolag: 
  Rolag

NBN Co: 
  National Broadband Network

Digital loop carrier: 
  Configuration

After School Club: 
  Premise and format

MNSi Telecom: 
  Acquisitions & Introduction to fibre



### Using a Trained Dense Retriever
<a id='dense_train'></a>

Can we take advantage of our data to do better?

### Dense Retriever Evaluation
<a id='dense_eval'></a>

How can we evaluate the embedding model?

## Answer Generation Model
<a id='generation'></a>

Once we have a question and a document containing

