# Question & Answering with ArXiV papers at scale
This notebook is about neural question and answering using transformers models (ALBERT) at SCALE. The below approach is capable to perform Q&A across millions of documents in few seconds.

I will be using ArXiV's papers abstracts to do Q&A at this point it time as I do not have access to actual PDF texts. But - the same approach can be followed to seek answers from actual text in place of just the abstracts. 

I will post another notebook when I get my hands on the actual paper's texts. Now let's dive in...

In [None]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# You can write up to 5GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

### Reading the entire json metadata
This cell may take a minute to run considering the volume of data

In [None]:
import json
data  = []
with open("/kaggle/input/arxiv/arxiv-metadata-oai-snapshot.json", 'r') as f:
    for line in f: 
        data.append(json.loads(line))

I'm limiting my analysis to just 50,000 documents because of the compute limit.

In [None]:
data = pd.DataFrame(data[:50000])

### Welcome Haystack!

<img src="https://raw.githubusercontent.com/deepset-ai/haystack/master/docs/img/sketched_concepts_white.png">

The secret sauce behind scaling up is **Haystack**. It lets you scale QA models to large collections of documents! 
You can read more about this amazing library here https://github.com/deepset-ai/haystack

For installation: `! pip install git+https://github.com/deepset-ai/haystack.git`

But just to give a background, there are 3 major components to Haystack.
1. **Document Store:** Database storing the documents for our search. We recommend Elasticsearch, but have also more light-weight options for fast prototyping (SQL or In-Memory).
2. **Retriever:** Fast, simple algorithm that identifies candidate passages from a large collection of documents. Algorithms include TF-IDF or BM25, custom Elasticsearch queries, and embedding-based approaches. The Retriever helps to narrow down the scope for Reader to smaller units of text where a given question could be answered.
3. **Reader:** Powerful neural model that reads through texts in detail to find an answer. Use diverse models like BERT, RoBERTa or XLNet trained via FARM or Transformers on SQuAD like tasks. The Reader takes multiple passages of text as input and returns top-n answers with corresponding confidence scores. You can just load a pretrained model from Hugging Face's model hub or fine-tune it to your own domain data.

And then there is **Finder** which glues together a Reader and a Retriever as a pipeline to provide an easy-to-use question answering interface.

In [None]:
# installing haystack

! pip install git+https://github.com/deepset-ai/haystack.git

In [None]:
# importing necessary dependencies

from haystack import Finder
from haystack.indexing.cleaning import clean_wiki_text
from haystack.indexing.utils import convert_files_to_dicts, fetch_archive_from_http
from haystack.reader.farm import FARMReader
from haystack.reader.transformers import TransformersReader
from haystack.utils import print_answers

### Setting up DocumentStore
Haystack finds answers to queries within the documents stored in a `DocumentStore`. The current implementations of `DocumentStore` include `ElasticsearchDocumentStore`, `SQLDocumentStore`, and `InMemoryDocumentStore`.

But they recommend `ElasticsearchDocumentStore` because as it comes preloaded with features like [full-text queries](https://www.elastic.co/guide/en/elasticsearch/reference/current/full-text-queries.html), [BM25 retrieval](https://www.elastic.co/elasticon/conf/2016/sf/improved-text-scoring-with-bm25), and [vector storage for text embeddings](https://www.elastic.co/guide/en/elasticsearch/reference/7.6/dense-vector.html).

So - Let's set up a `ElasticsearchDocumentStore`

In [None]:
! wget https://artifacts.elastic.co/downloads/elasticsearch/elasticsearch-7.6.2-linux-x86_64.tar.gz -q
! tar -xzf elasticsearch-7.6.2-linux-x86_64.tar.gz
! chown -R daemon:daemon elasticsearch-7.6.2
 
import os
from subprocess import Popen, PIPE, STDOUT
es_server = Popen(['elasticsearch-7.6.2/bin/elasticsearch'],
                   stdout=PIPE, stderr=STDOUT,
                   preexec_fn=lambda: os.setuid(1)  # as daemon
                  )
# wait until ES has started
! sleep 30

In [None]:
from haystack.database.elasticsearch import ElasticsearchDocumentStore
document_store = ElasticsearchDocumentStore(host="localhost", username="", password="", index="document")

Once `ElasticsearchDocumentStore` is setup, we will write our documents/texts to the DocumentStore.
* Writing documents to `ElasticsearchDocumentStore` requires a format - **List of dictionaries**
The default format here is: 
`[{"name": "<some-document-name>, "text": "<the-actual-text>"},
{"name": "<some-document-name>, "text": "<the-actual-text>"}
{"name": "<some-document-name>, "text": "<the-actual-text>"}]`

(Optionally: you can also add more key-value-pairs here, that will be indexed as fields in Elasticsearch and can be accessed later for filtering or shown in the responses of the Finder)

* We will use **title** column to pass as `name` and **abstract** column to pass as the `text`

In [None]:
# Now, let's write the dicts containing documents to our DB.
document_store.write_documents(data[['title', 'abstract']].rename(columns={'title':'name','abstract':'text'}).to_dict(orient='records'))

### Let's prepare Retriever, Reader,  & Finder
**Retrievers** help narrowing down the scope for the Reader to smaller units of text where a given question could be answered. They use some simple but fast algorithm.

Here: We use Elasticsearch's default BM25 algorithm

In [None]:
from haystack.retriever.sparse import ElasticsearchRetriever
retriever = ElasticsearchRetriever(document_store=document_store)

A **Reader** scans the texts returned by retrievers in detail and extracts the k best answers. They are based on powerful, but slower deep learning models.

Haystack currently supports Readers based on the frameworks FARM and Transformers. With both you can either load a local model or one from Hugging Face's model hub (https://huggingface.co/models).

Here: a medium sized RoBERTa QA model using a Reader based on FARM (https://huggingface.co/deepset/roberta-base-squad2)

In [None]:
reader = FARMReader(model_name_or_path="deepset/roberta-base-squad2", use_gpu=True, context_window_size=500)

And finally:  The **Finder** sticks together reader and retriever in a pipeline to answer our actual questions. 

In [None]:
finder = Finder(reader, retriever)

### And we're done !
Below is the list of questions that I was asking the model and the results were pleasing.

In [None]:
sample_questions = ["What do we know about Bourin and Uchiyama?",
       "How is structure of event horizon linked with Morse theory?",
       "What do we know about symbiotic stars"]

In [None]:
prediction = finder.get_answers(question="What do we know about symbiotic stars", top_k_retriever=10, top_k_reader=2)
result = print_answers(prediction, details="minimal")