# Question answering using ElasticSearch and SciBERT
This notebook attempts to answer the most questions in the vaccines and [therapeutics tasks](https://www.kaggle.com/allen-institute-for-ai/CORD-19-research-challenge/tasks?taskId=561) using a combination of ElasticSearch for the initial information retrieval and SciBERT for the further answering of the questions. It is loosly based on [this paper by David R. Cheriton](https://arxiv.org/pdf/1902.01718.pdf). 

Roughly what it does is the following:

1.   Retrieve relevant papers based on keywords (this is annotated by humans)
2.   Train a SciBERT model on the SQuAD 2.0 set for Question and Answering
3.   Predict the answer for each of the questions based on each of the relevant articles and display the results.



## Prerequisites
This code depends on Google's BERT implementation and the training script they've written to retrain the BERT model for the SQuAD challenge.
It also requires an ElasticSearch server as well as an Google TPU for faster training and prediction. Alternatively you can also run this code locally, though a GPU is highly recommended.  

In [0]:
!git clone https://github.com/ofjpostema/bert.git

Cloning into 'bert'...
remote: Enumerating objects: 343, done.[K
remote: Total 343 (delta 0), reused 0 (delta 0), pack-reused 343[K
Receiving objects: 100% (343/343), 305.80 KiB | 3.00 MiB/s, done.
Resolving deltas: 100% (187/187), done.


In [0]:
from google.colab import auth
auth.authenticate_user()

In [0]:
# You can't import these by default on Google Colab
from elasticsearch_dsl import connections, Index, Search
from elasticsearch_dsl import Document, Text, Boolean
from elasticsearch import Elasticsearch

In [0]:
import os
import pandas as pd
import json
import pprint
pp = pprint.PrettyPrinter(indent=2)
from collections import Counter
import re
import numpy as np
from tqdm.notebook import tqdm
import datetime
import random
import string
import sys
import collections
import tensorflow as tf

In [0]:
QUESTION_DIR = os.path.join("..", "data", "interim", "questions")
if not os.path.isdir(QUESTION_DIR):
    os.mkdir(QUESTION_DIR)

In [0]:
BUCKET = 'of-covid-19-clean'
output_dir_name = 'bert_output'
BUCKET_NAME = 'gs://{}'.format(BUCKET)
OUTPUT_DIR = 'gs://{}/{}'.format(BUCKET, output_dir_name)
QUESTION_DIR = "questions"
GS_QUESTION_DIR = 'gs://{}/{}'.format(BUCKET, QUESTION_DIR)
ANSWER_DIR = "answers"
GS_ANSWER_DIR = 'gs://{}/{}'.format(BUCKET, ANSWER_DIR)
TPU_ADDRESS = 'grpc://10.10.15.42:8470'
GCP_PROJECT = 'covid-19-271609'

In [0]:
connections.create_connection(hosts=['localhost'], timeout=20)

<Elasticsearch([{'host': 'localhost'}])>

## ElasticSearch
This section of the code processes all of the documents and reads them into the ElasticSearch index.

In [0]:
# Define the paths to the data
dir_data_raw = os.path.join("..", "data", "raw")
data_dir_interim = os.path.join("..", "data", "interim")
datasets = ['biorxiv_medrxiv', 'comm_use_subset', 'custom_license', 'noncomm_use_subset']

### Formatting
These formatting helper functions are courtesy of [xhlulu](https://www.kaggle.com/xhlulu/cord-19-eda-parse-json-and-generate-clean-csv)

In [0]:
def format_body(body_text):
    texts = [(di['section'], di['text']) for di in body_text]
    texts_di = {di['section']: "" for di in body_text}
    
    for section, text in texts:
        texts_di[section] += text

    body = ""

    for section, text in texts_di.items():
        body += section
        body += "\n\n"
        body += text
        body += "\n\n"
    
    return body

### Preprocessing
We'll attempt to extract the results and conclusion sections from the articles. This is done based on the heading titles.

In [0]:
def parse_article(full_path, file_path):
    """
    Parse an article's body text and extract the full text, the results and the conclusion.
    
    full_path: str: The fully qualified path to the file
    file_path: str: The file path starting from the data_raw dir
    """
    section_headings = {
        "results": ["results and discussion", "results"],
        "conclusion": ["conclusion", "conclusions", "discussion and conclusions"],
        #TODO: Intro
    }
    with open(full_path) as file:
        json_article = json.load(file)["body_text"]
        article_sections = []
        # For extracting the main body we 
        metadata.loc[index, 'full_text'] = format_body(json_article)
        for body_text in json_article:
            # Clean the section headings, lowercase and trim them
            section_heading = re.sub(r'[^a-zA-Z0-9 ]', '', body_text["section"]).lower().strip()
            for section, headings in section_headings.items():
                if section_heading in headings:
                    metadata.loc[index, section] =  article[section] + body_text["text"]

In [0]:
# Load the metadata and initialize the new, empty, columns
metadata = pd.read_csv(os.path.join(dir_data_raw, "metadata.csv"))
metadata["full_text"] = ""
metadata["file_path"] = None
metadata["results"] = ""
metadata["conclusion"] = ""

In [0]:
for index, article in tqdm(metadata.iterrows()):
    # We only need to update if there's a full text
    if article["has_full_text"]:
        for dataset in datasets:
            file_path = os.path.join(dataset, dataset, str(article["sha"]) + ".json")
            metadata.loc[index, "file_path"] = file_path
            full_path = os.path.join(dir_data_raw, file_path)
            if os.path.exists(full_path):
                parse_article(full_path, file_path)

HBox(children=(FloatProgress(value=1.0, bar_style='info', max=1.0), HTML(value='')))




#### Checkpointing
Optional: Store the resutls in a CSV file.

In [0]:
metadata.to_csv(os.path.join(data_dir_interim, "1_full_data.csv"))

In [0]:
metadata = pd.read_csv(os.path.join(data_dir_interim, "1_full_data.csv"))

### Ingestion
Create a document type for the data and upload all papers that have a full text.

In [0]:
class Paper(Document):
    id = Text(required=True, index='covid')
    title = Text(required=True)
    authors = Text(required=True)
    abstract = Text(required=True)
    text = Text(required=True)
    results = Text(required=True)
    conclusion = Text(required=True)
    bibliography = Text(required=False)

    class Meta:
        name = 'covid'

In [0]:
for index, paper in tqdm(metadata.iterrows()):
    pass

HBox(children=(FloatProgress(value=1.0, bar_style='info', max=1.0), HTML(value='')))




In [0]:
index = Index("covid")

for index, paper in tqdm(metadata.iterrows()):
    if paper["has_full_text"]:
        paper_doc = Paper(
            id=paper["sha"] if type(paper["sha"]) == str else "",
            title=paper["title"] if type(paper["title"]) == str else "",
            authors=paper["authors"] if type(paper["authors"]) == str else "",
            abstract=paper["abstract"] if type(paper["abstract"]) == str else "",
            text=paper["full_text"] if type(paper["full_text"]) == str else "",
            results=paper["results"] if type(paper["results"]) == str else "",
            conclusion=paper["conclusion"] if type(paper["conclusion"]) == str else "",
            bibliography=""
        )
        paper_doc.save(index="covid")

HBox(children=(FloatProgress(value=1.0, bar_style='info', max=1.0), HTML(value='')))




In [0]:
client = Elasticsearch()

In [0]:
queries = [
    {
        "id": 1,
        "question": "What is the clinical effectiveness of antiviral agents?",
        "keywords": ["clinical effectiveness", "therapeutic", "antiviral agents"],
    },
    {
        "id": 2,
        "question": "What is the effectiveness of drugs being developed and tried to treat COVID-19 patients?",
        "keywords": ["clinical trials", "bench trials", "viral inhibitors", "naproxen", "clarithromycin", "minocyclinethat", "viral replication"],
    },
    {
        "id": 3,
        "question": "Are there potential complication of Antibody-Dependent Enhancement (ADE) in vaccine recipients?",
        "keywords": ["complications", "Antibody-Dependent Enhancement", "vaccine", "antiviral proteins"],
    },
    {
        "id": 4,
        "question": "Are there animal models that offer predictive value for a human vaccine?",
        "keywords": ["animal models", "predictive", "vaccine"],
    },
    {
        "id": 5,
        "question": "How to distribute scarces therapeutics?",
        "keywords": ["distribution", "therapeutics", "antiviral agents", "decision making", "prioritizing"],
    },
    {
        "id": 6,
        "question": "How to expand production capacity of antiviral agents?",
        "keywords": ["production capacity", "therapeutic", "antiviral agents"],
    },
    {
        "id": 7,
        "question": "Are there universal coronavirus vaccines?",
        "keywords": ["coronavirus vaccine", "universal vaccine"],
    },
    {
        "id": 8,
        "question": "Which animal models are there?",
        "keywords": ["animal models", "challenge studies"],
    },
    {
        "id": 9,
        "question": "Which prophylaxis clinical studies are there?",
        "keywords": ["prevention", "prophylaxis", "clinical study"],
    },
    {
        "id": 10,
        "question": "What is the clinical effectiveness of antiviral agents?",
        "keywords": ["clinical effectiveness", "therapeutic", "antiviral agents"],
    },
]

### Collecting data
We'll collect all relevant data from ElasticSearch. To do this we

1. Create a search query based on the question
2. Get all nouns from the query
3. Get synonyms for all nouns
4. Search using this search query (in the abstract and the keywords)
5. Collect the top 50 results
6. Create a train.json file for this question, posing it to each article

In [0]:
def search_get_results(question, limit_from, limit_size):
    should = [{"match": {"text": keyword}} for keyword in question["keywords"]]
    response = client.search(
        index="covid",
        body={
          "from": limit_from,
          "size": limit_size,
          "query": {
                "bool": {
                  "should": [{"match": {"text": "covid"}}, {"match": {"text": "ncov"}}],
                  "should": should,
                }
          },
        }
    )
    return response


def get_all_results(question, min_score):
    last_score = 10000
    limit_size = 50
    limit_from = 0
    hits = []
    while last_score > min_score:
        search_results = search_get_results(question, limit_from, limit_size)
        hits = hits + search_results["hits"]["hits"]
        limit_from += limit_size
        last_score = hits[-1]["_score"]
    return hits

In [0]:
for query in queries:
    hits = get_all_results(query, 11)
    print("{} hits for query {}".format(len(hits), query["id"]))
    input_questions = {
        "version": "v0.1",
        "data": [
            {
                "title": hit["_source"]["title"],
                "paragraphs": []
            }
        ]
    }
    # Get the query
    for hit in hits:
        input_questions["data"][0]["paragraphs"].append({
            "qas": [{
                "question": query["question"],
                "id": "q_{}_h_{}".format(query["id"], hit["_source"]["id"]),
                "is_impossible": ""
            }],
            "context": hit["_source"]["text"].lower()
        })
    with open(os.path.join(QUESTION_DIR, 
                            "q_{}_h_{}.json".format(
                                query["id"],
                                hit["_source"]["id"]
                            )), 'w') as outfile:
        json.dump(input_questions, outfile)

50 hits for query 1
900 hits for query 2
350 hits for query 3
50 hits for query 4
150 hits for query 5
50 hits for query 6
50 hits for query 7
50 hits for query 8
50 hits for query 9
50 hits for query 10


In [0]:
!gsutil -m cp -r $QUESTION_DIR $GS_QUESTION_DIR

In [0]:
!gsutil -m rm -r $GS_QUESTION_DIR

## Training
We first need to re-train SciBERT to actually answer questions.

In [0]:
assert 'COLAB_TPU_ADDR' in os.environ, 'ERROR: Not connected to a TPU runtime; please see the first cell in this notebook for instructions!'
TPU_ADDRESS = 'grpc://' + os.environ['COLAB_TPU_ADDR']
print('TPU address is => ', TPU_ADDRESS)

TPU address is =>  grpc://10.49.216.202:8470


In [0]:
!wget https://s3-us-west-2.amazonaws.com/ai2-s2-research/scibert/tensorflow_models/scibert_scivocab_uncased.tar.gz
!tar -xf scibert_scivocab_uncased.tar.gz

--2020-03-23 14:30:05--  https://s3-us-west-2.amazonaws.com/ai2-s2-research/scibert/tensorflow_models/scibert_scivocab_uncased.tar.gz
Resolving s3-us-west-2.amazonaws.com (s3-us-west-2.amazonaws.com)... 52.218.229.176
Connecting to s3-us-west-2.amazonaws.com (s3-us-west-2.amazonaws.com)|52.218.229.176|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 1216161420 (1.1G) [application/x-tar]
Saving to: ‘scibert_scivocab_uncased.tar.gz’


2020-03-23 14:30:37 (36.9 MB/s) - ‘scibert_scivocab_uncased.tar.gz’ saved [1216161420/1216161420]

CommandException: No URLs matched: /content/bert/scibert_scivocab_uncased


In [0]:
!gsutil mv /content/scibert_scivocab_uncased $BUCKET_NAME

Copying file:///content/scibert_scivocab_uncased/bert_config.json [Content-Type=application/json]...
Removing file:///content/scibert_scivocab_uncased/bert_config.json...
Copying file:///content/scibert_scivocab_uncased/bert_model.ckpt.data-00000-of-00001 [Content-Type=application/octet-stream]...
==> NOTE: You are uploading one or more large file(s), which would run
significantly faster if you enable parallel composite uploads. This
feature can be enabled by editing the
"parallel_composite_upload_threshold" value in your .boto
configuration file. However, note that if you do this large files will
be uploaded as `composite objects
<https://cloud.google.com/storage/docs/composite-objects>`_,which
means that any user who downloads such objects will need to have a
compiled crcmod installed (see "gsutil help crcmod"). This is because
without a compiled crcmod, computing checksums on composite objects is
so slow that gsutil disables downloads of composite objects.

Removing file:///content/

In [0]:
!wget https://rajpurkar.github.io/SQuAD-explorer/dataset/train-v2.0.json
!wget https://rajpurkar.github.io/SQuAD-explorer/dataset/dev-v2.0.json

--2020-03-23 14:33:49--  https://rajpurkar.github.io/SQuAD-explorer/dataset/train-v2.0.json
Resolving rajpurkar.github.io (rajpurkar.github.io)... 185.199.108.153, 185.199.110.153, 185.199.109.153, ...
Connecting to rajpurkar.github.io (rajpurkar.github.io)|185.199.108.153|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 42123633 (40M) [application/json]
Saving to: ‘train-v2.0.json’


2020-03-23 14:33:49 (176 MB/s) - ‘train-v2.0.json’ saved [42123633/42123633]

--2020-03-23 14:33:50--  https://rajpurkar.github.io/SQuAD-explorer/dataset/dev-v2.0.json
Resolving rajpurkar.github.io (rajpurkar.github.io)... 185.199.111.153, 185.199.108.153, 185.199.110.153, ...
Connecting to rajpurkar.github.io (rajpurkar.github.io)|185.199.111.153|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 4370528 (4.2M) [application/json]
Saving to: ‘dev-v2.0.json’


2020-03-23 14:33:51 (49.3 MB/s) - ‘dev-v2.0.json’ saved [4370528/4370528]



In [0]:
!python bert/run_squad.py \
  --vocab_file=$BUCKET_NAME/scibert_scivocab_uncased/vocab.txt \
  --bert_config_file=$BUCKET_NAME/scibert_scivocab_uncased/bert_config.json \
  --init_checkpoint=$BUCKET_NAME/scibert_scivocab_uncased/bert_model.ckpt \
  --do_train=True \
  --train_file=train-v2.0.json \
  --do_predict=True \
  --predict_file=dev-v2.0.json \
  --train_batch_size=24 \
  --learning_rate=3e-5 \
  --num_train_epochs=2.0 \
  --use_tpu=True \
  --tpu_name=$TPU_ADDRESS \
  --max_seq_length=512 \
  --doc_stride=128 \
  --version_2_with_negative=True \
  --output_dir=$OUTPUT_DIR

[1;30;43mStreaming output truncated to the last 5000 lines.[0m
I0323 15:13:15.614530 139694172755840 tpu_estimator.py:604] Dequeue next (1) batch(es) of data from outfeed.
INFO:tensorflow:Enqueue next (1) batch(es) of data to infeed.
I0323 15:13:15.624404 139694172755840 tpu_estimator.py:600] Enqueue next (1) batch(es) of data to infeed.
INFO:tensorflow:Dequeue next (1) batch(es) of data from outfeed.
I0323 15:13:15.624521 139694172755840 tpu_estimator.py:604] Dequeue next (1) batch(es) of data from outfeed.
INFO:tensorflow:Enqueue next (1) batch(es) of data to infeed.
I0323 15:13:15.633375 139694172755840 tpu_estimator.py:600] Enqueue next (1) batch(es) of data to infeed.
INFO:tensorflow:Dequeue next (1) batch(es) of data from outfeed.
I0323 15:13:15.633485 139694172755840 tpu_estimator.py:604] Dequeue next (1) batch(es) of data from outfeed.
INFO:tensorflow:Enqueue next (1) batch(es) of data to infeed.
I0323 15:13:15.643283 139694172755840 tpu_estimator.py:600] Enqueue next (1) bat

## Predicting
This assumes that there is already a trained model in the previously mentioned directory in the Google Cloud bucket. If you want to train using a TPU you also need to enter the TPU's address. 

In [0]:
from google.cloud import storage
storage_client = storage.Client(project=GCP_PROJECT)
bucket = storage_client.get_bucket(BUCKET)

In [0]:
question_blobs = bucket.list_blobs(
    prefix=QUESTION_DIR
)

In [0]:
for question in tqdm(question_blobs):
    question_name = question.name
    output_dir_answer = question.name.split(".")[0].split("/")[-1]
    !python bert/run_squad.py \
      --vocab_file=$BUCKET_NAME/scibert_scivocab_uncased/vocab.txt \
      --bert_config_file=$BUCKET_NAME/scibert_scivocab_uncased/bert_config.json \
      --init_checkpoint=$BUCKET_NAME/scibert_scivocab_uncased/bert_model.ckpt \
      --do_train=False \
      --max_query_length=30  \
      --do_predict=True \
      --predict_file=$BUCKET_NAME/$question_name \
      --use_tpu=True \
      --tpu_name=$TPU_ADDRESS \
      --predict_batch_size=8 \
      --n_best_size=3 \
      --max_seq_length=512 \
      --doc_stride=128 \
      --output_dir=$BUCKET_NAME/answers/$output_dir_answer/

HBox(children=(IntProgress(value=1, bar_style='info', max=1), HTML(value='')))




W0324 15:59:19.537562 139695366555520 module_wrapper.py:139] From bert/run_squad.py:1147: The name tf.logging.set_verbosity is deprecated. Please use tf.compat.v1.logging.set_verbosity instead.


W0324 15:59:19.537784 139695366555520 module_wrapper.py:139] From bert/run_squad.py:1147: The name tf.logging.ERROR is deprecated. Please use tf.compat.v1.logging.ERROR instead.

I0324 15:59:24.522022 139695366555520 utils.py:141] NumExpr defaulting to 2 threads.
2020-03-24 15:59:59.064572: W tensorflow/core/distributed_runtime/rpc/grpc_session.cc:370] GrpcSession::ListDevices will initialize the session with an empty graph and other defaults because the session has not yet been created.
2020-03-24 16:00:00.346172: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcuda.so.1
2020-03-24 16:00:00.348584: E tensorflow/stream_executor/cuda/cuda_driver.cc:318] failed call to cuInit: CUDA_ERROR_NO_DEVICE: no CUDA-capable device is detected
2020-

# Postprocessing
After the predictions have been made, we should do some post processing. This is mainly matching the position of the found answer to the original text and extracting passages from it that contain the answer.

In [22]:
import nltk
nltk.download('punkt')

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


True

In [23]:
import tensorflow as tf
print(tf.version)

<module 'tensorflow._api.v2.version' from '/usr/local/lib/python3.6/dist-packages/tensorflow/_api/v2/version/__init__.py'>


In [0]:
from bert import tokenization

In [0]:
class InputFeatures(object):
    """A single set of features of data."""

    def __init__(self,
                 unique_id,
                 example_index,
                 doc_span_index,
                 tokens,
                 token_to_orig_map,
                 token_is_max_context,
                 input_ids,
                 input_mask,
                 segment_ids,
                 start_position=None,
                 end_position=None,
                 is_impossible=None):
        self.unique_id = unique_id
        self.example_index = example_index
        self.doc_span_index = doc_span_index
        self.tokens = tokens
        self.token_to_orig_map = token_to_orig_map
        self.token_is_max_context = token_is_max_context
        self.input_ids = input_ids
        self.input_mask = input_mask
        self.segment_ids = segment_ids
        self.start_position = start_position
        self.end_position = end_position
        self.is_impossible = is_impossible

In [0]:
tokenizer = tokenization.BasicTokenizer(do_lower_case=True)

In [0]:
def parse_questions(paragraphs):
  """
  Parse the questions from the paragraphs in the questions file.
  This will extract the article texts and the question.
  args
  paragraphs: list: A list of paragraph dicts.

  return 
  articles: dict
  question: str
  """
  qa_id_text = {}
  for paragraph in original_json["data"][0]["paragraphs"]:
    for qas in paragraph["qas"]:
      qa_id_text[qas["id"]] = paragraph["context"]
    question_text = qas["question"]
  return qa_id_text, question_text

def get_sentence_index(word_index, sentences):
  """
  Get the index of a sentence given the index of a word in the text.

  args
  word_index: int: The index of the word
  sentences: list<str>: The list of sentences

  return
  sentence_index: int: The index of the sentence
  """
  i = 0
  for idx_sentence, sentence in enumerate(sentences):
    sententence_tokens = sentence.split(" ")
    i += len(sententence_tokens)
    if i > word_index:
      return idx_sentence

def get_passage(word_index, sentences, text_to_find):
  """
  Get a passage from the text, given a list of sentences, the text to find and 
  an index of the word.
  """
  selected_sentence = get_sentence_index(word_index, sentences)
  distance = 1
  found = False
  while not found and distance < 6:
    combined_sentences = " ".join([
      sentence 
      for idx_sentence, sentence in enumerate(sentences) 
      if  idx_sentence >= selected_sentence - distance and 
          idx_sentence <= selected_sentence + distance
      ]
    )
    if text_to_find in combined_sentences:
      return combined_sentences
    else:
      distance += 1
  return None

In [28]:
answers_blobs = bucket.list_blobs(
    prefix=ANSWER_DIR
)

qa_overview = collections.defaultdict(list)

for answer_blob in tqdm(answers_blobs):
  # Check if these are the n best predictions
  if answer_blob.name.endswith("nbest_predictions.json"):
   
    # Load the question file, to get the original texts
    question_key = answer_blob.name.split("/")[-2]

    # We'll first load the original questions file that was used to predict on
    original = storage.blob.Blob("questions_small/"+question_key+".json", bucket)
    original_json = json.loads(original.download_as_string())
    qa_id_text, question_text = parse_questions(original_json["data"][0]["paragraphs"])

    # Now we'll get the predicted answers
    question_results = json.loads(answer_blob.download_as_string())

    for question, results in tqdm(question_results.items()):
      if results[0]["text"] != "empty":
        text_tokenized = qa_id_text[question].split(" ")
        text = tokenizer._clean_text(qa_id_text[question])
        text = re.sub(' +', ' ', text)

        sentences = nltk.sent_tokenize(text)
        passage = get_passage(results[0]["start_orig_doc"], sentences, results[0]["text"])
        
        if passage:
          id_question = question.split("_")[1]
          id_article = question.split("_")[3].split(".")[0]
          qa_overview[id_question].append({
              "id": id_article,
              "passage": passage, 
              "probability": results[0]["probability"]
          })

for id_question, articles in qa_overview.items():
  qa_overview[id_question] = sorted(articles, key = lambda i: i['probability']) 

HBox(children=(IntProgress(value=1, bar_style='info', max=1), HTML(value='')))

HBox(children=(IntProgress(value=0, max=49), HTML(value='')))




HBox(children=(IntProgress(value=0, max=49), HTML(value='')))




HBox(children=(IntProgress(value=0, max=900), HTML(value='')))




HBox(children=(IntProgress(value=0, max=350), HTML(value='')))




HBox(children=(IntProgress(value=0, max=50), HTML(value='')))




HBox(children=(IntProgress(value=0, max=150), HTML(value='')))




HBox(children=(IntProgress(value=0, max=50), HTML(value='')))




HBox(children=(IntProgress(value=0, max=50), HTML(value='')))




HBox(children=(IntProgress(value=0, max=50), HTML(value='')))




HBox(children=(IntProgress(value=0, max=50), HTML(value='')))





# Answers
This section attempts to provide answers to the questions that were asked before.

In [0]:
from tabulate import tabulate
from IPython.display import HTML, display
import ipywidgets as widgets

In [30]:
button = widgets.Dropdown(
    options=[(query["question"], query["id"]) for query in queries],
    value=None,
    description='Question:',
    disabled=False,
)
output = widgets.Output()

display(button, output)

def on_change(change):
    with output:
        output.clear_output()
        data = [[article["id"], article["passage"]] for article in qa_overview[str(change.new)]]
        display(HTML(tabulate(data, 
                              tablefmt='html', 
                              headers=["Paper SHA","Passage"])))

button.observe(on_change, 'value')

Dropdown(description='Question:', options=(('What is the clinical effectiveness of antiviral agents?', 1), ('W…

Output()