## Embeddings
In this notebook, we attempt to carry out a semantic search with the use of Embeddings. 

The overall idea is the following:
1. Produce embeddings for the titles of academic papers.
2. Produce an embedding for the query.
3. Use cosine similarity to find the papers that are most similar to this query.
4. Use a question answering model to produce the sentence in the abstracts of the top papers that better answers this query.

We make use of pretrained BERT models fine-tuned for specific tasks, an approach which has seen a lot of popularity in NLP recently.

BERT is the state of the art bi-directional transformer from Google. The main difference from other models is that it takes into consideration both previous and subsequent tokens to make a predicton embedding for the current token. Embeddings from BERT, unlike word2vec or fasttext, are also contextual. This means the same word will have a different embedding depending on the sentence it belongs to.

Unlike other embedding models, it is also trained to produce an embedding for a sentence as the first tag, usually refered to as the [CLS] tag, which is very useful for us.

The real power of BERT comes from fine-tuning it for specific tasks. This is where it beat previous state of the art approaches, for example, in machine translation, semantic search and question answering.

We work with the following 4 BERT models in this work. References can be found by clicking on the links.

1. [SciBERT](https://arxiv.org/abs/1903.10676): This is the standard BERT model fine-tuned on scientific papers. This should therefore be slightly more suited to the type of vocabulary and sentence structure we are expecting.

2. [CovidBERT](https://huggingface.co/deepset/covid_bert_base): This is the standard BERT model fine-tuned specifically on this coronavirus dataset.

3. [Sentence-BERT](https://arxiv.org/abs/1908.10084): This model is trained specifically so that the embeddings produced on the [CLS] tag are meaningful when compared using cosine similarity.

4. [BERT for Question answering](https://huggingface.co/bert-large-uncased-whole-word-masking-finetuned-squad): This model is fine tuned for the question answering task on the [SQuAD](https://rajpurkar.github.io/SQuAD-explorer/) database. It is given a question and a piece of text as input, and is trained to output the start and end indices in the piece of text where the answer can be found.

Finally, we compare this with a baseline using the well known BM25 ranking algorithm.

In [69]:
import torch
from torch.utils import data
from transformers import AutoTokenizer, AutoModel, AutoModelForQuestionAnswering
from sklearn.manifold import TSNE
import matplotlib.pyplot as plt
from itertools import islice
import os
import json
from time import time
from collections import Counter
import numpy as np
import pandas as pd
import torch.nn.functional as F
from sklearn.manifold import TSNE
from sklearn.decomposition import PCA
import matplotlib.pyplot as plt
from sentence_transformers import SentenceTransformer
from tqdm.notebook import tnrange, tqdm
from utils import *
import plotly.express as px
from plotly.subplots import make_subplots
import plotly.graph_objects as go
from rank_bm25 import BM25Okapi

%load_ext autoreload
%autoreload 2
text_path = 'data/preprocessed_text.json'

The autoreload extension is already loaded. To reload it, use:
  %reload_ext autoreload


In [2]:
print("Loading all paper data...")
with open('data/all_text.json', 'r') as f:
    json_data = json.load(f)

print("Loading preprocessed paper data...")
with open('data/preprocessed_text.json', 'r') as f:
    articles = json.load(f)
len(articles.keys())

Loading all paper data...
Loading preprocessed paper data...


33375

### Load the different models that will be used in this work
When run for the first time, this cell may take some time as the models must first be downloaded.

In [4]:
# original scibert
tokenizer_scibert = AutoTokenizer.from_pretrained('allenai/scibert_scivocab_uncased')
model_scibert = AutoModel.from_pretrained('allenai/scibert_scivocab_uncased')
# bert finetuned on covid
tokenizer_covid = AutoTokenizer.from_pretrained('deepset/covid_bert_base')
model_covid = AutoModel.from_pretrained('deepset/covid_bert_base')
# bert for sentences
model_sent = SentenceTransformer('bert-base-nli-mean-tokens')
# question answering fine-tuned bert
tokenizer_qa = AutoTokenizer.from_pretrained("bert-large-uncased-whole-word-masking-finetuned-squad")
model_qa = AutoModelForQuestionAnswering.from_pretrained("bert-large-uncased-whole-word-masking-finetuned-squad")

### Sentence similarity example
The following is an example of using the cosine similarity to find relevant sentences to the query "What are the symptoms for the virus?". We want a **high score** for the first sentence, "Fever was one of the symptoms of the virus." and a **low score** for the second, "The virus mainly spreads through water molecules.".

For this case, it seems the fine tuning on Sentence-BERT really pays off.

In [5]:
def sentence_similarity(first, second, model, tokenizer):
    return cosine_similarity(sentence_embedding(tokenizer, model, first), sentence_embedding(tokenizer, model, second))

#### Scibert

In [6]:
print(sentence_similarity("What are the symptoms for the virus?", "Fever was one of the symptoms of the virus.", model_scibert, tokenizer_scibert))
print(sentence_similarity("What are the symptoms for the virus?", "The virus mainly spreads through water molecules.", model_scibert, tokenizer_scibert))

0.8017190098762512
0.760513424873352


#### Covid-Bert

In [7]:
print(sentence_similarity("What are the symptoms for the virus?", "Fever was one of the symptoms of the virus.", model_covid, tokenizer_covid))
print(sentence_similarity("What are the symptoms for the virus?", "The virus mainly spreads through water molecules.", model_covid, tokenizer_covid))

0.8973249197006226
0.8796727061271667


#### Sentence-Bert

In [8]:
print(sentence_similarity("What are the symptoms for the virus?", "Fever was one of the symptoms of the virus.", model_sent, None))
print(sentence_similarity("What are the symptoms for the virus?", "The virus mainly spreads through water molecules.", model_sent, None))

0.8109130859375
0.48866531252861023


### Crop paper titles to the first sentence. Drop those that are still too large
In this project, we work with some limitations due to the computational power available to us. While in theory it is possible to compute embeddings for all the titles, this would require either a very long compute time each time or very large files to be loaded. Because of this, we limit ourselves to 2000 papers, as a proof of concept, knowing that this solution could be extended if successful.

Additionally, we restrict ourselves to the first sentence of each paper in order to avoid very long computation times and impose a limit of 50 tokens after tokenization.

Finally, we drop papers for which the title is empty.

In total, we drop 119 papers, about 5% of the data.

In [9]:
# select n papers
n = 2000
max_length = 50
selection = take(n, articles)
selected_papers = {key: articles[key] for key in selection}

# used only to display the original (non-processed) abstracts at the end
selected_papers_original = {key: json_data[key] for key in selection}
paper_ids = list(selected_papers_original.keys())

titles = [paper_json[1]['title'] for paper_json in selected_papers.items()]
cropped_titles = []
dropped = 0
to_drop = []
for index, title in enumerate(titles):
    if title == '':
        dropped += 1
        to_drop.append(index)
        continue
    dot_index = title.find(".")
    if dot_index == -1:
        cropped_titles.append(title)
    else:
        cropped_titles.append(title[0:dot_index + 1])

print("Dropped {} empty titles".format(dropped))
drop_from_lists([titles, paper_ids], to_drop)
# first run removes titles that are too long, 
# second run builds actual embeddings once both tokenizers have removed those that are too long

encoded_scibert, indices_to_drop = get_encodings_drop_long(cropped_titles, tokenizer_scibert, max_length = max_length)
drop_from_lists([cropped_titles, titles, paper_ids], indices_to_drop)

encoded_covid, indices_to_drop = get_encodings_drop_long(cropped_titles, tokenizer_covid, max_length = max_length)
drop_from_lists([cropped_titles, titles, paper_ids], indices_to_drop)

encoded_scibert, indices_to_drop = get_encodings_drop_long(cropped_titles, tokenizer_scibert, max_length = max_length)
drop_from_lists([cropped_titles, titles, paper_ids], indices_to_drop)

encoded_covid, indices_to_drop = get_encodings_drop_long(cropped_titles, tokenizer_covid, max_length = max_length)
drop_from_lists([cropped_titles, titles, paper_ids], indices_to_drop)

index_to_paperid_map = {ind: paper_ids[ind] for ind in range(len(paper_ids))}

# Sanity check: makes sure pre-processed and original data are consistent with each other after selecting papers
assert(len(cropped_titles) == len(titles) == len(paper_ids))

Dropped 84 empty titles
Dropped 16 titles
Dropped 19 titles
Dropped 0 titles
Dropped 0 titles


#### Generate Embeddings
We generate the embeddings for each title by passing them through the model in batches.
This should take 5-10 minutes.

If this fails, try setting num_workers to 0.

In [10]:
batch_size = 32
num_workers = 4
title_generator = data.DataLoader(encoded_scibert, batch_size=batch_size, num_workers=num_workers)
embeddings_scibert = torch.zeros(encoded_scibert.shape[0], 768)
embeddings_covid = torch.zeros(encoded_scibert.shape[0], 768)
with torch.no_grad():
    cur_index = 0
    t = tqdm(iter(title_generator), leave=False, total=len(title_generator))
    for i, batch in enumerate(t):
        cur_index += batch_size
        output_scibert = model_scibert(batch)
        embeddings_scibert[cur_index - batch_size: cur_index] = output_scibert[0][:, 0, :]

title_generator = data.DataLoader(encoded_covid, batch_size=batch_size, num_workers=num_workers)
with torch.no_grad():
    cur_index = 0
    t = tqdm(iter(title_generator), leave=False, total=len(title_generator))
    for i, batch in enumerate(t):
        cur_index += batch_size
        output_covid = model_covid(batch)
        embeddings_covid[cur_index - batch_size: cur_index] = output_covid[0][:, 0, :]

embeddings_sent = torch.tensor(model_sent.encode(cropped_titles))

HBox(children=(FloatProgress(value=0.0, max=59.0), HTML(value='')))

HBox(children=(FloatProgress(value=0.0, max=59.0), HTML(value='')))

## Similarity search
We preform a search for the query "Risk factors for covid-19 death" and show the results.

The results from the sentence model seem much better, with almost all of the articles being clearly related to the query, especially the top few.

The other models do not show such strong results.

In [11]:
query = "Risk factors for covid-19 death"
query_embedding_scibert = get_query_embedding(tokenizer_scibert, model_scibert, query)
query_embedding_covid = get_query_embedding(tokenizer_covid, model_covid, query)
query_embedding_sent = get_query_embedding(None, model_sent, query)

In [12]:
n = 20
indices_scibert, titles_scibert = find_top_n_similar(embeddings_scibert, query_embedding_scibert, titles, n=n)
titles_scibert

['Outbound traffic from Wuhan and COVID-19 incidence Temporal relationship between outbound traffic from Wuhan and the 2019 coronavirus disease (COVID-19) incidence in China',
 'Comparative Epidemiology of Human Fatal Infections with Novel, High (H5N6 and H5N1) and Low (H7N9 and H9N2) Pathogenicity Avian Influenza A Viruses',
 'Transmission and epidemiological characteristics of Severe Acute Respiratory Syndrome Coronavirus 2 (SARS-CoV-2) infected Pneumonia (COVID-19): preliminary evidence obtained in comparison with 2003-SARS',
 'In silico study of the spike protein from SARS-CoV-2 interaction with ACE2: similarity with SARS-CoV, hot-spot analysis and effect of the receptor polymorphism',
 'An exploratory randomized, controlled study on the efficacy and safety of lopinavir/ritonavir or arbidol treating adult patients hospitalized with mild/moderate COVID-19 (ELACOI)',
 'The role of institutional trust in preventive and treatment-seeking behaviors during the 2019 novel coronavirus (201

In [13]:
indices_covid, titles_covid = find_top_n_similar(embeddings_covid, query_embedding_covid, titles, n=n)
titles_covid

['Application of extracorporeal membrane oxygenation in patients with severe acute respiratory distress syndrome induced by avian influenza A (H7N9) viral pneumonia: national data from the Chinese multicentre collaboration',
 'Vaccination with single plasmid DNA encoding IL-12 and antigens of 2 severe fever with thrombocytopenia syndrome virus elicits complete 3 protection in IFNAR knockout mice Republic of Korea Author summary 57',
 'BMC Infectious Diseases Development of a reverse transcription-loop-mediated isothermal amplification (RT-LAMP) system for a highly sensitive detection of enterovirus in the stool samples of acute flaccid paralysis cases',
 'Title: Clinical features and progression of acute respiratory distress syndrome in Nutrition and Safety, Ministry of Education Key Lab of Environment and # Corresponding authors: 1 7 Word count: 2,681 (including Research in context) 2 0',
 'Teacher led school-based surveillance can allow accurate tracking of emerging infectious diseas

In [14]:
indices_sent, titles_sent = find_top_n_similar(embeddings_sent, query_embedding_sent, titles, n=n)
titles_sent

['Estimation of risk factors for COVID-19 mortality -preliminary results',
 'Estimates of the severity of COVID-19 disease',
 'Building a COVID-19 Vulnerability Index',
 'Potential Factors for Prediction of Disease Severity of COVID-19 Patients',
 'Investigating the Impact of Asymptomatic Carriers on COVID-19 Transmission',
 'Risk factors related to hepatic injury in patients with corona virus disease 2019',
 'Potential biochemical markers to identify severe cases among COVID-19 patients',
 'Dynamic profile of severe or critical COVID-19 cases',
 'Assessing the Global Tendency of COVID-19 Outbreak',
 'Impact of the contact and exclusion rates on the spread of COVID-19 pandemic',
 'Potential Biases in Estimating Absolute and Relative Case-Fatality Risks during Outbreaks',
 'The time scale of asymptomatic transmission affects estimates of epidemic potential in the COVID-19 outbreak',
 'Estimating the cure rate and case fatality rate of the ongoing epidemic COVID-19',
 'Relations of param

## Visualization (cells must be run)
We take the embeddings from each of the models and reduce their dimensionality using PCA followed by t-sne.
We highlight the top n results for a query in red.

We would expect that the search results are mostly in the same region, with different searches producing mostly different red clusters.

Surprisingly, despite sentence-BERT producing clearly superior results, we find that the other 2 models follow our initial hypothesis, with clearly defined and concentrated clusters, whereas sentence-BERT produces several sparser clusters. 

We were not able to find a fully satisfactory answer to why this happens. However, upon further inspection, we find that one of these clusters tends to contain papers which are seem mostly related to the search query, while the others contain those which are not as related. This can be confirmed by **hovering** over the plots below to see the title names.

This tells us that the reason for these extra clusters may be that there are not enough similar papers in our reduced dataset. By reducing n appropriately we get only one cluster of related papers. This gives us extra confirmation that the embeddings produced by sentence-BERT truly do capture the notion of semantic similarity appropriately.

In [15]:
tsne_scibert = get_tsne_embeddings(embeddings_scibert)
tsne_covid = get_tsne_embeddings(embeddings_covid)
tsne_sent = get_tsne_embeddings(embeddings_sent)

In [16]:
def plot_query_embeddings_plotly(query, titles, n=30):
    models = [model_scibert, model_covid, model_sent]
    tokenizers = [tokenizer_scibert, tokenizer_covid, None]
    embeddings = [embeddings_scibert, embeddings_covid, embeddings_sent]
    tsnes = [tsne_scibert, tsne_covid, tsne_sent]
    plot_titles = ["Scibert", "Covid", "Bert Sentence"]
    fig = make_subplots(rows=1, cols=3, subplot_titles=plot_titles)
    for index, cur in enumerate(zip(models, tokenizers, embeddings, plot_titles, tsnes)):
        query_embedding = get_query_embedding(cur[1], cur[0], query)
        similar, _ = find_top_n_similar(cur[2], query_embedding, titles, n=n)
        similar_set = set(similar[:n].tolist())
        tsne = cur[4]
        fig.add_trace(go.Scatter(x=tsne[:, 0], y=tsne[:, 1], \
                                 mode="markers", text=titles, \
                                 marker=dict(size=[6 if i in similar_set else 4 for i in range(len(titles))],\
                                             color=['red' if i in similar_set else 'blue' for i in range(len(titles))]))\
                      , 1, index + 1) 
        print("--------Top 3 results for {}:--------".format(plot_titles[index]))
        for i in similar[:3]:
            print(titles[i])
    fig.update_layout(height=400, width=1000, title_text="Visualization of Search Results for '{}'".format(query))
    fig.show()

In [17]:
plot_query_embeddings_plotly("Risk factors for covid-19 death", titles)

--------Top 3 results for Scibert:--------
Outbound traffic from Wuhan and COVID-19 incidence Temporal relationship between outbound traffic from Wuhan and the 2019 coronavirus disease (COVID-19) incidence in China
Comparative Epidemiology of Human Fatal Infections with Novel, High (H5N6 and H5N1) and Low (H7N9 and H9N2) Pathogenicity Avian Influenza A Viruses
Transmission and epidemiological characteristics of Severe Acute Respiratory Syndrome Coronavirus 2 (SARS-CoV-2) infected Pneumonia (COVID-19): preliminary evidence obtained in comparison with 2003-SARS
--------Top 3 results for Covid:--------
Application of extracorporeal membrane oxygenation in patients with severe acute respiratory distress syndrome induced by avian influenza A (H7N9) viral pneumonia: national data from the Chinese multicentre collaboration
Vaccination with single plasmid DNA encoding IL-12 and antigens of 2 severe fever with thrombocytopenia syndrome virus elicits complete 3 protection in IFNAR knockout mice 

In [18]:
plot_query_embeddings_plotly("Transmission mechanisms of covid-19", titles)

--------Top 3 results for Scibert:--------
Comparative Epidemiology of Human Fatal Infections with Novel, High (H5N6 and H5N1) and Low (H7N9 and H9N2) Pathogenicity Avian Influenza A Viruses
Structural modeling of 2019-novel coronavirus (nCoV) spike protein reveals a proteolytically- sensitive activation loop as a distinguishing feature compared to SARS-CoV and related SARS- like coronaviruses
In silico study of the spike protein from SARS-CoV-2 interaction with ACE2: similarity with SARS-CoV, hot-spot analysis and effect of the receptor polymorphism
--------Top 3 results for Covid:--------
medicina Antiviral Activity of Exopolysaccharides Produced by Lactic Acid Bacteria of the Genera Pediococcus, Leuconostoc and Lactobacillus against Human Adenovirus Type 5
The role of post-Golgi transport pathways and sorting motifs in the plasmodesmal targeting of the movement protein (MP) of Ourmia melon virus (OuMV)
Comparative Epidemiology of Human Fatal Infections with Novel, High (H5N6 and H5N

### Task 1
#### What is known about transmission, incubation, and environmental stability? What do we know about natural history, transmission, and diagnostics for the virus? What have we learned about infection prevention and control?
* Range of incubation periods for the disease in humans (and how this varies across age and health status) and how long individuals are contagious, even after recovery.
* Prevalence of asymptomatic shedding and transmission (e.g., particularly children)
* Seasonality of transmission
* Physical science of the coronavirus (e.g., charge distribution, adhesion to hydrophilic/phobic surfaces, environmental survival to inform decontamination efforts for affected areas and provide information about viral shedding).
* Persistence and stability on a multitude of substrates and sources (e.g., nasal discharge, sputum, urine, fecal matter, blood).
* Disease models, including animal models for infection, disease and transmission
* Effectiveness of movement control strategies to prevent secondary transmission in health care and community settings

In [52]:
def display_closest_papers(indices: list):
    for ind in indices:
        print("------------------------------------------------")
        paper_id = index_to_paperid_map[ind]
        abstract = selected_papers_original[paper_id]['abstract']
        title = selected_papers_original[paper_id]['title']
        print("Title: ", title)
        print("Abstract: ", abstract)

### Approach 1:  Query for all subtasks of the task jointly
In this experiment, we represent **all subtasks of task 1 as a single query**. We then obtain the embedding of the task1 query from Sentence Bert, and find the papers with the most similar embeddings as provided from Sentence Bert.

In our opinion, the relevance of the retrieved papers by using embeddings is much more clear than relevance of the papers retrieved with the [LDA](./LDA-answer-finding.ipynb) approach. The embeddings seem to be able to capture the semantic meaning of the query and produce results that mostly aim to answer the topic.

Despite suffering from the fact that the set of task1 subtasks is too broad to be handled jointly, there is still a direct connection between most of the matched papers and a specific question in task1. Examples:

* The papers, "*Feline immunodeficiency virus in puma: Estimation of force of infection reveals insights into transmission*", "*Comparative Proteome Analysis of Porcine Jejunum Tissues in Response to a Virulent Strain of Porcine Epidemic Diarrhea Virus and Its Attenuated Strain"* and "*Comparative analysis of routes of immunization of a live porcine reproductive and respiratory syndrome virus (PRRSV) vaccine in a heterologous virus challenge study*" present animal-based disease models, answering the subtask "*Disease models, including animal models for infection, disease and transmission*".
* The 2nd paper, "*Middle East respiratory syndrome coronavirus: transmission, virology and therapeutic targeting to aid in outbreak control*", talks about transmission in general.
* The paper "*Structure and immune recognition of the porcine epidemic diarrhea virus spike protein*" provides results on a COV-related virus that emerges from fecal matter and affects pigs. This result seems to match parts of the 2 subtasks: "*Disease models in animals*" and "*Persistence and on substrates and sources(e.g. fecal matter)*", but in reality does not really answer any of them. This demonstrates the disadvantages of considering the subtasks jointly.


In [62]:
n = 10
task1_questions = [
    # general questions
    "What is known about transmission incubation and environmental stability? What do we know about natural history transmission and diagnostics for the virus? What have we learned about infection prevention and control?",
    
    # sub-task questions
    "Range of incubation periods for the disease in humans and how this varies across age and health status and how long individuals are contagious, even after recovery.",
    "Prevalence of asymptomatic shedding and transmission particularly children.", 
    "Seasonality season of transmission.",
    "Physical science of the coronavirus charge distribution, adhesion to hydrophilic phobic surfaces, environmental survival to inform decontamination efforts for affected areas and provide information about viral shedding.",
    "Persistence and stability on a multitude of substrates and sources nasal discharge sputum  urine  fecal matter  blood.",
    "Disease models including animal models for infection disease and transmission",
    "Effectiveness of movement control strategies to prevent secondary transmission in health care healthcare and community settings."]

task1_query = '. '.join(task1_questions)
task1_embedding = get_query_embedding(None, model_sent, task1_query)
indices, relevant_titles = find_top_n_similar(embeddings_sent, task1_embedding, titles, n=n)
display_closest_papers(indices.tolist()[:n])

------------------------------------------------
Title:  Feline immunodeficiency virus in puma: Estimation of force of infection reveals insights into transmission
Abstract:  1. Determining parameters that govern pathogen transmission (such as the force of infection, FOI), and pathogen impacts on morbidity and mortality, is exceptionally challenging for wildlife. Vital parameters can vary, for example across host populations, between sexes and within an individual's lifetime. species, forming species-specific viral-host associations. FIV infection is common in populations of puma (Puma concolor), yet uncertainty remains over transmission parameters and the significance of FIV infection for puma mortality. In this study, the age-specific FOI of FIV in pumas was estimated from prevalence data, and the evidence for disease-associated mortality was assessed. 3. We fitted candidate models to FIV prevalence data and adopted a maximum likelihood method to estimate parameter values in each mod

### Approach 2: Querying subtasks separately
Once again we achieve superior results with this method than the LDA method. The top results especially are relevant to the queries in every case. However, it is not totally clear which approach is superior from the 2 in this notebook. It is possible that, with a larger dataset, this approach becomes more relevant.

Here, we also take the top 5 abstracts from the results and use the question answering model to attempt to retrieve the most relevant sentences directly, with mixed results. Specifically, some of these abstracts are too long, so we find that taking the last 5 sentences is a decent compromise. 

These poor results may be due to the in general lengthy questions which are posed, which tend to be more difficult for question answering. Furthermore, there is sometimes some difficulty in segmenting sentences, with full stops being confused for decimal points.

### This example illustrates the desired outcome for question answering
Context text: "Based on currently available information and clinical expertise, older adults and people of any age who have serious underlying medical conditions might be at higher risk for severe illness from COVID-19."

In [77]:
def answer(question, text, tokenizer, model):
    input_ids = tokenizer.encode(question, text)
    token_type_ids = [0 if i <= input_ids.index(102) else 1 for i in range(len(input_ids))]
    start_scores, end_scores = model(torch.tensor([input_ids]), token_type_ids=torch.tensor([token_type_ids]))
    all_tokens = tokenizer.convert_ids_to_tokens(input_ids)
    return ' '.join(all_tokens[torch.argmax(start_scores) : torch.argmax(end_scores)+1])

In [169]:
question, text = "Who is at risk for coronavirus?", "Based on currently available information and clinical expertise, older adults and people of any age who have serious underlying medical conditions might be at higher risk for severe illness from COVID-19."
answer(question, text, tokenizer_qa, model_qa)

'older adults and people of any age who have serious underlying medical conditions'

#### Subtask: Prevalence of asymptomatic shedding and transmission particularly children.
All of the retrieved documents discuss COVID-related infections in children. From our investigation, only one paper focuses on the children as asymptomatic carriers explicitly: "*Investigating the Impact of Asymptomatic Carriers on COVID-19 Transmission*". The reduced size of our dataset may play a larger role in this very specific query.

In [28]:
question = "What is the prevalence of asymptomatic shedding and transmission particularly children?"
question_embedding = get_query_embedding(None, model_sent, question)
indices, _ = find_top_n_similar(embeddings_sent, question_embedding, titles, n=n)
display_closest_papers(indices.tolist()[:n])

------------------------------------------------
Title:  Risk factors for severe acute lower respiratory infections in children -a systematic review and meta-analysis
Abstract:  Aim To identify the risk factors in children under five years of age for severe acute lower respiratory infections (ALRI), which are the leading cause of child mortality. We performed a systematic review of published literature available in the public domain. We conducted a quality assessment of all eligible studies according to GRADE criteria and performed a meta-analysis to report the odds ratios for all risk factors identified in these studies. We identified 36 studies that investigated 19 risk factors for severe ALRI. Of these, 7 risk factors were significantly associated with severe ALRI in a consistent manner across studies, with the following meta-analysis estimates of odds ratios (with 95% confidence intervals): low birth weight 3.18 (1.02-9.90), lack of exclusive breastfeeding 2.34 (1.42-3.88), crowdin

In [159]:
paper_ids = [index_to_paperid_map[indices[i].item()] for i in range(5)]
abstracts = [selected_papers_original[paper_id]['abstract'] for paper_id in paper_ids]
shorter_abstracts = ['.'.join(abstract.split(".")[-5:]) for abstract in abstracts]
for ab in shorter_abstracts:
    print(answer(question, ab, tokenizer_qa, model_qa))

under - five
unclear
18 %
severity of acute bro ##nch ##iol ##itis depends on carried viruses . pl ##os one 4 ( 2 ) : e ##45 ##9 ##6
what is the prevalence of as ##ym ##pt ##oma ##tic shed ##ding and transmission particularly children ? [SEP] however , if children continue to mix with others outside the home during the closure ##s , these measures are unlikely to be effective


#### Subtask: Disease models including animal models for infection disease and transmission
Only the first 3 papers seem relevant, and they all relate to animal models. It seems that the search took the term "including animal models" in too restrictive of a fashion. It's interesting that LDA actually provided stronger results in this category. This may be due to the fact that we limit ourselves to 2000 papers here.

In [75]:
question = "What do disease models including animal models for infection disease and transmission tell us?"
question_embedding = get_query_embedding(None, model_sent, question)
indices, _ = find_top_n_similar(embeddings_sent, question_embedding, titles, n=n)
display_closest_papers(indices.tolist()[:n])

------------------------------------------------
Title:  Animal virus ecology and evolution are shaped by the virus host-body
Abstract:  12 The current classification of animal viruses primarily relates to the virus molecular world, the 13 genomic architecture and the corresponding host-cell infection cycle. This virus centered 14 perspective does not make allowance for the precept that virus fitness hinges on the virus 15 transmission success. Virus transmission reflects the infection-shedding-transmission 16 dynamics and, with it, the organ system involvement and other, macroscopic dimensions of 17 the host environment. This study examines the transmission ecology of the world main 18 livestock viruses, 36 in total, belonging to eleven different families, and a mix of RNA, DNA 19 and retroviruses. Viruses are virtually ranked in an outer-to inner-body fashion, based on 20 the shifting organ system involvement and associated infection-shedding-transmission 21 dynamics. As a next step,

In [78]:
paper_ids = [index_to_paperid_map[indices[i].item()] for i in range(5)]
abstracts = [selected_papers_original[paper_id]['abstract'] for paper_id in paper_ids]
shorter_abstracts = [' '.join(abstract.split(".")[-5:]) for abstract in abstracts]
for ab in shorter_abstracts:
    print(answer(question, ab, tokenizer_qa, model_qa))

viruses are virtually ranked in an outer - to inner - body fashion , based on 20 the shifting organ system involvement and associated infection - shed ##ding - transmission 21 dynamics
anti ##vira ##l immune responses and viral pathogen ##esis
disease models including animal models for infection disease and transmission tell us ? [SEP]
we propose that deployment of i ##g domain proteins is a widely - used strategy by 28 viruses
creative commons at ##tri ##bution 4 0 international license ( http : / / creative ##com ##mons org / licenses / by / 4 0 / ) , which permits unrest ##ricted use , distribution , and reproduction in any medium


#### Subtask: Physical science of the coronavirus charge distribution, adhesion to hydrophilic phobic surfaces, environmental survival to inform decontamination efforts for affected areas and provide information about viral shedding
We achieve mixed results for this subtask, with some papers relating to the physical properties of the virus and its interaction with the environment, for example the first two, as well as others with seemingly unrelated topics.

In [79]:
question = "What is the physical science of the coronavirus charge distribution, adhesion to hydrophilic phobic surfaces, environmental survival to inform decontamination efforts for affected areas and provide information about viral shedding?"
question_embedding = get_query_embedding(None, model_sent, question)
indices, _ = find_top_n_similar(embeddings_sent, question_embedding, titles, n=n)
display_closest_papers(indices.tolist()[:n])

------------------------------------------------
Title:  The Infectious Bronchitis Virus Coronavirus Envelope Protein Alters Golgi pH to Protect Spike Protein and Promote Release of Infectious Virus
Abstract:  Coronaviruses (CoVs) are important human pathogens with significant zoonotic potential. Progress has been made toward identifying potential vaccine candidates for highly pathogenic human CoVs, including use of attenuated viruses that lack the CoV envelope (E) protein or express E mutants. However, no approved vaccines or anti-viral therapeutics exist. CoVs assemble by budding into the lumen of the early Golgi prior to exocytosis. The small CoV E protein plays roles in assembly, virion release, and pathogenesis. CoV E has a single hydrophobic domain (HD), is targeted to Golgi membranes, and has cation channel activity in vitro. The E protein from the avian infectious bronchitis virus (IBV) has dramatic effects on the secretory system, which requires residues in the HD. Mutation of

In [80]:
paper_ids = [index_to_paperid_map[indices[i].item()] for i in range(5)]
abstracts = [selected_papers_original[paper_id]['abstract'] for paper_id in paper_ids]
shorter_abstracts = [' '.join(abstract.split(".")[-3:]) for abstract in abstracts]
for ab in shorter_abstracts:
    print(answer(question, ab, tokenizer_qa, model_qa))

alter ##s the secret ##ory pathway through interaction with host cells factors

[SEP]
ad ##hesion to hydro ##phi ##lic ph ##ob ##ic surfaces
ad ##hesion to hydro ##phi ##lic ph ##ob ##ic surfaces


## Comparison with BM25 search

As a baseline for comparison, we use the BM25 search algorithm. This is similar in spirit to Tf-Idf, and is a classic information retrieval technique. Due to its much increased efficiency, we can use the abstract information as well as the title for the search.

This algorithm is much faster and can work with much longer lengths of text, which need not be padded. However, unlike the embedding method, it works with reduced information regarding similarity between words. It also disregards information regarding the position of words in each sentence.

Overall, we find almost no overlap between both search techniques' results, suggesting that they may be complementary. This baseline relies much more on finding exact word matches between the query and the title or abstract. The embedding technique, despite only being able to work with titles, finds results which are clearly relevant to the query but which may use synonyms or related phrases.

The best approach may result from a combination of both of these techniques: The embedding search for a more full, context aware search and a technique such as BM25 to be able to make use of fuller information from the paper abstract or body.

In [72]:
## setup
corpus = [selected_papers_original[paper_id]['title'] + " " + selected_papers_original[paper_id]['abstract'] for paper_id in paper_ids]
tokenized = [doc.split(" ") for doc in corpus]
bm25 = BM25Okapi(tokenized)
n = 5
def bm25_search(query):
    tokenized_query = query.split(" ")
    doc_scores = bm25.get_scores(tokenized_query)
    indices = doc_scores.argsort()[::-1]
    return indices

In [73]:
query = "Prevalence of asymptomatic shedding and transmission particularly children"
indices = bm25_search(query)
display_closest_papers(indices.tolist()[:n])

------------------------------------------------
Title:  Epidemiology and Transmission of COVID-19 in Shenzhen China: Analysis of 391 cases and 1,286 of their close contacts
Abstract:  Rapid spread of SARS-CoV-2 in Wuhan prompted heightened surveillance in Shenzhen and elsewhere in China. The resulting data provide a rare opportunity to measure key metrics of disease course, transmission, and the impact of control. The Shenzhen CDC identified 391 SARS-CoV-2 cases from January 14 to February 12, 2020 and 1286 close contacts. We compare cases identified through symptomatic surveillance and contact tracing, and estimate the time from symptom onset to confirmation, isolation, and hospitalization. We estimate metrics of disease transmission and analyze factors influencing transmission risk. Cases were older than the general population (mean age 45) and balanced between males (187) and females (204). Ninety-one percent had mild or moderate clinical severity at initial assessment. Three have 

In [74]:
query = "Disease models including animal models for infection disease and transmission"
indices = bm25_search(query)
display_closest_papers(indices.tolist()[:n])

------------------------------------------------
Title:  Three asymptomatic animal infection models of hemorrhagic fever with renal syndrome caused by hantaviruses
Abstract:  Hantaan virus (HTNV) and Puumala virus (PUUV) are rodent-borne hantaviruses that are the primary causes of hemorrhagic fever with renal syndrome (HFRS) in Europe and Asia. The development of well characterized animal models of HTNV and PUUV infection is critical for the evaluation and the potential licensure of HFRS vaccines and therapeutics. In this study we present three animal models of HTNV infection (hamster, ferret and marmoset), and two animal models of PUUV infection (hamster, ferret). Infection of hamsters with a~3 times the infectious dose 99% (ID 99 ) of HTNV by the intramuscular and~1 ID 99 of HTNV by the intranasal route leads to a persistent asymptomatic infection, characterized by sporadic viremia and high levels of viral genome in the lung, brain and kidney. In contrast, infection of hamsters with~

In [81]:
query = "physical science of the coronavirus charge distribution, adhesion to hydrophilic phobic surfaces, environmental survival to inform decontamination efforts for affected areas and provide information about viral shedding"
indices = bm25_search(query)
display_closest_papers(indices.tolist()[:n])

------------------------------------------------
Title:  Viral gain-of-function experiments uncover residues under diversifying selection in nature
Abstract:  Viral gain-of-function mutations are commonly observed in the laboratory; however, it is unknown whether those mutations also evolve in nature. We identify two key residues in the host recognition protein of bacteriophage l that are necessary to exploit a new receptor; both residues repeatedly evolved among homologs from environmental samples. Our results provide evidence for widespread host-shift evolution in nature and a proof of concept for integrating experiments with genomic epidemiology. Many viruses can expand their host range with a few mutations 1-3 that enable the exploitation of new cellular receptors 2,3 . Such mutations may be the first steps toward an epidemic outbreak; this observation has driven an expansion of theoretical 4 , experimental and surveillance studies of host-range shifts in emergent pathogens, inclu