# Building and Evaluating a COVID-19 oriented Information Retrieval Engine


## 1. Proccesing data 
Firstly, we are going to process the data

In [3]:
pip install pandas

Note: you may need to restart the kernel to use updated packages.


In [4]:
import pandas as pd
import json
import time
import xml.etree.ElementTree as ET

dt = pd.read_csv("../data/metadata.csv")
print(dt)

        cord_uid                                       sha  \
0       ug7v899j  d1aafb70c066a2068b02786f8929fd9c900897fb   
1       02tnwd4m  6b0567729c2143a66d737eb0a2f63f2dce2e5a7d   
2       ejv2xln0  06ced00a5fc04215949aa72528f2eeaae1d58927   
3       2b73a28n  348055649b6b8cf2b9a376498df9bf41f7123605   
4       9785vg6d  5f48792a5fa08bed9f56016f4981ae2ca6031b32   
...          ...                                       ...   
192504  z4ro6lmh  203f36475be74229101548475d68352b939f8b5b   
192505  hi8k8wvb  9f1bc99798e8823e690697394dcb23533a45c60e   
192506  ma3ndg41  ffba777376718ef2a0dd74a8eab90e2bfacd240f   
192507  wh10285j  d521c5a2dcbd79a5be606fcf586b1e0448344172   
192508  pnl9th2c  c047bf76813106d4fd586e49164e7feddfbe352f   

                      source_x  \
0                          PMC   
1                          PMC   
2                          PMC   
3                          PMC   
4                          PMC   
...                        ...   
192504           

  has_raised = await self.run_ast_nodes(code_ast.body, cell_name,


#### Preparing the dataset
We decided just to work with the papers of the PDF_JSON corpus. Therefore, the first step is to delete from the dataframe the elements that are not in that folder. The number of examples is reduced from 192509 to 79755. Still, there are more documents in the pdf_json than in the dataframe (over 84000) because many pdf in the corpus have the same cord_uid. Technically, the papers mapped into the the same cord_uid are the same one, but with differences in ghe publication (if one article has been published by Elsevier and Springer, it will be mapped twice with the same cord_uid). Our take on the problem will be to consider just on of the documents associated to one cord_uid, instead of the full_list.

In [5]:
dt = dt[dt.pdf_json_files.notnull()]
dt = dt.reset_index(drop = True)
print(dt)

       cord_uid                                       sha  \
0      ug7v899j  d1aafb70c066a2068b02786f8929fd9c900897fb   
1      02tnwd4m  6b0567729c2143a66d737eb0a2f63f2dce2e5a7d   
2      ejv2xln0  06ced00a5fc04215949aa72528f2eeaae1d58927   
3      2b73a28n  348055649b6b8cf2b9a376498df9bf41f7123605   
4      9785vg6d  5f48792a5fa08bed9f56016f4981ae2ca6031b32   
...         ...                                       ...   
79750  z4ro6lmh  203f36475be74229101548475d68352b939f8b5b   
79751  hi8k8wvb  9f1bc99798e8823e690697394dcb23533a45c60e   
79752  ma3ndg41  ffba777376718ef2a0dd74a8eab90e2bfacd240f   
79753  wh10285j  d521c5a2dcbd79a5be606fcf586b1e0448344172   
79754  pnl9th2c  c047bf76813106d4fd586e49164e7feddfbe352f   

                     source_x  \
0                         PMC   
1                         PMC   
2                         PMC   
3                         PMC   
4                         PMC   
...                       ...   
79750            Medline; PMC   
797

Next, we drop the columns that will not add information to our information retrieval system (such as the licenses or the doi) and that do not help to map each example of the dataframe with a document in the pdf_json corpus.

In [6]:
columns_to_delete = ["doi", "source_x", "pmcid", "pubmed_id", "license", "mag_id", "who_covidence_id", "arxiv_id", "pmc_json_files", "url", "s2_id"]
# dt_original = dt
dt = dt.drop(columns_to_delete, axis = 1)
print(dt)

       cord_uid                                       sha  \
0      ug7v899j  d1aafb70c066a2068b02786f8929fd9c900897fb   
1      02tnwd4m  6b0567729c2143a66d737eb0a2f63f2dce2e5a7d   
2      ejv2xln0  06ced00a5fc04215949aa72528f2eeaae1d58927   
3      2b73a28n  348055649b6b8cf2b9a376498df9bf41f7123605   
4      9785vg6d  5f48792a5fa08bed9f56016f4981ae2ca6031b32   
...         ...                                       ...   
79750  z4ro6lmh  203f36475be74229101548475d68352b939f8b5b   
79751  hi8k8wvb  9f1bc99798e8823e690697394dcb23533a45c60e   
79752  ma3ndg41  ffba777376718ef2a0dd74a8eab90e2bfacd240f   
79753  wh10285j  d521c5a2dcbd79a5be606fcf586b1e0448344172   
79754  pnl9th2c  c047bf76813106d4fd586e49164e7feddfbe352f   

                                                   title  \
0      Clinical features of culture-proven Mycoplasma...   
1      Nitric oxide: a pro-inflammatory mediator in l...   
2        Surfactant protein-D and pulmonary host defense   
3                   Role of

In [7]:
print(dt.shape[0])

79755


In [8]:
# Document 1 
print(dt.iloc[0])

cord_uid                                                   ug7v899j
sha                        d1aafb70c066a2068b02786f8929fd9c900897fb
title             Clinical features of culture-proven Mycoplasma...
abstract          OBJECTIVE: This retrospective chart review des...
publish_time                                             2001-07-04
authors                         Madani, Tariq A; Al-Ghamdi, Aisha A
journal                                              BMC Infect Dis
pdf_json_files    document_parses/pdf_json/d1aafb70c066a2068b027...
Name: 0, dtype: object


In [9]:
# Document 1
print(dt.iloc[0].title)
print(dt.iloc[0].abstract)

Clinical features of culture-proven Mycoplasma pneumoniae infections at King Abdulaziz University Hospital, Jeddah, Saudi Arabia
OBJECTIVE: This retrospective chart review describes the epidemiology and clinical features of 40 patients with culture-proven Mycoplasma pneumoniae infections at King Abdulaziz University Hospital, Jeddah, Saudi Arabia. METHODS: Patients with positive M. pneumoniae cultures from respiratory specimens from January 1997 through December 1998 were identified through the Microbiology records. Charts of patients were reviewed. RESULTS: 40 patients were identified, 33 (82.5%) of whom required admission. Most infections (92.5%) were community-acquired. The infection affected all age groups but was most common in infants (32.5%) and pre-school children (22.5%). It occurred year-round but was most common in the fall (35%) and spring (30%). More than three-quarters of patients (77.5%) had comorbidities. Twenty-four isolates (60%) were associated with pneumonia, 14 (35

#### Parse the pdf
The articles mapped with each row of the metadata (cord_uid) are stored separately and parsed in .json. The following script converts an article mapped with each cord_uid into a python dictionary, storing the set of dictionaries on pdf_json_list.

##### Es probable que esto sea opcional, porque los metadatos ya nos dan el título, autores y abstract (las tres cosas más importantes). Trabajar usando todo el cuerpo de los artículos puede mejorar los resultados, pero el tiempo de cómputo aumentaría mucho. Además, la información sobre el título, autores y abstract es mejor en los metadatos, dado que los metadatos son proporcionados directamente de las revistas y, en contraste, la info. sobre titulo, autores y abstract del pdf_json corpus son obtenidos automáticamente mediante un parseo de pdf a json y puede haber errores.
##### Poca broma, a mi se me desborda la memoria. Mi RAM la tenía al 94% de ocupación con el siguiente script (obviamente no pude terminarlo).

In [10]:
'''
pdf_json_list = []
t0 = time.time()
for row in dt.index :
    json_path = (dt.loc[row]['pdf_json_files'].split('; '))[0]
    json_file = open("../data/"+json_path) 
    full_text_dict = json.load(json_file)
    pdf_json_list.append(full_text_dict)
t1 = time.time()
'''

'\npdf_json_list = []\nt0 = time.time()\nfor row in dt.index :\n    json_path = (dt.loc[row][\'pdf_json_files\'].split(\'; \'))[0]\n    json_file = open("../data/"+json_path) \n    full_text_dict = json.load(json_file)\n    pdf_json_list.append(full_text_dict)\nt1 = time.time()\n'

Script to read the test queries (https://towardsdatascience.com/download-and-parse-trec-covid-data-8f9840686c37)

In [11]:
topics = {}
root = ET.parse("../queries/test_queries.xml").getroot()
for topic in root.findall("topic"):
    topic_number = int(topic.attrib["number"])
    topics[topic_number] = {}
    for query in topic.findall("query"):
        topics[topic_number]["query"] = query.text  # We only need the query: 
    #for question in topic.findall("question"):
     #   topics[topic_number]["question"] = question.text        
    #for narrative in topic.findall("narrative"):
     #   topics[topic_number]["narrative"] = narrative.text
print(topics[1].keys())

dict_keys(['query'])


In [12]:
print(topics)

{1: {'query': 'coronavirus origin'}, 2: {'query': 'coronavirus response to weather changes'}, 3: {'query': 'coronavirus immunity'}, 4: {'query': 'how do people die from the coronavirus'}, 5: {'query': 'animal models of COVID-19'}, 6: {'query': 'coronavirus test rapid testing'}, 7: {'query': 'serological tests for coronavirus'}, 8: {'query': 'coronavirus under reporting'}, 9: {'query': 'coronavirus in Canada'}, 10: {'query': 'coronavirus social distancing impact'}, 11: {'query': 'coronavirus hospital rationing'}, 12: {'query': 'coronavirus quarantine'}, 13: {'query': 'how does coronavirus spread'}, 14: {'query': 'coronavirus super spreaders'}, 15: {'query': 'coronavirus outside body'}, 16: {'query': 'how long does coronavirus survive on surfaces'}, 17: {'query': 'coronavirus clinical trials'}, 18: {'query': 'masks prevent coronavirus'}, 19: {'query': 'what alcohol sanitizer kills coronavirus'}, 20: {'query': 'coronavirus and ACE inhibitors'}, 21: {'query': 'coronavirus mortality'}, 22: 

In [13]:
for key in topics: 
    value = topics[key]
    print(key, value["query"])

1 coronavirus origin
2 coronavirus response to weather changes
3 coronavirus immunity
4 how do people die from the coronavirus
5 animal models of COVID-19
6 coronavirus test rapid testing
7 serological tests for coronavirus
8 coronavirus under reporting
9 coronavirus in Canada
10 coronavirus social distancing impact
11 coronavirus hospital rationing
12 coronavirus quarantine
13 how does coronavirus spread
14 coronavirus super spreaders
15 coronavirus outside body
16 how long does coronavirus survive on surfaces
17 coronavirus clinical trials
18 masks prevent coronavirus
19 what alcohol sanitizer kills coronavirus
20 coronavirus and ACE inhibitors
21 coronavirus mortality
22 coronavirus heart impacts
23 coronavirus hypertension
24 coronavirus diabetes
25 coronavirus biomarkers
26 coronavirus early symptoms
27 coronavirus asymptomatic
28 coronavirus hydroxychloroquine
29 coronavirus drug repurposing
30 coronavirus remdesivir
31 difference between coronavirus and flu
32 coronavirus subtyp

Script to read the relevance judgement (the information needed to evaluate our system). The round id is not needed and it is therefore omitted. Also, the relevance is binarized

In [14]:
relevance_data = pd.read_csv("../queries/relevance_judgements.txt", sep=" ", header=None)
relevance_data.columns = ["topic_id", "round_id", "cord_uid", "relevancy"]
relevance_data = relevance_data.drop("round_id" ,axis = 1)
relevance_data['relevancy'] = relevance_data['relevancy'].replace([2],'1')
print(relevance_data)

       topic_id  cord_uid relevancy
0             1  005b2j4b         1
1             1  00fmeepz         1
2             1  010vptx3         1
3             1  0194oljo         1
4             1  021q9884         1
...         ...       ...       ...
69313        50  zvop8bxh         1
69314        50  zwf26o63         1
69315        50  zwsvlnwe         0
69316        50  zxr01yln         1
69317        50  zz8wvos9         1

[69318 rows x 3 columns]


With all the metadata (and optionally json_pdf), test topics and relevance judgement we are prepared to build and validate the system

## 2. A simple VSM implementation
We have adapted the simple vector space model implementation for our code. 
#### Implementation

First, we install and import libraries 

In [15]:
# We first install the NLTK toolkit

In [16]:
pip install nltk 

Note: you may need to restart the kernel to use updated packages.


In [17]:
# We also need to download the NLTK data bundle

In [18]:
import nltk 

nltk.download('all')

[nltk_data] Downloading collection 'all'
[nltk_data]    | 
[nltk_data]    | Downloading package abc to
[nltk_data]    |     C:\Users\Sandrus\AppData\Roaming\nltk_data...
[nltk_data]    |   Package abc is already up-to-date!
[nltk_data]    | Downloading package alpino to
[nltk_data]    |     C:\Users\Sandrus\AppData\Roaming\nltk_data...
[nltk_data]    |   Package alpino is already up-to-date!
[nltk_data]    | Downloading package biocreative_ppi to
[nltk_data]    |     C:\Users\Sandrus\AppData\Roaming\nltk_data...
[nltk_data]    |   Package biocreative_ppi is already up-to-date!
[nltk_data]    | Downloading package brown to
[nltk_data]    |     C:\Users\Sandrus\AppData\Roaming\nltk_data...
[nltk_data]    |   Package brown is already up-to-date!
[nltk_data]    | Downloading package brown_tei to
[nltk_data]    |     C:\Users\Sandrus\AppData\Roaming\nltk_data...
[nltk_data]    |   Package brown_tei is already up-to-date!
[nltk_data]    | Downloading package cess_cat to
[nltk_data]    |     

[nltk_data]    |   Package twitter_samples is already up-to-date!
[nltk_data]    | Downloading package udhr to
[nltk_data]    |     C:\Users\Sandrus\AppData\Roaming\nltk_data...
[nltk_data]    |   Package udhr is already up-to-date!
[nltk_data]    | Downloading package udhr2 to
[nltk_data]    |     C:\Users\Sandrus\AppData\Roaming\nltk_data...
[nltk_data]    |   Package udhr2 is already up-to-date!
[nltk_data]    | Downloading package unicode_samples to
[nltk_data]    |     C:\Users\Sandrus\AppData\Roaming\nltk_data...
[nltk_data]    |   Package unicode_samples is already up-to-date!
[nltk_data]    | Downloading package universal_treebanks_v20 to
[nltk_data]    |     C:\Users\Sandrus\AppData\Roaming\nltk_data...
[nltk_data]    |   Package universal_treebanks_v20 is already up-to-
[nltk_data]    |       date!
[nltk_data]    | Downloading package verbnet to
[nltk_data]    |     C:\Users\Sandrus\AppData\Roaming\nltk_data...
[nltk_data]    |   Package verbnet is already up-to-date!
[nltk_d

True

In [19]:
#We now install the gensim package

In [20]:
pip install gensim 

Note: you may need to restart the kernel to use updated packages.


In [21]:
# All the required software is now installed. 

In [22]:
# We now import the functions provided by NLTK to perform tokenizing considering punctuation signs.
from nltk.tokenize import wordpunct_tokenize, regexp_tokenize
# Next, we import required functions to filter-out stopwords for the English language.
from nltk.corpus import stopwords
# Now we import the function that implements the Porter's stemming algorithm.
from nltk.stem import PorterStemmer

The first step is aimed at preprocessing each document in the collection. We write a function that receives the variable dt, and returns a list containing all STEMS in the collection whose associated token is longer than 2 characters and is NOT an (English) stopword.

In [23]:
def preprocess_document(doc): # Each doc is each df row. We will only use title and abstract: dt.iloc[i].title and dt.iloc[i].abstract
    #print(i)
    stopset = set(stopwords.words('english'))
    stemmer = PorterStemmer()
    if type(doc.title) != str and type(doc.abstract) != str: # For empty documents without title and abstract
        final = [""]
    else:
        if type(doc.title) == str: 
            tokens = wordpunct_tokenize(doc.title)
        if type(doc.abstract) == str:
            tokens.extend(wordpunct_tokenize(doc.abstract))
            # clean guarda las palabras (en minuscula) que no están incluidas en stopset
        clean = [token.lower() for token in tokens if token.lower() not in stopset and len(token) > 2 and "%" not in token]
        final = [stemmer.stem(word) for word in clean]
    return final

In [24]:
print(preprocess_document(dt.iloc[0])) # Print STEMS for the document 1
print(type(dt.iloc[0].title))

['clinic', 'featur', 'cultur', 'proven', 'mycoplasma', 'pneumonia', 'infect', 'king', 'abdulaziz', 'univers', 'hospit', 'jeddah', 'saudi', 'arabia', 'object', 'retrospect', 'chart', 'review', 'describ', 'epidemiolog', 'clinic', 'featur', 'patient', 'cultur', 'proven', 'mycoplasma', 'pneumonia', 'infect', 'king', 'abdulaziz', 'univers', 'hospit', 'jeddah', 'saudi', 'arabia', 'method', 'patient', 'posit', 'pneumonia', 'cultur', 'respiratori', 'specimen', 'januari', '1997', 'decemb', '1998', 'identifi', 'microbiolog', 'record', 'chart', 'patient', 'review', 'result', 'patient', 'identifi', 'requir', 'admiss', 'infect', 'commun', 'acquir', 'infect', 'affect', 'age', 'group', 'common', 'infant', 'pre', 'school', 'children', 'occur', 'year', 'round', 'common', 'fall', 'spring', 'three', 'quarter', 'patient', 'comorbid', 'twenti', 'four', 'isol', 'associ', 'pneumonia', 'upper', 'respiratori', 'tract', 'infect', 'bronchiol', 'cough', 'fever', 'malais', 'common', 'symptom', 'crepit', 'wheez', '

In [25]:
print(dt.iloc[29264])
print(dt.iloc[29264].title)
print(type(dt.iloc[29264].title))

cord_uid                                                   n06og3cw
sha                        8d35867e078939b7f20187322e41011cec8b8cb3
title                                                           NaN
abstract                                                        NaN
publish_time                                             2020-05-13
authors           De Coninck, David; d'Haenens, Leen; Matthijs, ...
journal                                               Public Health
pdf_json_files    document_parses/pdf_json/8d35867e078939b7f2018...
Name: 29264, dtype: object
nan
<class 'float'>


In [26]:
print(preprocess_document(dt.iloc[29264]))

['']


In [27]:
def preprocess_query(q):
    stopset = set(stopwords.words('english'))
    stemmer = PorterStemmer()
    tokens = wordpunct_tokenize(q)
    clean = [token.lower() for token in tokens if token.lower() not in stopset and len(token) > 2]
    final = [stemmer.stem(word) for word in clean]
    return final

Once all documents in the collection have been preprocessed, we need to create a dictionary containing the mappings WORD_ID -> WORD. This dictionary is required to create the vector-based word representations.

In [19]:
from gensim import corpora

# Different words in the collection
def create_dictionary(docs):
    #print(preprocess_document(docs.iloc[79754]))
    # List all pre-processing documents
    pdocs = [preprocess_document(docs.iloc[i]) for i in range(docs.shape[0])]
    #print("pdocs: ", pdocs)
    # Build the dictionary
    dictionary = corpora.Dictionary(pdocs)
    # Save in a file
    dictionary.save('simple_vsm.dict')
    return dictionary

Let us call the create_dictionary function feeding it with the complete dt.

In [54]:
dict = create_dictionary(dt)
print(dict)

Dictionary(123368 unique tokens: ['1997', '1998', 'abdulaziz', 'acquir', 'admiss']...)


Now we have built the dictionary containing the vocabulary that we will use for indexing. Now we write a function that create the bag of words-based representation for each document in the collection.

In [20]:
def docs2bows(allData, dictionary):
    docs = [preprocess_document(allData.iloc[i]) for i in range(allData.shape[0])]
    # We obtain the set of frequencies for each term
    vectors = [dictionary.doc2bow(doc) for doc in docs]
    corpora.MmCorpus.serialize('simple_vsm_docs.mm', vectors)
    return vectors

Let us now generate the BOWs for the complete dt.

In [56]:
bows = docs2bows(dt, dict)

In [57]:
print(bows)

IOPub data rate exceeded.
The notebook server will temporarily stop sending output
to the client in order to avoid crashing it.
To change this limit, set the config variable
`--NotebookApp.iopub_data_rate_limit`.

Current values:
NotebookApp.iopub_data_rate_limit=1000000.0 (bytes/sec)
NotebookApp.rate_limit_window=3.0 (secs)



These are pairs (word identifier, frequency). Let us now convert them into something a bit more readable.

In [58]:
for v in bows:
    tvec = [(dict[id], freq) for (id, freq) in v]
    print(tvec)

[('1997', 1), ('1998', 1), ('abdulaziz', 2), ('acquir', 1), ('admiss', 1), ('affect', 1), ('age', 1), ('arabia', 2), ('associ', 1), ('breath', 1), ('bronchial', 1), ('bronchiol', 1), ('chart', 2), ('children', 2), ('clinic', 2), ('common', 5), ('commun', 1), ('comorbid', 4), ('complic', 1), ('conclus', 1), ('cough', 1), ('crepit', 2), ('cultur', 3), ('data', 1), ('decemb', 1), ('describ', 1), ('die', 3), ('due', 1), ('epidemiolog', 1), ('except', 1), ('fall', 1), ('featur', 2), ('fever', 1), ('find', 1), ('follow', 1), ('four', 1), ('group', 1), ('high', 1), ('hospit', 2), ('identifi', 2), ('immunocompromis', 2), ('infant', 2), ('infect', 7), ('isol', 1), ('januari', 1), ('jeddah', 2), ('king', 2), ('like', 1), ('malais', 1), ('method', 1), ('microbiolog', 1), ('mortal', 1), ('mycoplasma', 2), ('non', 1), ('object', 1), ('occur', 1), ('patient', 11), ('pneumonia', 11), ('posit', 1), ('pre', 1), ('preschool', 1), ('present', 1), ('proven', 2), ('publish', 1), ('quarter', 1), ('rate', 1)

IOPub data rate exceeded.
The notebook server will temporarily stop sending output
to the client in order to avoid crashing it.
To change this limit, set the config variable
`--NotebookApp.iopub_data_rate_limit`.

Current values:
NotebookApp.iopub_data_rate_limit=1000000.0 (bytes/sec)
NotebookApp.rate_limit_window=3.0 (secs)




[('age', 2), ('due', 1), ('fever', 3), ('infect', 3), ('like', 1), ('patient', 6), ('respiratori', 3), ('result', 1), ('symptom', 1), ('tract', 3), ('upper', 3), ('contribut', 1), ('might', 1), ('product', 2), ('sever', 2), ('variou', 2), ('also', 1), ('express', 1), ('mani', 1), ('respons', 1), ('show', 1), ('acut', 1), ('cytokin', 4), ('viral', 1), ('differ', 1), ('control', 5), ('lower', 2), ('mark', 1), ('syndrom', 2), ('stimuli', 1), ('complaint', 1), ('significantli', 3), ('frequent', 1), ('open', 1), ('promot', 2), ('measur', 2), ('possibl', 1), ('compar', 1), ('produc', 1), ('investig', 1), ('region', 2), ('match', 2), ('less', 1), ('stimul', 2), ('long', 1), ('term', 1), ('healthi', 5), ('whether', 1), ('immunolog', 1), ('008', 1), ('link', 1), ('chemokin', 2), ('explain', 1), ('recurr', 1), ('ligand', 2), ('monocyt', 2), ('tnf', 1), ('001', 6), ('chromatin', 1), ('lp', 1), ('mcp', 2), ('memori', 1), ('004', 2), ('rant', 1), ('fatigu', 3), ('043', 1), ('sex', 2), ('diminish',

IOPub data rate exceeded.
The notebook server will temporarily stop sending output
to the client in order to avoid crashing it.
To change this limit, set the config variable
`--NotebookApp.iopub_data_rate_limit`.

Current values:
NotebookApp.iopub_data_rate_limit=1000000.0 (bytes/sec)
NotebookApp.rate_limit_window=3.0 (secs)




[('find', 1), ('like', 1), ('result', 2), ('three', 1), ('deriv', 1), ('involv', 1), ('product', 3), ('protein', 1), ('sever', 1), ('addit', 1), ('bind', 1), ('human', 1), ('within', 2), ('acid', 2), ('amino', 2), ('complet', 2), ('gene', 3), ('genom', 3), ('polymeras', 1), ('relat', 2), ('strategi', 1), ('viru', 4), ('base', 1), ('end', 1), ('function', 1), ('one', 4), ('primari', 1), ('sequenc', 12), ('site', 1), ('suggest', 1), ('use', 2), ('growth', 1), ('analysi', 3), ('design', 1), ('two', 1), ('conserv', 1), ('nucleotid', 2), ('ribosom', 1), ('domain', 2), ('factor', 1), ('hepat', 1), ('mhv', 1), ('clone', 1), ('motif', 3), ('found', 1), ('could', 1), ('frame', 2), ('open', 1), ('read', 1), ('thu', 1), ('unknown', 1), ('infecti', 1), ('small', 1), ('structur', 1), ('transmiss', 1), ('reveal', 2), ('locat', 1), ('compar', 1), ('length', 1), ('put', 1), ('region', 1), ('larg', 1), ('less', 1), ('presenc', 1), ('determin', 2), ('replicas', 2), ('potenti', 1), ('assign', 1), ('prot

IOPub data rate exceeded.
The notebook server will temporarily stop sending output
to the client in order to avoid crashing it.
To change this limit, set the config variable
`--NotebookApp.iopub_data_rate_limit`.

Current values:
NotebookApp.iopub_data_rate_limit=1000000.0 (bytes/sec)
NotebookApp.rate_limit_window=3.0 (secs)



These are basically TF-weighted vectors. We now want to convert these vectors into their TF-IDF weighted counterparts. We need, however, to import the models module from Gensim.

In [30]:
from gensim import models

def create_TF_IDF_model(allData):
    dictionary = create_dictionary(allData)
    docs2bows(allData, dictionary)
    loaded_allData = corpora.MmCorpus('simple_vsm_docs.mm')
    tfidf = models.TfidfModel(loaded_allData)
    print("tfidf", tfidf)
    return tfidf, dictionary

Let us now create the TF-IDF model.

In [60]:
tfidfm = create_TF_IDF_model(dt)
print(tfidfm)

(<gensim.models.tfidfmodel.TfidfModel object at 0x000001DA1F051B20>, <gensim.corpora.dictionary.Dictionary object at 0x000001DA3154F5E0>)


As can be seen, a complex object is returned that contains the TF-IDF model and the associated dictionary. Let us now take a closer look of such a TF-IDF model.

In [61]:
print(tfidfm[0].__dict__)

IOPub data rate exceeded.
The notebook server will temporarily stop sending output
to the client in order to avoid crashing it.
To change this limit, set the config variable
`--NotebookApp.iopub_data_rate_limit`.

Current values:
NotebookApp.iopub_data_rate_limit=1000000.0 (bytes/sec)
NotebookApp.rate_limit_window=3.0 (secs)



We finally create a function that given the dt and the topics provides a document ranking sorted in descending order of relevance (according to the cosine measure)

In [22]:
from operator import itemgetter
from gensim import similarities
 
    
def launch_query(allData, q, filename='simple_vsm_docs.mm'):
    tfidf, dictionary = create_TF_IDF_model(allData)
    loaded_allData = corpora.MmCorpus(filename)
    index = similarities.MatrixSimilarity(loaded_allData, num_features=len(dictionary))
    pq = preprocess_query(q)
    vq = dictionary.doc2bow(pq)
    qtfidf = tfidf[vq]
    sim = index[qtfidf]
    ranking = sorted(enumerate(sim), key=itemgetter(1), reverse=True)
    print("##############")
    print("QUERY:",q)
    pos = 0
    for doc, score in ranking:
        """
        if(round(score,3)==0):
            print("[ Score = " + "%.3f" % round(score,3) + " ] " + allData.iloc[doc].title,"#########################"); 
        else:
            print("[ Score = " + "%.3f" % round(score,3) + " ] " + allData.iloc[doc].title); 
        """
        if ( pos <10 ): # First ten positions
            print("[ Score = " + "%.3f" % round(score,3) + " ] " + allData.iloc[doc].title); 
        else: 
            break
        pos += 1
    return ranking

And now we can launch any query we see fit to our newly created Information Retrieval engine.


We choose the query 1.

In [137]:
#for key in topics: 
    #value = topics[key]
    #launch_query(dt, value["query"])


In [39]:
# We select all cord_uid in relevance_data for the query 1
cords_relevance_data = set()
for i in range(relevance_data.loc[relevance_data.topic_id==1].shape[0]):
    cords_relevance_data.add(relevance_data.loc[relevance_data.topic_id==1].cord_uid[i])
print(cords_relevance_data)

{'p6a3j6mr', '06o2tbon', '1bpc8g6n', 'oxs4o9xe', 'igdkz3ht', 'l9vtsj3e', 'dgbf77hn', 'oewvpv66', 'qele28zk', '6c68pmem', '0qaoam29', '4hvv4sep', 'yy7abob9', 'l48iq9yj', 'bsnibsuf', 'sn7rswab', 'rerkx6js', 'zph6r4il', 'kromgrnu', '9blg1oke', '4b6jtbor', '1mjaycee', 'dle938bt', '5l10cbp2', 'nw0jbs1s', 'mrsya6wz', '3pqtmhob', 'uqahztur', '6ez0u7iq', 'p57tsbyw', '3qzlo90e', 'zy8qjaai', 'kvfau8j0', 'nk9hhco3', 'c3geck81', 'l6xgrxp5', '4j6cbnk2', 'g5fqxdqh', '1pnc889f', 'gq3965se', 'vnafx1ng', 'ns628u21', '2w0zr9c0', 'mvwkflnq', 'jgwvjkbj', '6dq8xx7c', '3okdfxzq', 'qla6edp4', '85acs4lk', 'm4y8tf6u', 'p9tx2oer', 'vydgwa8o', '87d7gzgb', 'tuzmu7p5', 'dnxhtbxn', '8xm0kacj', 'fihedvms', '95fc828i', 'z122v1uz', 'e1urdt9w', 'orqlbw9m', '127c5bve', 'pka6ipav', 'r8el8nqm', 'blqzi69t', '2f8c66zi', 'wco27nop', '2hb28brw', '89sg0cpk', 'ienet82k', 'vmtb5swj', 'rwsfw1ei', 'ki7bn67o', 'lmqm1bio', 'lxakf79k', 'p60xy2ki', '8arwlhf0', 'x7rqfsgs', 'b5329o75', 'sis4fjh5', 'fozglfc8', 'suhqgmlo', 'wzcaugst', '95

In [40]:
# We select all cord_uid that are in both (relevance_data and dt) for the query 1
dt_rel_data = pd.DataFrame()
aux = 0
for i in range(dt.cord_uid.shape[0]):
    #print(dt.cord_uid[i])
    if dt.cord_uid[i] in cords_relevance_data:
        #print(dt.cord_uid[i])
        dt_rel_data[aux] = dt.iloc[i]
        aux +=1
dt_rel_data = dt_rel_data.T
print(dt_rel_data)
print(len(cords_relevance_data))
        

      cord_uid                                                sha  \
0     sw4wtxdk           4faf1ac964c605b384dda60bc37df300766401b9   
1     6wu024ng           f2ab1be1bbd80c0f102714fdc90597af2739442c   
2     sbxqwfmy           c9b0389a55de2f9cbfe37049d1072e0984613923   
3     1rhy8td0           8a6809df45d5f80a822d68d3c305f7640e10234a   
4     t7rxmzvi           1a162c4dd45cad2c49168b2d6f2c350e47a3db09   
...        ...                                                ...   
1215  py6qu4tl  dca8ced82157924ed86c698a7dd482be81b4b266; ff9f...   
1216  4qenzjiu           fdf021cfe745daed338cce7eaa5e548581477ff4   
1217  fozglfc8           422bcfff056d118337d3941c0afcb8b888142182   
1218  gy0kfhy6           877f7dfd596cabc12cf7228ffb19cd6b663cea93   
1219  rcwck1y3           7db1d6af433a96af8d1dac0fdd078ea9d1980e9c   

                                                  title  \
0     NSs Encoded by Groundnut Bud Necrosis Virus Is...   
1     Comparative Efficacy of Hemagglutinin, Nucleop.

In [51]:
# Now we rank for the query 1
value = topics[1]
rank= launch_query(dt_rel_data, value["query"])

tfidf TfidfModel(num_docs=1220, num_nnz=96453)
##############
QUERY: coronavirus origin
[ Score = 0.587 ] Origin and Evolution of the 2019 Novel Coronavirus
[ Score = 0.555 ] Characteristics of Metazoan DNA Replication Origins
[ Score = 0.536 ] Bat origin of a new human coronavirus: there and back again
[ Score = 0.411 ] Commentary: Origin and evolution of pathogenic coronaviruses
[ Score = 0.411 ] Strategies to trace back the origin of COVID-19
[ Score = 0.375 ] A glimpse into the origins of genetic diversity in SARS-CoV-2
[ Score = 0.364 ] A phylogenomic data-driven exploration of viral origins and evolution
[ Score = 0.343 ] Experimental infection of a US spike-insertion deletion porcine epidemic diarrhea virus in conventional nursing piglets and cross-protection to the original US PEDV infection
[ Score = 0.335 ] Zoonotic origins of human coronavirus 2019 (HCoV-19 / SARS-CoV-2): why is this work important?
[ Score = 0.334 ] Tracking the origin of early COVID-19 cases in Canada


#### Performance evaluation

In [148]:
pip install trectools

Collecting trectools
  Downloading trectools-0.0.44.tar.gz (26 kB)
Collecting sarge>=0.1.1
  Downloading sarge-0.1.6.tar.gz (26 kB)
Collecting bs4>=0.0.0.1
  Downloading bs4-0.0.1.tar.gz (1.1 kB)
Building wheels for collected packages: trectools, sarge, bs4
  Building wheel for trectools (setup.py): started
  Building wheel for trectools (setup.py): finished with status 'done'
  Created wheel for trectools: filename=trectools-0.0.44-py3-none-any.whl size=26303 sha256=449bbad25c09cf61f5c075753fa8c9ebd59f556c4fa7404b65859849d78cc200
  Stored in directory: c:\users\sandrus\appdata\local\pip\cache\wheels\3f\53\ca\9708354ec1c7272c3c4f954f8a1f300b676f8c36a33cacce60
  Building wheel for sarge (setup.py): started
  Building wheel for sarge (setup.py): finished with status 'done'
  Created wheel for sarge: filename=sarge-0.1.6-py3-none-any.whl size=19056 sha256=45f914a9d212be44fa1eae39ef5ec77ddee95424325d685cc04c746e0c6af84e
  Stored in directory: c:\users\sandrus\appdata\local\pip\cache\wheels

In [1]:
# TODO 
# Faltaría binarizar el resultado: 1 relevante 0 no relevante, poniendo un treshold
# Una vez binarizado sacar el accuracy u otras métricas
# Recordar que las 3 celdas de código anterior para sacar el ranking, son solo para una query, habría que hacer un bucle
# para sacar el ranking para todas las querys :-)

## 3. VSM based on word co-occurrence implementation - WCSVSM
We have used the VSM based on word co-occurrence implementation with our dataset.

This implementation was proposed in: 

Chen, S., Chen, Y., Yuan, F., & Chang, X. (2020). Establishment of herbal prescription vector space model based on word co-occurrence. Journal of Supercomputing, 76(5), 3590–3601. https://doi.org/10.1007/s11227-018-2559-3

#### Implementation

First, we install and import libraries

In [28]:
pip install nltk 

Note: you may need to restart the kernel to use updated packages.


In [29]:
pip install gensim 

Note: you may need to restart the kernel to use updated packages.


In [30]:
import nltk 

nltk.download('all')

[nltk_data] Downloading collection 'all'
[nltk_data]    | 
[nltk_data]    | Downloading package abc to
[nltk_data]    |     C:\Users\Sandrus\AppData\Roaming\nltk_data...
[nltk_data]    |   Package abc is already up-to-date!
[nltk_data]    | Downloading package alpino to
[nltk_data]    |     C:\Users\Sandrus\AppData\Roaming\nltk_data...
[nltk_data]    |   Package alpino is already up-to-date!
[nltk_data]    | Downloading package biocreative_ppi to
[nltk_data]    |     C:\Users\Sandrus\AppData\Roaming\nltk_data...
[nltk_data]    |   Package biocreative_ppi is already up-to-date!
[nltk_data]    | Downloading package brown to
[nltk_data]    |     C:\Users\Sandrus\AppData\Roaming\nltk_data...
[nltk_data]    |   Package brown is already up-to-date!
[nltk_data]    | Downloading package brown_tei to
[nltk_data]    |     C:\Users\Sandrus\AppData\Roaming\nltk_data...
[nltk_data]    |   Package brown_tei is already up-to-date!
[nltk_data]    | Downloading package cess_cat to
[nltk_data]    |     

True

In [31]:
# All the required software is now installed. 

In [32]:
# We now import the functions provided by NLTK to perform tokenizing considering punctuation signs.
from nltk.tokenize import wordpunct_tokenize, regexp_tokenize
# Next, we import required functions to filter-out stopwords for the English language.
from nltk.corpus import stopwords
# Now we import the function that implements the Porter's stemming algorithm.
from nltk.stem import PorterStemmer

In [33]:
import numpy as np

We will use the following functions of the Simple VSM implementation:
** Please, run the next two defs before:
- preprocess_document(doc) -> returns a list containing all STEMS in the collection (document). Input: One document
- preprocess_query(q)  -> returns a list containing all STEMS in the collection (query). Input: One query
- create_dictionary(docs) -> returns a dictionary containing the mappings WORD_ID -> WORD. This dictionary is required to create the vector-based word representations. Input: all documents
- docs2bows(allData, dictionary) -> returns vectors, bag of words-based representation for each document in the collection. Each bow include in bows contain an id and freq. dict(id) return the name of the word. => TF-weighted vectors
- create_TF_IDF_model(allData)


In [34]:
from gensim import corpora

# Different words in the collection
def create_dictionaryWCS(docs):
    #print(preprocess_document(docs.iloc[79754]))
    # List all pre-processing documents
    pdocs = [preprocess_document(docs.iloc[i]) for i in range(docs.shape[0])]
    #print("pdocs: ", pdocs)
    # Build the dictionary
    dictionary = corpora.Dictionary(pdocs)
    # Save in a file
    dictionary.save('wcs_vsm.dict')
    return dictionary

In [35]:
def docs2bowsWCS(allData, dictionary):
    docs = [preprocess_document(allData.iloc[i]) for i in range(allData.shape[0])]
    # We obtain the set of frequencies for each term
    vectors = [dictionary.doc2bow(doc) for doc in docs]
    corpora.MmCorpus.serialize('wcs_vsm_docs.mm', vectors)
    return vectors

In [36]:
from gensim import models

def create_TF_IDF_modelWCS(allData):
    dictionary = create_dictionaryWCS(allData)
    docs2bowsWCS(allData, dictionary)
    loaded_allData = corpora.MmCorpus('wcs_vsm_docs.mm')
    tfidf = models.TfidfModel(loaded_allData)
    print("tfidf", tfidf)
    return tfidf, dictionary

We will calculate Mutual Information ( (frequency of word i and j)^2 / ((frequency of word i)*(frequency of word i)) ) , and with that the weight ( weight = TF X IDF X MI(wordi,wordj)). 

First, we will use the extraction algorithm of the paper for extract the word pair.



In [37]:
from gensim.matutils import corpus2csc 
# We create a co-occurrence matrix 
def coOccurrence_matrix(allData):
    dictionary = create_dictionaryWCS(allData)
    vectors = docs2bowsWCS(allData, dictionary)
    print(len(vectors))
    term_doc_mat = corpus2csc(vectors) # Return a sparse matrix
    return term_doc_mat
    #term_term_mat = np.dot(term_doc_mat, term_doc_mat.T)
    #return term_term_mat

In [41]:
bowsMatrix = coOccurrence_matrix(dt_rel_data) 
print(bowsMatrix) 
print(type(bowsMatrix))


1220
  (0, 0)	8.0
  (1, 0)	1.0
  (2, 0)	1.0
  (3, 0)	1.0
  (4, 0)	1.0
  (5, 0)	1.0
  (6, 0)	2.0
  (7, 0)	1.0
  (8, 0)	2.0
  (9, 0)	2.0
  (10, 0)	1.0
  (11, 0)	1.0
  (12, 0)	1.0
  (13, 0)	2.0
  (14, 0)	1.0
  (15, 0)	3.0
  (16, 0)	1.0
  (17, 0)	1.0
  (18, 0)	1.0
  (19, 0)	1.0
  (20, 0)	1.0
  (21, 0)	1.0
  (22, 0)	1.0
  (23, 0)	1.0
  (24, 0)	1.0
  :	:
  (2018, 1219)	1.0
  (2076, 1219)	1.0
  (2082, 1219)	1.0
  (2094, 1219)	1.0
  (2104, 1219)	1.0
  (2225, 1219)	3.0
  (2228, 1219)	3.0
  (2233, 1219)	2.0
  (2242, 1219)	1.0
  (2413, 1219)	1.0
  (2416, 1219)	1.0
  (2566, 1219)	2.0
  (2892, 1219)	1.0
  (3588, 1219)	1.0
  (3590, 1219)	1.0
  (5563, 1219)	1.0
  (6015, 1219)	1.0
  (8156, 1219)	1.0
  (8157, 1219)	1.0
  (8215, 1219)	1.0
  (8411, 1219)	1.0
  (8412, 1219)	1.0
  (8413, 1219)	1.0
  (8414, 1219)	1.0
  (8415, 1219)	1.0
<class 'scipy.sparse.csc.csc_matrix'>


In [42]:
print(bowsMatrix.shape) # 8416 for each 1220 documents

(8416, 1220)


In [43]:
# bowsMatrix at the document one: 
print(bowsMatrix[0])

  (0, 0)	8.0
  (0, 2)	4.0
  (0, 4)	1.0
  (0, 7)	1.0
  (0, 8)	3.0
  (0, 10)	3.0
  (0, 17)	1.0
  (0, 27)	1.0
  (0, 91)	1.0
  (0, 101)	1.0
  (0, 104)	1.0
  (0, 120)	1.0
  (0, 125)	1.0
  (0, 136)	4.0
  (0, 139)	1.0
  (0, 185)	1.0
  (0, 192)	3.0
  (0, 196)	1.0
  (0, 204)	1.0
  (0, 216)	1.0
  (0, 234)	1.0
  (0, 238)	1.0
  (0, 244)	1.0
  (0, 248)	1.0
  (0, 261)	2.0
  :	:
  (0, 960)	1.0
  (0, 975)	2.0
  (0, 976)	1.0
  (0, 981)	3.0
  (0, 993)	1.0
  (0, 1024)	1.0
  (0, 1030)	1.0
  (0, 1039)	1.0
  (0, 1045)	2.0
  (0, 1053)	1.0
  (0, 1055)	1.0
  (0, 1063)	1.0
  (0, 1086)	1.0
  (0, 1100)	2.0
  (0, 1106)	3.0
  (0, 1111)	1.0
  (0, 1124)	1.0
  (0, 1134)	1.0
  (0, 1136)	1.0
  (0, 1180)	1.0
  (0, 1211)	1.0
  (0, 1214)	1.0
  (0, 1215)	2.0
  (0, 1217)	1.0
  (0, 1219)	1.0


In [44]:
def coOccurrence_matrixAr(bowsMatrix):
    bowsMatrixAr = bowsMatrix.toarray() 
    return bowsMatrixAr

In [45]:
bArr = coOccurrence_matrixAr(coOccurrence_matrix(dt_rel_data))
print(bArr)
print(len(bowsMatrixAr))
print(bowsMatrixAr[0,1219])

1220
[[8. 0. 4. ... 1. 0. 1.]
 [1. 0. 0. ... 0. 1. 0.]
 [1. 0. 0. ... 0. 0. 0.]
 ...
 [0. 0. 0. ... 0. 0. 1.]
 [0. 0. 0. ... 0. 0. 1.]
 [0. 0. 0. ... 0. 0. 1.]]


NameError: name 'bowsMatrixAr' is not defined

In [None]:
# Code only for visualization: 

#coOccVec = np.zeros(bowsMatrixAr.shape)
dictWCS = create_dictionaryWCS(dt_rel_data)
for i in range(len(bowsMatrixAr)):
    for j in range(len(bowsMatrixAr[1])):
        #coOccVec.append([dictWCS[i],dictWCS[j], bowsMatrixAr[i,j]])
        print(dictWCS[i],dictWCS[j], bowsMatrixAr[i,j])
print(coOccVec)

Now, we have the frequency of word co-occurrence (word i, word j) for each document.

Support(wordi, wordj) = freq(wordi,wordj)

Confidence (wordi, wordj) = MI(wordi,wordj)

We are going to calculate the MI:

In [None]:
def MI(allData):
    bowsMatrixAr = coOccurrence_matrixAr(coOccurrence_matrix(allData))
    
    dictionary = create_dictionaryWCS(allData)
    vectors = docs2bowsWCS(allData, dictionary)
    #print(vectors[0])
    MI_Matrix = np.ones(bowsMatrixAr.shape)
    for i in range(len(bowsMatrixAr)):
        for j in range(len(bowsMatrixAr[1])): # i,j matrix data; j vectors 
            pWord1 = 0
            pWord2 = 0
            for (id, freq) in vectors[j]:
                if id == i: 
                    pWord1 = freq
                if id == j: 
                    pWord2 = freq  
            #print("bowsMatrixAr[i,j]", bowsMatrixAr[i,j])
            #print("pWord1",pWord1)
            #print("pWord2",pWord2)
            if (pWord1==0) or (pWord2==0) or (bowsMatrixAr[i,j]==0) : 
                MI_Matrix[i,j] = 0
            else: 
                MI_Matrix[i,j] = (bowsMatrixAr[i,j]**2) / (pWord1*pWord2)
                #print(MI_Matrix[i,j], i)
    return MI_Matrix

In [None]:
print(MI(dt_rel_data))

In [None]:
# We only choose pairs with support >3 and confidence > 3 (POR EJEMPLO, INSTANCIAR UN TRESHOLD PARA CONFIDENCE CUANDO 
# SE EJECUTE BIEN EL DE ARRIBA, CREO QUE SUPPORT ESTARÍA BIEN ASÍ )


In [124]:
# adaptar lo siguiente a lo nuevo

We finally create a function that given the dt and the topics provides a document ranking sorted in descending order of relevance (according to the cosine measure)

In [9]:
from operator import itemgetter
from gensim import similarities
 
    
def launch_queryWCS(allData, q, filename='wcs_vsm_docs.mm'):
    tfidf, dictionary = create_TF_IDF_modelWCS(allData)
    loaded_allData = corpora.MmCorpus(filename)
    index = similarities.MatrixSimilarity(loaded_allData, num_features=len(dictionary))
    pq = preprocess_query(q)
    vq = dictionary.doc2bow(pq)
    qtfidf = tfidf[vq]
    sim = index[qtfidf]
    ranking = sorted(enumerate(sim), key=itemgetter(1), reverse=True)
    print("##############")
    print("QUERY:",q)
    pos = 0
    for doc, score in ranking:
        """
        if(round(score,3)==0):
            print("[ Score = " + "%.3f" % round(score,3) + " ] " + allData.iloc[doc].title,"#########################"); 
        else:
            print("[ Score = " + "%.3f" % round(score,3) + " ] " + allData.iloc[doc].title); 
        """
        if ( pos <10 ): # First ten positions
            print("[ Score = " + "%.3f" % round(score,3) + " ] " + allData.iloc[doc].title); 
        else: 
            break
        pos += 1
    return ranking

And now we can launch any query we see fit to our newly created Information Retrieval engine.

We choose the query 1.

In [23]:
# We select all cord_uid in relevance_data for the query 1
cords_relevance_data = set()
for i in range(relevance_data.loc[relevance_data.topic_id==1].shape[0]):
    cords_relevance_data.add(relevance_data.loc[relevance_data.topic_id==1].cord_uid[i])

In [24]:
# We select all cord_uid that are in both (relevance_data and dt) for the query 1
dt_rel_data = pd.DataFrame()
aux = 0
for i in range(dt.cord_uid.shape[0]):
    #print(dt.cord_uid[i])
    if dt.cord_uid[i] in cords_relevance_data:
        #print(dt.cord_uid[i])
        dt_rel_data[aux] = dt.iloc[i]
        aux +=1
dt_rel_data = dt_rel_data.T

In [28]:
# Now we rank for the query 1
value = topics[1]
rankWCS= launch_queryWCS(dt_rel_data, value["query"])

tfidf TfidfModel(num_docs=1220, num_nnz=96453)
##############
QUERY: coronavirus origin
[ Score = 0.587 ] Origin and Evolution of the 2019 Novel Coronavirus
[ Score = 0.555 ] Characteristics of Metazoan DNA Replication Origins
[ Score = 0.536 ] Bat origin of a new human coronavirus: there and back again
[ Score = 0.411 ] Commentary: Origin and evolution of pathogenic coronaviruses
[ Score = 0.411 ] Strategies to trace back the origin of COVID-19
[ Score = 0.375 ] A glimpse into the origins of genetic diversity in SARS-CoV-2
[ Score = 0.364 ] A phylogenomic data-driven exploration of viral origins and evolution
[ Score = 0.343 ] Experimental infection of a US spike-insertion deletion porcine epidemic diarrhea virus in conventional nursing piglets and cross-protection to the original US PEDV infection
[ Score = 0.335 ] Zoonotic origins of human coronavirus 2019 (HCoV-19 / SARS-CoV-2): why is this work important?
[ Score = 0.334 ] Tracking the origin of early COVID-19 cases in Canada
