# Building and Evaluating a COVID-19 oriented Information Retrieval Engine


## 1. Proccesing data 
Firstly, we are going to process the data

In [1]:
pip install pandas

Note: you may need to restart the kernel to use updated packages.


You should consider upgrading via the 'C:\Users\Sandrus\anaconda3\python.exe -m pip install --upgrade pip' command.


In [2]:
import pandas as pd
import json
import time
import xml.etree.ElementTree as ET

dt = pd.read_csv("../data/metadata.csv")
print(dt)

        cord_uid                                       sha  \
0       ug7v899j  d1aafb70c066a2068b02786f8929fd9c900897fb   
1       02tnwd4m  6b0567729c2143a66d737eb0a2f63f2dce2e5a7d   
2       ejv2xln0  06ced00a5fc04215949aa72528f2eeaae1d58927   
3       2b73a28n  348055649b6b8cf2b9a376498df9bf41f7123605   
4       9785vg6d  5f48792a5fa08bed9f56016f4981ae2ca6031b32   
...          ...                                       ...   
192504  z4ro6lmh  203f36475be74229101548475d68352b939f8b5b   
192505  hi8k8wvb  9f1bc99798e8823e690697394dcb23533a45c60e   
192506  ma3ndg41  ffba777376718ef2a0dd74a8eab90e2bfacd240f   
192507  wh10285j  d521c5a2dcbd79a5be606fcf586b1e0448344172   
192508  pnl9th2c  c047bf76813106d4fd586e49164e7feddfbe352f   

                      source_x  \
0                          PMC   
1                          PMC   
2                          PMC   
3                          PMC   
4                          PMC   
...                        ...   
192504           

  has_raised = await self.run_ast_nodes(code_ast.body, cell_name,


#### Preparing the dataset
We decided just to work with the papers of the PDF_JSON corpus. Therefore, the first step is to delete from the dataframe the elements that are not in that folder. The number of examples is reduced from 192509 to 79755. Still, there are more documents in the pdf_json than in the dataframe (over 84000) because many pdf in the corpus have the same cord_uid. Technically, the papers mapped into the the same cord_uid are the same one, but with differences in ghe publication (if one article has been published by Elsevier and Springer, it will be mapped twice with the same cord_uid). Our take on the problem will be to consider just on of the documents associated to one cord_uid, instead of the full_list.

In [3]:
dt = dt[dt.pdf_json_files.notnull()]
dt = dt.reset_index(drop = True)
print(dt)

       cord_uid                                       sha  \
0      ug7v899j  d1aafb70c066a2068b02786f8929fd9c900897fb   
1      02tnwd4m  6b0567729c2143a66d737eb0a2f63f2dce2e5a7d   
2      ejv2xln0  06ced00a5fc04215949aa72528f2eeaae1d58927   
3      2b73a28n  348055649b6b8cf2b9a376498df9bf41f7123605   
4      9785vg6d  5f48792a5fa08bed9f56016f4981ae2ca6031b32   
...         ...                                       ...   
79750  z4ro6lmh  203f36475be74229101548475d68352b939f8b5b   
79751  hi8k8wvb  9f1bc99798e8823e690697394dcb23533a45c60e   
79752  ma3ndg41  ffba777376718ef2a0dd74a8eab90e2bfacd240f   
79753  wh10285j  d521c5a2dcbd79a5be606fcf586b1e0448344172   
79754  pnl9th2c  c047bf76813106d4fd586e49164e7feddfbe352f   

                     source_x  \
0                         PMC   
1                         PMC   
2                         PMC   
3                         PMC   
4                         PMC   
...                       ...   
79750            Medline; PMC   
797

Next, we drop the columns that will not add information to our information retrieval system (such as the licenses or the doi) and that do not help to map each example of the dataframe with a document in the pdf_json corpus.

In [4]:
columns_to_delete = ["doi", "source_x", "pmcid", "pubmed_id", "license", "mag_id", "who_covidence_id", "arxiv_id", "pmc_json_files", "url", "s2_id"]
# dt_original = dt
dt = dt.drop(columns_to_delete, axis = 1)
print(dt)

       cord_uid                                       sha  \
0      ug7v899j  d1aafb70c066a2068b02786f8929fd9c900897fb   
1      02tnwd4m  6b0567729c2143a66d737eb0a2f63f2dce2e5a7d   
2      ejv2xln0  06ced00a5fc04215949aa72528f2eeaae1d58927   
3      2b73a28n  348055649b6b8cf2b9a376498df9bf41f7123605   
4      9785vg6d  5f48792a5fa08bed9f56016f4981ae2ca6031b32   
...         ...                                       ...   
79750  z4ro6lmh  203f36475be74229101548475d68352b939f8b5b   
79751  hi8k8wvb  9f1bc99798e8823e690697394dcb23533a45c60e   
79752  ma3ndg41  ffba777376718ef2a0dd74a8eab90e2bfacd240f   
79753  wh10285j  d521c5a2dcbd79a5be606fcf586b1e0448344172   
79754  pnl9th2c  c047bf76813106d4fd586e49164e7feddfbe352f   

                                                   title  \
0      Clinical features of culture-proven Mycoplasma...   
1      Nitric oxide: a pro-inflammatory mediator in l...   
2        Surfactant protein-D and pulmonary host defense   
3                   Role of

In [5]:
print(dt.shape[0])

79755


In [6]:
# Document 1 
print(dt.iloc[0])

cord_uid                                                   ug7v899j
sha                        d1aafb70c066a2068b02786f8929fd9c900897fb
title             Clinical features of culture-proven Mycoplasma...
abstract          OBJECTIVE: This retrospective chart review des...
publish_time                                             2001-07-04
authors                         Madani, Tariq A; Al-Ghamdi, Aisha A
journal                                              BMC Infect Dis
pdf_json_files    document_parses/pdf_json/d1aafb70c066a2068b027...
Name: 0, dtype: object


In [7]:
# Document 1
print(dt.iloc[0].title)
print(dt.iloc[0].abstract)

Clinical features of culture-proven Mycoplasma pneumoniae infections at King Abdulaziz University Hospital, Jeddah, Saudi Arabia
OBJECTIVE: This retrospective chart review describes the epidemiology and clinical features of 40 patients with culture-proven Mycoplasma pneumoniae infections at King Abdulaziz University Hospital, Jeddah, Saudi Arabia. METHODS: Patients with positive M. pneumoniae cultures from respiratory specimens from January 1997 through December 1998 were identified through the Microbiology records. Charts of patients were reviewed. RESULTS: 40 patients were identified, 33 (82.5%) of whom required admission. Most infections (92.5%) were community-acquired. The infection affected all age groups but was most common in infants (32.5%) and pre-school children (22.5%). It occurred year-round but was most common in the fall (35%) and spring (30%). More than three-quarters of patients (77.5%) had comorbidities. Twenty-four isolates (60%) were associated with pneumonia, 14 (35

#### Parse the pdf
The articles mapped with each row of the metadata (cord_uid) are stored separately and parsed in .json. The following script converts an article mapped with each cord_uid into a python dictionary, storing the set of dictionaries on pdf_json_list.

##### Es probable que esto sea opcional, porque los metadatos ya nos dan el título, autores y abstract (las tres cosas más importantes). Trabajar usando todo el cuerpo de los artículos puede mejorar los resultados, pero el tiempo de cómputo aumentaría mucho. Además, la información sobre el título, autores y abstract es mejor en los metadatos, dado que los metadatos son proporcionados directamente de las revistas y, en contraste, la info. sobre titulo, autores y abstract del pdf_json corpus son obtenidos automáticamente mediante un parseo de pdf a json y puede haber errores.
##### Poca broma, a mi se me desborda la memoria. Mi RAM la tenía al 94% de ocupación con el siguiente script (obviamente no pude terminarlo).

In [8]:
'''
pdf_json_list = []
t0 = time.time()
for row in dt.index :
    json_path = (dt.loc[row]['pdf_json_files'].split('; '))[0]
    json_file = open("../data/"+json_path) 
    full_text_dict = json.load(json_file)
    pdf_json_list.append(full_text_dict)
t1 = time.time()
'''

'\npdf_json_list = []\nt0 = time.time()\nfor row in dt.index :\n    json_path = (dt.loc[row][\'pdf_json_files\'].split(\'; \'))[0]\n    json_file = open("../data/"+json_path) \n    full_text_dict = json.load(json_file)\n    pdf_json_list.append(full_text_dict)\nt1 = time.time()\n'

Script to read the test queries (https://towardsdatascience.com/download-and-parse-trec-covid-data-8f9840686c37)

In [9]:
topics = {}
root = ET.parse("../queries/test_queries.xml").getroot()
for topic in root.findall("topic"):
    topic_number = int(topic.attrib["number"])
    topics[topic_number] = {}
    for query in topic.findall("query"):
        topics[topic_number]["query"] = query.text  # We only need the query: 
    #for question in topic.findall("question"):
     #   topics[topic_number]["question"] = question.text        
    #for narrative in topic.findall("narrative"):
     #   topics[topic_number]["narrative"] = narrative.text
print(topics[1].keys())

dict_keys(['query'])


In [10]:
print(topics)

{1: {'query': 'coronavirus origin'}, 2: {'query': 'coronavirus response to weather changes'}, 3: {'query': 'coronavirus immunity'}, 4: {'query': 'how do people die from the coronavirus'}, 5: {'query': 'animal models of COVID-19'}, 6: {'query': 'coronavirus test rapid testing'}, 7: {'query': 'serological tests for coronavirus'}, 8: {'query': 'coronavirus under reporting'}, 9: {'query': 'coronavirus in Canada'}, 10: {'query': 'coronavirus social distancing impact'}, 11: {'query': 'coronavirus hospital rationing'}, 12: {'query': 'coronavirus quarantine'}, 13: {'query': 'how does coronavirus spread'}, 14: {'query': 'coronavirus super spreaders'}, 15: {'query': 'coronavirus outside body'}, 16: {'query': 'how long does coronavirus survive on surfaces'}, 17: {'query': 'coronavirus clinical trials'}, 18: {'query': 'masks prevent coronavirus'}, 19: {'query': 'what alcohol sanitizer kills coronavirus'}, 20: {'query': 'coronavirus and ACE inhibitors'}, 21: {'query': 'coronavirus mortality'}, 22: 

In [144]:
for key in topics: 
    value = topics[key]
    print(key, value["query"])

1 coronavirus origin
2 coronavirus response to weather changes
3 coronavirus immunity
4 how do people die from the coronavirus
5 animal models of COVID-19
6 coronavirus test rapid testing
7 serological tests for coronavirus
8 coronavirus under reporting
9 coronavirus in Canada
10 coronavirus social distancing impact
11 coronavirus hospital rationing
12 coronavirus quarantine
13 how does coronavirus spread
14 coronavirus super spreaders
15 coronavirus outside body
16 how long does coronavirus survive on surfaces
17 coronavirus clinical trials
18 masks prevent coronavirus
19 what alcohol sanitizer kills coronavirus
20 coronavirus and ACE inhibitors
21 coronavirus mortality
22 coronavirus heart impacts
23 coronavirus hypertension
24 coronavirus diabetes
25 coronavirus biomarkers
26 coronavirus early symptoms
27 coronavirus asymptomatic
28 coronavirus hydroxychloroquine
29 coronavirus drug repurposing
30 coronavirus remdesivir
31 difference between coronavirus and flu
32 coronavirus subtyp

Script to read the relevance judgement (the information needed to evaluate our system). The round id is not needed and it is therefore omitted. Also, the relevance is binarized

In [11]:
relevance_data = pd.read_csv("../queries/relevance_judgements.txt", sep=" ", header=None)
relevance_data.columns = ["topic_id", "round_id", "cord_uid", "relevancy"]
relevance_data = relevance_data.drop("round_id" ,axis = 1)
relevance_data['relevancy'] = relevance_data['relevancy'].replace([2],'1')
print(relevance_data)

       topic_id  cord_uid relevancy
0             1  005b2j4b         1
1             1  00fmeepz         1
2             1  010vptx3         1
3             1  0194oljo         1
4             1  021q9884         1
...         ...       ...       ...
69313        50  zvop8bxh         1
69314        50  zwf26o63         1
69315        50  zwsvlnwe         0
69316        50  zxr01yln         1
69317        50  zz8wvos9         1

[69318 rows x 3 columns]


With all the metadata (and optionally json_pdf), test topics and relevance judgement we are prepared to build and validate the system

## 2. A simple vsm implementation
We have adapted the simple vector space model implementation for our code. 

First, we install and import libraries 

In [17]:
# We first install the NLTK toolkit

In [18]:
pip install nltk 

Note: you may need to restart the kernel to use updated packages.


You should consider upgrading via the 'C:\Users\Sandrus\anaconda3\python.exe -m pip install --upgrade pip' command.


In [19]:
# We also need to download the NLTK data bundle

In [20]:
import nltk 

nltk.download('all')

[nltk_data] Downloading collection 'all'
[nltk_data]    | 
[nltk_data]    | Downloading package abc to
[nltk_data]    |     C:\Users\Sandrus\AppData\Roaming\nltk_data...
[nltk_data]    |   Package abc is already up-to-date!
[nltk_data]    | Downloading package alpino to
[nltk_data]    |     C:\Users\Sandrus\AppData\Roaming\nltk_data...
[nltk_data]    |   Package alpino is already up-to-date!
[nltk_data]    | Downloading package biocreative_ppi to
[nltk_data]    |     C:\Users\Sandrus\AppData\Roaming\nltk_data...
[nltk_data]    |   Package biocreative_ppi is already up-to-date!
[nltk_data]    | Downloading package brown to
[nltk_data]    |     C:\Users\Sandrus\AppData\Roaming\nltk_data...
[nltk_data]    |   Package brown is already up-to-date!
[nltk_data]    | Downloading package brown_tei to
[nltk_data]    |     C:\Users\Sandrus\AppData\Roaming\nltk_data...
[nltk_data]    |   Package brown_tei is already up-to-date!
[nltk_data]    | Downloading package cess_cat to
[nltk_data]    |     

True

In [21]:
#We now install the gensim package

In [22]:
pip install gensim 

Note: you may need to restart the kernel to use updated packages.


You should consider upgrading via the 'C:\Users\Sandrus\anaconda3\python.exe -m pip install --upgrade pip' command.


In [23]:
# All the required software inst now installed. 

In [109]:
# We now import the functions provided by NLTK to perform tokenizing considering punctuation signs.
from nltk.tokenize import wordpunct_tokenize, regexp_tokenize
# Next, we import required functions to filter-out stopwords for the English language.
from nltk.corpus import stopwords
# Now we import the function that implements the Porter's stemming algorithm.
from nltk.stem import PorterStemmer

The first step is aimed at preprocessing each document in the collection. We write a function that receives the variable dt, and returns a list containing all STEMS in the collection whose associated token is longer than 2 characters and is NOT an (English) stopword.

In [123]:
def preprocess_document(doc): # Each doc is each df row. We will only use title and abstract: dt.iloc[i].title and dt.iloc[i].abstract
    #print(i)
    stopset = set(stopwords.words('english'))
    stemmer = PorterStemmer()
    if type(doc.title) != str and type(doc.abstract) != str: # For empty documents without title and abstract
        final = [""]
    else:
        if type(doc.title) == str: 
            tokens = wordpunct_tokenize(doc.title)
        if type(doc.abstract) == str:
            tokens.extend(wordpunct_tokenize(doc.abstract))
            # clean guarda las palabras (en minuscula) que no están incluidas en stopset
        clean = [token.lower() for token in tokens if token.lower() not in stopset and len(token) > 2 and "%" not in token]
        final = [stemmer.stem(word) for word in clean]
    return final

In [124]:
print(preprocess_document(dt.iloc[0])) # Print STEMS for the document 1
print(type(dt.iloc[0].title))

['clinic', 'featur', 'cultur', 'proven', 'mycoplasma', 'pneumonia', 'infect', 'king', 'abdulaziz', 'univers', 'hospit', 'jeddah', 'saudi', 'arabia', 'object', 'retrospect', 'chart', 'review', 'describ', 'epidemiolog', 'clinic', 'featur', 'patient', 'cultur', 'proven', 'mycoplasma', 'pneumonia', 'infect', 'king', 'abdulaziz', 'univers', 'hospit', 'jeddah', 'saudi', 'arabia', 'method', 'patient', 'posit', 'pneumonia', 'cultur', 'respiratori', 'specimen', 'januari', '1997', 'decemb', '1998', 'identifi', 'microbiolog', 'record', 'chart', 'patient', 'review', 'result', 'patient', 'identifi', 'requir', 'admiss', 'infect', 'commun', 'acquir', 'infect', 'affect', 'age', 'group', 'common', 'infant', 'pre', 'school', 'children', 'occur', 'year', 'round', 'common', 'fall', 'spring', 'three', 'quarter', 'patient', 'comorbid', 'twenti', 'four', 'isol', 'associ', 'pneumonia', 'upper', 'respiratori', 'tract', 'infect', 'bronchiol', 'cough', 'fever', 'malais', 'common', 'symptom', 'crepit', 'wheez', '

In [125]:
print(dt.iloc[29264])
print(dt.iloc[29264].title)
print(type(dt.iloc[29264].title))

cord_uid                                                   n06og3cw
sha                        8d35867e078939b7f20187322e41011cec8b8cb3
title                                                           NaN
abstract                                                        NaN
publish_time                                             2020-05-13
authors           De Coninck, David; d'Haenens, Leen; Matthijs, ...
journal                                               Public Health
pdf_json_files    document_parses/pdf_json/8d35867e078939b7f2018...
Name: 29264, dtype: object
nan
<class 'float'>


In [126]:
print(preprocess_document(dt.iloc[29264]))

['']


In [149]:
def preprocess_query(q):
    stopset = set(stopwords.words('english'))
    stemmer = PorterStemmer()
    tokens = wordpunct_tokenize(q)
    clean = [token.lower() for token in tokens if token.lower() not in stopset and len(token) > 2]
    final = [stemmer.stem(word) for word in clean]
    return final

Once all documents in the collection have been preprocessed, we need to create a dictionary containing the mappings WORD_ID -> WORD. This dictionary is required to create the vector-based word representations.

In [127]:
from gensim import corpora

# Different words in the collection
def create_dictionary(docs):
    #print(preprocess_document(docs.iloc[79754]))
    # List all pre-processing documents
    pdocs = [preprocess_document(docs.iloc[i]) for i in range(docs.shape[0])]
    #print("pdocs: ", pdocs)
    # Build the dictionary
    dictionary = corpora.Dictionary(pdocs)
    # Save in a file
    dictionary.save('simple_vsm.dict')
    return dictionary

Let us call the create_dictionary function feeding it with the complete dt.

In [128]:
dict = create_dictionary(dt)
print(dict)

Dictionary(123368 unique tokens: ['1997', '1998', 'abdulaziz', 'acquir', 'admiss']...)


Now we have built the dictionary containing the vocabulary that we will use for indexing. Now we write a function that create the bag of words-based representation for each document in the collection.

In [131]:
def docs2bows(allData, dictionary):
    docs = [preprocess_document(allData.iloc[i]) for i in range(allData.shape[0])]
    # We obtain the set of frequencies for each term
    vectors = [dictionary.doc2bow(doc) for doc in docs]
    corpora.MmCorpus.serialize('simple_vsm_docs.mm', vectors)
    return vectors

Let us now generate the BOWs for the complete dt.

In [132]:
bows = docs2bows(dt, dict)

IOPub data rate exceeded.
The notebook server will temporarily stop sending output
to the client in order to avoid crashing it.
To change this limit, set the config variable
`--NotebookApp.iopub_data_rate_limit`.

Current values:
NotebookApp.iopub_data_rate_limit=1000000.0 (bytes/sec)
NotebookApp.rate_limit_window=3.0 (secs)



In [136]:
print(bows)

IOPub data rate exceeded.
The notebook server will temporarily stop sending output
to the client in order to avoid crashing it.
To change this limit, set the config variable
`--NotebookApp.iopub_data_rate_limit`.

Current values:
NotebookApp.iopub_data_rate_limit=1000000.0 (bytes/sec)
NotebookApp.rate_limit_window=3.0 (secs)



These are pairs (word identifier, frequency). Let us now convert them into something a bit more readable.

In [133]:
for v in bows:
    tvec = [(dict[id], freq) for (id, freq) in v]
    print(tvec)

[('1997', 1), ('1998', 1), ('abdulaziz', 2), ('acquir', 1), ('admiss', 1), ('affect', 1), ('age', 1), ('arabia', 2), ('associ', 1), ('breath', 1), ('bronchial', 1), ('bronchiol', 1), ('chart', 2), ('children', 2), ('clinic', 2), ('common', 5), ('commun', 1), ('comorbid', 4), ('complic', 1), ('conclus', 1), ('cough', 1), ('crepit', 2), ('cultur', 3), ('data', 1), ('decemb', 1), ('describ', 1), ('die', 3), ('due', 1), ('epidemiolog', 1), ('except', 1), ('fall', 1), ('featur', 2), ('fever', 1), ('find', 1), ('follow', 1), ('four', 1), ('group', 1), ('high', 1), ('hospit', 2), ('identifi', 2), ('immunocompromis', 2), ('infant', 2), ('infect', 7), ('isol', 1), ('januari', 1), ('jeddah', 2), ('king', 2), ('like', 1), ('malais', 1), ('method', 1), ('microbiolog', 1), ('mortal', 1), ('mycoplasma', 2), ('non', 1), ('object', 1), ('occur', 1), ('patient', 11), ('pneumonia', 11), ('posit', 1), ('pre', 1), ('preschool', 1), ('present', 1), ('proven', 2), ('publish', 1), ('quarter', 1), ('rate', 1)

IOPub data rate exceeded.
The notebook server will temporarily stop sending output
to the client in order to avoid crashing it.
To change this limit, set the config variable
`--NotebookApp.iopub_data_rate_limit`.

Current values:
NotebookApp.iopub_data_rate_limit=1000000.0 (bytes/sec)
NotebookApp.rate_limit_window=3.0 (secs)




[('age', 1), ('clinic', 2), ('conclus', 1), ('group', 1), ('high', 1), ('identifi', 1), ('method', 1), ('object', 2), ('publish', 1), ('result', 1), ('year', 2), ('indic', 5), ('consist', 1), ('defici', 1), ('develop', 4), ('respons', 1), ('specif', 3), ('current', 1), ('display', 1), ('includ', 1), ('natur', 1), ('studi', 1), ('uniqu', 1), ('health', 1), ('level', 1), ('outcom', 1), ('question', 1), ('red', 6), ('use', 1), ('intern', 1), ('low', 3), ('physiolog', 2), ('system', 1), ('analys', 1), ('report', 1), ('time', 2), ('factor', 1), ('found', 2), ('tool', 1), ('awar', 5), ('environ', 1), ('perform', 1), ('rel', 1), ('address', 1), ('investig', 1), ('context', 1), ('screen', 1), ('characterist', 2), ('less', 1), ('full', 2), ('term', 1), ('avail', 3), ('consequ', 3), ('potenti', 2), ('risk', 3), ('train', 4), ('third', 1), ('rais', 1), ('energi', 4), ('demand', 1), ('quantifi', 1), ('mean', 1), ('practic', 1), ('situat', 1), ('administ', 1), ('onlin', 1), ('influenti', 1), ('val

IOPub data rate exceeded.
The notebook server will temporarily stop sending output
to the client in order to avoid crashing it.
To change this limit, set the config variable
`--NotebookApp.iopub_data_rate_limit`.

Current values:
NotebookApp.iopub_data_rate_limit=1000000.0 (bytes/sec)
NotebookApp.rate_limit_window=3.0 (secs)




[('medicin', 1), ('impact', 1), ('worker', 1), ('skill', 1), ('europ', 1), ('shortag', 1), ('geriatr', 1)]
[('affect', 1), ('common', 1), ('except', 1), ('infect', 2), ('like', 1), ('patient', 2), ('respiratori', 1), ('contribut', 1), ('diseas', 1), ('gener', 2), ('involv', 1), ('mechan', 1), ('sever', 2), ('stress', 2), ('also', 2), ('develop', 1), ('divers', 1), ('human', 2), ('virus', 1), ('acut', 2), ('chronic', 1), ('famili', 2), ('relat', 1), ('strategi', 1), ('understand', 1), ('attent', 1), ('distress', 1), ('medicin', 1), ('syndrom', 1), ('effect', 1), ('prevent', 1), ('resourc', 1), ('technic', 1), ('around', 1), ('global', 1), ('lack', 1), ('disord', 2), ('first', 1), ('major', 1), ('loss', 1), ('influenza', 1), ('pandem', 4), ('could', 1), ('hundr', 1), ('promot', 4), ('suffer', 1), ('case', 3), ('individu', 1), ('post', 1), ('rapidli', 1), ('spread', 3), ('member', 1), ('consid', 2), ('countri', 1), ('quarantin', 2), ('sar', 4), ('alreadi', 1), ('sinc', 2), ('context', 1)

IOPub data rate exceeded.
The notebook server will temporarily stop sending output
to the client in order to avoid crashing it.
To change this limit, set the config variable
`--NotebookApp.iopub_data_rate_limit`.

Current values:
NotebookApp.iopub_data_rate_limit=1000000.0 (bytes/sec)
NotebookApp.rate_limit_window=3.0 (secs)



[('patient', 2), ('respiratori', 1), ('gener', 1), ('respons', 1), ('report', 1), ('treatment', 2), ('case', 1), ('context', 1), ('statu', 1), ('close', 1), ('exacerb', 1), ('variabl', 1), ('monitor', 1), ('tailor', 1), ('concomit', 1), ('gravi', 1), ('myasthenia', 1), ('covid', 3)]
[('respons', 1), ('health', 1), ('global', 1), ('centr', 1), ('reproduct', 1), ('justic', 1), ('sexual', 1), ('covid', 1)]
[('cultur', 1), ('four', 1), ('posit', 1), ('requir', 1), ('result', 1), ('anti', 1), ('condit', 1), ('format', 2), ('line', 1), ('protein', 6), ('also', 2), ('cell', 18), ('form', 2), ('host', 1), ('interact', 1), ('respect', 2), ('viru', 2), ('either', 1), ('fuse', 1), ('suggest', 2), ('establish', 1), ('cellular', 1), ('constitut', 1), ('induc', 1), ('two', 2), ('factor', 2), ('howev', 1), ('lack', 1), ('type', 6), ('clone', 1), ('abil', 1), ('kind', 1), ('pathogen', 1), ('certain', 1), ('compar', 1), ('produc', 7), ('investig', 1), ('neutral', 2), ('potenti', 1), ('neither', 1), ('t

IOPub data rate exceeded.
The notebook server will temporarily stop sending output
to the client in order to avoid crashing it.
To change this limit, set the config variable
`--NotebookApp.iopub_data_rate_limit`.

Current values:
NotebookApp.iopub_data_rate_limit=1000000.0 (bytes/sec)
NotebookApp.rate_limit_window=3.0 (secs)



[('associ', 1), ('epidemiolog', 1), ('four', 1), ('infect', 1), ('like', 1), ('present', 2), ('publish', 1), ('although', 1), ('diseas', 4), ('evid', 1), ('focus', 1), ('protein', 1), ('support', 1), ('antigen', 2), ('cell', 1), ('challeng', 3), ('host', 1), ('human', 2), ('immun', 4), ('respons', 4), ('specif', 5), ('surfac', 4), ('may', 1), ('import', 5), ('mucos', 4), ('remain', 1), ('viru', 5), ('demonstr', 1), ('question', 1), ('agent', 2), ('induc', 3), ('protect', 10), ('tissu', 1), ('two', 1), ('replic', 2), ('type', 1), ('vaccin', 2), ('probabl', 2), ('candid', 1), ('compar', 1), ('optim', 1), ('necessari', 2), ('experiment', 2), ('hypothesi', 1), ('induct', 1), ('presenc', 1), ('clearli', 2), ('determin', 1), ('inocul', 2), ('neutral', 1), ('best', 1), ('neither', 1), ('formul', 1), ('achiev', 1), ('predict', 1), ('summari', 1), ('antibodi', 1), ('intestin', 5), ('aim', 1), ('mean', 1), ('advantag', 1), ('immunolog', 3), ('serotyp', 1), ('explain', 1), ('anim', 1), ('serum', 

IOPub data rate exceeded.
The notebook server will temporarily stop sending output
to the client in order to avoid crashing it.
To change this limit, set the config variable
`--NotebookApp.iopub_data_rate_limit`.

Current values:
NotebookApp.iopub_data_rate_limit=1000000.0 (bytes/sec)
NotebookApp.rate_limit_window=3.0 (secs)




[('find', 1), ('patient', 1), ('posit', 1), ('present', 2), ('review', 3), ('year', 1), ('literatur', 3), ('pulmonari', 1), ('acut', 1), ('ill', 1), ('number', 1), ('report', 2), ('found', 1), ('etiolog', 1), ('case', 4), ('small', 1), ('reveal', 1), ('man', 1), ('investig', 1), ('propos', 1), ('thorough', 1), ('diagnos', 1), ('histori', 1), ('relationship', 1), ('util', 1), ('criteria', 1), ('old', 1), ('manifest', 1), ('secondari', 2), ('effus', 3), ('radiolog', 1), ('rash', 1), ('pleural', 3), ('skin', 1), ('indian', 1), ('pleurisi', 1), ('syphili', 4), ('extramarit', 1)]
[('age', 1), ('commun', 3), ('data', 1), ('four', 1), ('group', 1), ('high', 1), ('identifi', 2), ('infect', 1), ('like', 1), ('patient', 1), ('pre', 1), ('present', 1), ('respiratori', 1), ('result', 1), ('inflammatori', 1), ('known', 1), ('lung', 1), ('model', 5), ('cell', 1), ('complex', 1), ('develop', 1), ('immun', 1), ('interact', 5), ('respons', 1), ('within', 1), ('acut', 2), ('biolog', 1), ('differenti', 

[('clinic', 1), ('due', 1), ('hospit', 2), ('infect', 2), ('pneumonia', 2), ('rate', 1), ('diseas', 2), ('sever', 1), ('also', 1), ('challeng', 1), ('develop', 1), ('specif', 1), ('caus', 1), ('genom', 1), ('includ', 2), ('one', 1), ('sequenc', 1), ('control', 1), ('effici', 1), ('health', 1), ('emerg', 2), ('progress', 1), ('low', 1), ('number', 1), ('prevent', 1), ('protect', 1), ('public', 2), ('event', 1), ('howev', 1), ('first', 1), ('major', 1), ('novel', 1), ('rout', 1), ('could', 1), ('fast', 1), ('treatment', 2), ('well', 1), ('case', 3), ('measur', 1), ('contact', 1), ('transmiss', 1), ('vaccin', 1), ('grow', 1), ('rel', 1), ('work', 1), ('sinc', 1), ('diagnos', 1), ('popul', 1), ('oversea', 1), ('mobil', 1), ('china', 2), ('chines', 1), ('cumul', 1), ('death', 1), ('govern', 2), ('meanwhil', 1), ('share', 1), ('articl', 1), ('face', 1), ('person', 1), ('right', 1), ('great', 1), ('build', 1), ('five', 1), ('month', 1), ('updat', 1), ('scientist', 1), ('keep', 1), ('prove', 1

These are basically TF-weighted vectors. We now want to convert these vectors into their TF-IDF weighted counterparts. We need, however, to import the models module from Gensim.

In [134]:
from gensim import models

def create_TF_IDF_model(allData):
    dictionary = create_dictionary(allData)
    docs2bows(allData, dictionary)
    loaded_allData = corpora.MmCorpus('simple_vsm_docs.mm')
    tfidf = models.TfidfModel(loaded_allData)
    return tfidf, dictionary

Let us now create the TF-IDF model.

In [137]:
tfidfm = create_TF_IDF_model(dt)
print(tfidfm)

(<gensim.models.tfidfmodel.TfidfModel object at 0x00000195F731B0D0>, <gensim.corpora.dictionary.Dictionary object at 0x00000195D55D7130>)


As can be seen, a complex object is returned that contains the TF-IDF model and the associated dictionary. Let us now take a closer look of such a TF-IDF model.

In [138]:
print(tfidfm[0].__dict__)

IOPub data rate exceeded.
The notebook server will temporarily stop sending output
to the client in order to avoid crashing it.
To change this limit, set the config variable
`--NotebookApp.iopub_data_rate_limit`.

Current values:
NotebookApp.iopub_data_rate_limit=1000000.0 (bytes/sec)
NotebookApp.rate_limit_window=3.0 (secs)






We finally create a function that given the dt and the topics provides a document ranking sorted in descending order of relevance (according to the cosine measure)

In [151]:
from operator import itemgetter
from gensim import similarities
 
    
def launch_query(allData, q, filename='simple_vsm_docs.mm'):
    tfidf, dictionary = create_TF_IDF_model(allData)
    loaded_allData = corpora.MmCorpus(filename)
    index = similarities.MatrixSimilarity(loaded_allData, num_features=len(dictionary))
    pq = preprocess_query(q)
    vq = dictionary.doc2bow(pq)
    qtfidf = tfidf[vq]
    sim = index[qtfidf]
    ranking = sorted(enumerate(sim), key=itemgetter(1), reverse=True)
    print("##############")
    print("QUERY:",q)
    for doc, score in ranking:
        print("[ Score = " + "%.3f" % round(score,3) + " ] " + allData.iloc[doc].title); 

And now we can launch any query we see fit to our newly created Information Retrieval engine.

In [None]:
for key in topics: 
    value = topics[key]
    launch_query(dt, value["query"])