# Building and Evaluating a COVID-19 oriented Information Retrieval Engine


## 1. Proccesing data 
Firstly, we are going to process the data

In [31]:
pip install pandas

Note: you may need to restart the kernel to use updated packages.


In [32]:
import pandas as pd
import json
import time
import xml.etree.ElementTree as ET
import math

In [33]:

dt = pd.read_csv("../data/metadata.csv")
print(dt)

        cord_uid                                       sha  \
0       ug7v899j  d1aafb70c066a2068b02786f8929fd9c900897fb   
1       02tnwd4m  6b0567729c2143a66d737eb0a2f63f2dce2e5a7d   
2       ejv2xln0  06ced00a5fc04215949aa72528f2eeaae1d58927   
3       2b73a28n  348055649b6b8cf2b9a376498df9bf41f7123605   
4       9785vg6d  5f48792a5fa08bed9f56016f4981ae2ca6031b32   
...          ...                                       ...   
192504  z4ro6lmh  203f36475be74229101548475d68352b939f8b5b   
192505  hi8k8wvb  9f1bc99798e8823e690697394dcb23533a45c60e   
192506  ma3ndg41  ffba777376718ef2a0dd74a8eab90e2bfacd240f   
192507  wh10285j  d521c5a2dcbd79a5be606fcf586b1e0448344172   
192508  pnl9th2c  c047bf76813106d4fd586e49164e7feddfbe352f   

                      source_x  \
0                          PMC   
1                          PMC   
2                          PMC   
3                          PMC   
4                          PMC   
...                        ...   
192504           

  has_raised = await self.run_ast_nodes(code_ast.body, cell_name,


#### Preparing the dataset
We decided just to work with the papers of the PDF_JSON corpus. Therefore, the first step is to delete from the dataframe the elements that are not in that folder. The number of examples is reduced from 192509 to 79755. Still, there are more documents in the pdf_json than in the dataframe (over 84000) because many pdf in the corpus have the same cord_uid. Technically, the papers mapped into the the same cord_uid are the same one, but with differences in ghe publication (if one article has been published by Elsevier and Springer, it will be mapped twice with the same cord_uid). Our take on the problem will be to consider just on of the documents associated to one cord_uid, instead of the full_list.

In [34]:
dt = dt[dt.pdf_json_files.notnull()]
dt = dt.reset_index(drop = True)
print(dt)

       cord_uid                                       sha  \
0      ug7v899j  d1aafb70c066a2068b02786f8929fd9c900897fb   
1      02tnwd4m  6b0567729c2143a66d737eb0a2f63f2dce2e5a7d   
2      ejv2xln0  06ced00a5fc04215949aa72528f2eeaae1d58927   
3      2b73a28n  348055649b6b8cf2b9a376498df9bf41f7123605   
4      9785vg6d  5f48792a5fa08bed9f56016f4981ae2ca6031b32   
...         ...                                       ...   
79750  z4ro6lmh  203f36475be74229101548475d68352b939f8b5b   
79751  hi8k8wvb  9f1bc99798e8823e690697394dcb23533a45c60e   
79752  ma3ndg41  ffba777376718ef2a0dd74a8eab90e2bfacd240f   
79753  wh10285j  d521c5a2dcbd79a5be606fcf586b1e0448344172   
79754  pnl9th2c  c047bf76813106d4fd586e49164e7feddfbe352f   

                     source_x  \
0                         PMC   
1                         PMC   
2                         PMC   
3                         PMC   
4                         PMC   
...                       ...   
79750            Medline; PMC   
797

Next, we drop the columns that will not add information to our information retrieval system (such as the licenses or the doi) and that do not help to map each example of the dataframe with a document in the pdf_json corpus.

In [35]:
columns_to_delete = ["doi", "source_x", "pmcid", "pubmed_id", "license", "mag_id", "who_covidence_id", "arxiv_id", "pmc_json_files", "url", "s2_id"]
# dt_original = dt
dt = dt.drop(columns_to_delete, axis = 1)
print(dt)

       cord_uid                                       sha  \
0      ug7v899j  d1aafb70c066a2068b02786f8929fd9c900897fb   
1      02tnwd4m  6b0567729c2143a66d737eb0a2f63f2dce2e5a7d   
2      ejv2xln0  06ced00a5fc04215949aa72528f2eeaae1d58927   
3      2b73a28n  348055649b6b8cf2b9a376498df9bf41f7123605   
4      9785vg6d  5f48792a5fa08bed9f56016f4981ae2ca6031b32   
...         ...                                       ...   
79750  z4ro6lmh  203f36475be74229101548475d68352b939f8b5b   
79751  hi8k8wvb  9f1bc99798e8823e690697394dcb23533a45c60e   
79752  ma3ndg41  ffba777376718ef2a0dd74a8eab90e2bfacd240f   
79753  wh10285j  d521c5a2dcbd79a5be606fcf586b1e0448344172   
79754  pnl9th2c  c047bf76813106d4fd586e49164e7feddfbe352f   

                                                   title  \
0      Clinical features of culture-proven Mycoplasma...   
1      Nitric oxide: a pro-inflammatory mediator in l...   
2        Surfactant protein-D and pulmonary host defense   
3                   Role of

In [36]:
print(dt.shape[0])

79755


In [37]:
# Document 1 
print(dt.iloc[0])

cord_uid                                                   ug7v899j
sha                        d1aafb70c066a2068b02786f8929fd9c900897fb
title             Clinical features of culture-proven Mycoplasma...
abstract          OBJECTIVE: This retrospective chart review des...
publish_time                                             2001-07-04
authors                         Madani, Tariq A; Al-Ghamdi, Aisha A
journal                                              BMC Infect Dis
pdf_json_files    document_parses/pdf_json/d1aafb70c066a2068b027...
Name: 0, dtype: object


In [38]:
# Document 1
print(dt.iloc[0].title)
print(dt.iloc[0].abstract)

Clinical features of culture-proven Mycoplasma pneumoniae infections at King Abdulaziz University Hospital, Jeddah, Saudi Arabia
OBJECTIVE: This retrospective chart review describes the epidemiology and clinical features of 40 patients with culture-proven Mycoplasma pneumoniae infections at King Abdulaziz University Hospital, Jeddah, Saudi Arabia. METHODS: Patients with positive M. pneumoniae cultures from respiratory specimens from January 1997 through December 1998 were identified through the Microbiology records. Charts of patients were reviewed. RESULTS: 40 patients were identified, 33 (82.5%) of whom required admission. Most infections (92.5%) were community-acquired. The infection affected all age groups but was most common in infants (32.5%) and pre-school children (22.5%). It occurred year-round but was most common in the fall (35%) and spring (30%). More than three-quarters of patients (77.5%) had comorbidities. Twenty-four isolates (60%) were associated with pneumonia, 14 (35

#### Parse the pdf
The articles mapped with each row of the metadata (cord_uid) are stored separately and parsed in .json. The following script converts an article mapped with each cord_uid into a python dictionary, storing the set of dictionaries on pdf_json_list.

##### Es probable que esto sea opcional, porque los metadatos ya nos dan el título, autores y abstract (las tres cosas más importantes). Trabajar usando todo el cuerpo de los artículos puede mejorar los resultados, pero el tiempo de cómputo aumentaría mucho. Además, la información sobre el título, autores y abstract es mejor en los metadatos, dado que los metadatos son proporcionados directamente de las revistas y, en contraste, la info. sobre titulo, autores y abstract del pdf_json corpus son obtenidos automáticamente mediante un parseo de pdf a json y puede haber errores.
##### Poca broma, a mi se me desborda la memoria. Mi RAM la tenía al 94% de ocupación con el siguiente script (obviamente no pude terminarlo).

In [39]:
'''
pdf_json_list = []
t0 = time.time()
for row in dt.index :
    json_path = (dt.loc[row]['pdf_json_files'].split('; '))[0]
    json_file = open("../data/"+json_path) 
    full_text_dict = json.load(json_file)
    pdf_json_list.append(full_text_dict)
t1 = time.time()
'''

'\npdf_json_list = []\nt0 = time.time()\nfor row in dt.index :\n    json_path = (dt.loc[row][\'pdf_json_files\'].split(\'; \'))[0]\n    json_file = open("../data/"+json_path) \n    full_text_dict = json.load(json_file)\n    pdf_json_list.append(full_text_dict)\nt1 = time.time()\n'

Script to read the test queries (https://towardsdatascience.com/download-and-parse-trec-covid-data-8f9840686c37)

In [40]:
topics = {}
root = ET.parse("../queries/test_queries.xml").getroot()
for topic in root.findall("topic"):
    topic_number = int(topic.attrib["number"])
    topics[topic_number] = {}
    for query in topic.findall("query"):
        topics[topic_number]["query"] = query.text  # We only need the query: 
    #for question in topic.findall("question"):
     #   topics[topic_number]["question"] = question.text        
    #for narrative in topic.findall("narrative"):
     #   topics[topic_number]["narrative"] = narrative.text
print(topics[1].keys())

dict_keys(['query'])


In [41]:
print(topics)

{1: {'query': 'coronavirus origin'}, 2: {'query': 'coronavirus response to weather changes'}, 3: {'query': 'coronavirus immunity'}, 4: {'query': 'how do people die from the coronavirus'}, 5: {'query': 'animal models of COVID-19'}, 6: {'query': 'coronavirus test rapid testing'}, 7: {'query': 'serological tests for coronavirus'}, 8: {'query': 'coronavirus under reporting'}, 9: {'query': 'coronavirus in Canada'}, 10: {'query': 'coronavirus social distancing impact'}, 11: {'query': 'coronavirus hospital rationing'}, 12: {'query': 'coronavirus quarantine'}, 13: {'query': 'how does coronavirus spread'}, 14: {'query': 'coronavirus super spreaders'}, 15: {'query': 'coronavirus outside body'}, 16: {'query': 'how long does coronavirus survive on surfaces'}, 17: {'query': 'coronavirus clinical trials'}, 18: {'query': 'masks prevent coronavirus'}, 19: {'query': 'what alcohol sanitizer kills coronavirus'}, 20: {'query': 'coronavirus and ACE inhibitors'}, 21: {'query': 'coronavirus mortality'}, 22: 

In [42]:
for key in topics: 
    value = topics[key]
    print(key, value["query"])

1 coronavirus origin
2 coronavirus response to weather changes
3 coronavirus immunity
4 how do people die from the coronavirus
5 animal models of COVID-19
6 coronavirus test rapid testing
7 serological tests for coronavirus
8 coronavirus under reporting
9 coronavirus in Canada
10 coronavirus social distancing impact
11 coronavirus hospital rationing
12 coronavirus quarantine
13 how does coronavirus spread
14 coronavirus super spreaders
15 coronavirus outside body
16 how long does coronavirus survive on surfaces
17 coronavirus clinical trials
18 masks prevent coronavirus
19 what alcohol sanitizer kills coronavirus
20 coronavirus and ACE inhibitors
21 coronavirus mortality
22 coronavirus heart impacts
23 coronavirus hypertension
24 coronavirus diabetes
25 coronavirus biomarkers
26 coronavirus early symptoms
27 coronavirus asymptomatic
28 coronavirus hydroxychloroquine
29 coronavirus drug repurposing
30 coronavirus remdesivir
31 difference between coronavirus and flu
32 coronavirus subtyp

Script to read the relevance judgement (the information needed to evaluate our system). The round id is not needed and it is therefore omitted. Also, the relevance is binarized

In [43]:
relevance_data = pd.read_csv("../queries/relevance_judgements.txt", sep=" ", header=None)
relevance_data.columns = ["topic_id", "round_id", "cord_uid", "relevancy"]
relevance_data = relevance_data.drop("round_id" ,axis = 1)
relevance_data['relevancy'] = relevance_data['relevancy'].replace([2],'1')
print(relevance_data)

       topic_id  cord_uid relevancy
0             1  005b2j4b         1
1             1  00fmeepz         1
2             1  010vptx3         1
3             1  0194oljo         1
4             1  021q9884         1
...         ...       ...       ...
69313        50  zvop8bxh         1
69314        50  zwf26o63         1
69315        50  zwsvlnwe         0
69316        50  zxr01yln         1
69317        50  zz8wvos9         1

[69318 rows x 3 columns]


With all the metadata (and optionally json_pdf), test topics and relevance judgement we are prepared to build and validate the system

## 2. A simple VSM implementation
We have adapted the simple vector space model implementation for our code. 
#### Implementation

First, we install and import libraries 

In [14]:
# We first install the NLTK toolkit

In [44]:
pip install nltk 

Note: you may need to restart the kernel to use updated packages.


In [45]:
# We also need to download the NLTK data bundle

In [46]:
import nltk 

nltk.download('all')

[nltk_data] Downloading collection 'all'
[nltk_data]    | 
[nltk_data]    | Downloading package abc to
[nltk_data]    |     C:\Users\Sandrus\AppData\Roaming\nltk_data...
[nltk_data]    |   Package abc is already up-to-date!
[nltk_data]    | Downloading package alpino to
[nltk_data]    |     C:\Users\Sandrus\AppData\Roaming\nltk_data...
[nltk_data]    |   Package alpino is already up-to-date!
[nltk_data]    | Downloading package biocreative_ppi to
[nltk_data]    |     C:\Users\Sandrus\AppData\Roaming\nltk_data...
[nltk_data]    |   Package biocreative_ppi is already up-to-date!
[nltk_data]    | Downloading package brown to
[nltk_data]    |     C:\Users\Sandrus\AppData\Roaming\nltk_data...
[nltk_data]    |   Package brown is already up-to-date!
[nltk_data]    | Downloading package brown_tei to
[nltk_data]    |     C:\Users\Sandrus\AppData\Roaming\nltk_data...
[nltk_data]    |   Package brown_tei is already up-to-date!
[nltk_data]    | Downloading package cess_cat to
[nltk_data]    |     

True

In [47]:
#We now install the gensim package

In [48]:
pip install gensim 

Note: you may need to restart the kernel to use updated packages.


In [49]:
# All the required software is now installed. 

In [50]:
# We now import the functions provided by NLTK to perform tokenizing considering punctuation signs.
from nltk.tokenize import wordpunct_tokenize, regexp_tokenize
# Next, we import required functions to filter-out stopwords for the English language.
from nltk.corpus import stopwords
# Now we import the function that implements the Porter's stemming algorithm.
from nltk.stem import PorterStemmer

The first step is aimed at preprocessing each document in the collection. We write a function that receives the variable dt, and returns a list containing all STEMS in the collection whose associated token is longer than 2 characters and is NOT an (English) stopword.

In [51]:
def preprocess_document(doc): # Each doc is each df row. We will only use title and abstract: dt.iloc[i].title and dt.iloc[i].abstract
    #print(i)
    stopset = set(stopwords.words('english'))
    stemmer = PorterStemmer()
    if type(doc.title) != str and type(doc.abstract) != str: # For empty documents without title and abstract
        final = [""]
    else:
        if type(doc.title) == str: 
            tokens = wordpunct_tokenize(doc.title)
        if type(doc.abstract) == str:
            tokens.extend(wordpunct_tokenize(doc.abstract))
            # clean guarda las palabras (en minuscula) que no están incluidas en stopset
        clean = [token.lower() for token in tokens if token.lower() not in stopset and len(token) > 2 and "%" not in token]
        final = [stemmer.stem(word) for word in clean]
    return final

In [23]:
print(preprocess_document(dt.iloc[0])) # Print STEMS for the document 1
print(type(dt.iloc[0].title))

['clinic', 'featur', 'cultur', 'proven', 'mycoplasma', 'pneumonia', 'infect', 'king', 'abdulaziz', 'univers', 'hospit', 'jeddah', 'saudi', 'arabia', 'object', 'retrospect', 'chart', 'review', 'describ', 'epidemiolog', 'clinic', 'featur', 'patient', 'cultur', 'proven', 'mycoplasma', 'pneumonia', 'infect', 'king', 'abdulaziz', 'univers', 'hospit', 'jeddah', 'saudi', 'arabia', 'method', 'patient', 'posit', 'pneumonia', 'cultur', 'respiratori', 'specimen', 'januari', '1997', 'decemb', '1998', 'identifi', 'microbiolog', 'record', 'chart', 'patient', 'review', 'result', 'patient', 'identifi', 'requir', 'admiss', 'infect', 'commun', 'acquir', 'infect', 'affect', 'age', 'group', 'common', 'infant', 'pre', 'school', 'children', 'occur', 'year', 'round', 'common', 'fall', 'spring', 'three', 'quarter', 'patient', 'comorbid', 'twenti', 'four', 'isol', 'associ', 'pneumonia', 'upper', 'respiratori', 'tract', 'infect', 'bronchiol', 'cough', 'fever', 'malais', 'common', 'symptom', 'crepit', 'wheez', '

In [24]:
print(dt.iloc[29264])
print(dt.iloc[29264].title)
print(type(dt.iloc[29264].title))

cord_uid                                                   n06og3cw
sha                        8d35867e078939b7f20187322e41011cec8b8cb3
title                                                           NaN
abstract                                                        NaN
publish_time                                             2020-05-13
authors           De Coninck, David; d'Haenens, Leen; Matthijs, ...
journal                                               Public Health
pdf_json_files    document_parses/pdf_json/8d35867e078939b7f2018...
Name: 29264, dtype: object
nan
<class 'float'>


In [25]:
print(preprocess_document(dt.iloc[29264]))

['']


In [52]:
def preprocess_query(q):
    stopset = set(stopwords.words('english'))
    stemmer = PorterStemmer()
    tokens = wordpunct_tokenize(q)
    clean = [token.lower() for token in tokens if token.lower() not in stopset and len(token) > 2]
    final = [stemmer.stem(word) for word in clean]
    return final

Once all documents in the collection have been preprocessed, we need to create a dictionary containing the mappings WORD_ID -> WORD. This dictionary is required to create the vector-based word representations.

In [53]:
from gensim import corpora

# Different words in the collection
def create_dictionary(docs):
    #print(preprocess_document(docs.iloc[79754]))
    # List all pre-processing documents
    pdocs = [preprocess_document(docs.iloc[i]) for i in range(docs.shape[0])]
    #print("pdocs: ", pdocs)
    # Build the dictionary
    dictionary = corpora.Dictionary(pdocs)
    # Save in a file
    dictionary.save('simple_vsm.dict')
    return dictionary

Let us call the create_dictionary function feeding it with the complete dt.

In [28]:
dictionary = create_dictionary(dt)
print(dictionary)

Dictionary(123368 unique tokens: ['1997', '1998', 'abdulaziz', 'acquir', 'admiss']...)


Now we have built the dictionary containing the vocabulary that we will use for indexing. Now we write a function that create the bag of words-based representation for each document in the collection.

In [54]:
def docs2bows(allData, dictionary):
    docs = [preprocess_document(allData.iloc[i]) for i in range(allData.shape[0])]
    # We obtain the set of frequencies for each term
    vectors = [dictionary.doc2bow(doc) for doc in docs]
    corpora.MmCorpus.serialize('simple_vsm_docs.mm', vectors)
    return vectors

Let us now generate the BOWs for the complete dt.

In [30]:
bows = docs2bows(dt, dictionary)

KeyboardInterrupt: 

In [None]:
print(bows)

These are pairs (word identifier, frequency). Let us now convert them into something a bit more readable.

In [None]:
for v in bows:
    tvec = [(dictionary[id], freq) for (id, freq) in v]
    print(tvec)

These are basically TF-weighted vectors. We now want to convert these vectors into their TF-IDF weighted counterparts. We need, however, to import the models module from Gensim.

In [55]:
from gensim import models

def create_TF_IDF_model(allData):
    dictionary = create_dictionary(allData)
    docs2bows(allData, dictionary)
    loaded_allData = corpora.MmCorpus('simple_vsm_docs.mm')
    tfidf = models.TfidfModel(loaded_allData)
    print("tfidf", tfidf)
    return tfidf, dictionary

Let us now create the TF-IDF model.

In [None]:
tfidfm = create_TF_IDF_model(dt)
print(tfidfm)

As can be seen, a complex object is returned that contains the TF-IDF model and the associated dictionary. Let us now take a closer look of such a TF-IDF model.

In [None]:
print(tfidfm[0].__dict__)

We finally create a function that given the dt and the topics provides a document ranking sorted in descending order of relevance (according to the cosine measure)

In [158]:
from operator import itemgetter
from gensim import similarities
 
    
def launch_query(allData, q, number, f="null", filename='simple_vsm_docs.mm'):
    tfidf, dictionary = create_TF_IDF_model(allData)
    loaded_allData = corpora.MmCorpus(filename)
    index = similarities.MatrixSimilarity(loaded_allData, num_features=len(dictionary))
    pq = preprocess_query(q)
    vq = dictionary.doc2bow(pq)
    qtfidf = tfidf[vq]
    sim = index[qtfidf]
    ranking = sorted(enumerate(sim), key=itemgetter(1), reverse=True)
    
    print("QUERY:",q)
    for i in range(0,10) :
        print("[ Score = "+str(ranking[i][1])+" ] "+allData.iloc[ranking[i][0]].title)
        if f!="null":
            f.write(str(number)+" Q0 "+str(allData.iloc[ranking[i][0]].cord_uid)+" "+str(i+1)+" "+str(ranking[i][1])+" mySystem \n")
    """
    pos = 1
    for doc, score in ranking:
        if ( pos <=10 ): # First ten positions
            print("[ Score = " + "%.3f" % round(score,3) + " ] " + allData.iloc[doc].title); 
            f.write(str(number)+" Q0 "+str(allData.iloc[doc].cord_uid)+" "+str(pos)+" "+str(round(score,3))+" mySystem \n")
        else: 
            f.write(str(number)+" Q0 "+str(allData.iloc[doc].cord_uid)+" "+str(pos)+" "+str(round(score,3))+" mySystem \n")
            #break
        pos += 1"""
    return ranking

And now we can launch any query we see fit to our newly created Information Retrieval engine.

We choose the query 1.

In [None]:
#for key in topics: 
    #value = topics[key]
    #launch_query(dt, value["query"])


In [152]:
# We select all cord_uid in relevance_data for the query 1
cords_relevance_data = set()
for i in range(relevance_data.loc[relevance_data.topic_id==1].shape[0]):
    cords_relevance_data.add(relevance_data.loc[relevance_data.topic_id==1].cord_uid[i])
print(cords_relevance_data)

{'vydgwa8o', 'wfh9pzb2', '7utfozx5', '1e28zj1d', 'tvbnv5gz', 'o9oxchq6', 'tb3yyq44', 'etmhbrfe', 'gjdlp10q', 'cwfujgya', 'erozilox', 'qopcs6jy', 'gy8d8285', 'xlqzprqn', 'wt9j5mvd', 'b6dsdiux', 'tj4pn38j', 'ljcr0xtm', 'sxbmd0df', 'w53u5ive', 'rgmurgbf', '4n6v5kfv', 'kjx0z1hc', 'ab757i3f', 't8m9i4vv', '7odpslba', '5l10cbp2', 'e0b8gnh8', 'ljqrxjvv', 'eju7wnb9', 'bfjdvfuq', '52lcpf0x', 'nw4wap1d', '1s6dlcer', 'oewvpv66', '1585stal', '85acs4lk', '6yfj4co0', 'aki28lhp', '5p8gkbi7', '7x3nq9cp', 'kx5hihnr', 'cdthfl5f', 'shvsnkxd', 'fby616ap', 'yvh2fzxt', 'jsyao6qu', '65b267ic', 'flc25wlz', 'o349msnm', 'bzldjzp3', 'z7a3g6e8', '2uwnamao', '98syj71y', 'm2c5bvuj', 'utn21ce1', '2kbi9drl', 'esq33kx4', 'jntrjg8u', 'jbtrdvhe', 't0cw7l2a', '8a1cia8s', 'zi0lc3lp', 'b2znv6pa', 'gytg4iku', 'jjraqr85', 'z0ni2jsr', 'ao7bkcv5', '08efpohc', '8pkrg0mx', 'l5q5wc06', '4ilpph77', 'mgdfwbfm', '1ag9jkk6', 'moz22sur', '59492sjb', 'yeih3tfo', 'chbegdex', '1sq2uvur', 's0893qap', 'ibn71fcd', 'jqgyln7b', '1ldynibm', 'gz

In [153]:
# We select all cord_uid that are in both (relevance_data and dt) for the query 1
dt_rel_data = pd.DataFrame()
aux = 0

for i in range(dt.cord_uid.shape[0]):
    #print(dt.cord_uid[i])
    if dt.cord_uid[i] in cords_relevance_data:
        #print(dt.cord_uid[i])
        dt_rel_data[aux] = dt.iloc[i]
        aux +=1
dt_rel_data = dt_rel_data.T
print(dt_rel_data)
print(len(cords_relevance_data))
        

      cord_uid                                                sha  \
0     sw4wtxdk           4faf1ac964c605b384dda60bc37df300766401b9   
1     6wu024ng           f2ab1be1bbd80c0f102714fdc90597af2739442c   
2     sbxqwfmy           c9b0389a55de2f9cbfe37049d1072e0984613923   
3     1rhy8td0           8a6809df45d5f80a822d68d3c305f7640e10234a   
4     t7rxmzvi           1a162c4dd45cad2c49168b2d6f2c350e47a3db09   
...        ...                                                ...   
1215  py6qu4tl  dca8ced82157924ed86c698a7dd482be81b4b266; ff9f...   
1216  4qenzjiu           fdf021cfe745daed338cce7eaa5e548581477ff4   
1217  fozglfc8           422bcfff056d118337d3941c0afcb8b888142182   
1218  gy0kfhy6           877f7dfd596cabc12cf7228ffb19cd6b663cea93   
1219  rcwck1y3           7db1d6af433a96af8d1dac0fdd078ea9d1980e9c   

                                                  title  \
0     NSs Encoded by Groundnut Bud Necrosis Virus Is...   
1     Comparative Efficacy of Hemagglutinin, Nucleop.

In [154]:
# Now we rank for the query 1
value = topics[1]
rank = launch_query(dt_rel_data, value["query"], 1)

tfidf TfidfModel(num_docs=1220, num_nnz=96453)
QUERY: coronavirus origin
[ Score = 0.58731437 ] Origin and Evolution of the 2019 Novel Coronavirus
[ Score = 0.55520093 ] Characteristics of Metazoan DNA Replication Origins
[ Score = 0.53614223 ] Bat origin of a new human coronavirus: there and back again
[ Score = 0.41098344 ] Commentary: Origin and evolution of pathogenic coronaviruses
[ Score = 0.41098344 ] Strategies to trace back the origin of COVID-19
[ Score = 0.37517485 ] A glimpse into the origins of genetic diversity in SARS-CoV-2
[ Score = 0.36360946 ] A phylogenomic data-driven exploration of viral origins and evolution
[ Score = 0.34331477 ] Experimental infection of a US spike-insertion deletion porcine epidemic diarrhea virus in conventional nursing piglets and cross-protection to the original US PEDV infection
[ Score = 0.33538014 ] Zoonotic origins of human coronavirus 2019 (HCoV-19 / SARS-CoV-2): why is this work important?
[ Score = 0.33410308 ] Tracking the origin of 

#### Performance evaluation -- All Queries

First, we are going to ranking for all the queries. 

In [159]:
all_rankings_TFIDF=[]
f=open("../queries/trecRun_tfidf.txt","w")
for topi in range(len(topics)):
    # We select all cord_uid in relevance_data for the query number topi
    cords_relevance_data = set()
    aux = 0
    for i in range(topi+1):
        aux += relevance_data.loc[relevance_data.topic_id==(i)].shape[0]
    for i in range(relevance_data.loc[relevance_data.topic_id==(topi+1)].shape[0]):
        cords_relevance_data.add(relevance_data.loc[relevance_data.topic_id==(topi+1)].cord_uid[aux+i])
    # We select all cord_uid that are in both (relevance_data and dt) for the query number topi
    dt_rel_data = pd.DataFrame()
    aux = 0
    for i in range(dt.cord_uid.shape[0]):
        if dt.cord_uid[i] in cords_relevance_data:
            dt_rel_data[aux] = dt.iloc[i]
            aux +=1
    dt_rel_data = dt_rel_data.T
    # Now we rank for the query number topi
    print("##############")
    print("QUERY NUMBER ", (topi+1))
    value = topics[(topi+1)]
    rank = launch_query(dt_rel_data, value["query"],(topi+1),f)
    all_rankings_TFIDF.append(rank)

f.close()


##############
QUERY NUMBER  1
tfidf TfidfModel(num_docs=1220, num_nnz=96453)
QUERY: coronavirus origin
[ Score = 0.58731437 ] Origin and Evolution of the 2019 Novel Coronavirus
[ Score = 0.55520093 ] Characteristics of Metazoan DNA Replication Origins
[ Score = 0.53614223 ] Bat origin of a new human coronavirus: there and back again
[ Score = 0.41098344 ] Commentary: Origin and evolution of pathogenic coronaviruses
[ Score = 0.41098344 ] Strategies to trace back the origin of COVID-19
[ Score = 0.37517485 ] A glimpse into the origins of genetic diversity in SARS-CoV-2
[ Score = 0.36360946 ] A phylogenomic data-driven exploration of viral origins and evolution
[ Score = 0.34331477 ] Experimental infection of a US spike-insertion deletion porcine epidemic diarrhea virus in conventional nursing piglets and cross-protection to the original US PEDV infection
[ Score = 0.33538014 ] Zoonotic origins of human coronavirus 2019 (HCoV-19 / SARS-CoV-2): why is this work important?
[ Score = 0.334

##############
QUERY NUMBER  9
tfidf TfidfModel(num_docs=1235, num_nnz=98401)
QUERY: coronavirus in Canada
[ Score = 0.5790585 ] Genome Organization of Canada Goose Coronavirus, A Novel Species Identified in a Mass Die-off of Canada Geese
[ Score = 0.45560414 ] Climate Change and Health in Canada
[ Score = 0.41809103 ] Hospital-Based HTA and Know4Go at MEDICI in London, Ontario, Canada
[ Score = 0.4078517 ] Novel Coronavirus, Old Partisanship: COVID-19 Attitudes and Behaviours in the United States and Canada
[ Score = 0.40750474 ] Canada 2010: what should global health expect?
[ Score = 0.40750474 ] The Impact of COVID-19 on Diabetes Research in Canada
[ Score = 0.37199923 ] COVID-19 in Canada and the use of Personal Protective Equipment
[ Score = 0.36303407 ] The United States and Canada as a coupled epidemiological system: An example from hepatitis A
[ Score = 0.32216078 ] Effect of COVID-19 on the mental health care of older people in Canada
[ Score = 0.3080446 ] Human Bocavirus Inf

##############
QUERY NUMBER  17
tfidf TfidfModel(num_docs=976, num_nnz=84939)
QUERY: coronavirus clinical trials
[ Score = 0.64664865 ] Systematic review of the registered clinical trials for coronavirus disease 2019 (COVID-19)
[ Score = 0.6327981 ] Systematic Review of the Registered Clinical Trials of Coronavirus Disease 2019 (COVID-19)
[ Score = 0.613948 ] HUMAN CORONAVIRUS DATA FROM FOUR CLINICAL TRIALS OF MASKS AND RESPIRATORS
[ Score = 0.5923928 ] A Decade On: Systematic Review of ClinicalTrials.gov Infectious Disease Trials, 2007–2017
[ Score = 0.5908346 ] The race to find a SARS-CoV-2 drug can only be won by a few chosen drugs: a systematic review of registers of clinical trials of drugs aimed at preventing or treating COVID-19
[ Score = 0.57358706 ] Characteristics of COVID-19 Clinical Trials in China Based on the Registration Data on ChiCTR and ClinicalTrials.gov
[ Score = 0.57212317 ] A brief review of antiviral drugs evaluated in registered clinical trials for COVID-19
[ Sc

##############
QUERY NUMBER  24
tfidf TfidfModel(num_docs=918, num_nnz=69491)
QUERY: coronavirus diabetes
[ Score = 0.81461936 ] Coronavirus and diabetes: an update
[ Score = 0.5771639 ] Practical recommendations for the management of diabetes in patients with COVID-19
[ Score = 0.5562721 ] Diabetes management and specific considerations for patients with diabetes during coronavirus diseases pandemic: A scoping review
[ Score = 0.542721 ] Diabetes and COVID‐19: psychosocial consequences of the COVID‐19 pandemic in people with diabetes in Denmark—what characterizes people with high levels of COVID‐19‐related worries?
[ Score = 0.5393326 ] Coronavirus Disease 2019 and Diabetes: The Epidemic and the Korean Diabetes Association Perspective
[ Score = 0.53273576 ] COVID‐19 and diabetes
[ Score = 0.52262604 ] The burden of type 2 diabetes: are we doing enough?
[ Score = 0.4981807 ] Diabetes and Novel Coronavirus Infection: Implications for Treatment
[ Score = 0.49595004 ] The double burden of

##############
QUERY NUMBER  32
tfidf TfidfModel(num_docs=1220, num_nnz=100958)
QUERY: coronavirus subtypes
[ Score = 0.5011336 ] SFJ: An Implementation of Semantic Featherweight Java
[ Score = 0.4670424 ] In-Vitro Subtype-Specific Modulation of HIV-1 Trans-Activator of Transcription (Tat) on RNAi Silencing Suppressor Activity and Cell Death
[ Score = 0.43890134 ] Genotypes and subtypes of Cryptosporidium spp. in neonatal calves in Northern Ireland
[ Score = 0.42300546 ] Pathogenicity of three genetically diverse strains of PRRSV Type 1 in specific pathogen free pigs
[ Score = 0.38286084 ] Dating the time of viral subtype divergence
[ Score = 0.3757472 ] An unusual cluster of HIV-1 B/F recombinants in an Asian population
[ Score = 0.37399027 ] A Human Monoclonal Antibody with Neutralizing Activity against Highly Divergent Influenza Subtypes
[ Score = 0.35409623 ] Molecular characterization of Cryptosporidium isolates from diarrheal dairy calves in France
[ Score = 0.3523856 ] Infection

##############
QUERY NUMBER  40
tfidf TfidfModel(num_docs=916, num_nnz=76933)
QUERY: coronavirus mutations
[ Score = 0.5320487 ] Functional analysis of the stem loop S3 and S4 structures in the coronavirus 3′UTR
[ Score = 0.47644278 ] RdRp mutations are associated with SARS-CoV-2 genome evolution
[ Score = 0.4744976 ] Time Series Prediction of COVID-19 by Mutation Rate Analysis using Recurrent Neural Network-based LSTM Model
[ Score = 0.4637258 ] Recurrent mutations associated with isolation and passage of SARS coronavirus in cells from non‐human primates
[ Score = 0.45029032 ] Genome Sequencing of a Severe Acute Respiratory Syndrome Coronavirus 2 Isolate Obtained from a South African Patient with Coronavirus Disease 2019
[ Score = 0.44091716 ] Prediction of mutations engineered by randomness in H5N1 hemagglutinins of influenza A virus
[ Score = 0.43526208 ] Prediction of mutations engineered by randomness in H5N1 neuraminidases from influenza A virus
[ Score = 0.43191063 ] Prediction 

##############
QUERY NUMBER  48
tfidf TfidfModel(num_docs=491, num_nnz=41671)
QUERY: school reopening coronavirus
[ Score = 0.568277 ] Reopening schools after the COVID-19 lockdown
[ Score = 0.5622488 ] The impact of school reopening on the spread of COVID-19 in England
[ Score = 0.45147365 ] Expected impact of reopening schools after lockdown on COVID-19 epidemic in Ile-de-France
[ Score = 0.4018325 ] Should Schools Reopen Early or Late? – Transmission Dynamics of COVID-19 in Children
[ Score = 0.36745095 ] Shut and re-open: the role of schools in the spread of COVID-19 in Europe
[ Score = 0.36745095 ] Shut and re-open: the role of schools in the spread of COVID-19 in Europe
[ Score = 0.36648026 ] Cost Benefit Analysis of Limited Reopening Relative to a Herd Immunity Strategy or Shelter in Place for SARS-CoV-2 in the United States
[ Score = 0.3653023 ] Determining the optimal strategy for reopening schools, work and society in the UK: balancing earlier opening and the impact of test a

https://github.com/joaopalotti/trectools

In [162]:
pip install trectools

Note: you may need to restart the kernel to use updated packages.


In [164]:
# TREC_EVAL using document score
from trectools import TrecQrel, TrecRun, TrecEval
# A typical evaluation workflow

# We load a TrecRun object
r1 = TrecRun("../queries/trecRun_tfidf.txt")
#print(r1.topics()[:5]) # Shows the first 5 topics
r2 = TrecRun("../queries/trecRun2.txt") # INVENTADO
#print(r2.topics()[:5]) # Shows the first 5 topics


# We load a TrecQrel object
relevance_data_qrels = "../queries/relevance_judgements.txt"
qrels = TrecQrel(relevance_data_qrels)

te = TrecEval(r1, qrels)
map_result = te.get_map(trec_eval = True ) # The result is the same as trec_eval
ndcg_result = te.get_ndcg(trec_eval=True)     

print("MAP: %.3f, NDCG: %.3f" % (map_result, ndcg_result)) # Solo evalúa los 10 primeros resultados de cada ranking!!!


### r2 es inventado en este caso. Esta parte solo lo utilizaremos con este y el siguiente vsm 

result_r1 = r1.evaluate_run(qrels, per_query=True) 
result_r2 = r2.evaluate_run(qrels, per_query=True)

# Inspect for statistically significant differences between the two runs for  P_10 using two-tailed Student t-test
pvalue = result_r1.compare_with(result_r2, metric="P_10")

print("P-value for wrt P@10 between r1 and r2: %.3f" % (pvalue[1]))

MAP: 0.009, NDCG: 0.038
P-value for wrt P@10 between r1 and r2: 0.051


## 3. VSM based on word co-occurrence implementation - WCSVSM
We have used the VSM based on word co-occurrence implementation with our dataset.

This implementation was proposed in: 

Chen, S., Chen, Y., Yuan, F., & Chang, X. (2020). Establishment of herbal prescription vector space model based on word co-occurrence. Journal of Supercomputing, 76(5), 3590–3601. https://doi.org/10.1007/s11227-018-2559-3

#### Implementation

First, we install and import libraries

In [105]:
pip install nltk 

Note: you may need to restart the kernel to use updated packages.


In [106]:
pip install gensim 

Note: you may need to restart the kernel to use updated packages.


In [107]:
import nltk 

nltk.download('all')

[nltk_data] Downloading collection 'all'
[nltk_data]    | 
[nltk_data]    | Downloading package abc to
[nltk_data]    |     C:\Users\Sandrus\AppData\Roaming\nltk_data...
[nltk_data]    |   Package abc is already up-to-date!
[nltk_data]    | Downloading package alpino to
[nltk_data]    |     C:\Users\Sandrus\AppData\Roaming\nltk_data...
[nltk_data]    |   Package alpino is already up-to-date!
[nltk_data]    | Downloading package biocreative_ppi to
[nltk_data]    |     C:\Users\Sandrus\AppData\Roaming\nltk_data...
[nltk_data]    |   Package biocreative_ppi is already up-to-date!
[nltk_data]    | Downloading package brown to
[nltk_data]    |     C:\Users\Sandrus\AppData\Roaming\nltk_data...
[nltk_data]    |   Package brown is already up-to-date!
[nltk_data]    | Downloading package brown_tei to
[nltk_data]    |     C:\Users\Sandrus\AppData\Roaming\nltk_data...
[nltk_data]    |   Package brown_tei is already up-to-date!
[nltk_data]    | Downloading package cess_cat to
[nltk_data]    |     

True

In [108]:
# All the required software is now installed. 

In [109]:
# We now import the functions provided by NLTK to perform tokenizing considering punctuation signs.
from nltk.tokenize import wordpunct_tokenize, regexp_tokenize
# Next, we import required functions to filter-out stopwords for the English language.
from nltk.corpus import stopwords
# Now we import the function that implements the Porter's stemming algorithm.
from nltk.stem import PorterStemmer

In [110]:
import numpy as np

We will use the following functions of the Simple VSM implementation:
** Please, run the next two defs before:
- preprocess_document(doc) -> returns a list containing all STEMS in the collection (document). Input: One document
- preprocess_query(q)  -> returns a list containing all STEMS in the collection (query). Input: One query
- create_dictionary(docs) -> returns a dictionary containing the mappings WORD_ID -> WORD. This dictionary is required to create the vector-based word representations. Input: all documents
- docs2bows(allData, dictionary) -> returns vectors, bag of words-based representation for each document in the collection. Each bow include in bows contain an id and freq. dict(id) return the name of the word. => TF-weighted vectors (Build Later)



In [111]:
from gensim import corpora

# Different words in the collection
def create_dictionaryWCS(docs):
    print("-- create_dictionaryWCS")
    #print(preprocess_document(docs.iloc[79754]))
    # List all pre-processing documents
    pdocs = [preprocess_document(docs.iloc[i]) for i in range(docs.shape[0])]
    #print("pdocs: ", pdocs)
    # Build the dictionary
    dictionary = corpora.Dictionary(pdocs)
    # Save in a file
    dictionary.save('wcs_vsm.dict')
    return dictionary

In [112]:
def docs2bowsWCS(allData, dictionary):
    print("-- docs2bowsWCS")
    docs = [preprocess_document(allData.iloc[i]) for i in range(allData.shape[0])]
    # We obtain the set of frequencies for each term
    vectors = [dictionary.doc2bow(doc) for doc in docs]
    corpora.MmCorpus.serialize('wcs_vsm_docs.mm', vectors)
    return vectors    

In [None]:
dic1 = create_dictionaryWCS(dt_rel_data)
vect1 = docs2bowsWCS(dt_rel_data,dic1)
print(vect1) 


In [None]:
id, freq = vect1[0][0]
print(id)
print(freq)

In [None]:
for v in vect1:
    tvec1 = [(dic1[id], freq) for (id, freq) in v]
    print(tvec1)

We will calculate Mutual Information ( (frequency of word i and j)^2 / ((frequency of word i)*(frequency of word i)) ) , and with that the weight ( weight = TF X IDF X MI(wordi,wordj)).  CAMBIAR ESTO

First, we will use the extraction algorithm of the paper for extract the word pair.



In [None]:
# PRIMERO TENEMOS QUE BUSCAR EL SET R DE PARES DE CO-OCURRENCIAS. 
# PARA ELLO CALCULAREMOS TF DE PARES DE CO-OCURRENCIAS Y EL MI
# COMO TF TOMAREMOS EL MINIMO ENTRE DOS PALABRAS Y COMO MI EL MAX/SUMA DE CADA TF DE CADA PALABRA.
# VAMOS A CALCULAR EL TF Y EL MI A LA VEZ

In [122]:
# TF = Frequency of word co-occurrence
# Support (FREQ(W1,W2)) threshold >= 3

# Introducimos un solo vector asociado a un único documento, no el conjunto de vectores para todos los documentos
def TF_MI_Pairs_cooccu_WCS(vector): # Co-occurence fequency between 2 words: min(w1,w2)
    cooccurrence_frequency_MI = []
    for j in range(len(vector)): # Leemos cada par (id, frecuencia)
        idWord1, freqWord1 = vector[j]
        # If we start iterating, at j+1, there is no possibility of adding (j,i) with i<j (only if (i,j) is added)
        for h in range(j+1,len(vector)): 
            idWord2, freqWord2 = vector[h] # Asi ya tenemos las 2 palabras del par con sus frecuencias
            freqMin = min(freqWord1, freqWord2) # Hacemos el min como hemos dicho
            if freqMin >= 3 : # Threshold support >= 3
                MI = freqMin / max(freqWord1, freqWord2) # Sacamos el MI= min(word1,word2)/ word1+word2
                if MI > 0.3 :  # Threshold confidence > 0.3
                    cooccurrence_frequency_MI.append(((idWord1,idWord2),freqMin,MI))  
    
                
    return cooccurrence_frequency_MI # OUTPUT: List of tuples (co-ocurrence, coocurrence_frequency, coocurrence_MI)


In [None]:
print(dictionary[964])
for i in range(0,50):
    pairs = TF_MI_Pairs_cooccu_WCS(vect1[i])
    print("PAIRS: ", pairs)
    for (i,j),freq,MI in pairs:
        print(i,j)
        print(dictionary[i],dictionary[j])

In [114]:
# Para todos los vectores: 
def R_TF_IDF_MI_weight_cooccu_WCS(vectors): # Vamos a calcular TF, IDF, R (todos los pairs relevantes de todo el corpus)
    # R = [] Implicit in the list cooccurrence_document_frequency
    pairs_docs = []
    cooccurrence_document_frequency = [] # List of tuples (co-ocurrence, documente frequency of the co-ocurrence)
    
    for v in vectors: 
        #print("DOCUMENTO Nº ",doc_num)
        pairs = TF_MI_Pairs_cooccu_WCS(v)

        #print("Choosen:",choosed)
        #print("Pairs:",pairs)
        pairs_docs.append(pairs)
        for p,_,_ in pairs:
            is_in_list = False
            for df in range(len(cooccurrence_document_frequency)):
                if (cooccurrence_document_frequency[df][0] == p):
                    cooccurrence_document_frequency[df][1] = cooccurrence_document_frequency[df][1] +1
                    is_in_list = True
                    break
            if not is_in_list :
                cooccurrence_document_frequency.append([p,1])
            
    # Now, we have the IDF

    return sorted(cooccurrence_document_frequency), pairs_docs
       

In [None]:
cooccurrence_document_frequency, pairs_docs= R_TF_IDF_MI_weight_cooccu_WCS(vect1) # Recuerda: vect1 es el vector de frecuencias para cada palabra de los documentos del relevance_data para la primera query

In [None]:
print(len(pairs_docs)*len(pairs_docs[0]))

In [None]:
print(cooccurrence_document_frequency)

In [None]:
print(pairs_docs)

In [115]:
def get_idf(cooccurrence,cooccurrence_document_frequency,number_of_documents) :
    for i,df in cooccurrence_document_frequency:
        assert(df != 0)
        if (cooccurrence == i): 
            # The idf is N/df, where N is the total number of documents and df the number of documents containing the co-occurrence
            return number_of_documents/df
        assert(df>0)

In [116]:
# YA TENEMOS R, Y ADEMÁS TENEMOS EL TF, DF, Y EL MI DE CADA PAR DE PALABRAS EN R PARA CADA DOC
# AHORA CALCULAREMOS EL WEIGHT = TF*IDF*MI PARA CADA PAR DE PALABRAS EN CADA DOCUMENTO
def weight(vectors):
    print("-- weighted document vectors")
    cooccurrence_document_frequency, pairs_docs = R_TF_IDF_MI_weight_cooccu_WCS(vectors)
        # R: Todos los pares de palabras finales en todos los documentos
        # TF_MI_all: Obtenemos una tupla (idWord1,idWord2,freqWord1,freqWord2,freqMin,MI) para cada 
            # par de palabra de R en cada documento 
        # coocurrence_document_frequency: Para cada par en R obtenemos su DF guardado como [par, total] para el corpus de la query
        # pairs_docs: Pares de palabras finales contenidas en R en cada documento
    # Ahora para documento y para cada par de palabras de cada doc, calcularemos el weight: 
    model =[]
    for document_index,document in enumerate(pairs_docs): # For each doc
        weight_i = {}
        for p,TF,MI in document:
            # We obtain the idf for the coocurrence
            IDF = get_idf(p,cooccurrence_document_frequency, len(pairs_docs))
            weight_i.update({p:math.log10(TF)*math.log10(IDF)*MI})
        model.append(weight_i)
    return model, cooccurrence_document_frequency, pairs_docs

In [None]:
we1, cooccurrence_document_frequency, pairs_docs = weight(vect1)

In [None]:
print(we1)

In [117]:
def get_relevant_cooccurrences(vq,cooccurrence_document_frequency,number_of_documents):
    relevant_cooccurrences_weighted = {}
    for id1,freq1 in vq: # Leemos cada par (id, frecuencia)
        for id2,freq2 in vq:
            if (id1,id2) in [e[0] for e in cooccurrence_document_frequency] :
                TF = min(freq1,freq2)
                IDF = get_idf((id1,id2),cooccurrence_document_frequency, number_of_documents)
                MI = TF/max(freq1,freq2)
                relevant_cooccurrences_weighted.update({(id1,id2) : math.log10(TF)*math.log10(IDF)*MI})
                
    return relevant_cooccurrences_weighted

In [None]:
print(get_relevant_cooccurrences([(63,2),(77,2)],cooccurrence_document_frequency,10))

In [None]:
# AHORA YA TENEMOS LOS WEIGHTS PARA LAS PALABRAS RELEVANTES, AHORA PROCESAREMOS PARA LA QUERY Y SACAREMOS LOS PARES DE PALABRAS

def take_second(elem): # Para en el ranking ordenar por score
    return elem[1]

def launch_queryWCS(allData, q, number, f="null",  filename='wcs_vsm_docs.mm'):
    dictionary = create_dictionaryWCS(allData) # Creamos el diccionario
    
    vectors = docs2bowsWCS(allData, dictionary) # Creamos un array con una fila para documento, y cada elemento de la fila: (id palabra,freq)
    
    weight_docs, cooccurrence_document_frequency, pairs_docs = weight(vectors)
    print(list(weight_docs[0].keys())[0])
    print(dictionary[list(weight_docs[0].keys())[0][0]])
    print(dictionary[list(weight_docs[0].keys())[0][1]])
        # con weight(vectors) Calculamos:
            # R: los pares relevantes de todo el corpus
            # el weight de esos pares en cada documento
            # TF_MI_all: una tupla (idWord1,idWord2,freqWord1,freqWord2,freqMin,MI) para cada par de palabra de R en cada doc 
            # coocurrence_document_frequency: el IDF para cada par con su frecuencia total para todo el corpus
            # pairs_docs: se muestran los pares pertenecientes a R que aparecen en cada documento relevantes 
    print("-- preprocess_query")
    pq = preprocess_query(q) # Preprocesamos la query
    print("-- doc2bow_query")
    vq = dictionary.doc2bow(pq) #Asociamos un ID a cada palabra
    
    print("-- query_weighted_vector")
    # Obtain the relevant co-occurrences (the ones of R, implicit in cooccurrence_documente_frequenct) that are present in the query: 
    pairs_q = get_relevant_cooccurrences(vq,cooccurrence_document_frequency, len(pairs_docs))


    #print("pairs_q: ", pairs_q)
    
    print("-- launch the query")
    # For each document
    ranking = []
    for i,doc in enumerate(weight_docs) :
        # Find the cosine between the document and the query: i.e. the dot product of the vectors / mult. of the length of the vectors
        dot_product = 0
        document_vector_length = 0
        query_vector_length = 0
        for cooccurrence in doc.keys() :
            if cooccurrence in pairs_q.keys() :
                dot_product += doc[cooccurrence]*pairs_q[cooccurrence]
            document_vector_length += doc[cooccurrence]**2
        for cooccurrence in pairs_q.keys() :
            query_vector_length += pairs_q[cooccurrence]**2
        value = 0
        if query_vector_length == 0 or document_vector_length == 0 :
            value = 0.0
        else:
            value = abs(dot_product/(math.sqrt(document_vector_length)*math.sqrt(query_vector_length)))
        ranking.append((i,value))
    
    ranking.sort(key= lambda x: x[1], reverse=True)
    
    for i in range(0,10) :
        print("[ Score = "+str(ranking[i][1])+" ] "+allData.iloc[ranking[i][0]].title)
        if f!="null":
            f.write(str(number)+" Q0 "+str(allData.iloc[ranking[i][0]].cord_uid)+" "+str(i+1)+" "+str(ranking[i][1])+" mySystem \n")
    """
    pos = 1
    for doc, score in ranking:
        if ( pos <=10 ): # First ten positions
            print("[ Score = "+str(ranking[i][1])+" ] "+allData.iloc[ranking[i][0]].title)
            if f!="null":
                f.write(str(number)+" Q0 "+str(allData.iloc[doc].cord_uid)+" "+str(pos)+" "+str(round(score,3))+" mySystem \n")
        else: 
            if f!="null":
                f.write(str(number)+" Q0 "+str(allData.iloc[doc].cord_uid)+" "+str(pos)+" "+str(round(score,3))+" mySystem \n")
            else:
                break
        pos += 1    """
    return ranking
            
    
'''   
    print("-- totalWeight")
    totalWeight = []
    for doc_number in range(len(allData)):  
        weight_doc = 0
        for pairsq in pairs_q:
            for w in weight_docs[doc_number]:
                p = w[0]
                i,j = p
                if pairsq == p or pairsq == (j,i):
                    weight_doc += w[1]
                    break                
        totalWeight.append((doc_number,weight_doc))
    #print("totalWeight: ", totalWeight)
    
    print("-- ranking")
    ranking = sorted(totalWeight, key =take_second,reverse=True )
    print("QUERY:",q)
    pos = 0
    for doc, score in ranking:
        """
        if(round(score,3)==0):
            print("[ Score = " + "%.3f" % round(score,3) + " ] " + allData.iloc[doc].title,"#########################"); 
        else:
            print("[ Score = " + "%.3f" % round(score,3) + " ] " + allData.iloc[doc].title); 
        """
        if ( pos <10 ): # First ten positions
            print("[ Score = " + "%.3f" % round(score,3) + " ] " + allData.iloc[doc].title); 
        else: 
            break
        pos += 1
    return ranking
    """              
    qtWeight = tWeight_Matrix[vq] # A PARTIR DE AQUÍ ESTÁ EL FALLO
    sim = index[qtWeight]
    ranking = sorted(enumerate(sim), key=itemgetter(1), reverse=True)
    print("##############")
    print("QUERY:",q)
    pos = 0
    for doc, score in ranking:
        """"""
        if(round(score,3)==0):
            print("[ Score = " + "%.3f" % round(score,3) + " ] " + allData.iloc[doc].title,"#########################"); 
        else:
            print("[ Score = " + "%.3f" % round(score,3) + " ] " + allData.iloc[doc].title); 
        """"""
        if ( pos <10 ): # First ten positions
            print("[ Score = " + "%.3f" % round(score,3) + " ] " + allData.iloc[doc].title); 
        else: 
            break
        pos += 1
    return ranking"""
'''

And now we can launch any query we see fit to our newly created Information Retrieval engine.

We choose the query 1.

In [None]:
# We select all cord_uid in relevance_data for the query 1
cords_relevance_data = set()
for i in range(relevance_data.loc[relevance_data.topic_id==1].shape[0]):
    cords_relevance_data.add(relevance_data.loc[relevance_data.topic_id==1].cord_uid[i])

In [None]:
# We select all cord_uid that are in both (relevance_data and dt) for the query 1
dt_rel_data = pd.DataFrame()
aux = 0
for i in range(dt.cord_uid.shape[0]):
    #print(dt.cord_uid[i])
    if dt.cord_uid[i] in cords_relevance_data:
        #print(dt.cord_uid[i])
        dt_rel_data[aux] = dt.iloc[i]
        aux +=1
dt_rel_data = dt_rel_data.T

In [None]:
# Now we rank for the query 1
value = topics[1]
# rankWCS= launch_queryWCS(dt_rel_data, value["query"])
rank_WCS= launch_queryWCS(dt_rel_data, value["query"], 1)

#### Performance evaluation -- All Queries
First, we are going to ranking for all the queries. 

In [161]:
all_rankings_WCS=[]
f=open("../queries/trecRun_WCS.txt","w")
for topi in range(len(topics)):
    # We select all cord_uid in relevance_data for the query number topi
    cords_relevance_data = set()
    aux = 0
    for i in range(topi+1):
        aux += relevance_data.loc[relevance_data.topic_id==(i)].shape[0]
    for i in range(relevance_data.loc[relevance_data.topic_id==(topi+1)].shape[0]):
        cords_relevance_data.add(relevance_data.loc[relevance_data.topic_id==(topi+1)].cord_uid[aux+i])
    # We select all cord_uid that are in both (relevance_data and dt) for the query number topi
    dt_rel_data = pd.DataFrame()
    aux = 0
    for i in range(dt.cord_uid.shape[0]):
        if dt.cord_uid[i] in cords_relevance_data:
            dt_rel_data[aux] = dt.iloc[i]
            aux +=1
    dt_rel_data = dt_rel_data.T
    # Now we rank for the query number topi
    print("##############")
    print("QUERY NUMBER ", (topi+1))
    value = topics[(topi+1)]
    rank_WCS= launch_queryWCS(dt_rel_data, value["query"],(topi+1),f)
    all_rankings_WCS.append(rank_WCS)
    

f.close()


##############
QUERY NUMBER  1
-- create_dictionaryWCS
-- docs2bowsWCS


KeyboardInterrupt: 

In [160]:
pip install trectools

Note: you may need to restart the kernel to use updated packages.


In [None]:
# TREC_EVAL using document score
from trectools import TrecQrel, TrecRun, TrecEval
# A typical evaluation workflow

# We load a TrecRun object
r1 = TrecRun("../queries/trecRun_tfidf.txt")
#print(r1.topics()[:5]) # Shows the first 5 topics
r2 = TrecRun("../queries/trecRun_WCS")
#print(r2.topics()[:5]) # Shows the first 5 topics


# We load a TrecQrel object
relevance_data_qrels = "../queries/relevance_judgements.txt"
qrels = TrecQrel(relevance_data_qrels)

te = TrecEval(r2, qrels)
map_result = te.get_map(trec_eval = True ) # The result is the same as trec_eval
ndcg_result = te.get_ndcg(trec_eval=True)     

print("MAP: %.3f, NDCG: %.3f" % (map_result, ndcg_result))


###

result_r1 = r1.evaluate_run(qrels, per_query=True) 
result_r2 = r2.evaluate_run(qrels, per_query=True)

# Inspect for statistically significant differences between the two runs for  P_10 using two-tailed Student t-test
pvalue = result_r1.compare_with(result_r2, metric="P_10")

print("P-value for wrt P@10 between r1 and r2: %.3f" % (pvalue[1]))

## 4. VSM based on combine the word co-occurrence implementation and TF-IDF implementation

However, relying only in co-occurrenes might by a pitfall to avoid. The queries tend to be small thus co-occurrence will not be frequent. As such, we decide lo linearly combine the ranking provided by the TF-IDF model and by the k-best co-occurrences model. 

$final\_ranking = \alpha * TFIDF\_ranking + (1-\alpha) * WCS\_ranking, \alpha \in [0,1]$

In other words, for each document in the ranking we will do a ponderate sum (with $\alpha$) of their cosine similarities of both ranking models.

In [None]:
# Launch the TF-IDF model for query 1
value = topics[1]
rank_tfidf= launch_query(dt_rel_data, value["query"])

In [None]:
print(rank_tfidf)

In [102]:
# Linearly combination function
def combine_ranks(rank1, rank2, alpha) :
    dict_rank = dict(rank1)
    rank_final = []
    for i,v in rank2 :
        rank_final.append((i, alpha * dict_rank[i] + (1-alpha)*v))
    rank_final.sort(key= lambda x: x[1], reverse=True)
    for i in range(0,10) :
        print("[ Score = "+str(rank_final[i][1])+" ] "+dt_rel_data.iloc[rank_final[i][0]].title)
    return rank_final

In [103]:
combine_ranks(rank_tfidf, rank_WCS, alpha = 0.5)

NameError: name 'rank_tfidf' is not defined