# Building and Evaluating a COVID-19 oriented Information Retrieval Engine


## 1. Introduction
In this work, we are going to building and evaluating the performance of several custom Information Retrieval engines. For that, we will use the TREC-COVID19 corpus to recover relevant documents related to SARS-COV-2 and COVID-10 disease queries. 

Firstly, we carry out a state of the art to explore several ways to do vector space models. 

Then, we start with the process of the data  to preparing the dataset, queries, and relevance judgement data. 

After that,  we create the first Vector Space Model implementation. This implementation is the simple version of one VSM. Here, one TF-IDF model is used and the VSM provides a document ranking in descending order of relevance due to the cosine measure.

Nextly, we implement the VSM based on word co-occurrences. This implementation takes into account the Mutual Information, the TF and IDF, of two word co-occurrences. And, at the same as Simple VSM, provides a document ranking.

Afterwards, we combine the two previous VSM to create one ranking through a linearly combination of the two VSM. 
 
In all Vector Space Models one performance evaluation is done. 

And, at the end of this work, one conclusion is carry out.

## 2. State of the art
Before selecting a model to implement, we decided to review the recent literature concerning vector space information retrieval systems. First, we will describe the paradigm which we are dealing with, since information retrieval is such a broad task. Then, we will briefly summarize some interesting articles. Note that these articles are sometimes quite specific in its building and in its application. We aim to generalize and to explain how such systems would work in a vector space-based information retrieval system

### 2.1 Supervised vs Unsupervised learning
This distinction belongs more to the paradigm of Machine Learning and not all that much to Information retrieval, but some parallelism in the way of building models can be stablished between these two techniques.

In supervised learning, we train a Machine Learning model using labelled data. The data is therefore explicitly characterized by that label and the model is trained into distinguish between labels using the attributes of the data. The most common tasks are the ones of classification and regression. In unsupervised learning, the data is not labelled and therefore implicitly characterized. Usually unsupervised learning aims towards forming clusters, exploring in that way the data.

Our base model is a case of unsupervised learning training, since we only train our vector space model using a set of documents, but those are not a priori labelled as relevant or not according to any query. It is noteworthy that the test may be considered as supervised, since we have a series of relevance judgement and some interesting metrics can be obtained following the TREC guidelines. In the literature, we can find also examples of supervised information retrieval systems, such as "cite", that uses the relevance judgment or a relevant/irrelevant label to train the model using documents and queries too.

### 2.2 Co-occurrence
Our base model constructs a vector space in which each document d_i is transformed into a vector, whose component t_j represents the j-th word of the corpus and the value v_j of t_j represents the TF IDF metric for the term t_j in document d_i. An input query is transformed into a vector and the cosine similarity between the documents d_i of the corpus and the query q is studied, so that the more relevant queries are the one that maximizes the cosine similarity measure.

While the model is straight-forward, precise and easy to understand, this model ignores the existing correlation between words. Many models develop a word co-occurrence to account for this correlation. Chen et al. (2020) proposes a vector space model for herbal prescription, an ambit in which we can find correlation between two herbs being recommended together. The proposed article first extracts the most relevant co-occurrences of the corpus, exploring exhaustively the search space of all possible co-occurrences and discarding the ones below a certain support (frequency) and confidence (mutual information) threshold. Then each document d_i is transformed into a vector where each component c_j represents the j co-occurrence of the set R of relevant co-occurencies. The value v_j of c_j can be computed as v_j = TF * IDF * MI, i.e. the product of the term frequency, inverse document frequency and mutual information of c_j.

Although co-occurrence models are somehow deprecated comparing its performance to novel approaches, we decided to implement an adaption of the model of Chen et al. (2020) due to (1) it's theoretical simplicity, (2) it's explainability, giving us keys on how and when the model works and (3) the no-need to use specialized and complex libraries.


### 2.3 Description logics
Description logics is a formal knowledge widely used in IA, since it provides a platform to express ontologies (the Web Ontology Language is based on Description Logics). It is similar to L1, but it's knowledge representation power is slightly more limited.

DL-VSM (Boukhari and Omri, 2020) uses a Description Logics inference and a Vector Space Model in parallel to extract the important concepts of a corpus. It also uses the MeSH thesaurus, that contains a set of biomedical concepts, characterized by a preference term (the term representing that concept) and a set of non-preference terms (alternative term to represent the concept). Each term is composed of a set of words. I.e. concept > term > word in an abstraction level.

1) The Vector Space Model calculates the weight of each word in each document, using the BM25 metric instead of TFIDF. Next, the term weight in MeSH thesaurus is computed similarly and the similarity between each document and each MeSH term (a set of words describing the term) to weight each MeSH term and, finally, compute the weight of each concept for each document.

2) Description logics is used to describe each document as a conjunction of relation represented_by(w_i), where w_i is a word of the document. Each concept of MeSH is represented as a conjunction of relations described_by(t_i) where t_i is a term that describes the concept. An inference is carried out to identify the concepts relevant of each document.

The intersection between the concepts obtained with both approaches defines our model. Given an input query (in which a concept is extracted), the score of each document is the sum of the weight its concept words present in the documents calculated in (1).


### 2.4 Case Based Reasoning
Although not applied in the field of Information retrieval but in the one of costumer preference analysis, the VSM-CBR (Ke et al., 2020) supposes an interesting approach to this problem. The proposed algorithm analyses the different customer demand (which is in text format) and store it in a vector space, where each demand is represented by a vector. A k-means clustering algorithms clusters these demands, exploratory analyse each one of them and study future demands. Although clustering has been previously used, this articles highlight the exploratory analysis of those cluster

The interesting philosophy of this work is the clustering part. Instead building a model with |D| documents, we can cluster them according to the stems frequency and build a model using |C| clusters, with |C|<|D|. Furthermore, an exploratory analysis can be performed to label each clusters with a set of concepts and use these concepts to retrieve information, instead of the words of the document, which have a higher cardinality. The computational charge will be diminished, but the downside is that documents in the same clusters will be scored with the same relevance.

### 2.5 Neural Vector Space Models.
Neural Information retrieval has arose a lot of enthusiasm concerning the efficiency and efficacy of Deep Learning approaches (Marchesin et al., 2020). In the following paragraph, we describe the intuition of two of those models

1) The Deep Relevance Matching Model (Guo et al., 2020) is a supervised IR system that looks for exact matching (opposed to partial matching with stemming). It groups each pair of terms in the query and looks for matching in the documents. For each document, the matching values are input into the feedforward network and the output is the score of the aforementioned document given the query. Since it is a supervised model, the net should be trained with sample queries and documents.

2) The Neural Vector Space Model (Gysel et al., 2020) is closer to the approach that we have to take in this work, since it creates a model without evaluating the results with the queries until the model is constructed. The general idea is to train an artificial neural network that optimizes a function that minimizes the distance between n-grams (co-occurrences of size n) and the documents. The optimization of the n-grams provides a closer and simplified representation of the documents, giving more weight to the words that are discriminative between each other. This give as a result a simplified vector space and, in order to perform the ranking, the queries are transformed to n-grams and the cosine similarity is studied between the n-gram (query) and each document.

## 3. Proccesing data 
Before starting to implement Vector Space Models, we need to prepare the dataset to obtain only relevant data. Also, we need to treat our queries to obtain only the name of the queries (query.text). And we need to recover relevant judgment data from one txt. So, first of all, we are going to process all of our data.

#### Implementation

First, we install and import libraries.

In [1]:
pip install pandas

Note: you may need to restart the kernel to use updated packages.


In [2]:
import pandas as pd
import json
import time
import xml.etree.ElementTree as ET
import math

#### Preparing the dataset

We download the data for this work from https://ai2-semanticscholar-cord-19.s3-us-west-2.amazonaws.com/historical_releases/cord-19_2020-07-16.tar.gz

More information about this data can be found at
https://ai2-semanticscholar-cord-19.s3-us-west-2.amazonaws.com/historical_releases.html

We choose the metadata.csv to start to create our data set. 

In [3]:
dt = pd.read_csv("../data/metadata.csv")
print(dt)

        cord_uid                                       sha  \
0       ug7v899j  d1aafb70c066a2068b02786f8929fd9c900897fb   
1       02tnwd4m  6b0567729c2143a66d737eb0a2f63f2dce2e5a7d   
2       ejv2xln0  06ced00a5fc04215949aa72528f2eeaae1d58927   
3       2b73a28n  348055649b6b8cf2b9a376498df9bf41f7123605   
4       9785vg6d  5f48792a5fa08bed9f56016f4981ae2ca6031b32   
...          ...                                       ...   
192504  z4ro6lmh  203f36475be74229101548475d68352b939f8b5b   
192505  hi8k8wvb  9f1bc99798e8823e690697394dcb23533a45c60e   
192506  ma3ndg41  ffba777376718ef2a0dd74a8eab90e2bfacd240f   
192507  wh10285j  d521c5a2dcbd79a5be606fcf586b1e0448344172   
192508  pnl9th2c  c047bf76813106d4fd586e49164e7feddfbe352f   

                      source_x  \
0                          PMC   
1                          PMC   
2                          PMC   
3                          PMC   
4                          PMC   
...                        ...   
192504           

  interactivity=interactivity, compiler=compiler, result=result)


We decided just to work with the papers of the PDF_JSON corpus. Therefore, the first step is to delete from the data frame the elements that are not in that folder. The number of examples is reduced from 192509 to 79755. Still, there are more documents in the pdf_json than in the data frame (over 84000), because many documents (PDF) in the corpus have the same cord_uid. Technically, the papers mapped into the same cord_uid are the same one, but with differences in the publication (if one article has been published by Elsevier and Springer, it will be mapped twice with the same cord_uid). Our take on the problem will be to consider just one of the documents associated with one cord_uid, instead of the full_list.

In [4]:
dt = dt[dt.pdf_json_files.notnull()]
dt = dt.reset_index(drop = True)
print(dt)

       cord_uid                                       sha  \
0      ug7v899j  d1aafb70c066a2068b02786f8929fd9c900897fb   
1      02tnwd4m  6b0567729c2143a66d737eb0a2f63f2dce2e5a7d   
2      ejv2xln0  06ced00a5fc04215949aa72528f2eeaae1d58927   
3      2b73a28n  348055649b6b8cf2b9a376498df9bf41f7123605   
4      9785vg6d  5f48792a5fa08bed9f56016f4981ae2ca6031b32   
...         ...                                       ...   
79750  z4ro6lmh  203f36475be74229101548475d68352b939f8b5b   
79751  hi8k8wvb  9f1bc99798e8823e690697394dcb23533a45c60e   
79752  ma3ndg41  ffba777376718ef2a0dd74a8eab90e2bfacd240f   
79753  wh10285j  d521c5a2dcbd79a5be606fcf586b1e0448344172   
79754  pnl9th2c  c047bf76813106d4fd586e49164e7feddfbe352f   

                     source_x  \
0                         PMC   
1                         PMC   
2                         PMC   
3                         PMC   
4                         PMC   
...                       ...   
79750            Medline; PMC   
797

Next, we drop the columns that will not add information to our information retrieval system and that do not help to map each example of the data frame with a document in the pdf_json corpus.

In our opinion, these columns are doi, source_x, pmcid, pubmed_id, license, mag_id, who_covidence_id, arxiv_id, pmc_json_files, url, and s2_id. 

We think only need the columns cord_uid, sha, title, abstract, publish_time, authors, journal, and pdf_json_files. But, it is possible that we do not need all the columns for all the VSM.

In [5]:
columns_to_delete = ["doi", "source_x", "pmcid", "pubmed_id", "license", "mag_id", "who_covidence_id", "arxiv_id", "pmc_json_files", "url", "s2_id"]
# dt_original = dt
dt = dt.drop(columns_to_delete, axis = 1)
print(dt)

       cord_uid                                       sha  \
0      ug7v899j  d1aafb70c066a2068b02786f8929fd9c900897fb   
1      02tnwd4m  6b0567729c2143a66d737eb0a2f63f2dce2e5a7d   
2      ejv2xln0  06ced00a5fc04215949aa72528f2eeaae1d58927   
3      2b73a28n  348055649b6b8cf2b9a376498df9bf41f7123605   
4      9785vg6d  5f48792a5fa08bed9f56016f4981ae2ca6031b32   
...         ...                                       ...   
79750  z4ro6lmh  203f36475be74229101548475d68352b939f8b5b   
79751  hi8k8wvb  9f1bc99798e8823e690697394dcb23533a45c60e   
79752  ma3ndg41  ffba777376718ef2a0dd74a8eab90e2bfacd240f   
79753  wh10285j  d521c5a2dcbd79a5be606fcf586b1e0448344172   
79754  pnl9th2c  c047bf76813106d4fd586e49164e7feddfbe352f   

                                                   title  \
0      Clinical features of culture-proven Mycoplasma...   
1      Nitric oxide: a pro-inflammatory mediator in l...   
2        Surfactant protein-D and pulmonary host defense   
3                   Role of

In [6]:
print(dt.shape[0])

79755


In [7]:
# Document 1 
print(dt.iloc[0])

cord_uid                                                   ug7v899j
sha                        d1aafb70c066a2068b02786f8929fd9c900897fb
title             Clinical features of culture-proven Mycoplasma...
abstract          OBJECTIVE: This retrospective chart review des...
publish_time                                             2001-07-04
authors                         Madani, Tariq A; Al-Ghamdi, Aisha A
journal                                              BMC Infect Dis
pdf_json_files    document_parses/pdf_json/d1aafb70c066a2068b027...
Name: 0, dtype: object


In [8]:
# Document 1
print(dt.iloc[0].title)
print(dt.iloc[0].abstract)

Clinical features of culture-proven Mycoplasma pneumoniae infections at King Abdulaziz University Hospital, Jeddah, Saudi Arabia
OBJECTIVE: This retrospective chart review describes the epidemiology and clinical features of 40 patients with culture-proven Mycoplasma pneumoniae infections at King Abdulaziz University Hospital, Jeddah, Saudi Arabia. METHODS: Patients with positive M. pneumoniae cultures from respiratory specimens from January 1997 through December 1998 were identified through the Microbiology records. Charts of patients were reviewed. RESULTS: 40 patients were identified, 33 (82.5%) of whom required admission. Most infections (92.5%) were community-acquired. The infection affected all age groups but was most common in infants (32.5%) and pre-school children (22.5%). It occurred year-round but was most common in the fall (35%) and spring (30%). More than three-quarters of patients (77.5%) had comorbidities. Twenty-four isolates (60%) were associated with pneumonia, 14 (35

#### Parse the test_queries file

To read the test queries, we use the script that appears at https://towardsdatascience.com/download-and-parse-trec-covid-data-8f9840686c37

The file test_queries.txt contains 50 queries (topics). Each query has three parts: query, question, and narrative. We will only need the query part. 

In [9]:
topics = {}
root = ET.parse("../queries/test_queries.xml").getroot()
for topic in root.findall("topic"):
    topic_number = int(topic.attrib["number"])
    topics[topic_number] = {}
    for query in topic.findall("query"):
        topics[topic_number]["query"] = query.text  # We only need the query part
        
print(topics[1].keys())

dict_keys(['query'])


In [10]:
# We show all queries. Topics is a python dictionary 
print(topics)

{1: {'query': 'coronavirus origin'}, 2: {'query': 'coronavirus response to weather changes'}, 3: {'query': 'coronavirus immunity'}, 4: {'query': 'how do people die from the coronavirus'}, 5: {'query': 'animal models of COVID-19'}, 6: {'query': 'coronavirus test rapid testing'}, 7: {'query': 'serological tests for coronavirus'}, 8: {'query': 'coronavirus under reporting'}, 9: {'query': 'coronavirus in Canada'}, 10: {'query': 'coronavirus social distancing impact'}, 11: {'query': 'coronavirus hospital rationing'}, 12: {'query': 'coronavirus quarantine'}, 13: {'query': 'how does coronavirus spread'}, 14: {'query': 'coronavirus super spreaders'}, 15: {'query': 'coronavirus outside body'}, 16: {'query': 'how long does coronavirus survive on surfaces'}, 17: {'query': 'coronavirus clinical trials'}, 18: {'query': 'masks prevent coronavirus'}, 19: {'query': 'what alcohol sanitizer kills coronavirus'}, 20: {'query': 'coronavirus and ACE inhibitors'}, 21: {'query': 'coronavirus mortality'}, 22: 

In [11]:
# We show only queries names
for key in topics: 
    value = topics[key]
    print(key, value["query"])

1 coronavirus origin
2 coronavirus response to weather changes
3 coronavirus immunity
4 how do people die from the coronavirus
5 animal models of COVID-19
6 coronavirus test rapid testing
7 serological tests for coronavirus
8 coronavirus under reporting
9 coronavirus in Canada
10 coronavirus social distancing impact
11 coronavirus hospital rationing
12 coronavirus quarantine
13 how does coronavirus spread
14 coronavirus super spreaders
15 coronavirus outside body
16 how long does coronavirus survive on surfaces
17 coronavirus clinical trials
18 masks prevent coronavirus
19 what alcohol sanitizer kills coronavirus
20 coronavirus and ACE inhibitors
21 coronavirus mortality
22 coronavirus heart impacts
23 coronavirus hypertension
24 coronavirus diabetes
25 coronavirus biomarkers
26 coronavirus early symptoms
27 coronavirus asymptomatic
28 coronavirus hydroxychloroquine
29 coronavirus drug repurposing
30 coronavirus remdesivir
31 difference between coronavirus and flu
32 coronavirus subtyp

#### Parse the relevance_judgements file
Here, we implement the script to read the relevance judgment. This, is the information needed to evaluate our system. For that, the round_id is not needed and it is therefore omitted. Also, relevancy (0, 1 or 2) is binarized (1 or 0).

In [12]:
relevance_data = pd.read_csv("../queries/relevance_judgements.txt", sep=" ", header=None)
relevance_data.columns = ["topic_id", "round_id", "cord_uid", "relevancy"]
relevance_data = relevance_data.drop("round_id" ,axis = 1)
relevance_data['relevancy'] = relevance_data['relevancy'].replace([2],'1')
print(relevance_data)

       topic_id  cord_uid relevancy
0             1  005b2j4b         1
1             1  00fmeepz         1
2             1  010vptx3         1
3             1  0194oljo         1
4             1  021q9884         1
...         ...       ...       ...
69313        50  zvop8bxh         1
69314        50  zwf26o63         1
69315        50  zwsvlnwe         0
69316        50  zxr01yln         1
69317        50  zz8wvos9         1

[69318 rows x 3 columns]








With all the metadata (and optionally json_pdf), test topics, and relevance judgment, we are prepared to build and validate the system.

## 4. A simple VSM implementation
We have adapted the simple vector space model implementation for our code. 

Here, we use one TF-IDF model and the VSM provides a document ranking in descending order of relevance due to the cosine measure.

On the TF-IDF model, each pair-document is created using the weight:
$$w_{t,d}=log(1+tf_{t,d})xlog_{10}(N/df_{t});$$
with: 

- t = term or word
    
- d = document
    
- $tf_{t,d}$ = document frequency of t. It is the number of times that t occurs in d 
- $df_{t}$ = inverse document frequency.


#### Implementation

In this subsection, we define all the python functions that we will use to create the VSM to ranking the documents with the queries.

First, we install and import libraries.

In [13]:
# We first install the NLTK toolkit

In [14]:
pip install nltk 

Note: you may need to restart the kernel to use updated packages.


In [15]:
#We now install the gensim package

In [16]:
pip install gensim 

Note: you may need to restart the kernel to use updated packages.


In [17]:
# We also need to download the NLTK data bundle

In [18]:
import nltk 

nltk.download('all')

[nltk_data] Downloading collection 'all'
[nltk_data]    | 
[nltk_data]    | Downloading package abc to
[nltk_data]    |     C:\Users\enriq\AppData\Roaming\nltk_data...
[nltk_data]    |   Package abc is already up-to-date!
[nltk_data]    | Downloading package alpino to
[nltk_data]    |     C:\Users\enriq\AppData\Roaming\nltk_data...
[nltk_data]    |   Package alpino is already up-to-date!
[nltk_data]    | Downloading package biocreative_ppi to
[nltk_data]    |     C:\Users\enriq\AppData\Roaming\nltk_data...
[nltk_data]    |   Package biocreative_ppi is already up-to-date!
[nltk_data]    | Downloading package brown to
[nltk_data]    |     C:\Users\enriq\AppData\Roaming\nltk_data...
[nltk_data]    |   Package brown is already up-to-date!
[nltk_data]    | Downloading package brown_tei to
[nltk_data]    |     C:\Users\enriq\AppData\Roaming\nltk_data...
[nltk_data]    |   Package brown_tei is already up-to-date!
[nltk_data]    | Downloading package cess_cat to
[nltk_data]    |     C:\Users\e

[nltk_data]    |   Package timit is already up-to-date!
[nltk_data]    | Downloading package toolbox to
[nltk_data]    |     C:\Users\enriq\AppData\Roaming\nltk_data...
[nltk_data]    |   Package toolbox is already up-to-date!
[nltk_data]    | Downloading package treebank to
[nltk_data]    |     C:\Users\enriq\AppData\Roaming\nltk_data...
[nltk_data]    |   Package treebank is already up-to-date!
[nltk_data]    | Downloading package twitter_samples to
[nltk_data]    |     C:\Users\enriq\AppData\Roaming\nltk_data...
[nltk_data]    |   Package twitter_samples is already up-to-date!
[nltk_data]    | Downloading package udhr to
[nltk_data]    |     C:\Users\enriq\AppData\Roaming\nltk_data...
[nltk_data]    |   Package udhr is already up-to-date!
[nltk_data]    | Downloading package udhr2 to
[nltk_data]    |     C:\Users\enriq\AppData\Roaming\nltk_data...
[nltk_data]    |   Package udhr2 is already up-to-date!
[nltk_data]    | Downloading package unicode_samples to
[nltk_data]    |     C:\U

True

In [19]:
# All the required software is now installed. 

In [20]:
# We now import the functions provided by NLTK to perform tokenizing considering punctuation signs.
from nltk.tokenize import wordpunct_tokenize, regexp_tokenize
# Next, we import required functions to filter-out stopwords for the English language.
from nltk.corpus import stopwords
# Now we import the function that implements the Porter's stemming algorithm.
from nltk.stem import PorterStemmer

The first step is aimed at preprocessing each document in the collection. We write a function that receives one document and returns a list containing all STEMS in the document whose associated token is longer than 2 characters and is NOT an (English) stopword.

In [21]:
def preprocess_document(doc): # Each doc is each dt row. We will only use title and abstract: dt.iloc[i].title and dt.iloc[i].abstract
    stopset = set(stopwords.words('english'))
    stemmer = PorterStemmer()
    if type(doc.title) != str and type(doc.abstract) != str: # For empty documents without title and abstract
        final = [""]
    else:
        if type(doc.title) == str: 
            tokens = wordpunct_tokenize(doc.title)
        if type(doc.abstract) == str:
            tokens.extend(wordpunct_tokenize(doc.abstract))
        # clean saves the words (in lower case) that are not included in stopset
        clean = [token.lower() for token in tokens if token.lower() not in stopset and len(token) > 2 and "%" not in token]
        final = [stemmer.stem(word) for word in clean]
    return final

In [22]:
print(preprocess_document(dt.iloc[0])) # Print STEMS for the document 1
print(type(dt.iloc[0].title))

['clinic', 'featur', 'cultur', 'proven', 'mycoplasma', 'pneumonia', 'infect', 'king', 'abdulaziz', 'univers', 'hospit', 'jeddah', 'saudi', 'arabia', 'object', 'retrospect', 'chart', 'review', 'describ', 'epidemiolog', 'clinic', 'featur', 'patient', 'cultur', 'proven', 'mycoplasma', 'pneumonia', 'infect', 'king', 'abdulaziz', 'univers', 'hospit', 'jeddah', 'saudi', 'arabia', 'method', 'patient', 'posit', 'pneumonia', 'cultur', 'respiratori', 'specimen', 'januari', '1997', 'decemb', '1998', 'identifi', 'microbiolog', 'record', 'chart', 'patient', 'review', 'result', 'patient', 'identifi', 'requir', 'admiss', 'infect', 'commun', 'acquir', 'infect', 'affect', 'age', 'group', 'common', 'infant', 'pre', 'school', 'children', 'occur', 'year', 'round', 'common', 'fall', 'spring', 'three', 'quarter', 'patient', 'comorbid', 'twenti', 'four', 'isol', 'associ', 'pneumonia', 'upper', 'respiratori', 'tract', 'infect', 'bronchiol', 'cough', 'fever', 'malais', 'common', 'symptom', 'crepit', 'wheez', '

In [23]:
print(dt.iloc[29264])
print(dt.iloc[29264].title)
print(type(dt.iloc[29264].title))

cord_uid                                                   n06og3cw
sha                        8d35867e078939b7f20187322e41011cec8b8cb3
title                                                           NaN
abstract                                                        NaN
publish_time                                             2020-05-13
authors           De Coninck, David; d'Haenens, Leen; Matthijs, ...
journal                                               Public Health
pdf_json_files    document_parses/pdf_json/8d35867e078939b7f2018...
Name: 29264, dtype: object
nan
<class 'float'>


In [24]:
print(preprocess_document(dt.iloc[29264]))

['']


Just like we need a function to preprocess each document, we need a function to preprocess each query to returns a list containing all STEMS in the query.

In [25]:
def preprocess_query(q):
    stopset = set(stopwords.words('english'))
    stemmer = PorterStemmer()
    tokens = wordpunct_tokenize(q)
    # clean saves the words (in lower case) that are not included in stopset
    clean = [token.lower() for token in tokens if token.lower() not in stopset and len(token) > 2] 
    final = [stemmer.stem(word) for word in clean]
    return final

Once we have the function to preprocess documents and queries, we need to implement a function to create a dictionary containing the mappings 
$$WORD\_ID \ ->\ WORD$$
This dictionary is required to create vector-based word representations.

In [26]:
from gensim import corpora

# Dictionary: Different words (STEMS) in the collection (CORPUS, all documents) with their IDs
def create_dictionary(docs):
    # List all pre-processing documents
    pdocs = [preprocess_document(docs.iloc[i]) for i in range(docs.shape[0])]
    # Build the dictionary with corpora (gensim)
    dictionary = corpora.Dictionary(pdocs)
    # Save in a file
    dictionary.save('simple_vsm.dict')
    return dictionary

Now, we have built the dictionary containing the vocabulary that we will use for indexing. 

Next, we will write a function that creates the bag of words-based representation for each document in the collection.

One bag of words model is a vector representation that contains for each vector the frequency of the words associate with its word.

In [27]:
def docs2bows(allData, dictionary):
    docs = [preprocess_document(allData.iloc[i]) for i in range(allData.shape[0])]
    # We obtain the set of frequencies for each term
    vectors = [dictionary.doc2bow(doc) for doc in docs]
    corpora.MmCorpus.serialize('simple_vsm_docs.mm', vectors)
    return vectors

With the funcion docs2bows we have basically TF-weighted vectors. 

We now want to convert these vectors into their TF-IDF weighted counterparts. We need, however, to import the models module from Gensim.

In [None]:
##############################
##############################
###### COPIAR LO DE ABAJO#####
##############################
##############################

### Cluster-VSM
Inspiring ourselves in the article by Ke et al. (2020) we decided to implement a VSM model in which the results are documents are grouped in clusters (using the k-means algorithm) and then, the tfidf model is built using the centroids of these clusters instead of the document. This relaxes the computational charge of doing query, while adding computational charge to the model building. Also, we lost some precision in the document scoring, since all the documents in the same query will get the same score. However, the bigger the number of clusters, the better the precision in scoring, but it will take more time to launch the query.

In [41]:
from gensim import models
from sklearn.cluster import KMeans
import numpy as np

def create_TF_IDF_model_cluster(data, filename = 'simple_vsm_docs_CBR.mm'):
    number_of_clusters = 200
    dictionary = create_dictionary(data) # Create the dictionary
    vectors = docs2bows(data, dictionary) # We will not really use this function for this case because we will use gensim models.
    
    # Extra step of cluster-VSM. We perform k-means and use the centroids as the vectors of our vector space model
    matrix = []
    for vector in vectors :
        matrix_row = [0]*len(dictionary)
        for i in vector :
            matrix_row[i[0]] = i[1]
        matrix.append(matrix_row)
    matrix = np.array(matrix)
    kmeans = KMeans(n_clusters=number_of_clusters, random_state=0).fit(matrix)
    cluster_centers = kmeans.cluster_centers_
    cluster_centers_vectors = []
    for row in cluster_centers :
        doc_vector = []
        for j,element in enumerate(row) :
            if element > 0 :
                doc_vector.append((j,element))
        cluster_centers_vectors.append(doc_vector)
    cluster_asignments = [[]]*number_of_clusters
    for i,cluster in enumerate(kmeans.labels_) :
        cluster_asignments[cluster].append(i)
            
    
    corpora.MmCorpus.serialize(filename, cluster_centers_vectors)
    loaded_allData = corpora.MmCorpus(filename)
    print(loaded_allData)
    tfidf = models.TfidfModel(loaded_allData)
    print("tfidf", tfidf)
    return tfidf, dictionary, cluster_asignments

Ideally, one would want to build the model using all the data. However, the k-means implementation is not optimized for a 80000x10000 matrix and it would work. Again, we are going to consider for each query it's document of its relevance judgment. This is not what one would normally do, but for purposes of exemplification and show the functioning, we consider it a small price to pay.

In [42]:
# Let us now create the TF-IDF model.
# ** THIS IS NOT NECESSARY. IT IS ONLY FOR TESTING. THIS TAKE PLENTY OF TIME.
#tfidfm,dictionary,clusters_asignments = create_TF_IDF_model(dt)
#print(tfidfm)

We finally create a function that given the dt and the topics provides a document ranking, according to the cosine measure, sorted in descending order of relevance (the better documents at the beginning). The function returns the ranking for all the documents but only shows the ten first positions. 

In [43]:
from operator import itemgetter
from gensim import similarities
 
    
def launch_query_cluster(allData, q, number,tfidf, dictionary, cluster_asignments, filename = 'simple_vsm_docs_CBR.mm', f="null"):
    #tfidf, dictionary, cluster_asignments = create_TF_IDF_model(allData) 
    loaded_allData = corpora.MmCorpus(filename)
    index = similarities.MatrixSimilarity(loaded_allData, num_features=len(dictionary))
    pq = preprocess_query(q)
    vq = dictionary.doc2bow(pq)
    qtfidf = tfidf[vq]
    sim = index[qtfidf]
    ranking = sorted(enumerate(sim), key=itemgetter(1), reverse=True)
    ranking_real = []
    
    # We do a ranking with the documents of each cluster. Documents in the same cluster will receive the same score
    for i in ranking :
        cluster = i[0]
        ranking_real = ranking_real + [(e,i[1]) for e in cluster_asignments[cluster]]
        
        
    print("QUERY:",q)
    """
    
    for i in range(0,10) :
        print("[ Score = "+str(ranking[i][1])+" ] "+allData.iloc[ranking[i][0]].title)
        if f!="null":
            f.write(str(number)+" Q0 "+str(allData.iloc[ranking[i][0]].cord_uid)+" "+str(i+1)+" "+str(ranking[i][1])+" mySystem \n")
    """
    pos = 1
    for doc, score in ranking_real:
        if ( pos <=10 ): # First ten positions
            print("[ Score = " + "%.3f" % round(score,3) + " ] " + allData.iloc[doc].title); 
            if f!="null":
                f.write(str(number)+" Q0 "+str(allData.iloc[doc].cord_uid)+" "+str(pos)+" "+str(round(score,3))+" mySystem \n")
        else: 
            if f!="null":
                f.write(str(number)+" Q0 "+str(allData.iloc[doc].cord_uid)+" "+str(pos)+" "+str(round(score,3))+" mySystem \n")
            else:
                break
        pos += 1
    return ranking_real

And now we can launch any query we see fit to our newly created Information Retrieval engine.

To do a little test, we choose the query 1: "coronavirus origin".

We know that this program takes plenty of time if we use all the dt (pdf_json) dataset, to avoid that, we have decided to choose only the documents that appear in relevance_data (relevance_judgements.txt). With this strategy we can use the VSM in an acceptable time, and, also, we can evaluate our ranking with the relevance_data. 

In [45]:
import time
all_rankings_cluster = []
for topi in range(len(topics)):
    # We select all cord_uid in relevance_data for the query number topi
    cords_relevance_data = set()
    aux = 0
    for i in range(topi+1):
        aux += relevance_data.loc[relevance_data.topic_id==(i)].shape[0]
    for i in range(relevance_data.loc[relevance_data.topic_id==(topi+1)].shape[0]):
        cords_relevance_data.add(relevance_data.loc[relevance_data.topic_id==(topi+1)].cord_uid[aux+i])
    # We select all cord_uid that are in both (relevance_data and dt) for the query number topi
    dt_rel_data = pd.DataFrame()
    aux = 0
    for i in range(dt.cord_uid.shape[0]):
        if dt.cord_uid[i] in cords_relevance_data:
            dt_rel_data[aux] = dt.iloc[i]
            aux +=1
    dt_rel_data = dt_rel_data.T
    # Now we rank for the query number topi
    print("##############")
    print("QUERY NUMBER ", (topi+1))
    value = topics[(topi+1)]
    tfidfm,dictionary,clusters_asignments = create_TF_IDF_model_cluster(dt_rel_data)
    t0 = time.time()
    rank_cluster = launch_query_cluster(dt_rel_data, value["query"], topi+1,tfidfm, dictionary, clusters_asignments)
    t1 = time.time()
    all_rankings_cluster.append(rank_cluster)
    print(str(t1-t0)+"seconds")

'''
for k in range(1,51) :
    # We select all cord_uid in relevance_data for the query k
    cords_relevance_data = set()
    for i in range(relevance_data.loc[relevance_data.topic_id==k].shape[0]):
        cords_relevance_data.add(relevance_data.loc[relevance_data.topic_id==k].cord_uid[i])

    # We select all cord_uid that are in both (relevance_data and dt) for the query 1
    dt_rel_data = pd.DataFrame()
    aux = 0

    for i in range(dt.cord_uid.shape[0]):
        #print(dt.cord_uid[i])
        if dt.cord_uid[i] in cords_relevance_data:
            #print(dt.cord_uid[i])
            dt_rel_data[aux] = dt.iloc[i]
            aux +=1
    dt_rel_data = dt_rel_data.T
    
    value = topics[k]
    tfidfm,dictionary,clusters_asignments = create_TF_IDF_model(dt_rel_data)
    t0 = time.time()
    rank = launch_query(dt_rel_data, value["query"], 1,tfidfm, dictionary, clusters_asignments)
    t1 = time.time()
    print(str(t1-t0)+"seconds")
'''

##############
QUERY NUMBER  1
MmCorpus(200 documents, 8416 features, 48616 non-zero entries)
tfidf TfidfModel(num_docs=200, num_nnz=48616)
QUERY: coronavirus origin
[ Score = 0.508 ] NSs Encoded by Groundnut Bud Necrosis Virus Is a Bifunctional Enzyme
[ Score = 0.508 ] Comparative Efficacy of Hemagglutinin, Nucleoprotein, and Matrix 2 Protein Gene-Based Vaccination against H5N1 Influenza in Mouse and Ferret
[ Score = 0.508 ] Elevation of Intact and Proteolytic Fragments of Acute Phase Proteins Constitutes the Earliest Systemic Antiviral Response in HIV-1 Infection
[ Score = 0.508 ] Large-scale evolutionary surveillance of the 2009 H1N1 influenza A virus using resequencing arrays
[ Score = 0.508 ] Angiotensin-converting enzyme 2 autoantibodies: further evidence for a role of the renin-angiotensin system in inflammation
[ Score = 0.508 ] Diagnostic value of triggering receptor expressed on myeloid cells-1 and C-reactive protein for patients with lung infiltrates: an observational study


KeyboardInterrupt: 