In [1]:
import pandas as pd
import json
import time
import xml.etree.ElementTree as ET

df = pd.read_csv("../data/metadata.csv")
print(df)

        cord_uid                                       sha  \
0       ug7v899j  d1aafb70c066a2068b02786f8929fd9c900897fb   
1       02tnwd4m  6b0567729c2143a66d737eb0a2f63f2dce2e5a7d   
2       ejv2xln0  06ced00a5fc04215949aa72528f2eeaae1d58927   
3       2b73a28n  348055649b6b8cf2b9a376498df9bf41f7123605   
4       9785vg6d  5f48792a5fa08bed9f56016f4981ae2ca6031b32   
...          ...                                       ...   
192504  z4ro6lmh  203f36475be74229101548475d68352b939f8b5b   
192505  hi8k8wvb  9f1bc99798e8823e690697394dcb23533a45c60e   
192506  ma3ndg41  ffba777376718ef2a0dd74a8eab90e2bfacd240f   
192507  wh10285j  d521c5a2dcbd79a5be606fcf586b1e0448344172   
192508  pnl9th2c  c047bf76813106d4fd586e49164e7feddfbe352f   

                      source_x  \
0                          PMC   
1                          PMC   
2                          PMC   
3                          PMC   
4                          PMC   
...                        ...   
192504           

  interactivity=interactivity, compiler=compiler, result=result)


#### Preparing the dataset
We decided just to work with the papers of the PDF_JSON corpus. Therefore, the first step is to delete from the dataframe the elements that are not in that folder. The number of examples is reduced from 192509 to 79755. Still, there are more documents in the pdf_json than in the dataframe (over 84000) because many pdf in the corpus have the same cord_uid. Technically, the papers mapped into the the same cord_uid are the same one, but with differences in ghe publication (if one article has been published by Elsevier and Springer, it will be mapped twice with the same cord_uid). Our take on the problem will be to consider just on of the documents associated to one cord_uid, instead of the full_list.

In [2]:
df = df[df.pdf_json_files.notnull()]
df = df.reset_index(drop = True)
print(df)

       cord_uid                                       sha  \
0      ug7v899j  d1aafb70c066a2068b02786f8929fd9c900897fb   
1      02tnwd4m  6b0567729c2143a66d737eb0a2f63f2dce2e5a7d   
2      ejv2xln0  06ced00a5fc04215949aa72528f2eeaae1d58927   
3      2b73a28n  348055649b6b8cf2b9a376498df9bf41f7123605   
4      9785vg6d  5f48792a5fa08bed9f56016f4981ae2ca6031b32   
...         ...                                       ...   
79750  z4ro6lmh  203f36475be74229101548475d68352b939f8b5b   
79751  hi8k8wvb  9f1bc99798e8823e690697394dcb23533a45c60e   
79752  ma3ndg41  ffba777376718ef2a0dd74a8eab90e2bfacd240f   
79753  wh10285j  d521c5a2dcbd79a5be606fcf586b1e0448344172   
79754  pnl9th2c  c047bf76813106d4fd586e49164e7feddfbe352f   

                     source_x  \
0                         PMC   
1                         PMC   
2                         PMC   
3                         PMC   
4                         PMC   
...                       ...   
79750            Medline; PMC   
797

Next, we drop the columns that will not add information to our information retrieval system (such as the licenses or the doi) and that do not help to map each example of the dataframe with a document in the pdf_json corpus.

In [3]:
columns_to_delete = ["doi", "source_x", "pmcid", "pubmed_id", "license", "mag_id", "who_covidence_id", "arxiv_id", "pmc_json_files", "url", "s2_id"]
# df_original = df
df = df.drop(columns_to_delete, axis = 1)
print(df)

       cord_uid                                       sha  \
0      ug7v899j  d1aafb70c066a2068b02786f8929fd9c900897fb   
1      02tnwd4m  6b0567729c2143a66d737eb0a2f63f2dce2e5a7d   
2      ejv2xln0  06ced00a5fc04215949aa72528f2eeaae1d58927   
3      2b73a28n  348055649b6b8cf2b9a376498df9bf41f7123605   
4      9785vg6d  5f48792a5fa08bed9f56016f4981ae2ca6031b32   
...         ...                                       ...   
79750  z4ro6lmh  203f36475be74229101548475d68352b939f8b5b   
79751  hi8k8wvb  9f1bc99798e8823e690697394dcb23533a45c60e   
79752  ma3ndg41  ffba777376718ef2a0dd74a8eab90e2bfacd240f   
79753  wh10285j  d521c5a2dcbd79a5be606fcf586b1e0448344172   
79754  pnl9th2c  c047bf76813106d4fd586e49164e7feddfbe352f   

                                                   title  \
0      Clinical features of culture-proven Mycoplasma...   
1      Nitric oxide: a pro-inflammatory mediator in l...   
2        Surfactant protein-D and pulmonary host defense   
3                   Role of

#### Parse the pdf
The articles mapped with each row of the metadata (cord_uid) are stored separately and parsed in .json. The following script converts an article mapped with each cord_uid into a python dictionary, storing the set of dictionaries on pdf_json_list.

##### Es probable que esto sea opcional, porque los metadatos ya nos dan el título, autores y abstract (las tres cosas más importantes). Trabajar usando todo el cuerpo de los artículos puede mejorar los resultados, pero el tiempo de cómputo aumentaría mucho. Además, la información sobre el título, autores y abstract es mejor en los metadatos, dado que los metadatos son proporcionados directamente de las revistas y, en contraste, la info. sobre titulo, autores y abstract del pdf_json corpus son obtenidos automáticamente mediante un parseo de pdf a json y puede haber errores.
##### Poca broma, a mi se me desborda la memoria. Mi RAM la tenía al 94% de ocupación con el siguiente script (obviamente no pude terminarlo).

In [4]:
'''
pdf_json_list = []
t0 = time.time()
for row in df.index :
    json_path = (df.loc[row]['pdf_json_files'].split('; '))[0]
    json_file = open("../data/"+json_path) 
    full_text_dict = json.load(json_file)
    pdf_json_list.append(full_text_dict)
t1 = time.time()
'''

'\npdf_json_list = []\nt0 = time.time()\nfor row in df.index :\n    json_path = (df.loc[row][\'pdf_json_files\'].split(\'; \'))[0]\n    json_file = open("../data/"+json_path) \n    full_text_dict = json.load(json_file)\n    pdf_json_list.append(full_text_dict)\nt1 = time.time()\n'

Script to read the test queries (https://towardsdatascience.com/download-and-parse-trec-covid-data-8f9840686c37)

In [5]:
topics = {}
root = ET.parse("../queries/test_queries.xml").getroot()
for topic in root.findall("topic"):
    topic_number = int(topic.attrib["number"])
    topics[topic_number] = {}
    for query in topic.findall("query"):
        topics[topic_number]["query"] = query.text
    for question in topic.findall("question"):
        topics[topic_number]["question"] = question.text        
    for narrative in topic.findall("narrative"):
        topics[topic_number]["narrative"] = narrative.text
print(topics[1].keys())

dict_keys(['query', 'question', 'narrative'])


Script to read the relevance judgement (the information needed to evaluate our system). The round id is not needed and it is therefore omitted,

In [8]:
relevance_data = pd.read_csv("../queries/relevance_judgements.txt", sep=" ", header=None)
relevance_data.columns = ["topic_id", "round_id", "cord_uid", "relevancy"]
relevance_data = relevance_data.drop("round_id" ,axis = 1)
print(relevance_data)

       topic_id  cord_uid  relevancy
0             1  005b2j4b          2
1             1  00fmeepz          1
2             1  010vptx3          2
3             1  0194oljo          1
4             1  021q9884          1
...         ...       ...        ...
69313        50  zvop8bxh          2
69314        50  zwf26o63          1
69315        50  zwsvlnwe          0
69316        50  zxr01yln          1
69317        50  zz8wvos9          1

[69318 rows x 3 columns]


With all the metadata (and optionally json_pdf), test topics and relevance judgement we are prepared to build and validate the system