In [46]:
!pip install pandas tqdm

Collecting tqdm
[?25l  Downloading https://files.pythonhosted.org/packages/4a/1c/6359be64e8301b84160f6f6f7936bbfaaa5e9a4eab6cbc681db07600b949/tqdm-4.45.0-py2.py3-none-any.whl (60kB)
[K    100% |████████████████████████████████| 61kB 1.7MB/s ta 0:00:01
Installing collected packages: tqdm
Successfully installed tqdm-4.45.0
[33mYou are using pip version 19.0.3, however version 20.0.2 is available.
You should consider upgrading via the 'pip install --upgrade pip' command.[0m


# CORD-19 References

Este notebook explora la calidad de las referencias contenidas en el COVID-19 Open Research Dataset Challenge.
Las respuestas que se intentan responder son:

- Cuántos papers tiene el dataset?
- Cuántos papers del dataset tienen referencia?
- Cuántas referencias hace cada paper?
- Cuántas referencias que se hacen están dentro del conjunto de datos? Cuántas referencias que se hacen están fuera?

In [6]:
# En primer lugar, se cargará el conjunto de datos.
# Se debe extraer el conjunto de datos en una carpeta `datasets`.

import pandas as pd


DATASET_FOLDER_PATH = "./datasets/CORD-19-research-challenge"
metadata_df = pd.read_csv(f"{DATASET_FOLDER_PATH}/metadata.csv", index_col="cord_uid")
metadata_df.head()

Unnamed: 0_level_0,sha,source_x,title,doi,pmcid,pubmed_id,license,abstract,publish_time,authors,journal,Microsoft Academic Paper ID,WHO #Covidence,has_pdf_parse,has_pmc_xml_parse,full_text_file,url
cord_uid,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1
xqhn0vbp,1e1286db212100993d03cc22374b624f7caee956,PMC,Airborne rhinovirus detection and effect of ul...,10.1186/1471-2458-3-5,PMC140314,12525263.0,no-cc,"BACKGROUND: Rhinovirus, the most common cause ...",2003-01-13,"Myatt, Theodore A; Johnston, Sebastian L; Rudn...",BMC Public Health,,,True,True,custom_license,https://www.ncbi.nlm.nih.gov/pmc/articles/PMC1...
gi6uaa83,8ae137c8da1607b3a8e4c946c07ca8bda67f88ac,PMC,Discovering human history from stomach bacteria,10.1186/gb-2003-4-5-213,PMC156578,12734001.0,no-cc,Recent analyses of human pathogens have reveal...,2003-04-28,"Disotell, Todd R",Genome Biol,,,True,True,custom_license,https://www.ncbi.nlm.nih.gov/pmc/articles/PMC1...
le0ogx1s,,PMC,A new recruit for the army of the men of death,10.1186/gb-2003-4-7-113,PMC193621,12844350.0,no-cc,"The army of the men of death, in John Bunyan's...",2003-06-27,"Petsko, Gregory A",Genome Biol,,,False,True,custom_license,https://www.ncbi.nlm.nih.gov/pmc/articles/PMC1...
fy4w7xz8,0104f6ceccf92ae8567a0102f89cbb976969a774,PMC,Association of HLA class I with severe acute r...,10.1186/1471-2350-4-9,PMC212558,12969506.0,no-cc,BACKGROUND: The human leukocyte antigen (HLA) ...,2003-09-12,"Lin, Marie; Tseng, Hsiang-Kuang; Trejaut, Jean...",BMC Med Genet,,,True,True,custom_license,https://www.ncbi.nlm.nih.gov/pmc/articles/PMC2...
0qaoam29,5b68a553a7cbbea13472721cd1ad617d42b40c26,PMC,A double epidemic model for the SARS propagation,10.1186/1471-2334-3-19,PMC222908,12964944.0,no-cc,BACKGROUND: An epidemic of a Severe Acute Resp...,2003-09-10,"Ng, Tuen Wai; Turinici, Gabriel; Danchin, Antoine",BMC Infect Dis,,,True,True,custom_license,https://www.ncbi.nlm.nih.gov/pmc/articles/PMC2...


In [8]:
# Veamos la cantidad de registros y columnas del archivo de metadatos
metadata_df.columns, len(metadata_df)

(Index(['sha', 'source_x', 'title', 'doi', 'pmcid', 'pubmed_id', 'license',
        'abstract', 'publish_time', 'authors', 'journal',
        'Microsoft Academic Paper ID', 'WHO #Covidence', 'has_pdf_parse',
        'has_pmc_xml_parse', 'full_text_file', 'url'],
       dtype='object'),
 51078)

**Se observan 51.078 registros**, cada uno representando un paper
Además del identificador único `cord_uid`, se observan las siguientes columnas:

- `sha`: hash del PDF
- `source_x`: repositorio fuente, e.g. biorxiv, Elsevier, etc.
- `title`: título del paper
- `doi`, `pmcid`, `pubmed_id`, `Microsoft Academic Paper ID`, `WHO #Covidence`: identificadores asociados al paper
- `license`: licencia de uso
- `abstract`: resumen en lenguaje natural
- `publish_time`: fecha de publicación
- `journal`: en el caso de haber sido publicado en uno, el nombre del journal
- `authors`: autores en lenguaje natural
- `has_pdf_parse`: si el paper tiene parseado su PDF o no
- `has_pmc_xml_parse`: si el XML de PubMed del paper está parseado o no
- `full_text_file`
- `url`: enlace al paper

A continuación se revisará la estructura de carpetas y archivos del conjunto de datos.

In [37]:
import json
import random
import os

def _sample_json_file(path):
    for _, _, file_names in os.walk(path):
        sample_file_name = random.choice(file_names)
        file_path = os.path.join(path, sample_file_name)
        with open(file_path) as file:
            contents = json.load(file)
        return list(contents.keys())

def walk_dataset():
    cum_sum = 0
    for root, folders, files in os.walk(DATASET_FOLDER_PATH):
        num_folders = len(folders)
        num_files = len(files)
        if "json" in root and num_folders == 0 and num_files > 0:
            cum_sum += num_files
            print(f"{root}: {num_files} files")
    return cum_sum

walk_dataset()

./datasets/CORD-19-research-challenge/biorxiv_medrxiv/biorxiv_medrxiv/pdf_json: 1625 files
./datasets/CORD-19-research-challenge/noncomm_use_subset/noncomm_use_subset/pdf_json: 2490 files
./datasets/CORD-19-research-challenge/noncomm_use_subset/noncomm_use_subset/pmc_json: 2217 files
./datasets/CORD-19-research-challenge/custom_license/custom_license/pdf_json: 26505 files
./datasets/CORD-19-research-challenge/custom_license/custom_license/pmc_json: 7802 files
./datasets/CORD-19-research-challenge/comm_use_subset/comm_use_subset/pdf_json: 9524 files
./datasets/CORD-19-research-challenge/comm_use_subset/comm_use_subset/pmc_json: 9148 files


59311

Se observan 59.311 archivos, lo que arroja una diferencia de 8.233 respecto a la cantidad de registros en el CSV de metadatos.
Probablemente, existen registros en el CSV de metadatos que tienen más de un archivo.

De acuerdo a la información encontrada en la comunidad de Kaggle, existen al menos dos procedimientos recomendados para cargar los datos.

```python
# Primer procedimiento
for row in metadata_file:
    pmc_id = row["pmc_id"]
    if exists pmc_id file in pmc_jsons subfolders:
        return pmc_id file
    # Si pmc_id es nulo o no existe archivo
    sha = row["sha"]
    if exists sha file in pdf_jsons subfolders:
        return sha file

# Segundo procedimiento
for pdf file in pdf files:
    if pdf file sha not in metadata_file shas:
        continue
    else:
        row = metadata_file row with matching pdf file sha
        pmc_id = row["pmc_id"]
        if exists pmc_id file in pmc_jsons subfolders:
            return pmc_id file
        return pdf file
```

In [38]:
# 1. Construir diccionarios de PMC files, JSON files
# 2. Iterate metadata file

def get_id_paths_dicts(cls):
    """
    Esta función construye los diccionarios cuyas llaves son identificadores
    de alguna clase de papers (PDF o PMC) y cuyos valores son los paths hacia
    los archivos asociados.
    """
    all_files = {}
    for root, folders, files in os.walk(DATASET_FOLDER_PATH):
        num_folders = len(folders)
        num_files = len(files)
        if cls in root and num_folders == 0 and num_files > 0:
            for file_name in files:
                _id = file_name.split(".")[0]
                all_files[_id] = os.path.join(root, file_name)
    return all_files

In [42]:
pdf_dict = get_id_paths_dicts("pdf")
pmc_dict = get_id_paths_dicts("pmc")
len(pdf_dict), len(pmc_dict)

(40144, 19167)

Se observa la existencia de:

- 40.144 archivos PDF con sus respectivos JSONs
- 19.167 archivos PMC con sus respectivos JSONs

En base a la información anterior, se seguirá el primer procedimiento para cargar los papers.

In [111]:
from tqdm.notebook import tqdm


class BasePaper:
    def __init__(self, metadata_row, file_path):
        self._metadata_row = metadata_row
        self._file_path = file_path
        self._file_contents = self._load_json_contents(file_path)
        
        self._title = metadata_row["title"]
        self._authors = metadata_row["authors"]
        self._publish_time = metadata_row["publish_time"]
        self._abstract = metadata_row["abstract"]
        self._bib_entries = self._file_contents["bib_entries"]
    
    @staticmethod
    def _load_json_contents(path):
        with open(path) as file:
            contents = json.load(file)
        return contents

    @property
    def title(self):
        return self._metadata_row["title"]
        
    @property
    def authors(self):
        return self._metadata_row["authors"]
        
    @property
    def publish_time(self):
        return self._metadata_row["publish_time"]
        
    @property
    def abstract(self):
        return self._metadata_row["abstract"]
        
    @property
    def bib_entries(self):
        return self._metadata_row["bib_entries"]
    
    
class PDFPaper(BasePaper):
    pass
        

class PMCPaper(BasePaper):
    pass

def load_papers(metadata_df):
    papers = []
    not_found = []
    for idx, row in tqdm(metadata_df.iterrows()):
        pmc_id = row["pmcid"]
        shas = row["sha"]
        paper = None
        
        if pmc_id in pmc_dict:
            pmc_path = pmc_dict[pmc_id]
            paper = PMCPaper(row, pmc_path)
            
        if paper is None and pd.notna(shas):
            shas_splitted = shas.split("; ")
            for sha in shas_splitted:
                if sha in pdf_dict:
                    pdf_path = pdf_dict[sha]
                    paper = PDFPaper(row, pdf_path)
                    break
        
        if paper is None and (row["has_pdf_parse"] or row["has_pmc_xml_parse"]):
            not_found.append(idx)
        if paper is not None:
            papers.append(paper)
            
    return papers, not_found

papers, not_found = load_papers(metadata_df)
len(papers), len(not_found)

HBox(children=(FloatProgress(value=1.0, bar_style='info', max=1.0), HTML(value='')))




(38882, 0)

In [96]:
mask_1 = metadata_df["has_pdf_parse"]
mask_2 = metadata_df["has_pmc_xml_parse"]
(mask_1 | mask_2).sum()

38882

Somos capaces de identificar los archivos JSONs de 38.882 registros del archivo de metadatos.
Es consistente con que solo 38.882 registros del archivo de metadatos tiene parseado PDF o PMC.

In [114]:
paper = papers[0]
paper.title, paper.authors, paper.abstract

('Airborne rhinovirus detection and effect of ultraviolet irradiation on detection by a semi-nested RT-PCR assay',
 'Myatt, Theodore A; Johnston, Sebastian L; Rudnick, Stephen; Milton, Donald K',
 'BACKGROUND: Rhinovirus, the most common cause of upper respiratory tract infections, has been implicated in asthma exacerbations and possibly asthma deaths. Although the method of transmission of rhinoviruses is disputed, several studies have demonstrated that aerosol transmission is a likely method of transmission among adults. As a first step in studies of possible airborne rhinovirus transmission, we developed methods to detect aerosolized rhinovirus by extending existing technology for detecting infectious agents in nasal specimens. METHODS: We aerosolized rhinovirus in a small aerosol chamber. Experiments were conducted with decreasing concentrations of rhinovirus. To determine the effect of UV irradiation on detection of rhinoviral aerosols, we also conducted experiments in which we ex

# TODO
- Hacer match entre título en lenguaje natural de referencias y título de papers.