# Exporing CORD-19 References

This notebooks explores how much information we can obtain from the citation references of the papers in the COVID-19 Open Research Dataset Challenge.

Here we are trying to:

- Load the dataset.
    - How many papers can we parse?
- How to extract the references?
- How many references are also in the CORD-19 dataset?
- How many are out?

Remember to install requirements by running:

```bash
$ pip install -r requirements.txt
```

In [None]:
import pandas as pd
from fastprogress.fastprogress import progress_bar

from pathlib import Path

import json
import random
import os



## Loading the CORD-19 dataset

In [None]:
cord19_dataset_folder = "./datasets/CORD-19-research-challenge"

In [None]:
if Path(cord19_dataset_folder).exists():
    print('Good to go')
else:
    print(f'{data_root} does not exist! Download it using 00_download.ipynb.')

Good to go


Loading `metadata.csv` file as a pandas `DataFrame`.

In [None]:
metadata_df = pd.read_csv(f"{cord19_dataset_folder}/metadata.csv", index_col="cord_uid")

How metadata looks like?

In [None]:
file_in_metadata_count = len(metadata_df)

In [None]:
f'Total records loaded: {file_in_metadata_count}'

'Total records loaded: 57366'

In [None]:
metadata_df.columns

Index(['sha', 'source_x', 'title', 'doi', 'pmcid', 'pubmed_id', 'license',
       'abstract', 'publish_time', 'authors', 'journal',
       'Microsoft Academic Paper ID', 'WHO #Covidence', 'has_pdf_parse',
       'has_pmc_xml_parse', 'full_text_file', 'url'],
      dtype='object')

In [None]:
metadata_df.head(3)

Unnamed: 0_level_0,sha,source_x,title,doi,pmcid,pubmed_id,license,abstract,publish_time,authors,journal,Microsoft Academic Paper ID,WHO #Covidence,has_pdf_parse,has_pmc_xml_parse,full_text_file,url
cord_uid,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1
zjufx4fo,b2897e1277f56641193a6db73825f707eed3e4c9,PMC,Sequence requirements for RNA strand transfer ...,10.1093/emboj/20.24.7220,PMC125340,11742998.0,unk,Nidovirus subgenomic mRNAs contain a leader se...,2001-12-17,"Pasternak, Alexander O.; van den Born, Erwin; ...",The EMBO Journal,,,True,True,custom_license,http://europepmc.org/articles/pmc125340?pdf=re...
ymceytj3,e3d0d482ebd9a8ba81c254cc433f314142e72174,PMC,"Crystal structure of murine sCEACAM1a[1,4]: a ...",10.1093/emboj/21.9.2076,PMC125375,11980704.0,unk,CEACAM1 is a member of the carcinoembryonic an...,2002-05-01,"Tan, Kemin; Zelus, Bruce D.; Meijers, Rob; Liu...",The EMBO Journal,,,True,True,custom_license,http://europepmc.org/articles/pmc125375?pdf=re...
wzj2glte,00b1d99e70f779eb4ede50059db469c65e8c1469,PMC,Synthesis of a novel hepatitis C virus protein...,10.1093/emboj/20.14.3840,PMC125543,11447125.0,no-cc,Hepatitis C virus (HCV) is an important human ...,2001-07-16,"Xu, Zhenming; Choi, Jinah; Yen, T.S.Benedict; ...",EMBO J,,,True,True,custom_license,https://www.ncbi.nlm.nih.gov/pmc/articles/PMC1...


Each paper is represented by a unique id called `cord_uid`.

Meaning of columns:

- `sha`: PDF file hash
- `source_x`: source repository, e.g. Biorxiv, Elsevier, etc.
- `title`: paper title
- `doi`, `pmcid`, `pubmed_id`, `Microsoft Academic Paper ID`, `WHO #Covidence`: other document ids
- `license`: usage license
- `abstract`:plain text abstract
- `publish_time`: publish date
- `journal`: academic journal of publication, if applicable
- `authors`: authors in plain text
- `has_pdf_parse`: if PDF parsing is available
- `has_pmc_xml_parse`: if PubMed XML is available
- `full_text_file`: pointer to the source file in the dataset
- `url`: URL to paper online source

A continuación se revisará la estructura de carpetas y archivos del conjunto de datos.

In [None]:
def _sample_json_file(path):
    for _, _, file_names in os.walk(path):
        sample_file_name = random.choice(file_names)
        file_path = os.path.join(path, sample_file_name)
        with open(file_path) as file:
            contents = json.load(file)
        return list(contents.keys())

In [None]:
def walk_dataset():
    cum_sum = 0
    for root, folders, files in os.walk(cord19_dataset_folder):
        num_folders = len(folders)
        num_files = len(files)
        if "json" in root and num_folders == 0 and num_files > 0:
            cum_sum += num_files
            print(f"{root}: {num_files} files")
    return cum_sum

In [None]:
source_file_count = walk_dataset()

./datasets/CORD-19-research-challenge/custom_license/custom_license/pmc_json: 10615 files
./datasets/CORD-19-research-challenge/custom_license/custom_license/pdf_json: 31376 files
./datasets/CORD-19-research-challenge/noncomm_use_subset/noncomm_use_subset/pmc_json: 2258 files
./datasets/CORD-19-research-challenge/noncomm_use_subset/noncomm_use_subset/pdf_json: 2518 files
./datasets/CORD-19-research-challenge/biorxiv_medrxiv/biorxiv_medrxiv/pdf_json: 2278 files
./datasets/CORD-19-research-challenge/comm_use_subset/comm_use_subset/pmc_json: 9390 files
./datasets/CORD-19-research-challenge/comm_use_subset/comm_use_subset/pdf_json: 9769 files


In [None]:
print(f'Files in metadata: {file_in_metadata_count}, source files: {source_file_count}')

Files in metadata: 57366, source files: 68204


In [None]:
print(f'We have {source_file_count-file_in_metadata_count} files without metadata.')

We have 10838 files without metadata.


Se observan 59.311 archivos, lo que arroja una diferencia de 8.233 respecto a la cantidad de registros en el CSV de metadatos.
Probablemente, existen registros en el CSV de metadatos que tienen más de un archivo.

De acuerdo a la información encontrada en la comunidad de Kaggle, existen al menos dos procedimientos recomendados para cargar los datos.

```python
# Primer procedimiento
for row in metadata_file:
    pmc_id = row["pmc_id"]
    if exists pmc_id file in pmc_jsons subfolders:
        return pmc_id file
    # Si pmc_id es nulo o no existe archivo
    sha = row["sha"]
    if exists sha file in pdf_jsons subfolders:
        return sha file

# Segundo procedimiento
for pdf file in pdf files:
    if pdf file sha not in metadata_file shas:
        continue
    else:
        row = metadata_file row with matching pdf file sha
        pmc_id = row["pmc_id"]
        if exists pmc_id file in pmc_jsons subfolders:
            return pmc_id file
        return pdf file
```

In [None]:
# 1. Construir diccionarios de PMC files, JSON files
# 2. Iterate metadata file

def get_id_paths_dicts(cls):
    """
    Esta función construye los diccionarios cuyas llaves son identificadores
    de alguna clase de papers (PDF o PMC) y cuyos valores son los paths hacia
    los archivos asociados.
    """
    all_files = {}
    for root, folders, files in os.walk(cord19_dataset_folder):
        num_folders = len(folders)
        num_files = len(files)
        if cls in root and num_folders == 0 and num_files > 0:
            for file_name in files:
                _id = file_name.split(".")[0]
                all_files[_id] = os.path.join(root, file_name)
    return all_files

In [None]:
pdf_dict = get_id_paths_dicts("pdf")
pmc_dict = get_id_paths_dicts("pmc")
len(pdf_dict), len(pmc_dict)

(45941, 22263)

Se observa la existencia de:

- 40.144 archivos PDF con sus respectivos JSONs
- 19.167 archivos PMC con sus respectivos JSONs

En base a la información anterior, se seguirá el primer procedimiento para cargar los papers.

In [None]:
class BasePaper:
    def __init__(self, metadata_row, file_path):
        self._metadata_row = metadata_row
        self._file_path = file_path
        self._file_contents = self._load_json_contents(file_path)
        
        self._referenced_by = []
        self._references = []
        
    @staticmethod
    def _load_json_contents(path):
        with open(path) as file:
            contents = json.load(file)
        return contents

    @property
    def title(self):
        return self._metadata_row["title"]
        
    @property
    def authors(self):
        return self._metadata_row["authors"]
        
    @property
    def publish_time(self):
        return self._metadata_row["publish_time"]
        
    @property
    def abstract(self):
        return self._metadata_row["abstract"]
        
    @property
    def bib_entries(self):
        return self._file_contents["bib_entries"]
    
    def register_reference(self, reference):
        self._references.append(reference)
        reference.register_referenced(self)
    
    def register_referenced(self, referenced):
        self._referenced_by.append(referenced)

In [None]:
class PDFPaper(BasePaper):
    pass

In [None]:
class PMCPaper(BasePaper):
    pass

In [None]:
progress_bar()

TypeError: __init__() missing 1 required positional argument: 'gen'

In [None]:
def load_papers(metadata_df):
    papers = []
    not_found = []
    for idx, row in progress_bar(metadata_df.iterrows(), total=len(metadata_df)):
        pmc_id = row["pmcid"]
        shas = row["sha"]
        paper = None
        
        if pmc_id in pmc_dict:
            pmc_path = pmc_dict[pmc_id]
            paper = PMCPaper(row, pmc_path)
            
        if paper is None and pd.notna(shas):
            shas_splitted = shas.split("; ")
            for sha in shas_splitted:
                if sha in pdf_dict:
                    pdf_path = pdf_dict[sha]
                    paper = PDFPaper(row, pdf_path)
                    break
        
        if paper is None and (row["has_pdf_parse"] or row["has_pmc_xml_parse"]):
            not_found.append(idx)
        if paper is not None:
            papers.append(paper)
            
    return papers, not_found

In [None]:
papers, not_found = load_papers(metadata_df)

In [None]:
print(f'{len(papers)} papers found, {len(not_found)} not found.')

In [None]:
mask_1 = metadata_df["has_pdf_parse"]
mask_2 = metadata_df["has_pmc_xml_parse"]
print(f'There are {(mask_1 | mask_2).sum()} files with either parsed PDF or PMC XML.')

Somos capaces de identificar los archivos JSONs de 38.882 registros del archivo de metadatos.
Es consistente con que solo 38.882 registros del archivo de metadatos tiene parseado PDF o PMC.

In [None]:
paper = papers
print('Title:', paper.title)
print('Authors:', paper.authors)
print(paper.abstract)

# Match entre referencias y títulos
A continuación se realizará el match entre los títulos en lenguaje natural de las referencias de cada paper y los títulos de éstos.
Para hacerlo, se construirá un diccionario cuyas llaves son títulos de papers y los valores son las instancias de los papers correspondientes.
Luego, para verificar si las referencias existen en el conjunto de datos, se detecta la presencia de la llave en el diccionario.

In [None]:
# A continuación se construye el diccionario que mapea títulos a papers
paper_titles = {}
for paper in papers:
    try:
        title = paper.title.lower()
        paper_titles[title] = paper
    except AttributeError as err:
        print(err, title)
        print(paper.authors, paper.abstract)

Por algún motivo existen títulos codificados como *floats*.
Será importante verificar si se debe a un error de programación o a un problema del conjunto de datos.

In [None]:
num_processed_refs = 0
num_succesfully_processed_refs = 0
for paper in tqdm(papers):
    for _, ref in paper.bib_entries.items():
        ref_title = ref["title"].lower()
        if ref_title in paper_titles:
            paper.register_reference(paper_titles[ref_title])
            num_succesfully_processed_refs += 1
    num_processed_refs += len(paper.bib_entries)
num_processed_refs, num_succesfully_processed_refs

In [None]:
num_processed_refs, num_succesfully_processed_refs

Se observa que aproximadamente un 6,26% de las referencias se encuentran dentro del conjunto de datos.
Un 93,74% de los papers referenciados no se encuentran en el conjunto de datos.
Una estrategia para mitigar esto es crear nodos especiales para los papers fuera del conjunto, de esa forma se preserva mejor la estructura del grafo.

Por otra parte, se deben responder otras preguntas:

- Cuántos papers son referenciados al menos una vez en el conjunto de datos?
- Cuáles son los papers más referenciados del conjunto de datos? Long tail?
- Cuáles son los papers más referenciados fuera del conjunto de datos? Long tail?

In [None]:
num_papers = len(papers)
num_references = 0
num_referenced_by = 0
for paper in papers:
    if len(paper._references) > 0:
        num_references += 1
    if len(paper._referenced_by) > 0:
        num_referenced_by += 1
num_papers, num_references, num_referenced_by

Se observa que de los 38.882 papers procesados, al menos 25.321 (65,12%) tienen más de una referencia correctamente enlazada, y al menos 17.824 (45,84%) son referenciados al menos una vez.

In [None]:
papers_sorted = sorted(papers, key=lambda p: len(p._referenced_by), reverse=True)

In [None]:
def display_paper(paper):
    if isinstance(paper, list):
        for elem in paper:
            display_paper(elem)
            print("\n", end="")
    else:
        print(f"""Title: {paper.title}
Authors: {paper.authors}
Publish time: {paper.publish_time}
Linked references: {len(paper._references)}
Linked referenced by: {len(paper._referenced_by)}
Abstract: {paper.abstract}""")

In [None]:
# Paper más citado dentro del conjunto de datos
display_paper(papers_sorted[0])

In [None]:
display_paper(papers_sorted[0]._references)

Se observa como el conjunto de datos incluye papers antiguo, de años anteriores a la pandemia actual.
A continuación se observará la distribución del número de referencias.

In [None]:
import matplotlib.pyplot as plt
import seaborn as sns
sns.set()

In [None]:
num_references = sorted([len(paper._references) for paper in papers], reverse=True)
num_referenced_by = sorted([len(paper._referenced_by) for paper in papers], reverse=True)
fig = plt.figure(figsize=(16, 8))
ax = sns.lineplot(y=num_references, x=range(len(papers)))
ax.set(title="Referencias de papers correctamente enlazadas", yscale="log")
pass

In [None]:
fig = plt.figure(figsize=(16, 8))
ax = sns.lineplot(y=num_referenced_by, x=range(len(papers)))
ax.set(title="Referencias a papers correctamente enlazadas", yscale="log")
pass

Al igual que distintos aspectos de la vida real, se ve como la cantidad de referencias correctamente enlazadas de papers siguen una distribución con cola pesada.

# PageRank

A continuación se realizará el cómputo de los puntajes de PageRank de cada paper.

In [None]:
# Ref.: https://networkx.github.io/documentation/stable/index.html
!pip install networkx

In [None]:
import networkx as nx

def build_graph(papers):
    G = nx.DiGraph()

    # First add the nodes
    for paper in papers:
        G.add_node(paper)
    
    # Then, add the links
    for paper in papers:
        for referenced_paper in paper._references:
            G.add_edge(paper, referenced_paper)
    
    return G

In [None]:
G = build_graph(papers)
G

In [None]:
G.number_of_nodes(), G.number_of_edges()

In [None]:
pr = nx.pagerank(G)
pr

In [None]:
sorted_pr = {k: v for k, v in sorted(pr.items(), key=lambda item: item[1], reverse=True)}
sorted_pr

In [None]:
display_paper(list(sorted_pr.keys())[10:20])

Aquí se puede observar que los papers con mayor PageRank son muy citados y en general son trabajos de décadas pasadas que probablemente constituyen la base de la investigación actual contra el COVID-19.

A continuación se mostrará la distribución de los puntajes de PageRank.

In [None]:
import numpy as np

pr_values = np.array(list(sorted_pr.values()))

# Remover outliers
pr_mean = np.mean(pr_values)
pr_std = np.std(pr_values)
pr_distance = abs(pr_values - pr_mean)
max_std = 1.5
pr_not_outlier = pr_distance < max_std * pr_std
pr_no_outliers = pr_values[pr_not_outlier]

len(pr_values), len(pr_no_outliers), len(pr_values) - len(pr_no_outliers)

In [None]:
fig = plt.figure(figsize=(16, 8))
ax = sns.distplot(pr_values, kde=False, rug=True)
ax.set(title="Distribución del PageRank de papers de CORD-19 (con outliers)", yscale="log")
pass

In [None]:
fig = plt.figure(figsize=(16, 8))
ax = sns.distplot(pr_no_outliers, kde=False, rug=True)
ax.set(title="Distribución del PageRank de papers de CORD-19 (sin outliers)", yscale="log")
pass