# CORD-19 Updated Data Overview (13 May)

The new data update makes some big changes to the CORD-19 set that Kagglers have been working with for several weeks.

This Notebook runs through some of the changes and shows you how to load and work with the data:

1. [**Metadata**](#Metadata)
2. [**Documents**](#Documents)
3. [**Tasks**](#Tasks)

In [None]:
import os
import pandas as pd
import json

# Increase max # of columns displayed when printing a DataFrame
pd.set_option('display.max_columns', 500)

In [None]:
INPUT_DIR = '../input/CORD-19-research-challenge'
TASK_DIR = os.path.join(INPUT_DIR, 'Kaggle')
DATA_DIR = os.path.join(INPUT_DIR, 'document_parses')

# Metadata

The metadata CSV is at `../input/CORD-19-research-challenge/metadata.csv`

In [None]:
meta = pd.read_csv('../input/CORD-19-research-challenge/metadata.csv',
                   low_memory=False) # Mixed data types in cols 1-4

In [None]:
meta.shape

Fields are shown below. `title` and `abstract` are useful for quickly filtering on topic.

`pdf_json_files` is the link to the extracted PDF text in JSON format.

`pmc_json_files` is the link to the extracted PubMed Central (see: https://en.wikipedia.org/wiki/PubMed_Central) data which has been enriched.

In [None]:
for i, c in enumerate(meta.columns):
    print(f'{i:>2d} {c}')

About half (~33.5k) the papers have a PubMed Central (PMC) extract:

In [None]:
meta.pmc_json_files.notnull().value_counts()

~49k papers have PDF extracts:

In [None]:
meta.pdf_json_files.notnull().value_counts()

# Documents

Let's start by loading a single document from a reference in the metadata:

In [None]:
meta.iloc[0]

In [None]:
meta.iloc[0].pmc_json_files

In [None]:
pmc_file = meta.iloc[0].pmc_json_files
with open(os.path.join(INPUT_DIR, pmc_file), 'rb') as f:
    pmc_extract = json.load(f)

You can now navigate this as a dictionary:

In [None]:
pmc_extract.keys()

In [None]:
pmc_extract['bib_entries']['BIBREF0']

Or convert to a Pandas Series for convenience (although some cells will contain nested lists / dicts)

In [None]:
pmc_extract_df = pd.Series(pmc_extract)

In [None]:
pmc_extract_df

To load all the full text into memory, you can do this (example will be only 1000 papers and using the PDF files):

In [None]:
pdf_list = []
for pdf_file in meta[meta.pdf_json_files.notnull()].pdf_json_files[:1000]:
    # Some entries have multiple paths separated by ';'
    pdf_file = pdf_file.split(';')[0] # take the first one
    with open(os.path.join(INPUT_DIR, pdf_file), 'rb') as f:
        pdf_extract = json.load(f)
    pdf_list.append(pdf_extract)

In [None]:
full_text_df = pd.DataFrame(pdf_list)

In [None]:
full_text_df.head()

By printing out an article we can show how to access the text. We'll take our first article `full_text_df.iloc[0]` and print out the relevant metadata and body text.

In [None]:
# First entry in the body_text list for first record
full_text_df.body_text[0][0]

In [None]:
# Useful libraries for printing formatted text into Notebooks
import html
from IPython.display import HTML

### Print out PDF extract

First we'll take a look at the PDF extract:

In [None]:
current_section = ''
output_html = ''
temp_meta = meta[meta.sha == full_text_df.paper_id[0]]
title = temp_meta.title.values[0]
authors = temp_meta.authors.values[0]
doi = temp_meta.doi.values[0]
output_html += f'<h3>{html.escape(title)}</h3>'
output_html += f'{html.escape(authors)}<br>'
output_html += f'<a href="{doi}">{doi}</a><br><br>'
for item in full_text_df.body_text[0]:
    section = item['section']
    if section != current_section:
        current_section = section
        output_html += f'<h4>{html.escape(section)}</h4><br>'
    output_html += html.escape(item['text'])
    output_html += '<br><br>'
display(HTML(output_html))

### Print out PMC extract

The PMC extract is from PubMed Central, and in this case is more accurate. Note that the `paper_id` in the PMC extracts joins with `pmcid` in metadata.csv, rather than `sha`.

In [None]:
pmc_list = []
for pmc_file in meta[meta.pmc_json_files.notnull()].pmc_json_files[:1000]:
    # Some entries have multiple paths separated by ';'
    pmc_file = pmc_file.split(';')[0] # take the first one
    with open(os.path.join(INPUT_DIR, pmc_file), 'rb') as f:
        pmc_extract = json.load(f)
    pmc_list.append(pmc_extract)
full_pmc_df = pd.DataFrame(pmc_list)

In [None]:
current_section = ''
output_html = ''
temp_meta = meta[meta.pmcid == full_pmc_df.paper_id[0]]
title = temp_meta.title.values[0]
authors = temp_meta.authors.values[0]
doi = temp_meta.doi.values[0]
output_html += f'<h3>{html.escape(title)}</h3>'
output_html += f'{html.escape(authors)}<br>'
output_html += f'<a href="{doi}">{doi}</a><br><br>'
for item in full_pmc_df.body_text[0]:
    section = item['section']
    if section != current_section:
        current_section = section
        output_html += f'<h4>{html.escape(section)}</h4><br>'
    output_html += html.escape(item['text'])
    output_html += '<br><br>'
display(HTML(output_html))

# Tasks

Below is a quick run through of the new task folder structure:

In [None]:
os.listdir(TASK_DIR)

In [None]:
sorted(os.listdir(os.path.join(TASK_DIR, 'target_tables')))