## About this notebook

In this notebook, I quickly explore the `biorxiv` subset of the papers. Since it is stored in JSON format, the structure is likely too complex to directly perform analysis. Thus, I not only explore the structure of those files, but I also provide the following helper functions for you to easily format inner dictionaries from each file:
* `format_name(author)`
* `format_affiliation(affiliation)`
* `format_authors(authors, with_affiliation=False)`
* `format_body(body_text)`
* `format_bib(bibs)`

Feel free to reuse those functions for your own purpose! If you do, please leave a link to this notebook.

Throughout the EDA, I show you how to use each of those files. At the end, I show you how to generate a clean version of the `biorxiv` as well as all the other datasets, which you can directly use by choosing this notebook as a data source ("File" -> "Add or upload data" -> "Kernel Output File" tab -> search the name of this notebook).

### Update Log

* V9: First release.
* V10: Updated paths to include the [14k new papers](https://www.kaggle.com/allen-institute-for-ai/CORD-19-research-challenge/discussion/137474).

In [1]:
import os
import json
from pprint import pprint
from copy import deepcopy

import numpy as np
import pandas as pd
from tqdm.notebook import tqdm

## Helper Functions

Unhide the cell below to find the definition of the following functions:
* `format_name(author)`
* `format_affiliation(affiliation)`
* `format_authors(authors, with_affiliation=False)`
* `format_body(body_text)`
* `format_bib(bibs)`

In [2]:
def format_name(author):
    middle_name = " ".join(author['middle'])
    
    if author['middle']:
        return " ".join([author['first'], middle_name, author['last']])
    else:
        return " ".join([author['first'], author['last']])


def format_affiliation(affiliation):
    text = []
    location = affiliation.get('location')
    if location:
        text.extend(list(affiliation['location'].values()))
    
    institution = affiliation.get('institution')
    if institution:
        text = [institution] + text
    return ", ".join(text)

def format_authors(authors, with_affiliation=False):
    name_ls = []
    
    for author in authors:
        name = format_name(author)
        if with_affiliation:
            affiliation = format_affiliation(author['affiliation'])
            if affiliation:
                name_ls.append(f"{name} ({affiliation})")
            else:
                name_ls.append(name)
        else:
            name_ls.append(name)
    
    return ", ".join(name_ls)

def format_body(body_text):
    texts = [(di['section'], di['text']) for di in body_text]
    texts_di = {di['section']: "" for di in body_text}
    
    for section, text in texts:
        texts_di[section] += text

    body = ""

    for section, text in texts_di.items():
        body += section
        body += "\n\n"
        body += text
        body += "\n\n"
    
    return body

def format_bib(bibs):
    if type(bibs) == dict:
        bibs = list(bibs.values())
    bibs = deepcopy(bibs)
    formatted = []
    
    for bib in bibs:
        bib['authors'] = format_authors(
            bib['authors'], 
            with_affiliation=False
        )
        formatted_ls = [str(bib[k]) for k in ['title', 'authors', 'venue', 'year']]
        formatted.append(", ".join(formatted_ls))

    return "; ".join(formatted)

Unhide the cell below to find the definition of the following functions:
* `load_files(dirname)`
* `generate_clean_df(all_files)`

In [3]:
def load_files(dirname):
    filenames = os.listdir(dirname)
    raw_files = []

    for filename in tqdm(filenames):
        filename = dirname + filename
        file = json.load(open(filename, 'rb'))
        raw_files.append(file)
    
    return raw_files

def generate_clean_df(all_files):
    cleaned_files = []
    
    for file in tqdm(all_files):
        features = [
            file['paper_id'],
            file['metadata']['title'],
            format_authors(file['metadata']['authors']),
            format_authors(file['metadata']['authors'], 
                           with_affiliation=True),
            format_body(file['abstract']),
            format_body(file['body_text']),
            format_bib(file['bib_entries']),
            file['metadata']['authors'],
            file['bib_entries']
        ]

        cleaned_files.append(features)

    col_names = ['paper_id', 'title', 'authors',
                 'affiliations', 'abstract', 'text', 
                 'bibliography','raw_authors','raw_bibliography']

    clean_df = pd.DataFrame(cleaned_files, columns=col_names)
    clean_df.head()
    
    return clean_df

## Biorxiv: Exploration

Let's first take a quick glance at the `biorxiv` subset of the data. We will also use this opportunity to load all of the json files into a list of **nested** dictionaries (each `dict` is an article).

In [4]:
biorxiv_dir = '/kaggle/input/CORD-19-research-challenge/biorxiv_medrxiv/biorxiv_medrxiv/'
filenames = os.listdir(biorxiv_dir)
print("Number of articles retrieved from biorxiv:", len(filenames))

Number of articles retrieved from biorxiv: 1053


In [5]:
all_files = []

for filename in filenames:
    filename = biorxiv_dir + filename
    file = json.load(open(filename, 'rb'))
    all_files.append(file)

In [6]:
file = all_files[0]
print("Dictionary keys:", file.keys())

Dictionary keys: dict_keys(['paper_id', 'metadata', 'abstract', 'body_text', 'bib_entries', 'ref_entries', 'back_matter'])


## Biorxiv: Generate CSV

In this section, I show you how to manually generate the CSV files. As you can see, it's now super simple because of the `format_` helper functions. In the next sections, I show you have to generate them in 3 lines using the `load_files` and `generate_clean_dr` helper functions.

In [7]:
cleaned_files = []

for file in tqdm(all_files):
    features = [
        file['paper_id'],
        file['metadata']['title'],
        format_authors(file['metadata']['authors']),
        format_authors(file['metadata']['authors'], 
                       with_affiliation=True),
        format_body(file['abstract']),
        format_body(file['body_text']),
        format_bib(file['bib_entries']),
        file['metadata']['authors'],
        file['bib_entries']
    ]
    
    cleaned_files.append(features)

HBox(children=(FloatProgress(value=0.0, max=1053.0), HTML(value='')))




In [8]:
col_names = [
    'paper_id', 
    'title', 
    'authors',
    'affiliations', 
    'abstract', 
    'text', 
    'bibliography',
    'raw_authors',
    'raw_bibliography'
]

clean_df = pd.DataFrame(cleaned_files, columns=col_names)
clean_df.head()

Unnamed: 0,paper_id,title,authors,affiliations,abstract,text,bibliography,raw_authors,raw_bibliography
0,4602afcb8d95ebd9da583124384fd74299d20f5b,SPINT2 inhibits proteases involved in activati...,"Marco R Straus, Jonathan T Kinder, Michal Sega...","Marco R Straus, Jonathan T Kinder (University ...",Abstract\n\nViruses possessing class I fusion ...,Introduction 9\n\nInfluenza-like illnesses (IL...,Protease inhibitors targeting coronavirus and ...,"[{'first': 'Marco', 'middle': ['R'], 'last': '...","{'BIBREF1': {'ref_id': 'b1', 'title': 'Proteas..."
1,90b5ecf991032f3918ad43b252e17d1171b4ea63,The role of absolute humidity on transmission ...,"Wei Luo, Maimuna S Majumder, Diambo Liu, Canel...","Wei Luo (Boston Children's Hospital, 02215, Bo...",,"Introduction\n\nSince December 2019, an increa...",A novel coronavirus from patients with pneumon...,"[{'first': 'Wei', 'middle': [], 'last': 'Luo',...","{'BIBREF0': {'ref_id': 'b0', 'title': 'A novel..."
2,d3c2e2839498c613ee95739dce7052109750362c,Long-Term Persistence of IgG Antibodies in SAR...,"Xiaoqin Guo, Zhongmin Guo, Chaohui Duan, Zelia...","Xiaoqin Guo (Sun Yat-sen University, 510080, G...",Abstract\n\n23 BACKGROUND 24 The ongoing world...,\n\nCC-BY-ND 4.0 International license It is m...,"The severe acute respiratory syndrome, J S Pei...","[{'first': 'Xiaoqin', 'middle': [], 'last': 'G...","{'BIBREF0': {'ref_id': 'b0', 'title': 'The sev..."
3,bbd9d63dc2c733c763770f62205ef9adeceb0127,Effects of temperature variation and humidity ...,"Yueling Ma, Yadong Zhao, Jiangtao Liu, Xiaotao...","Yueling Ma (Lanzhou University, 730000, Lanzho...",Abstract\n\nObject Meteorological parameters a...,"Introduction\n\nIn December 2019, a novel coro...",A new coronavirus associated with human respir...,"[{'first': 'Yueling', 'middle': [], 'last': 'M...","{'BIBREF0': {'ref_id': 'b0', 'title': 'A new c..."
4,da3aa20131ac2805c0d9e1b29f094683479ab5b7,Ruler elements in chromatin remodelers set nuc...,"Elisa Oberbeckmann, Vanessa Niebauer, Shinya W...",Elisa Oberbeckmann (Martinsried near to Munich...,Abstract\n\nArrays of regularly spaced nucleos...,\n\nArrays of regularly spaced nucleosomes dom...,The Snf2 homolog Fun30 acts as a homodimeric A...,"[{'first': 'Elisa', 'middle': [], 'last': 'Obe...","{'BIBREF0': {'ref_id': 'b0', 'title': 'The Snf..."


In [None]:
clean_df.to_csv('biorxiv_clean.csv', index=False)

## Generate CSV: Custom (PMC), Commercial, Non-commercial licenses

In [9]:
pmc_dir = '/kaggle/input/CORD-19-research-challenge/custom_license/custom_license/'
pmc_files = load_files(pmc_dir)
pmc_df = generate_clean_df(pmc_files)
pmc_df.to_csv('clean_pmc.csv', index=False)
pmc_df.head()

HBox(children=(FloatProgress(value=0.0, max=20657.0), HTML(value='')))




HBox(children=(FloatProgress(value=0.0, max=20657.0), HTML(value='')))

KeyboardInterrupt: 

In [None]:
comm_dir = '/kaggle/input/CORD-19-research-challenge/comm_use_subset/comm_use_subset/'
comm_files = load_files(comm_dir)
comm_df = generate_clean_df(comm_files)
comm_df.to_csv('clean_comm_use.csv', index=False)
comm_df.head()

In [None]:
noncomm_dir = '/kaggle/input/CORD-19-research-challenge/noncomm_use_subset/noncomm_use_subset/'
noncomm_files = load_files(noncomm_dir)
noncomm_df = generate_clean_df(noncomm_files)
noncomm_df.to_csv('clean_noncomm_use.csv', index=False)
noncomm_df.head()