# COVID-19 Workbook
In this Workbook, I will try to replicate some graphs others have made. 
Example from xhlulu (EDA, parse JSON and generate clean CSV)

## Install and import libraries

In [1]:
#First clean the JSON files from biorxiv according to xhlulu (EDA, parse JSON and generate clean CSV)
import os
import json
from pprint import pprint
from copy import deepcopy

import numpy as np
import pandas as pd
#from tqdm.notebook import tqdm
from tqdm import tqdm

Unhide the cell below to find the definition of the following functions:

- format_name(author)
- format_affiliation(affiliation)
- format_authors(authors, with_affiliation=False)
- format_body(body_text)
- format_bib(bibs)

In [2]:
def format_name(author):
    middle_name = " ".join(author['middle'])
    
    if author['middle']:
        return " ".join([author['first'], middle_name, author['last']])
    else:
        return " ".join([author['first'], author['last']])


def format_affiliation(affiliation):
    text = []
    location = affiliation.get('location')
    if location:
        text.extend(list(affiliation['location'].values()))
    
    institution = affiliation.get('institution')
    if institution:
        text = [institution] + text
    return ", ".join(text)

def format_authors(authors, with_affiliation=False):
    name_ls = []
    
    for author in authors:
        name = format_name(author)
        if with_affiliation:
            affiliation = format_affiliation(author['affiliation'])
            if affiliation:
                name_ls.append(f"{name} ({affiliation})")
            else:
                name_ls.append(name)
        else:
            name_ls.append(name)
    
    return ", ".join(name_ls)

def format_body(body_text):
    texts = [(di['section'], di['text']) for di in body_text]
    texts_di = {di['section']: "" for di in body_text}
    
    for section, text in texts:
        texts_di[section] += text

    body = ""

    for section, text in texts_di.items():
        body += section
        body += "\n\n"
        body += text
        body += "\n\n"
    
    return body

def format_bib(bibs):
    if type(bibs) == dict:
        bibs = list(bibs.values())
    bibs = deepcopy(bibs)
    formatted = []
    
    for bib in bibs:
        bib['authors'] = format_authors(
            bib['authors'], 
            with_affiliation=False
        )
        formatted_ls = [str(bib[k]) for k in ['title', 'authors', 'venue', 'year']]
        formatted.append(", ".join(formatted_ls))

    return "; ".join(formatted)

Unhide the cell below to find the definition of the following functions:

- load_files(dirname)
- generate_clean_df(all_files)

In [3]:
def load_files(dirname):
    filenames = os.listdir(dirname)
    raw_files = []

    for filename in tqdm(filenames):
        filename = dirname + filename
        file = json.load(open(filename, 'rb'))
        raw_files.append(file)
    
    return raw_files

def generate_clean_df(all_files):
    cleaned_files = []
    
    for file in tqdm(all_files):
        features = [
            file['paper_id'],
            file['metadata']['title'],
            format_authors(file['metadata']['authors']),
            format_authors(file['metadata']['authors'], 
                           with_affiliation=True),
            format_body(file['abstract']),
            format_body(file['body_text']),
            format_bib(file['bib_entries']),
            file['metadata']['authors'],
            file['bib_entries']
        ]

        cleaned_files.append(features)

    col_names = ['paper_id', 'title', 'authors',
                 'affiliations', 'abstract', 'text', 
                 'bibliography','raw_authors','raw_bibliography']

    clean_df = pd.DataFrame(cleaned_files, columns=col_names)
    clean_df.head()
    
    return clean_df

## Comm_use: Exploration
Let's first take a quick glance at the biorxiv subset of the data. We will also use this opportunity to load all of the json files into a list of nested dictionaries (each dict is an article).

In [4]:
x_dir = 'C:/Users/Renate/Documents/GitHub/Data-Projects/Kaggle - Covid-19/comm_use_subset/comm_use_subset/'
filenames = os.listdir(x_dir)
print("Number of articles retrieved from comm_use:", len(filenames))

Number of articles retrieved from comm_use: 9118


In [5]:
all_files = []

for filename in filenames:
    filename = x_dir + filename
    file = json.load(open(filename, 'rb'))
    all_files.append(file)

In [6]:
file = all_files[0]
print("Dictionary keys:", file.keys())

Dictionary keys: dict_keys(['paper_id', 'metadata', 'abstract', 'body_text', 'bib_entries', 'ref_entries', 'back_matter'])


## Comm_use: Abstract
the abstract dictionary is fairly simple

In [7]:
pprint(file['abstract'])

[]


## Comm_use: body text
Let's try to find out how the body_text dictionary looks like

In [8]:
print("body_text type:", type(file['body_text']))
print("body_text length:", len(file['body_text']))
print("body_text keys:", file['body_text'][0].keys())

body_text type: <class 'list'>
body_text length: 1
body_text keys: dict_keys(['text', 'cite_spans', 'ref_spans', 'section'])


In [9]:
print(all_files[0]['metadata'].keys())

dict_keys(['title', 'authors'])


In [10]:
print(all_files[0]['metadata']['title'])

Supplementary Information An eco-epidemiological study of Morbilli-related paramyxovirus infection in Madagascar bats reveals host-switching as the dominant macro-evolutionary mechanism


In [11]:
authors = all_files[1]['metadata']['authors']
pprint(authors[:4])

[{'affiliation': {},
  'email': '',
  'first': 'Elisabetta',
  'last': 'Padovan',
  'middle': [],
  'suffix': ''},
 {'affiliation': {},
  'email': '',
  'first': 'Marina',
  'last': 'Cella',
  'middle': [],
  'suffix': ''},
 {'affiliation': {},
  'email': '',
  'first': 'Shahram',
  'last': 'Salek-Ardakani',
  'middle': [],
  'suffix': ''},
 {'affiliation': {'institution': '',
                  'laboratory': 'Istituto Nazionale di Genetica Molecolare '
                                '"Romeo ed Enrica Invernizzi" (INGM)',
                  'location': {'country': 'Italy', 'settlement': 'Milan'}},
  'email': 'geginat@ingm.org',
  'first': 'Jens',
  'last': 'Geginat',
  'middle': [],
  'suffix': ''}]


In [12]:
texts = [(di['section'], di['text']) for di in file['body_text']]
texts_di = {di['section']: "" for di in file['body_text']}
for section, text in texts:
    texts_di[section] += text

pprint(list(texts_di.keys()))

['']


In [13]:
body = ""

for section, text in texts_di.items():
    body += section
    body += "\n\n"
    body += text
    body += "\n\n"

print(body[:3000])



- Figure S1 : Phylogeny of all sequences belonging to the UMRV phylogroup. - Table S4 : Bats cytochrome b sequences data set. -Table S5 : Test of host-parasite co-evolution using global fit methods ParaFit. Figure S1 . Phylogeny of all sequences belonging to the UMRV phylogroup. A global phylogeny of 308 partial L-gene sequences calculated in 10,000,000 iterations in MrBayes with the GTR + G + I evolutionary model and a 10% burn-in rooted with an Aquaparamyxovirus sequence (GenBank number EF646380). All Malagasy bat paramyxoviruses sequences obtained within this study fell within group of Unclassified Morbillivirus-Related viruses. Genbank accession numbers used for each virus genera are indicated in Table S6 . 




In [14]:

authors = all_files[1]['metadata']['authors']
pprint(authors[:3])


[{'affiliation': {},
  'email': '',
  'first': 'Elisabetta',
  'last': 'Padovan',
  'middle': [],
  'suffix': ''},
 {'affiliation': {},
  'email': '',
  'first': 'Marina',
  'last': 'Cella',
  'middle': [],
  'suffix': ''},
 {'affiliation': {},
  'email': '',
  'first': 'Shahram',
  'last': 'Salek-Ardakani',
  'middle': [],
  'suffix': ''}]


In [15]:
for author in authors:
    print("Name:", format_name(author))
    print("Affiliation:", format_affiliation(author['affiliation']))
    print()

Name: Elisabetta Padovan
Affiliation: 

Name: Marina Cella
Affiliation: 

Name: Shahram Salek-Ardakani
Affiliation: 

Name: Jens Geginat
Affiliation: Milan, Italy

Name: Giulia Nizzoli
Affiliation: Milan, Italy

Name: Moira Paroni
Affiliation: Milan, Italy

Name: Stefano Maglie
Affiliation: Milan, Italy

Name: Paola Larghi
Affiliation: Milan, Italy

Name: Steve Pascolo
Affiliation: University Hospital of Zurich, Zurich, Switzerland

Name: Sergio Abrignani
Affiliation: Milan, Italy



In [16]:
pprint(all_files[4]['metadata'], depth=4)

{'authors': [{'affiliation': {'institution': 'Tufts University',
                              'laboratory': '',
                              'location': {...}},
              'email': '',
              'first': 'Molly',
              'last': 'Hodul',
              'middle': [],
              'suffix': ''},
             {'affiliation': {'institution': 'Western Washington University',
                              'laboratory': '',
                              'location': {...}},
              'email': '',
              'first': 'Caroline',
              'last': 'Dahlberg',
              'middle': ['L'],
              'suffix': ''},
             {'affiliation': {'institution': 'Tufts University',
                              'laboratory': '',
                              'location': {...}},
              'email': 'peter.juo@tufts.edu',
              'first': 'Peter',
              'last': 'Juo',
              'middle': [],
              'suffix': ''},
             {'affiliation': {}

In [17]:
authors = all_files[4]['metadata']['authors']
print("Formatting without affiliation:")
print(format_authors(authors, with_affiliation=False))
print("\nFormatting with affiliation:")
print(format_authors(authors, with_affiliation=True))

Formatting without affiliation:
Molly Hodul, Caroline L Dahlberg, Peter Juo, Clive R Bramham, Carlos B Duarte, Angela M Mabb, Ivan Salazar

Formatting with affiliation:
Molly Hodul (Tufts University, Boston, MA, United States), Caroline L Dahlberg (Western Washington University, Bellingham, WA, United States), Peter Juo (Tufts University, Boston, MA, United States), Clive R Bramham, Carlos B Duarte, Angela M Mabb, Ivan Salazar


In [18]:
bibs = list(file['bib_entries'].values())
pprint(bibs[:2], depth=4)

[{'authors': [],
  'issn': '',
  'other_ids': {},
  'pages': '',
  'ref_id': 'b32',
  'title': 'NDV/HQ266603/Chicken/1992',
  'venue': '',
  'volume': '',
  'year': None},
 {'authors': [],
  'issn': '',
  'other_ids': {},
  'pages': '',
  'ref_id': 'b43',
  'title': 'MuV/FJ375177/Human/1987',
  'venue': '',
  'volume': '',
  'year': None}]


In [19]:
format_authors(bibs[1]['authors'], with_affiliation=False)

''

In [20]:
bib_formatted = format_bib(bibs[:5])
print(bib_formatted)

NDV/HQ266603/Chicken/1992, , , None; MuV/FJ375177/Human/1987, , , None; HeV/HM044321/Horse/2007, , , None; NDV/FJ751918/Chicken/1979, , , None; APMV4/EU877976/Duck/2006, , , None


In [21]:
cleaned_files = []

for file in tqdm(all_files):
    features = [
        file['paper_id'],
        file['metadata']['title'],
        format_authors(file['metadata']['authors']),
        format_authors(file['metadata']['authors'], 
                       with_affiliation=True),
        format_body(file['abstract']),
        format_body(file['body_text']),
        format_bib(file['bib_entries']),
        file['metadata']['authors'],
        file['bib_entries']
    ]
    
    cleaned_files.append(features)

100%|█████████████████████████████████████████████████████████████████████████████| 9118/9118 [00:34<00:00, 264.29it/s]


In [22]:
col_names = [
    'paper_id', 
    'title', 
    'authors',
    'affiliations', 
    'abstract', 
    'text', 
    'bibliography',
    'raw_authors',
    'raw_bibliography'
]

clean_df = pd.DataFrame(cleaned_files, columns=col_names)
clean_df.head()

Unnamed: 0,paper_id,title,authors,affiliations,abstract,text,bibliography,raw_authors,raw_bibliography
0,000b7d1517ceebb34e1e3e817695b6de03e2fa78,Supplementary Information An eco-epidemiologic...,"Julien Mélade, Nicolas Wieseke 4#, Beza Ramazi...","Julien Mélade (2 rue Maxime Rivière, 97490 Sai...",,\n\n- Figure S1 : Phylogeny of all sequences b...,"NDV/HQ266603/Chicken/1992, , , None; MuV/FJ375...","[{'first': 'Julien', 'middle': [], 'last': 'Mé...","{'BIBREF32': {'ref_id': 'b32', 'title': 'NDV/H..."
1,00142f93c18b07350be89e96372d240372437ed9,immunity to pathogens taught by specialized hu...,"Elisabetta Padovan, Marina Cella, Shahram Sale...","Elisabetta Padovan, Marina Cella, Shahram Sale...",Abstract\n\nDendritic cells (DCs) are speciali...,\n\niNTRODUCTiON Human beings are constantly e...,The dendritic cell system and its role in immu...,"[{'first': 'Elisabetta', 'middle': [], 'last':...","{'BIBREF0': {'ref_id': 'b0', 'title': 'The den..."
2,0022796bb2112abd2e6423ba2d57751db06049fb,Public Health Responses to and Challenges for ...,"Elvina Viennet, Scott A Ritchie, Craig R Willi...",Elvina Viennet (The Australian National Univer...,Abstract\n\nDengue has a negative impact in lo...,Introduction\n\nPathogens and vectors can now ...,"The global distribution and burden of dengue, ...","[{'first': 'Elvina', 'middle': [], 'last': 'Vi...","{'BIBREF0': {'ref_id': 'b0', 'title': 'The glo..."
3,00326efcca0852dc6e39dc6b7786267e1bc4f194,a section of the journal Frontiers in Pediatri...,"Jan Hau Lee, Oguz Dursun, Phuc Huu Phan, Yek K...","Jan Hau Lee, Oguz Dursun, Phuc Huu Phan, Yek K...","Abstract\n\nFifteen years ago, United Nations ...",\n\nIn addition to preventative care and nutri...,"Global, regional, and national levels of neona...","[{'first': 'Jan', 'middle': ['Hau'], 'last': '...","{'BIBREF0': {'ref_id': 'b0', 'title': 'Global,..."
4,00352a58c8766861effed18a4b079d1683fec2ec,MINI REVIEW Function of the Deubiquitinating E...,"Molly Hodul, Caroline L Dahlberg, Peter Juo, C...","Molly Hodul (Tufts University, Boston, MA, Uni...",Abstract\n\nPosttranslational modification of ...,INTRODUCTION\n\nUbiquitination is a widely use...,Regulation of AMPA receptor trafficking and sy...,"[{'first': 'Molly', 'middle': [], 'last': 'Hod...","{'BIBREF0': {'ref_id': 'b0', 'title': 'Regulat..."


In [23]:
clean_df.to_csv('C:/Users/Renate/Documents/GitHub/Data-Projects/Kaggle - Covid-19/comm_use_clean.csv', index=False)