# COVID-19 Open Research Dataset Challenge (CORD-19)
An AI challenge with AI2, CZI, MSR, Georgetown, NIH & The White House

### Task: What do we know about COVID-19 risk factors?

---

### Information about the data

1. Metadata for papers from these sources are combined: CZI, PMC, BioRxiv/MedRxiv. (total records 29500)
    - CZI 1236 records
    - PMC 27337
    - bioRxiv 566
    - medRxiv 361
2. 17K of the paper records have PDFs and the hash of the PDFs are in 'sha'<br>
3. For PMC sourced papers, one paper's metadata can be associated with one or more PDFs/shas under that paper - a PDF/sha correponding to the main article, and possibly additional PDF/shas corresponding to supporting materials for the article.<br>
4. 13K of the PDFs were processed with fulltext ('has_full_text'=True)<br>
5. Various 'keys' are populated with the metadata:
    - 'pmcid': populated for all PMC paper records (27337 non null)
	- 'doi': populated for all BioRxiv/MedRxiv paper records and most of the other records (26357 non null)
	- 'WHO #Covidence': populated for all CZI records and none of the other records (1236 non null)
	- 'pubmed_id': populated for some of the records
	- 'Microsoft Academic Paper ID': populated for some of the records

**Chan Zuckerberg Initiative (CZI)**<br>
**PubMed Central (PMC)** is a free digital repository that archives publicly accessible full-text scholarly articles that have been published within the biomedical and life sciences journal literature.<br>
**BioRxiv** (pronounced "bio-archive") is an open access preprint repository for the biological sciences<br>
**medRxiv. medRxiv** (pronounced med archive) is a preprint service for the medicine and health sciences and provides a free online platform for researchers to share, comment, and receive feedback on their work. Information among scientists spreads slowly, and often incompletely.

---

# Themes

## Transform the required themes

In [1]:
import pandas as pd
from io import StringIO

### Initial form of themes:
> 1. Data on potential risks factors
    - Smoking, pre-existing pulmonary disease
    - Co-infections (determine whether co-existing respiratory/viral infections make the virus more transmissible or virulent) and other co-morbidities
    - Neonates and pregnant women
    - Socio-economic and behavioral factors to understand the economic impact of the virus and whether there were differences.
2. Transmission dynamics of the virus, including the basic reproductive number, incubation period, serial interval, modes of transmission and environmental factors
3. Severity of disease, including risk of fatality among symptomatic hospitalized patients, and high-risk patient groups
4. Susceptibility of populations
5. Public health mitigation measures that could be effective for control

In [2]:
riskfac = StringIO("""Factor;Description
    Pulmorary risk;Smoking, preexisting pulmonary disease
    Infection risk;Coinfections determine whether coexisting respiratory or viral infections make the virus more transmissible or virulent and other comorbidities
    Birth risk;Neonates and pregnant women
    Socio-economic risk;Socio-economic and behavioral factors to understand the economic impact of the virus and whether there were differences
    Transmission;Transmission dynamics of the virus, including the basic reproductive number, incubation period, serial interval, modes of transmission and environmental factors
    Severity;Severity of disease, including risk of fatality among symptomatic hospitalized patients, and high-risk patient groups
    Susceptibility;Susceptibility of populations
    mitig measures;Public health mitigation measures that could be effective for control
    """)

rf_base = pd.read_csv(riskfac, sep= ";")
rf_base

Unnamed: 0,Factor,Description
0,Pulmorary risk,"Smoking, preexisting pulmonary disease"
1,Infection risk,Coinfections determine whether coexisting resp...
2,Birth risk,Neonates and pregnant women
3,Socio-economic risk,Socio-economic and behavioral factors to under...
4,Transmission,"Transmission dynamics of the virus, including ..."
5,Severity,"Severity of disease, including risk of fatalit..."
6,Susceptibility,Susceptibility of populations
7,mitig measures,Public health mitigation measures that could b...


In [3]:
# exporting factors and description to save it.
rf_base.to_csv(r'2020-03-13/rf_base.csv', index = False)

In [4]:
import spacy

In [5]:
nlp = spacy.load('en_core_web_sm')

from spacy.matcher import Matcher
m_tool = Matcher(nlp.vocab)

## Load the themes

In [6]:
data = pd.read_csv('2020-03-13/rf_base.csv', delimiter=',', header=None, skiprows=1, names=['Factor','Description'])
# We only need the Headlines text column from the data
descp = data[:8][['Description']];
descp['index'] = descp.index

In [7]:
descp

Unnamed: 0,Description,index
0,"Smoking, preexisting pulmonary disease",0
1,Coinfections determine whether coexisting resp...,1
2,Neonates and pregnant women,2
3,Socio-economic and behavioral factors to under...,3
4,"Transmission dynamics of the virus, including ...",4
5,"Severity of disease, including risk of fatalit...",5
6,Susceptibility of populations,6
7,Public health mitigation measures that could b...,7


## Preprocess the themes

In [8]:
# Loading Gensim and nltk libraries

import gensim
from gensim.utils import simple_preprocess
from gensim.parsing.preprocessing import STOPWORDS
from nltk.stem import WordNetLemmatizer, SnowballStemmer
from nltk.stem.porter import *
import numpy as np
np.random.seed(400)
import nltk
nltk.download('wordnet')

[nltk_data] Downloading package wordnet to
[nltk_data]     /Users/mikehatchi/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


True

In [9]:
# a function for preprocessing text
"""
def lemmatize_stemming(text):
    stemmer = SnowballStemmer('english')
    return stemmer.stem(WordNetLemmatizer().lemmatize(text, pos='v'))
"""
def preprocess(text):
    result=[]
    for token in gensim.utils.simple_preprocess(text) :
        if token not in gensim.parsing.preprocessing.STOPWORDS and len(token) > 3:
            # result.append(lemmatize_stemming(token))
            result.append(token)
    return result

In [10]:
'''
Preview a document after preprocessing
'''
theme_num = 0
theme_sample = descp[descp['index'] == theme_num].values[0][0]

print("Original document: ")
words = []
for word in theme_sample.split(' '):
    words.append(word)
print(words)
print("\n\nTokenized and lemmatized document: ")
print(preprocess(theme_sample))

Original document: 
['Smoking,', 'preexisting', 'pulmonary', 'disease']


Tokenized and lemmatized document: 
['smoking', 'preexisting', 'pulmonary', 'disease']


In [11]:
# descp = descp.dropna(subset=['Description'])
processed_docs = descp['Description'].map(preprocess)
# [processed_docs[i] for i in range(8)]

## Feature extraction

### Binary Encoding

In [12]:
factor = []
for i in processed_docs[:8]:
    vocab = set(i)
    factor.append(vocab)

for i in range(8):
    print(len(factor[i]), factor[i])

4 {'disease', 'smoking', 'pulmonary', 'preexisting'}
10 {'infections', 'transmissible', 'respiratory', 'coexisting', 'virus', 'determine', 'comorbidities', 'virulent', 'coinfections', 'viral'}
3 {'pregnant', 'neonates', 'women'}
8 {'economic', 'behavioral', 'virus', 'socio', 'impact', 'understand', 'factors', 'differences'}
14 {'basic', 'period', 'reproductive', 'transmission', 'serial', 'dynamics', 'number', 'environmental', 'virus', 'interval', 'modes', 'including', 'factors', 'incubation'}
11 {'high', 'fatality', 'disease', 'severity', 'patient', 'hospitalized', 'including', 'groups', 'risk', 'symptomatic', 'patients'}
2 {'populations', 'susceptibility'}
6 {'mitigation', 'health', 'effective', 'public', 'measures', 'control'}


---

# Retrieve Metadata data
sha / source_x / title / doi / pmcid / pubmed_id / authors / journal / Microsot A.P ID / Who # covidence

In [36]:
data = pd.read_csv("2020-03-13/all_sources_metadata_2020-03-13.csv")

In [45]:
def met_xiv(data, sha):
    for i, tracksha in data['sha']:
        if tracksha == sha:
            print("sha: {} - Source X: {} \n\nTitle:\n{} \n\ndoi:\n{} \n\npmcid:{} - pubmed_id:{} \n\nauthors:\n{} \n\journal:\n{} - Microsoft Academic Paper ID:{} - WHO #Covidence: {}".format(data['sha'],
                                                                                                                                                                                                 data['source_x'],
                                                                                                                                                                                                 data['title'],
                                                                                                                                                                                                 data['doi'],
                                                                                                                                                                                                 data['pmcid'],
                                                                                                                                                                                                 data['pubmed_id'],
                                                                                                                                                                                                 data['authors'],
                                                                                                                                                                                                 data['journal'],
                                                                                                                                                                                                 data['Microsoft Academic Paper ID'],
                                                                                                                                                                                                 data['WHO #Covidence']))

In [47]:
sha = 'c630ebcdf30652f0422c3ec12a00b50241dc9bd9'
met_xiv(data, sha)

sha: 0        c630ebcdf30652f0422c3ec12a00b50241dc9bd9
1        53eccda7977a31e3d0f565c884da036b1e85438e
2        210a892deb1c61577f6fba58505fd65356ce6636
3        e3b40cc8e0e137c416b4a2273a4dca94ae8178cc
4        92c2c9839304b4f2bc1276d41b1aa885d8b364fd
5        0df0d5270a9399cf4e23c0cdd877a80616a9725e
6        f24242580be243d5fc3f432915d86af6854bb8b7
7        d13a685f861b0f1ba05afa6e005311ad1820fd3a
8        e1b336d8be1a4c0ccc5a1bf41e48b3b004d3ece1
9        e9239100c5493ea914dc23c3d7a262f4326022ac
10       469ed0f00c09e2637351c9735c306f27acf3aace
11       4e550e034ccca6fa2a91e481ddba24db67bc9ae5
12       4bbb0c59babc718f67953fae032dad6ae0d7aeb1
13       c821803c55c6aad89b6d0c1d3ba252051e464017
14       c3bee2a4caca614b34f92c17b643b854dcdab28d
15       715842fa536064980818ad7e31ce511272a4b6bc
16       601d3c7ae4ebcd2d835e9cf8d7427ebd0b5db83f
17       06c89f69aa7b5f9648d2c1543b8246fe9c3610cf
18       acb678bdd7634055de18d0b89bb6a4890e6a0306
19       47772f3e98d8c61bb9782531a0338ba85f27

---

# Xviv Pipeline

### Loading json file function

In [13]:
import os
import json

def load_xviv(path):
    
    input_file = os.path.join(path)
    with open(input_file) as json_file:
        data = json.load(json_file)
        
        print("PAPER_ID: {}\n\nTITLE:\n{}".format(data['paper_id'],
                                                  data['metadata']['title']))
    
        name_fst = (item['authors']['first'] for item in data['metadata'])
        print("{}".format(name_fst))

In [18]:
import os
import json

def load_xviv(path):
    
    input_file = os.path.join(path)
    with open(input_file) as json_file:
        data = json.load(json_file)
        
        paper_id = data['paper_id']
        text = data['body_text'][:]
        
        return paper_id, text

In [21]:
def print_it(doc):
    print("PAPER_ID: {}\n\nTITLE:\n{} \n\TEXT:\n{}".format(doc), (doc))

In [22]:
doc = load_xviv('2020-03-13/biorxiv_medrxiv/biorxiv_medrxiv/0015023cc06b5362d332b3baf348d11567ca2fbb.json')
print_it(doc)

IndexError: tuple index out of range

generator = ( item['value'] for item in test_data )

...

for i in generator:
    do_something(i)

### Retrieve text document

In [17]:
load_xviv('2020-03-13/biorxiv_medrxiv/biorxiv_medrxiv/0015023cc06b5362d332b3baf348d11567ca2fbb.json')

0015023cc06b5362d332b3baf348d11567ca2fbb


In [86]:
\n\nAUTHORS:\n{} {} \n\nTEXT:\n{}
data['metadata']['authors'][0]['first'],
                                                                         data['metadata']['authors'][0]['last'],
                                                                         data['body_text'][:]))

---

# Phrase Matching

### Defining Patterns

In [21]:
m_tool.add('Pulmonary', factor[0])

In [14]:
factor = {}
for i in processed_docs[:8]:
    vocab[i] = set(i)
    factor.append(vocab)

for i in range(8):
    print(len(factor[i]), factor[i])

TypeError: 'set' object does not support item assignment

In [15]:
from sklearn.feature_extraction.text import CountVectorizer
vec = CountVectorizer(binary=True)
for i in range(8):
    vec.fit(factor[i])
    print([w for w in sorted(vec.vocabulary_.keys())])

KeyError: 0

# Sentences coordinates >> abstract