# COVID-19 Open Research Dataset Challenge (CORD-19)
An AI challenge with AI2, CZI, MSR, Georgetown, NIH & The White House

### Task: What do we know about COVID-19 risk factors?

---

### Information about the data

1. Metadata for papers from these sources are combined: CZI, PMC, BioRxiv/MedRxiv.
(total records 29500)
    - CZI 1236 records
    - PMC 27337
    - bioRxiv 566
    - medRxiv 361
2. 17K of the paper records have PDFs and the hash of the PDFs are in 'sha'<br>
3. For PMC sourced papers, one paper's metadata can be associated with one or more PDFs/shas under that paper - a PDF/sha correponding to the main article, and possibly additional PDF/shas corresponding to supporting materials for the article.<br>
4. 13K of the PDFs were processed with fulltext ('has_full_text'=True)<br>
5. Various 'keys' are populated with the metadata:
    - 'pmcid': populated for all PMC paper records (27337 non null)
	- 'doi': populated for all BioRxiv/MedRxiv paper records and most of the other records (26357 non null)
	- 'WHO #Covidence': populated for all CZI records and none of the other records (1236 non null)
	- 'pubmed_id': populated for some of the records
	- 'Microsoft Academic Paper ID': populated for some of the records
---
- Commercial use subset (includes PMC content) -- 9118 full text (new: 128), 183Mb
- Non-commercial use subset (includes PMC content) -- 2353 full text (new: 385), 41Mb
- Custom license subset -- 16959 full text (new: 15533), 345Mb
- bioRxiv/medRxiv subset (pre-prints that are not peer reviewed) -- 885 full text (new: 110), 14Mb
- Metadata file -- 60Mb

---
**Chan Zuckerberg Initiative (CZI)**<br>
**PubMed Central (PMC)** is a free digital repository that archives publicly accessible full-text scholarly articles that have been published within the biomedical and life sciences journal literature.<br>
**BioRxiv** (pronounced "bio-archive") is an open access preprint repository for the biological sciences<br>
**medRxiv. medRxiv** (pronounced med archive) is a preprint service for the medicine and health sciences and provides a free online platform for researchers to share, comment, and receive feedback on their work. Information among scientists spreads slowly, and often incompletely.

[More details on dataset](https://pages.semanticscholar.org/coronavirus-research)

---

# Themes aka patterns to match
This is what we have to find among the 29 500 archives.

## Transform the required themes

In [1]:
import pandas as pd
from io import StringIO

### Initial form of the themes:
> 1. Data on potential risks factors
    - Smoking, pre-existing pulmonary disease
    - Co-infections (determine whether co-existing respiratory/viral infections make the virus more transmissible or virulent) and other co-morbidities
    - Neonates and pregnant women
    - Socio-economic and behavioral factors to understand the economic impact of the virus and whether there were differences.
2. Transmission dynamics of the virus, including the basic reproductive number, incubation period, serial interval, modes of transmission and environmental factors
3. Severity of disease, including risk of fatality among symptomatic hospitalized patients, and high-risk patient groups
4. Susceptibility of populations
5. Public health mitigation measures that could be effective for control

### Keeping track of record of the themes by exporting them as .csv

In [40]:
riskfac = StringIO("""Factor;Description
    Pulmorary risk;Smoking, preexisting pulmonary disease
    Infection risk;Coinfections determine whether coexisting respiratory or viral infections make the virus more transmissible or virulent and other comorbidities
    Birth risk;Neonates and pregnant women
    Socio-economic risk;Socio-economic and behavioral factors to understand the economic impact of the virus and whether there were differences
    Transmission;Transmission dynamics of the virus, including the basic reproductive number, incubation period, serial interval, modes of transmission and environmental factors
    Severity;Severity of disease, including risk of fatality among symptomatic hospitalized patients, and high-risk patient groups
    Susceptibility;Susceptibility of populations
    mitig measures;Public health mitigation measures that could be effective for control
    """)

rf_base = pd.read_csv(riskfac, sep= ";")
rf_base

Unnamed: 0,Factor,Description
0,Pulmorary risk,"Smoking, preexisting pulmonary disease"
1,Infection risk,Coinfections determine whether coexisting resp...
2,Birth risk,Neonates and pregnant women
3,Socio-economic risk,Socio-economic and behavioral factors to under...
4,Transmission,"Transmission dynamics of the virus, including ..."
5,Severity,"Severity of disease, including risk of fatalit..."
6,Susceptibility,Susceptibility of populations
7,mitig measures,Public health mitigation measures that could b...


In [41]:
# exporting factors and description to save it.
rf_base.to_csv(r'2020-03-13/rf_base.csv', index = False)

### Loading the themes

In [15]:
data = pd.read_csv('2020-03-13/rf_base.csv', delimiter=',', header=None, skiprows=1, names=['Factor','Description'])
# We only need the Headlines text column from the data
descp = data[:8][['Description']];
descp['index'] = descp.index
descp

Unnamed: 0,Description,index
0,"Smoking, preexisting pulmonary disease",0
1,Coinfections determine whether coexisting resp...,1
2,Neonates and pregnant women,2
3,Socio-economic and behavioral factors to under...,3
4,"Transmission dynamics of the virus, including ...",4
5,"Severity of disease, including risk of fatalit...",5
6,Susceptibility of populations,6
7,Public health mitigation measures that could b...,7


## Preprocess the themes

In [16]:
# Loading Gensim and nltk libraries

import gensim
from gensim.utils import simple_preprocess
from gensim.parsing.preprocessing import STOPWORDS
from nltk.stem import WordNetLemmatizer, SnowballStemmer
from nltk.stem.porter import *
import numpy as np
np.random.seed(400)
import nltk
nltk.download('wordnet')

[nltk_data] Downloading package wordnet to
[nltk_data]     /Users/mikehatchi/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


True

### Preview of themes after preprocessing

In [122]:
import scispacy
import spacy
from spacy import displacy

nlp = spacy.load("en_core_sci_sm")

def patternizing(dataF):
    for i in range(8):
        theme_sample = dataF[dataF['index'] == i].values[0][0]

        text = theme_sample
        # print(theme_sample)

        doc = nlp(text)
        # print(list(doc.sents))
        return doc.ents
        
        # displacy.render(next(doc.sents), style='dep', jupyter=True)


In [123]:
# patternizing(descp)

In [124]:
def convert(tup, di):
    for b in tup:
        di.setdefault("Lower", b)
    return di

In [125]:
tups = patternizing(descp)
dictio = {}
print(convert(tups, dictio))

{'Lower': Smoking}


In [96]:
processed_docs = descp['Description'].map(patternizing)

TypeError: string indices must be integers

In [31]:
factor = []
for i in processed_docs[:8]:
    vocab = set(i)
    factor.append(vocab)

for i in range(8):
    print(len(factor[i]), factor[i])

4 {'preexisting', 'smoking', 'pulmonary', 'disease'}
10 {'infections', 'virulent', 'determine', 'respiratory', 'comorbidities', 'transmissible', 'coinfections', 'virus', 'viral', 'coexisting'}
3 {'neonates', 'women', 'pregnant'}
8 {'differences', 'socio', 'economic', 'understand', 'behavioral', 'impact', 'virus', 'factors'}
14 {'dynamics', 'serial', 'environmental', 'basic', 'number', 'transmission', 'reproductive', 'interval', 'period', 'modes', 'virus', 'including', 'incubation', 'factors'}
11 {'patient', 'fatality', 'disease', 'severity', 'groups', 'hospitalized', 'high', 'including', 'symptomatic', 'patients', 'risk'}
2 {'susceptibility', 'populations'}
6 {'mitigation', 'health', 'public', 'effective', 'measures', 'control'}


In [9]:
# a function for preprocessing text
"""
def lemmatize_stemming(text):
    stemmer = SnowballStemmer('english')
    return stemmer.stem(WordNetLemmatizer().lemmatize(text, pos='v'))
"""
def preprocess(text):
    result=[]
    for token in gensim.utils.simple_preprocess(text) :
        if token not in gensim.parsing.preprocessing.STOPWORDS and len(token) > 3:
            # result.append(lemmatize_stemming(token))
            result.append(token)
    return result

In [10]:
'''
Preview a document after preprocessing
'''
theme_num = 0
theme_sample = descp[descp['index'] == theme_num].values[0][0]

print("Original document: ")
words = []
for word in theme_sample.split(' '):
    words.append(word)
print(words)
print("\n\nTokenized and lemmatized document: ")
print(preprocess(theme_sample))

Original document: 
['Smoking,', 'preexisting', 'pulmonary', 'disease']


Tokenized and lemmatized document: 
['smoking', 'preexisting', 'pulmonary', 'disease']


In [11]:
# descp = descp.dropna(subset=['Description'])
processed_docs = descp['Description'].map(preprocess)
# [processed_docs[i] for i in range(8)]

## Feature extraction

In [12]:
factor = []
for i in processed_docs[:8]:
    vocab = set(i)
    factor.append(vocab)

for i in range(8):
    print(len(factor[i]), factor[i])

4 {'disease', 'smoking', 'pulmonary', 'preexisting'}
10 {'infections', 'transmissible', 'respiratory', 'coexisting', 'virus', 'determine', 'comorbidities', 'virulent', 'coinfections', 'viral'}
3 {'pregnant', 'neonates', 'women'}
8 {'economic', 'behavioral', 'virus', 'socio', 'impact', 'understand', 'factors', 'differences'}
14 {'basic', 'period', 'reproductive', 'transmission', 'serial', 'dynamics', 'number', 'environmental', 'virus', 'interval', 'modes', 'including', 'factors', 'incubation'}
11 {'high', 'fatality', 'disease', 'severity', 'patient', 'hospitalized', 'including', 'groups', 'risk', 'symptomatic', 'patients'}
2 {'populations', 'susceptibility'}
6 {'mitigation', 'health', 'effective', 'public', 'measures', 'control'}


---

# Xviv Pipeline

###  Text article retrievement

In [13]:
import os
import json
import glob

# json files'path from each folder
biomed_path = "2020-03-13/biorxiv_medrxiv/biorxiv_medrxiv/*.json"  # bio and med archive
commu_path = "2020-03-13/comm_use_subset/comm_use_subset/*.json"
noncom_path = "2020-03-13/noncomm_use_subset/noncomm_use_subset/*.json"
pmc_path = "2020-03-13/pmc_custom_license/pmc_custom_license/*.json"

# json file access function
def data_access(path):
    for i in glob.glob(path):
        # link = os.path.normpath(i)
        # print(link)
        
        # loading json file function
        with open(os.path.normpath(i)) as json_file:
            data = json.load(json_file)
            # paper_id = data['paper_id']
            
            # text = [item['text'] for item in data['body_text']]
            for item in data['body_text']:
                text = (item['text'])
            
            return text    # paper_id

# data_access(biomed_path)

---

# Phrase matching

In [43]:
import scispacy
import spacy
# SpaCy models for biomedical text processing >> ScispaCY package
# https://allenai.github.io/scispacy/

nlp = spacy.load('en_core_sci_sm') # for classic english: en_core_web_sm
from spacy.matcher import Matcher
m_tool = Matcher(nlp.vocab)

### Defining Patterns

In [21]:
# adding patterns to the Matcher

m_tool.add('Pulmonary', factor[0])

# Applying Matcher to the Document

In [None]:
article = nlp(data_access(path))

---

# Retrieve Metadata data
sha / source_x / title / doi / pmcid / pubmed_id / authors / journal / Microsot A.P ID / Who # covidence

In [62]:
data = pd.read_csv("2020-03-13/all_sources_metadata_2020-03-13.csv")

In [63]:
data.head(5)

Unnamed: 0,sha,source_x,title,doi,pmcid,pubmed_id,license,abstract,publish_time,authors,journal,Microsoft Academic Paper ID,WHO #Covidence,has_full_text
0,c630ebcdf30652f0422c3ec12a00b50241dc9bd9,CZI,Angiotensin-converting enzyme 2 (ACE2) as a SA...,10.1007/s00134-020-05985-9,,32125455.0,cc-by-nc,,2020,"Zhang, Haibo; Penninger, Josef M.; Li, Yimin; ...",Intensive Care Med,2002765000.0,#3252,True
1,53eccda7977a31e3d0f565c884da036b1e85438e,CZI,Comparative genetic analysis of the novel coro...,10.1038/s41421-020-0147-1,,,cc-by,,2020,"Cao, Yanan; Li, Lin; Feng, Zhimin; Wan, Shengq...",Cell Discovery,3003431000.0,#1861,True
2,210a892deb1c61577f6fba58505fd65356ce6636,CZI,Incubation Period and Other Epidemiological Ch...,10.3390/jcm9020538,,,cc-by,The geographic spread of 2019 novel coronaviru...,2020,"Linton, M. Natalie; Kobayashi, Tetsuro; Yang, ...",Journal of Clinical Medicine,3006065000.0,#1043,True
3,e3b40cc8e0e137c416b4a2273a4dca94ae8178cc,CZI,Characteristics of and Public Health Responses...,10.3390/jcm9020575,,32093211.0,cc-by,"In December 2019, cases of unidentified pneumo...",2020,"Deng, Sheng-Qun; Peng, Hong-Juan",J Clin Med,177663100.0,#1999,True
4,92c2c9839304b4f2bc1276d41b1aa885d8b364fd,CZI,Imaging changes in severe COVID-19 pneumonia,10.1007/s00134-020-05976-w,,32125453.0,cc-by-nc,,2020,"Zhang, Wei",Intensive Care Med,3006643000.0,#3242,False


In [64]:
# data.info()

In [65]:
def met_xiv(data, sha):
    for i, tracksha in enumerate(data['sha']):
        if tracksha == sha:
            print("Title\n{} \n\nAuthors\n{} \n\nSource: {} \n\nPaper ID: {}\n\ndoi: {} \n\npmcid: {} - pubmed_id: {} \n\nJournal: {}\n\n-linked to-\n\nMicrosoft Academic Paper ID: {} \n\nWHO #Covidence: {}".format(
                                                                                                                                                                                                 data['title'][i],
                                                                                                                                                                                                 data['authors'][i],
                                                                                                                                                                                                 data['source_x'][i],
                                                                                                                                                                                                 data['sha'][i],
                                                                                                                                                                                                 data['doi'][i],
                                                                                                                                                                                                 data['pmcid'][i],
                                                                                                                                                                                                 data['pubmed_id'][i],
                                                                                                                                                                                                 
                                                                                                                                                                                                 data['journal'][i],
                                                                                                                                                                                                 
                                                                                                                                                                                                 data['Microsoft Academic Paper ID'][i],
                                                                                                                                                                                                 data['WHO #Covidence'][i]))

In [66]:
 met_xiv(data, '0015023cc06b5362d332b3baf348d11567ca2fbb')

Title
The RNA pseudoknots in foot-and-mouth disease virus are dispensable for genome replication but essential for the production of infectious virus. 

Authors
Ward, J. C. J.; Lasecka-Dykes, L.; Neil, C.; Adeyemi, O.; Gold, S.; McLean, N.; Wright, C.; Herod, M. R.; Kealy, D.; Warner, E.; King, D. P.; Tuthill, T. J.; Rowlands, D. J.; Stonehouse, N. J. 

Source: biorxiv 

Paper ID: 0015023cc06b5362d332b3baf348d11567ca2fbb

doi: doi.org/10.1101/2020.01.10.901801 

pmcid: nan - pubmed_id: nan 

Journal: nan

-linked to-

Microsoft Academic Paper ID: nan 

WHO #Covidence: nan


---

---

In [14]:
factor = {}
for i in processed_docs[:8]:
    vocab[i] = set(i)
    factor.append(vocab)

for i in range(8):
    print(len(factor[i]), factor[i])

TypeError: 'set' object does not support item assignment

In [15]:
from sklearn.feature_extraction.text import CountVectorizer
vec = CountVectorizer(binary=True)
for i in range(8):
    vec.fit(factor[i])
    print([w for w in sorted(vec.vocabulary_.keys())])

KeyError: 0

# Sentences coordinates >> abstract