# COVID-19 Open Research Dataset Challenge (CORD-19)
An AI challenge with AI2, CZI, MSR, Georgetown, NIH & The White House

### Task: What do we know about COVID-19 risk factors?

---

### Information about the data

1. Metadata for papers from these sources are combined: CZI, PMC, BioRxiv/MedRxiv.
(total records 29500)
    - CZI 1236 records
    - PMC 27337
    - bioRxiv 566
    - medRxiv 361
2. 17K of the paper records have PDFs and the hash of the PDFs are in 'sha'<br>
3. For PMC sourced papers, one paper's metadata can be associated with one or more PDFs/shas under that paper - a PDF/sha correponding to the main article, and possibly additional PDF/shas corresponding to supporting materials for the article.<br>
4. 13K of the PDFs were processed with fulltext ('has_full_text'=True)<br>
5. Various 'keys' are populated with the metadata:
    - 'pmcid': populated for all PMC paper records (27337 non null)
	- 'doi': populated for all BioRxiv/MedRxiv paper records and most of the other records (26357 non null)
	- 'WHO #Covidence': populated for all CZI records and none of the other records (1236 non null)
	- 'pubmed_id': populated for some of the records
	- 'Microsoft Academic Paper ID': populated for some of the records
---
- Commercial use subset (includes PMC content) -- 9118 full text (new: 128), 183Mb
- Non-commercial use subset (includes PMC content) -- 2353 full text (new: 385), 41Mb
- Custom license subset -- 16959 full text (new: 15533), 345Mb
- bioRxiv/medRxiv subset (pre-prints that are not peer reviewed) -- 885 full text (new: 110), 14Mb
- Metadata file -- 60Mb

---
**Chan Zuckerberg Initiative (CZI)**<br>
**PubMed Central (PMC)** is a free digital repository that archives publicly accessible full-text scholarly articles that have been published within the biomedical and life sciences journal literature.<br>
**BioRxiv** (pronounced "bio-archive") is an open access preprint repository for the biological sciences<br>
**medRxiv. medRxiv** (pronounced med archive) is a preprint service for the medicine and health sciences and provides a free online platform for researchers to share, comment, and receive feedback on their work. Information among scientists spreads slowly, and often incompletely.

[More details on dataset](https://pages.semanticscholar.org/coronavirus-research)

---

# Patternizing the themes
This is what we have to find among the 29 500 archives.

## Transform the required themes

In [1]:
import pandas as pd
from io import StringIO

### Initial form of the themes:
> What do we know about COVID-19 risk factors? What have we learned from epidemiological studies?<br>
<br>Specifically, we want to know what the literature reports about:
1. Data on potential risks factors
    - Smoking, pre-existing pulmonary disease
    - Co-infections (determine whether co-existing respiratory/viral infections make the virus more transmissible or virulent) and other co-morbidities
    - Neonates and pregnant women
    - Socio-economic and behavioral factors to understand the economic impact of the virus and whether there were differences.
2. Transmission dynamics of the virus, including the basic reproductive number, incubation period, serial interval, modes of transmission and environmental factors
3. Severity of disease, including risk of fatality among symptomatic hospitalized patients, and high-risk patient groups
4. Susceptibility of populations
5. Public health mitigation measures that could be effective for control

### Keeping track of record of the themes by exporting them as .csv

In [2]:
riskfac = StringIO("""Factor;Description
    Pulmorary risk;Smoking, preexisting pulmonary disease
    Infection risk;Coinfections determine whether coexisting respiratory or viral infections make the virus more transmissible or virulent and other comorbidities
    Birth risk;Neonates and pregnant women
    Socio-economic risk;Socio-economic and behavioral factors to understand the economic impact of the virus and whether there were differences
    Transmission;Transmission dynamics of the virus, including the basic reproductive number, incubation period, serial interval, modes of transmission and environmental factors
    Severity;Severity of disease, including risk of fatality among symptomatic hospitalized patients, and high-risk patient groups
    Susceptibility;Susceptibility of populations
    mitig measures;Public health mitigation measures that could be effective for control
    """)

rf_base = pd.read_csv(riskfac, sep= ";")
rf_base

Unnamed: 0,Factor,Description
0,Pulmorary risk,"Smoking, preexisting pulmonary disease"
1,Infection risk,Coinfections determine whether coexisting resp...
2,Birth risk,Neonates and pregnant women
3,Socio-economic risk,Socio-economic and behavioral factors to under...
4,Transmission,"Transmission dynamics of the virus, including ..."
5,Severity,"Severity of disease, including risk of fatalit..."
6,Susceptibility,Susceptibility of populations
7,mitig measures,Public health mitigation measures that could b...


In [3]:
# exporting factors and description to save it.
rf_base.to_csv(r'2020-03-13/rf_base.csv', index = False)

### Loading the themes

In [4]:
data = pd.read_csv('2020-03-13/rf_base.csv', delimiter=',', header=None, skiprows=1, names=['Factor','Description'])
# We only need the Headlines text column from the data
descp = data[:8][['Description']];
descp['index'] = descp.index
descp

Unnamed: 0,Description,index
0,"Smoking, preexisting pulmonary disease",0
1,Coinfections determine whether coexisting resp...,1
2,Neonates and pregnant women,2
3,Socio-economic and behavioral factors to under...,3
4,"Transmission dynamics of the virus, including ...",4
5,"Severity of disease, including risk of fatalit...",5
6,Susceptibility of populations,6
7,Public health mitigation measures that could b...,7


## Tokenization of the themes with [ScispaCy package](https://allenai.github.io/scispacy/)

### Preview of themes after preprocessing
Sci-SpaCy are models for biomedical text processing made by Allen Institute [AI2](https://alleninstitute.org)

### Defining Patterns

In [5]:
import scispacy
import spacy
from spacy import displacy


nlp = spacy.load("en_core_sci_md") # "en_core_sci_md" larger biodmedical vocab. word vector

def patternizing(dataF):
    for i in range(8):
        theme_sample = dataF[dataF['index'] == i].values[0][0]

        text = theme_sample
        # print(theme_sample)

        doc = nlp(text)
       
        # print(list(doc.sents))
        print(doc.ents)
        
        # displacy.render(next(doc.sents), style='dep', jupyter=True)
patternizing(descp)

(Smoking, pulmonary disease)
(Coinfections, coexisting, respiratory, viral infections, virus, transmissible, virulent, comorbidities)
(Neonates, pregnant women)
(Socio-economic, behavioral factors, economic impact, virus)
(Transmission, dynamics, virus, basic reproductive number, incubation, serial interval, modes, transmission, environmental factors)
(Severity of disease, risk, fatality, symptomatic, hospitalized, patients, high-risk, patient)
(Susceptibility, populations)
(Public health mitigation, measures, effective, control)


In [17]:
def pRnize(dataF, indice):
    mastlist = []
    n_cov2 = ['covid19', 'covid-19',
              'Covid19', 'Covid-19',
              'COVID19', 'COVID-19',
              'Sars-Cov-2', 'Sars-CoV-2', 'Sars-COV-2', 'Sars-cov-2',
              'SARS-Cov-2', 'SARS-CoV-2', 'SARS-COV-2', 'SARS-cov-2',
              'sars-Cov-2', 'sars-CoV-2', 'sars-COV-2', 'sars-cov-2',
              'Sars Cov-2', 'Sars CoV-2', 'Sars COV-2', 'Sars cov-2',
              'SARS Cov-2', 'SARS CoV-2', 'SARS COV-2', 'SARS cov-2',
              'sars Cov-2', 'sars CoV-2', 'sars COV-2', 'sars cov-2',
              'Sars-Cov 2', 'Sars-CoV 2', 'Sars-COV 2', 'Sars-cov 2',
              'SARS-Cov 2', 'SARS-CoV 2', 'SARS-COV 2', 'SARS-cov 2',
              'sars-Cov 2', 'sars-CoV 2', 'sars-COV 2', 'sars-cov 2',
              'Sars Cov 2', 'Sars CoV 2', 'Sars COV 2', 'Sars cov 2',
              'SARS Cov 2', 'SARS CoV 2', 'SARS COV 2', 'SARS cov 2',
              'sars Cov 2', 'sars CoV 2', 'sars COV 2', 'sars cov 2',
              'Sars Cov2', 'Sars CoV2', 'Sars COV2', 'Sars cov2',
              'SARS Cov2', 'SARS CoV2', 'SARS COV2', 'SARS cov2',
              'sars Cov2', 'sars CoV2', 'sars COV2', 'sars cov2',]
    
    for i in range(8):
        factor = []
        theme_sample = dataF[dataF['index'] == i].values[0][0]

        text = theme_sample

        doc = nlp(text) 
        
        for item in doc.ents:
            vocab = str(item).lower().strip('()')
            
            factor.append(vocab)
        
        for name in n_cov2:
            factor.append(name)
        mastlist.append(factor)
    return mastlist[indice]

#pRnize(descp, 0)

In [16]:
for j in (pRnize(descp, i)for i in range(8)):
    print(j)

['smoking', 'pulmonary disease', 'covid19', 'covid-19', 'Covid19', 'Covid-19', 'COVID19', 'COVID-19', 'Sars-Cov-2', 'Sars-CoV-2', 'Sars-COV-2', 'Sars-cov-2', 'SARS-Cov-2', 'SARS-CoV-2', 'SARS-COV-2', 'SARS-cov-2', 'sars-Cov-2', 'sars-CoV-2', 'sars-COV-2', 'sars-cov-2', 'Sars Cov-2', 'Sars CoV-2', 'Sars COV-2', 'Sars cov-2', 'SARS Cov-2', 'SARS CoV-2', 'SARS COV-2', 'SARS cov-2', 'sars Cov-2', 'sars CoV-2', 'sars COV-2', 'sars cov-2', 'Sars-Cov 2', 'Sars-CoV 2', 'Sars-COV 2', 'Sars-cov 2', 'SARS-Cov 2', 'SARS-CoV 2', 'SARS-COV 2', 'SARS-cov 2', 'sars-Cov 2', 'sars-CoV 2', 'sars-COV 2', 'sars-cov 2', 'Sars Cov 2', 'Sars CoV 2', 'Sars COV 2', 'Sars cov 2', 'SARS Cov 2', 'SARS CoV 2', 'SARS COV 2', 'SARS cov 2', 'sars Cov 2', 'sars CoV 2', 'sars COV 2', 'sars cov 2', 'Sars Cov2', 'Sars CoV2', 'Sars COV2', 'Sars cov2', 'SARS Cov2', 'SARS CoV2', 'SARS COV2', 'SARS cov2', 'sars Cov2', 'sars CoV2', 'sars COV2', 'sars cov2']
['coinfections', 'coexisting', 'respiratory', 'viral infections', 'vir

**Due to Sci-SpaCy model, 'pulmonary disease' is considered as a token.**

In [82]:
class token(object):
    def __init__(self, name):
        self.name = name
    def __repr__(self):
        return self.name
    def __str__(self):
        return self.name

In [86]:
def pRnize(dataF, indice):
    mastlist = []
    vocab = {}
    for i in range(8):
        factor = []
        theme_sample = dataF[dataF['index'] == i].values[0][0]

        text = theme_sample

        doc = nlp(text) 
        
        for item in doc.ents:
            vocab = [{token('LOWER'): str(item).lower().strip('()')}]
            
            
            factor.append(vocab)
        mastlist.append(factor)
    return mastlist[indice]

pRnize(descp, 0)

[[{LOWER: 'smoking'}], [{LOWER: 'pulmonary disease'}]]

In [87]:
for j in (pRnize(descp, i)for i in range(8)):
    print(j)

[[{LOWER: 'smoking'}], [{LOWER: 'pulmonary disease'}]]
[[{LOWER: 'coinfections'}], [{LOWER: 'coexisting'}], [{LOWER: 'respiratory'}], [{LOWER: 'viral infections'}], [{LOWER: 'virus'}], [{LOWER: 'transmissible'}], [{LOWER: 'virulent'}], [{LOWER: 'comorbidities'}]]
[[{LOWER: 'neonates'}], [{LOWER: 'pregnant women'}]]
[[{LOWER: 'socio-economic'}], [{LOWER: 'behavioral factors'}], [{LOWER: 'economic impact'}], [{LOWER: 'virus'}]]
[[{LOWER: 'transmission'}], [{LOWER: 'dynamics'}], [{LOWER: 'virus'}], [{LOWER: 'basic reproductive number'}], [{LOWER: 'incubation'}], [{LOWER: 'serial interval'}], [{LOWER: 'modes'}], [{LOWER: 'transmission'}], [{LOWER: 'environmental factors'}]]
[[{LOWER: 'severity of disease'}], [{LOWER: 'risk'}], [{LOWER: 'fatality'}], [{LOWER: 'symptomatic'}], [{LOWER: 'hospitalized'}], [{LOWER: 'patients'}], [{LOWER: 'high-risk'}], [{LOWER: 'patient'}]]
[[{LOWER: 'susceptibility'}], [{LOWER: 'populations'}]]
[[{LOWER: 'public health mitigation'}], [{LOWER: 'measures'}], [{L

### Patterns are ready!

---

# Phrase matching

In [9]:
from spacy.matcher import Matcher
nlp = spacy.load("en_core_sci_md")
m_tool = Matcher(nlp.vocab)

In [20]:
smoking = [{'LOWER': 'smoking'}] 
pulmonary_dis = [{'LOWER': 'pulmonary'}, {'LOWER': 'disease'}]
lung = [{'LOWER': 'lung'}]

m_tool.add('Pulmonary', None, smoking, pulmonary_dis, lung)

In [21]:
text = nlp(u'Chronic obstructive pulmonary disease (COPD) is a type of obstructive lung disease characterized by long-term breathing problems and poor airflow. The main symptoms include shortness of breath and cough with sputum production. COPD is a progressive disease, meaning it typically worsens over time. Smoking or smoke is a problem too.')

In [22]:
phrase_matches = m_tool(text)
print(phrase_matches )

[(626681949659846149, 2, 4), (626681949659846149, 12, 13), (626681949659846149, 49, 50)]


In [23]:
for match_id, start, end in phrase_matches:
    string_id = nlp.vocab.strings[match_id]  
    span = text[start:end]                   
    print(match_id, string_id, start, end, span.text)

626681949659846149 Pulmonary 2 4 pulmonary disease
626681949659846149 Pulmonary 12 13 lung
626681949659846149 Pulmonary 49 50 Smoking


# Applying Matcher to the Document

In [215]:
import os
import json
import glob

# json files'path from each folder
biomed_path = "2020-03-13/biorxiv_medrxiv/biorxiv_medrxiv/"  # bio and med archive
commu_path = "2020-03-13/comm_use_subset/comm_use_subset/"
noncom_path = "2020-03-13/noncomm_use_subset/noncomm_use_subset/"
pmc_path = "2020-03-13/pmc_custom_license/pmc_custom_license/"

biomed_fo = "2020-03-13/biorxiv_medrxiv/biorxiv_medrxiv/*.json"  # bio and med archive
commu_fo = "2020-03-13/comm_use_subset/comm_use_subset/*.json"
noncom_fo = "2020-03-13/noncomm_use_subset/noncomm_use_subset/*.json"
pmc_fo = "2020-03-13/pmc_custom_license/pmc_custom_license/*.json"

# json file access function
def data_access(path):
    d_acc = {}
    for i in glob.glob(path):
        # link = os.path.normpath(i)
        # print(link)
        
        # loading json file function
        with open(os.path.normpath(i)) as json_file:
            data = json.load(json_file)
            paper_id = data['paper_id']
            
            # text = [item['text'] for item in data['body_text']]
            for item in data['body_text']:
                text = (item['text'])
                
                d_acc[paper_id] = text
    return d_acc

In [216]:
# data_access(biomed_fo)

In [224]:
def text_param(text_path, m_tool):
    # print(data_access(text_path))
    
    for i in data_access(text_path):
        article = nlp(i)
    
        phrase_matches = m_tool(u+article)
        print(phrase_matches)

In [225]:
text_param(pmc_fo, pulmonary)

NameError: name 'u' is not defined

In [130]:
phrase_matches = m_tool(article)
print(phrase_matches)

[]


In [None]:
nlp = spacy.load("en_core_web_sm")
def match_it(theme, xviv):
phMatch = PhraseMatcher(nlp.vocab)

p = key_per_theme(fact_name, descp)[theme]

patterns = [nlp(i) for i in p]

phMatch.add(theme, None, *patterns)

doc = nlp(xviv)

mat = phMatch(doc)

for match_id, start, end in mat:
    string_id = nlp.vocab.strings[match_id]  
    span = doc[start:end]
    spant = doc[(start) : (end+8)]

    print("\n\033[34mTHEME:\033[00m", string_id,"-\033[32mKEYWORDS\033[00m:", span.text)
    cprint(spant.text, 'grey', attrs=['bold'], end='')
    print()