# COVID-19 Open Research Dataset Challenge (CORD-19)
An AI challenge with AI2, CZI, MSR, Georgetown, NIH & The White House

### Task: What do we know about COVID-19 risk factors?

---

### About the data

**Information** 
1. **Metadata** for papers from these sources are combined: CZI, PMC, BioRxiv/MedRxiv.
(total records 29500)
    - **CZI** 1236 records
    - **PMC** 27337
    - **bioRxiv** 566
    - **medRxiv** 361
2. 17K of the paper records have PDFs and the hash of the PDFs are in 'sha'<br>
3. For PMC sourced papers, one paper's metadata can be associated with one or more PDFs/shas under that paper - a PDF/sha correponding to the main article, and possibly additional PDF/shas corresponding to supporting materials for the article.<br>
4. 13K of the PDFs were processed with fulltext ('has_full_text'=True)<br>
5. Various 'keys' are populated with the metadata:
    - 'pmcid': populated for all PMC paper records (27337 non null)
	- 'doi': populated for all BioRxiv/MedRxiv paper records and most of the other records (26357 non null)
	- 'WHO #Covidence': populated for all CZI records and none of the other records (1236 non null)
	- 'pubmed_id': populated for some of the records
	- 'Microsoft Academic Paper ID': populated for some of the records
---
Glossary:<br>
**Chan Zuckerberg Initiative (CZI)**<br>
**PubMed Central (PMC)** is a free digital repository that archives publicly accessible full-text scholarly articles that have been published within the biomedical and life sciences journal literature.<br>
**BioRxiv** (pronounced "bio-archive") is an open access preprint repository for the biological sciences<br>
**medRxiv. medRxiv** (pronounced med archive) is a preprint service for the medicine and health sciences and provides a free online platform for researchers to share, comment, and receive feedback on their work. Information among scientists spreads slowly, and often incompletely.

---
**The provided data are organized as followed**<br>
- Commercial use subset (includes PMC content) -- 9118 full text (new: 128), 183Mb
- Non-commercial use subset (includes PMC content) -- 2353 full text (new: 385), 41Mb
- Custom license subset -- 16959 full text (new: 15533), 345Mb
- bioRxiv/medRxiv subset (pre-prints that are not peer reviewed) -- 885 full text (new: 110), 14Mb
- Metadata file -- 60Mb

  #### [More details on dataset](https://pages.semanticscholar.org/coronavirus-research)

---

# The Appproach

This is a straight forward approach that used Allen Institute For AI SciSpacy model.<br>
- Step 1: transform the quoted factors (see. initial brief) in patterns via SciSpaCy. They will be called "Theme".<br>
- Step 2: then, match the Themes among the 29500 articles (XIV) due to SpaCy model. Here, we retrieve THEME, KEYWORD, PAPER_ID.<br>
- Step 3: from there TITLE, AUTHORS, SOURCE, ... are available if the data are available in the Metadata document.
<br><br>
1. Pros:
- provide quote, paper id, title, authors from article when the wanted 'Theme' is detected,
- straight forward approach especially in the rush context,
- use of models from Allen Institute of AI,
- prune the volume of documents when searching a specific topic.
<br><br>
2. Cons:
- basic approach: does not provide text summarization nor sentiment analysis,
- when matching/extraction actions are done, some manual actions are required.

## Patternizing the themes
This is what we have to find among the 29 500 archives.

## Transform the required themes

In [92]:
import pandas as pd
from io import StringIO

### Initial form of the themes:
> What do we know about COVID-19 risk factors? What have we learned from epidemiological studies?<br>
<br>Specifically, we want to know what the literature reports about:
1. Data on potential risks factors
    - Smoking, pre-existing pulmonary disease
    - Co-infections (determine whether co-existing respiratory/viral infections make the virus more transmissible or virulent) and other co-morbidities
    - Neonates and pregnant women
    - Socio-economic and behavioral factors to understand the economic impact of the virus and whether there were differences.
2. Transmission dynamics of the virus, including the basic reproductive number, incubation period, serial interval, modes of transmission and environmental factors
3. Severity of disease, including risk of fatality among symptomatic hospitalized patients, and high-risk patient groups
4. Susceptibility of populations
5. Public health mitigation measures that could be effective for control

### Keeping track of record of the themes by exporting them as .csv

In [93]:
riskfac = StringIO("""Factor;Description
    Pulmonary;Smoking, preexisting pulmonary disease
    Infection;Coinfections determine whether coexisting respiratory or viral infections make the virus more transmissible or virulent and other comorbidities
    Birth;Neonates and pregnant women
    Socio-eco;Socio-economic and behavioral factors to understand the economic impact of the virus and whether there were differences
    Transmission;Transmission dynamics of the virus, including the basic reproductive number, incubation period, serial interval, modes of transmission and environmental factors
    Severity;Severity of disease, including risk of fatality among symptomatic hospitalized patients, and high-risk patient groups
    Susceptibility;Susceptibility of populations
    Mitig-measures;Public health mitigation measures that could be effective for control
    """)

rf_base = pd.read_csv(riskfac, sep= ";")
rf_base

Unnamed: 0,Factor,Description
0,Pulmonary,"Smoking, preexisting pulmonary disease"
1,Infection,Coinfections determine whether coexisting resp...
2,Birth,Neonates and pregnant women
3,Socio-eco,Socio-economic and behavioral factors to under...
4,Transmission,"Transmission dynamics of the virus, including ..."
5,Severity,"Severity of disease, including risk of fatalit..."
6,Susceptibility,Susceptibility of populations
7,Mitig-measures,Public health mitigation measures that could b...


In [94]:
# exporting factors and description to save it.
rf_base.to_csv(r'2020-03-13/rf_base.csv', index = False)

### Loading the themes' descriptions

In [95]:
data = pd.read_csv('2020-03-13/rf_base.csv', delimiter=',', header=None, skiprows=1, names=['Factor','Description'])
# We only need the Headlines text column from the data
descp = data[:8][['Description']];
descp['index'] = descp.index
descp

Unnamed: 0,Description,index
0,"Smoking, preexisting pulmonary disease",0
1,Coinfections determine whether coexisting resp...,1
2,Neonates and pregnant women,2
3,Socio-economic and behavioral factors to under...,3
4,"Transmission dynamics of the virus, including ...",4
5,"Severity of disease, including risk of fatalit...",5
6,Susceptibility of populations,6
7,Public health mitigation measures that could b...,7


### Loading the themes' designation

In [96]:
fact_name = data[:8][['Factor']];
fact_name['index'] = fact_name.index
fact_name

Unnamed: 0,Factor,index
0,Pulmonary,0
1,Infection,1
2,Birth,2
3,Socio-eco,3
4,Transmission,4
5,Severity,5
6,Susceptibility,6
7,Mitig-measures,7


## Tokenization of the themes with [ScispaCy package](https://allenai.github.io/scispacy/)

### Preview of themes after preprocessing
Sci-SpaCy are models for biomedical text processing made by Allen Institute [AI2](https://alleninstitute.org)

### Defining Patterns

In [97]:
import scispacy
import spacy
from spacy import displacy


nlp = spacy.load("en_core_sci_md") # "en_core_sci_md" larger biodmedical vocab. word vector

def patternizing(dataF):
    for i in range(8):
        theme_sample = dataF[dataF['index'] == i].values[0][0]
        
        text = theme_sample
        # print(theme_sample)

        doc = nlp(text)
       
        # print(list(doc.sents))
        # print(doc.ents)
        
        displacy.render(next(doc.sents), style='ent', jupyter=True)
patternizing(descp)

**Due to Sci-SpaCy model, for e.g.:'pulmonary disease' is considered as a token.** This is not the case with "basic" SpaCy.

### Patterns into lists and then related to their name Theme.

In [98]:
nlp = spacy.load("en_core_sci_md")

def pRnize(dataF, indice):
    
    mastlist = []
    """
    n_cov2 = ['covid19', 'covid-19',
              'Covid19', 'Covid-19',
              'COVID19', 'COVID-19',
              'Sars-Cov-2', 'Sars-CoV-2', 'Sars-COV-2', 'Sars-cov-2',
              'SARS-Cov-2', 'SARS-CoV-2', 'SARS-COV-2', 'SARS-cov-2',
              'sars-Cov-2', 'sars-CoV-2', 'sars-COV-2', 'sars-cov-2',
              'Sars Cov-2', 'Sars CoV-2', 'Sars COV-2', 'Sars cov-2',
              'SARS Cov-2', 'SARS CoV-2', 'SARS COV-2', 'SARS cov-2',
              'sars Cov-2', 'sars CoV-2', 'sars COV-2', 'sars cov-2',
              'Sars-Cov 2', 'Sars-CoV 2', 'Sars-COV 2', 'Sars-cov 2',
              'SARS-Cov 2', 'SARS-CoV 2', 'SARS-COV 2', 'SARS-cov 2',
              'sars-Cov 2', 'sars-CoV 2', 'sars-COV 2', 'sars-cov 2',
              'Sars Cov 2', 'Sars CoV 2', 'Sars COV 2', 'Sars cov 2',
              'SARS Cov 2', 'SARS CoV 2', 'SARS COV 2', 'SARS cov 2',
              'sars Cov 2', 'sars CoV 2', 'sars COV 2', 'sars cov 2',
              'Sars Cov2', 'Sars CoV2', 'Sars COV2', 'Sars cov2',
              'SARS Cov2', 'SARS CoV2', 'SARS COV2', 'SARS cov2',
              'sars Cov2', 'sars CoV2', 'sars COV2', 'sars cov2',]
    """
    for i in range(8):
        factor = []
        theme_sample = dataF[dataF['index'] == i].values[0][0]

        text = theme_sample

        doc = nlp(text) 
        
        for item in doc.ents:
            vocab = str(item).lower().strip('()')
            factor.append(vocab)
        
        #for name in n_cov2:
         #   factor.append(name)
        mastlist.append(factor)
    return mastlist[indice]

# To test unquote
#pRnize(descp, 0)

**Here none of the COVID-19 designation are not been include in the patterns.<br>
If needed, just quote the list and the loop related to n_cov2.**

### Dictionnary {Theme: patterns} 

In [99]:
def key_per_theme(dataF, word):
    dic = {}
    for i in range(8):
        factor = dataF[dataF['index'] == i].values[0][0]
        wordy = pRnize(word, i)
    
        dic[factor.strip()] = wordy
    return dic

#### Preview of Themes

In [100]:
key_per_theme(fact_name, descp)
#for bo in key_per_theme(fact_name, descp):
#    print(bo)
#    print(key_per_theme(fact_name, descp)[bo])

{'Pulmonary': ['smoking', 'pulmonary disease'],
 'Infection': ['coinfections',
  'coexisting',
  'respiratory',
  'viral infections',
  'virus',
  'transmissible',
  'virulent',
  'comorbidities'],
 'Birth': ['neonates', 'pregnant women'],
 'Socio-eco': ['socio-economic',
  'behavioral factors',
  'economic impact',
  'virus'],
 'Transmission': ['transmission',
  'dynamics',
  'virus',
  'basic reproductive number',
  'incubation',
  'serial interval',
  'modes',
  'transmission',
  'environmental factors'],
 'Severity': ['severity of disease',
  'risk',
  'fatality',
  'symptomatic',
  'hospitalized',
  'patients',
  'high-risk',
  'patient'],
 'Susceptibility': ['susceptibility', 'populations'],
 'Mitig-measures': ['public health mitigation',
  'measures',
  'effective',
  'control']}

### Patterns are ready!

---

# Retrieve the text from articles (xiv)
recall that the xiv are dispatched among different sources: Biomed, Commercial, ...

In [101]:
import os
import json
import glob

In [102]:
# json file access function
def data_access(path):
    d_acc = {}
    for i in glob.glob(path):
        # link = os.path.normpath(i)
        # print(link)
        
        # loading json file function
        with open(os.path.normpath(i)) as json_file:
            data = json.load(json_file)
            paper_id = data['paper_id']
            
            # text = [item['text'] for item in data['body_text']]
            for item in data['body_text']:
                text = (item['text'])
                
                d_acc[paper_id] = text
                
    return d_acc

In [103]:
# json files'path from each folder

# path if needed to check just one article.
biomed_path = "2020-03-13/biorxiv_medrxiv/biorxiv_medrxiv/"  # bio and med archive
commu_path = "2020-03-13/comm_use_subset/comm_use_subset/"
noncom_path = "2020-03-13/noncomm_use_subset/noncomm_use_subset/"
pmc_path = "2020-03-13/pmc_custom_license/pmc_custom_license/"

# path if needed to check over all the folder.
biomed_fo = "2020-03-13/biorxiv_medrxiv/biorxiv_medrxiv/*.json"  # bio and med archive
commu_fo = "2020-03-13/comm_use_subset/comm_use_subset/*.json"
noncom_fo = "2020-03-13/noncomm_use_subset/noncomm_use_subset/*.json"
pmc_fo = "2020-03-13/pmc_custom_license/pmc_custom_license/*.json"


In [111]:
#xiv = data_access(biomed_fo)
#for i in xiv:
#    print(i)
    # match_it('Pulmonary', xiv[i])

### Data extraction from Metadata doc

This additional infos on article that are included into the Metadata doc.<br>
sha / source_x / title / doi / pmcid / pubmed_id / authors / journal / Microsot A.P ID / Who # covidence

In [105]:
metadata = pd.read_csv("2020-03-13/all_sources_metadata_2020-03-13.csv")
# metadata.head(5)

In [106]:
def met_xiv(metadata, sha):
    
    sha = str(sha)
        
    for i, tracksha in enumerate(metadata['sha']):
        if tracksha == sha:
            print("Title\n{} \n\nAuthors\n{} \n\nSource: {} \n\nPaper ID: {}\n\ndoi: {} \n\npmcid: {} - pubmed_id: {} \n\nJournal: {}\n\n-linked to-\n\nMicrosoft Academic Paper ID: {} \n\nWHO #Covidence: {}".format(
                                                                                                                                                                                                 metadata['title'][i],
                                                                                                                                                                                                 metadata['authors'][i],
                                                                                                                                                                                                 metadata['source_x'][i],
                                                                                                                                                                                                 metadata['sha'][i],
                                                                                                                                                                                                 metadata['doi'][i],
                                                                                                                                                                                                 metadata['pmcid'][i],
                                                                                                                                                                                                 metadata['pubmed_id'][i],
                                                                                                                                                                                                 
                                                                                                                                                                                                 metadata['journal'][i],
                                                                                                                                                                                                 
                                                                                                                                                                                                 metadata['Microsoft Academic Paper ID'][i],
                                                                                                                                                                                                 metadata['WHO #Covidence'][i]))
# met_xiv(metadata, '0015023cc06b5362d332b3baf348d11567ca2fbb')

---

# Matching Part

### Match the theme within a folder of xiv's.
display Theme / Keywords / Paper_id / Quote

In [107]:
from IPython.core.display import display, HTML
# from termcolor import colored, cprint 

from spacy.matcher import Matcher
from spacy.matcher import PhraseMatcher

In [108]:
nlp = spacy.load("en_core_sci_md")

def match_it(theme, xiv):
    
    phMatch = PhraseMatcher(nlp.vocab)
    
    article = data_access(xiv)
    
    
    p = key_per_theme(fact_name, descp)[theme]
    if len(p)==0:
        print('no patterns')
        
    patterns = [nlp(i) for i in p]

    phMatch.add(theme, None, *patterns)

    for num_id in article:
        paper_id = num_id
        
        doc = nlp(article[num_id])
        
        mat = phMatch(doc)
        # print(mat)
        
        for match_id, start, end in mat:
            string_id = nlp.vocab.strings[match_id]
            if len(string_id) == 0:
                print("No Result")
            else:
                span = doc[start:end]
                spant = doc[(start) : (end+20)]

                print("\nTHEME: \033[34m{}\033[00m - KEYWORDS: \033[32m{}\033[00m\n\nQUOTE: \033[0;37;40m{}\033[00m\n\nPAPER_ID:{}".format(string_id,
                                                                                                                     span.text, spant.text, paper_id))

            
                print()


### These are the exact spelling of the themes that has to be used. Each of them contains the keywords we had within the initial briefing.

- Pulmonary
- Infection
- Birth
- Socio-eco
- Transmission
- Severity
- Susceptibility
- mitig-measures

### These are the exact spelling of the Xiv folders path which contains the 29500 articles.

- biomed_fo: bioRxiv/medRxiv subset (pre-prints that are not peer reviewed)
- commu_fo: Commercial use subset (includes PMC content)
- noncom_fo: Non-commercial use subset (includes PMC content)
- pmc_fo: Custom license subset


### Make the query of the desired theme over folder of xiv.

In [109]:
match_it('Susceptibility', biomed_fo)


THEME: [34mSusceptibility[00m - KEYWORDS: [32msusceptibility[00m

QUOTE: [0;37;40msusceptibility or infectiousness. That means that we are not able to investigate the impact of school closures or the impact[00m

PAPER_ID:05d99c07db59b6948e39bfa62c2cbbf62944059a


THEME: [34mSusceptibility[00m - KEYWORDS: [32mpopulations[00m

QUOTE: [0;37;40mpopulations were closed after Jan. 23rd, 2020, using these values for 2019 correctly describes the return of the population[00m

PAPER_ID:0b282573f5c63c943021c10ca39a1ed21acfb429


THEME: [34mSusceptibility[00m - KEYWORDS: [32msusceptibility[00m

QUOTE: [0;37;40msusceptibility, we are effectively generating a matrix R i,j which determines the distribution of secondary confirmed cases in[00m

PAPER_ID:2536b83acf84368d7c13be81fe07aa0575115da7


THEME: [34mSusceptibility[00m - KEYWORDS: [32mpopulations[00m

QUOTE: [0;37;40mpopulations were determined by TCID50 and those of ribavirin-treated populations were normalized to mock-tr

### Additional info about the paper found.

PAPER_ID:564f8823050b52b5f5c36638ac1ae07557963f36 seems to match a lot.<br>
Here is more info about it:

In [110]:
met_xiv(metadata,'564f8823050b52b5f5c36638ac1ae07557963f36')

Title
Characterisation of the faecal virome of captive and wild Tasmanian devils using virus-like particles metagenomics and meta-transcriptomics 

Authors
Chong, R.; Shi, M.; Grueber, C. E.; Holmes, E. C.; Hogg, C.; Belov, K.; Barrs, V. R. 

Source: biorxiv 

Paper ID: 564f8823050b52b5f5c36638ac1ae07557963f36

doi: doi.org/10.1101/443457 

pmcid: nan - pubmed_id: nan 

Journal: nan

-linked to-

Microsoft Academic Paper ID: nan 

WHO #Covidence: nan


---