# COVID-19 Open Research Dataset Challenge (CORD-19)
An AI challenge with AI2, CZI, MSR, Georgetown, NIH & The White House

### Task: What do we know about COVID-19 risk factors?

---

### Information about the data

(1) Metadata for papers from these sources are combined: CZI, PMC, BioRxiv/MedRxiv. (total records 29500)
	- CZI 1236 records
	- PMC 27337
	- bioRxiv 566
	- medRxiv 361
(2) 17K of the paper records have PDFs and the hash of the PDFs are in 'sha'<br>
(3) For PMC sourced papers, one paper's metadata can be associated with one or more PDFs/shas under that paper - a PDF/sha correponding to the main article, and possibly additional PDF/shas corresponding to supporting materials for the article.<br>
(4)	13K of the PDFs were processed with fulltext ('has_full_text'=True)<br>
(5) Various 'keys' are populated with the metadata:
	- 'pmcid': populated for all PMC paper records (27337 non null)
	- 'doi': populated for all BioRxiv/MedRxiv paper records and most of the other records (26357 non null)
	- 'WHO #Covidence': populated for all CZI records and none of the other records (1236 non null)
	- 'pubmed_id': populated for some of the records
	- 'Microsoft Academic Paper ID': populated for some of the records

**Chan Zuckerberg Initiative (CZI)**<br>
**PubMed Central (PMC)** is a free digital repository that archives publicly accessible full-text scholarly articles that have been published within the biomedical and life sciences journal literature.<br>
**BioRxiv** (pronounced "bio-archive") is an open access preprint repository for the biological sciences<br>
**medRxiv. medRxiv** (pronounced med archive) is a preprint service for the medicine and health sciences and provides a free online platform for researchers to share, comment, and receive feedback on their work. Information among scientists spreads slowly, and often incompletely.

---

In [6]:
import pandas as pd

In [7]:
data = pd.read_csv("2020-03-13/all_sources_metadata_2020-03-13.csv")
data.head()

Unnamed: 0,sha,source_x,title,doi,pmcid,pubmed_id,license,abstract,publish_time,authors,journal,Microsoft Academic Paper ID,WHO #Covidence,has_full_text
0,c630ebcdf30652f0422c3ec12a00b50241dc9bd9,CZI,Angiotensin-converting enzyme 2 (ACE2) as a SA...,10.1007/s00134-020-05985-9,,32125455.0,cc-by-nc,,2020,"Zhang, Haibo; Penninger, Josef M.; Li, Yimin; ...",Intensive Care Med,2002765000.0,#3252,True
1,53eccda7977a31e3d0f565c884da036b1e85438e,CZI,Comparative genetic analysis of the novel coro...,10.1038/s41421-020-0147-1,,,cc-by,,2020,"Cao, Yanan; Li, Lin; Feng, Zhimin; Wan, Shengq...",Cell Discovery,3003431000.0,#1861,True
2,210a892deb1c61577f6fba58505fd65356ce6636,CZI,Incubation Period and Other Epidemiological Ch...,10.3390/jcm9020538,,,cc-by,The geographic spread of 2019 novel coronaviru...,2020,"Linton, M. Natalie; Kobayashi, Tetsuro; Yang, ...",Journal of Clinical Medicine,3006065000.0,#1043,True
3,e3b40cc8e0e137c416b4a2273a4dca94ae8178cc,CZI,Characteristics of and Public Health Responses...,10.3390/jcm9020575,,32093211.0,cc-by,"In December 2019, cases of unidentified pneumo...",2020,"Deng, Sheng-Qun; Peng, Hong-Juan",J Clin Med,177663100.0,#1999,True
4,92c2c9839304b4f2bc1276d41b1aa885d8b364fd,CZI,Imaging changes in severe COVID-19 pneumonia,10.1007/s00134-020-05976-w,,32125453.0,cc-by-nc,,2020,"Zhang, Wei",Intensive Care Med,3006643000.0,#3242,False


In [9]:
data.shape

(29500, 14)

In [11]:
"""
# gathering only the non-numerical type
cat_col = [cat for cat in data.dtypes.index if data.dtypes[cat]=='object']

# printing the frequencies for each category
for col in cat_col:
    print('\nFrequency of categories within {}'.format(col))
    print(data[col].value_counts())
"""

"\n# gathering only the non-numerical type\ncat_col = [cat for cat in data.dtypes.index if data.dtypes[cat]=='object']\n\n# printing the frequencies for each category\nfor col in cat_col:\n    print('\nFrequency of categories within {}'.format(col))\n    print(data[col].value_counts())\n"

## Load the dataset

In [15]:
'''
Load the dataset from the CSV and save it to 'data_text'
'''
import pandas as pd
data = pd.read_csv('2020-03-13/all_sources_metadata_2020-03-13.csv', error_bad_lines=False);
# We only need the Headlines text column from the data
data_text = data[:300000][['title']];
data_text['index'] = data_text.index

documents = data_text

In [16]:
'''
Get the total number of documents
'''
print(len(documents))

29500


## Data Preprocessing
Tokenization (split text into sentence into words) / Words < 3 characters out / Stopwords removed / lemmatize (third pers. & past to present verb) / stemmed (word reduced to its root form).

In [17]:
'''
Loading Gensim and nltk libraries
'''
# pip install gensim
import gensim
from gensim.utils import simple_preprocess
from gensim.parsing.preprocessing import STOPWORDS
from nltk.stem import WordNetLemmatizer, SnowballStemmer
from nltk.stem.porter import *
import numpy as np
np.random.seed(400)

In [18]:
import nltk
nltk.download('wordnet')

[nltk_data] Downloading package wordnet to
[nltk_data]     /Users/mikehatchi/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


True

In [21]:
'''
Write a function to perform the pre processing steps on the entire dataset
'''
def lemmatize_stemming(text):
    stemmer = SnowballStemmer('english')
    return stemmer.stem(WordNetLemmatizer().lemmatize(text, pos='v'))

# Tokenize and lemmatize
def preprocess(text):
    result=[]
    for token in gensim.utils.simple_preprocess(text) :
        if token not in gensim.parsing.preprocessing.STOPWORDS and len(token) > 3:
            # TODO: Apply lemmatize_stemming on the token, then add to the results list
            result.append(lemmatize_stemming(token))
    return result

In [23]:
'''
Preview a document after preprocessing
'''
document_num = 0
doc_sample = documents[documents['index'] == document_num].values[0][0]

print("Original document: ")
words = []
for word in doc_sample.split(' '):
    words.append(word)
print(words)
print("\n\nTokenized and lemmatized document: ")
print(preprocess(doc_sample))

Original document: 
['Angiotensin-converting', 'enzyme', '2', '(ACE2)', 'as', 'a', 'SARS-CoV-2', 'receptor:', 'molecular', 'mechanisms', 'and', 'potential', 'therapeutic', 'target']


Tokenized and lemmatized document: 
['angiotensin', 'convert', 'enzym', 'sar', 'receptor', 'molecular', 'mechan', 'potenti', 'therapeut', 'target']


In [25]:
documents = documents.dropna(subset=['title'])

In [26]:
processed_docs = documents['title'].map(preprocess)

In [29]:
processed_docs[:30]

0     [angiotensin, convert, enzym, sar, receptor, m...
1     [compar, genet, analysi, novel, coronavirus, n...
2     [incub, period, epidemiolog, characterist, nov...
3     [characterist, public, health, respons, corona...
4                [imag, chang, sever, covid, pneumonia]
5     [updat, estim, risk, transmiss, novel, coronav...
6     [real, time, forecast, ncov, epidem, china, fe...
7     [retract, chines, medic, staff, request, inter...
8     [covid, outbreak, diamond, princess, cruis, sh...
9     [distinct, role, sialosid, protein, receptor, ...
10    [month, coronavirus, diseas, covid, epidem, ch...
11    [effect, airport, screen, detect, travel, infe...
12    [genom, detect, coronavirus, type, tool, rapid...
13    [case, index, patient, caus, tertiari, transmi...
14    [emerg, novel, coronavirus, ncov, need, rapid,...
15               [coronavirus, ncov, epidem, hindsight]
16    [nonstructur, protein, like, associ, evolut, n...
17    [pathogen, ncov, quick, overview, comparis

## Bag of words on the dataset

In [31]:
dictionary = gensim.corpora.Dictionary(processed_docs)
count = 0
for k, v in dictionary.iteritems():
    print(k, v)
    count += 1
    if count > 30:
        break

0 angiotensin
1 convert
2 enzym
3 mechan
4 molecular
5 potenti
6 receptor
7 sar
8 target
9 therapeut
10 analysi
11 compar
12 coronavirus
13 differ
14 genet
15 ncov
16 novel
17 popul
18 avail
19 case
20 characterist
21 data
22 epidemiolog
23 incub
24 infect
25 period
26 public
27 right
28 statist
29 truncat
30 china
