# What do we know about COVID-19 risk factors?
#### This notebook tries to answer the question with the help of Word2Vec embeddings.<br> The notebook has following structure:
1. Analysing and filtering the metadata
2. Generating a dataframe with the articles
3. Preprocess the text
4. Word2Vec

In [1]:
import pandas as pd
import numpy as np
import json
import glob

## 1. Analysis and filtering the metadata
Load the metadata.csv file into a pandas dataframe

In [2]:
metadata_path = './data/metadata.csv'
meta_df = pd.read_csv(metadata_path, dtype={
    'WHO #Covidence': str,
    'sha': str
})
meta_df.head()

Unnamed: 0,cord_uid,sha,source_x,title,doi,pmcid,pubmed_id,license,abstract,publish_time,authors,journal,Microsoft Academic Paper ID,WHO #Covidence,has_pdf_parse,has_pmc_xml_parse,full_text_file,url
0,8q5ondtn,,Elsevier,Intrauterine virus infections and congenital h...,10.1016/0002-8703(72)90077-4,,4361535.0,els-covid,Abstract The etiologic basis for the vast majo...,1972-12-31,"Overall, James C.",American Heart Journal,,,False,False,custom_license,https://doi.org/10.1016/0002-8703(72)90077-4
1,pzfd0e50,,Elsevier,Coronaviruses in Balkan nephritis,10.1016/0002-8703(80)90355-5,,6243850.0,els-covid,,1980-03-31,"Georgescu, Leonida; Diosi, Peter; Buţiu, Ioan;...",American Heart Journal,,,False,False,custom_license,https://doi.org/10.1016/0002-8703(80)90355-5
2,22bka3gi,,Elsevier,Cigarette smoking and coronary heart disease: ...,10.1016/0002-8703(80)90356-7,,7355701.0,els-covid,,1980-03-31,"Friedman, Gary D",American Heart Journal,,,False,False,custom_license,https://doi.org/10.1016/0002-8703(80)90356-7
3,zp9k1k3z,aecbc613ebdab36753235197ffb4f35734b5ca63,Elsevier,Clinical and immunologic studies in identical ...,10.1016/0002-9343(73)90176-9,,4579077.0,els-covid,"Abstract Middle-aged female identical twins, o...",1973-08-31,"Brunner, Carolyn M.; Horwitz, David A.; Shann,...",The American Journal of Medicine,,,True,False,custom_license,https://doi.org/10.1016/0002-9343(73)90176-9
4,cjuzul89,,Elsevier,Epidemiology of community-acquired respiratory...,10.1016/0002-9343(85)90361-4,,4014285.0,els-covid,Abstract Upper respiratory tract infections ar...,1985-06-28,"Garibaldi, Richard A.",The American Journal of Medicine,,,False,False,custom_license,https://doi.org/10.1016/0002-9343(85)90361-4


The next step is to filter the entries of the metadata file.
First of all we just want to keep entries which have also the full text. This can be done by removing every row,
which has a '*False*' entry in **'has_pdf_parse'** and **'has_pmc_xml_parse'**.

After that all papers which were published before **2019-08-01** are removed since those will certainly not contain anything about COVID-19.

As a last step we drop every row which has a NaN in the column **'sha'**.
With this filtering the number of rows is reduced by almost a factor of 10.

In [3]:
print('Number of rows before filtering:',meta_df.shape[0])

# remove rows which do not have the full text
meta_df = meta_df.loc[meta_df['has_pdf_parse'] | meta_df['has_pmc_xml_parse'], :]

# remove papers which were published before 2019-07-01
meta_df.drop(meta_df[pd.to_datetime(meta_df['publish_time']) < '2019-08-01'].index,inplace=True)

# remove entries which do not have a sha
meta_df.dropna(subset=['sha'],inplace=True)
print('Number of rows after filtering:',meta_df.shape[0])

Number of rows before filtering: 47298
Number of rows after filtering: 3990


#### Get the path of all JSON files.

In [4]:
all_json = glob.glob('./data/**/*.json', recursive=True)
print('Number of JSON files found:',len(all_json))

Number of JSON files found: 52097


#### A helper class to load the JSON files for later use.

In [5]:
class JSONFileReader:
    def __init__(self, file_path):
        with open(file_path) as file:
            content = json.load(file)
            self.paper_id = content['paper_id']
            self.body_text = []
            # Body text
            for entry in content['body_text']:
                self.body_text.append(entry['text'])
            self.body_text = '\n'.join(self.body_text)

## 2. Generate dataframe with articles

In [6]:
dict_ = {'paper_id': [], 'title': [], 'authors': [],
         'publish_time': [], 'journal': [], 'body_text': [], 'abstract': []}
skipped = 0
for idx, entry in enumerate(all_json):
    if idx % (len(all_json) // 10) == 0:
        print(f'Processing index: {idx} of {len(all_json)}')
    content = JSONFileReader(entry)
    # get metadata information
    meta_data = meta_df.loc[meta_df['sha'] == content.paper_id]
    # no metadata, skip this paper
    if len(meta_data) == 0:
        skipped +=1
        continue
    
    dict_['paper_id'].append(content.paper_id)
    dict_['title'].append(meta_data['title'].values[0])
    dict_['authors'].append(meta_data['authors'].values[0])
    dict_['publish_time'].append(meta_data['publish_time'].values[0])
    dict_['journal'].append(meta_data['journal'].values[0])
    dict_['body_text'].append(content.body_text)
    dict_['abstract'].append(meta_data['abstract'].values[0])
    
df_covid = pd.DataFrame(dict_, columns=['paper_id', 'title', 'authors', 'publish_time',
                                        'journal', 'body_text', 'abstract'])
print('\n'+ 'Number of papers skipped:',skipped)
df_covid.head()

Processing index: 0 of 52097
Processing index: 5209 of 52097
Processing index: 10418 of 52097
Processing index: 15627 of 52097
Processing index: 20836 of 52097
Processing index: 26045 of 52097
Processing index: 31254 of 52097
Processing index: 36463 of 52097
Processing index: 41672 of 52097
Processing index: 46881 of 52097
Processing index: 52090 of 52097

Number of papers skipped: 48275


Unnamed: 0,paper_id,title,authors,publish_time,journal,body_text,abstract
0,c96cd1e79d3f1ea12887e2b4d9b3347b9ee07137,Effects of ubiquitin-proteasome inhibitor on t...,"Zhang, Hui; Yu, Jingbin; Sun, Hu; Zhao, Yunhe;...",2019-08-14,Exp Ther Med,"Viral myocarditis (VMC), as one of the most co...",Effects of ubiquitin-proteasome system (UPS) i...
1,4bbb0c59babc718f67953fae032dad6ae0d7aeb1,Genome Detective Coronavirus Typing Tool for r...,"Cleemput, S.; Dumon, W.; Fonseca, V.; Karim, W...",2020,"Bioinformatics (Oxford, England)",We are currently faced with a potential global...,"SUMMARY: Genome Detective is a web-based, user..."
2,c439e1028bc93d94f4eba1855c6b1eade676f732,Acute Respiratory Distress Syndrome as an Orga...,"Chang, Jae C.",2019-11-28,Clin Appl Thromb Hemost,Acute respiratory distress syndrome (ARDS) is ...,Acute respiratory distress syndrome (ARDS) is ...
3,58be092086c74c58e9067121a6ba4836468e7ec3,The Author's Response: Case of the Index Patie...,"Lim, Jaegyun; Jeon, Seunghyun; Shin, Hyun Youn...",2020,J Korean Med Sci,"administration reduces viral load. Also, the a...",
4,6bd9a619434c6c57efe77c253b97c31d98369b02,Mycoplasma pneumoniaeassociated transverse mye...,"Salloum, Shafee; Goenka, Ajay; Ey, Elizabeth",2019-09-12,Clin Pract,"and brachioradialis, but normal on the left si...",Acute transverse myelitis is a rare spinal cor...


#### Remove duplicates and Nans

In [7]:
print('Before removing duplicates:',df_covid.shape[0])
df_covid.drop_duplicates(['abstract', 'body_text'], inplace=True)
df_covid.dropna(subset=['body_text','abstract','publish_time'],inplace=True)
print('After removing duplicates:',df_covid.shape[0])

Before removing duplicates: 3822
After removing duplicates: 3213


#### Save the dataframe

In [8]:
df_covid.to_csv('./data/covid.csv',index=False)

## 3. Preprocess text

In [9]:
df_covid = pd.read_csv('./data/covid.csv')

#### Helper function for preprocessing the text
Following steps are done:
- convert to lower case
- tokenize
- remove stop words and punctuation
- lemmatize

In [10]:
import nltk
nltk.download('wordnet')
from nltk.stem import WordNetLemmatizer
from nltk import word_tokenize
import string
from nltk.corpus import stopwords

stop_words = set(stopwords.words('english')) 

def preprocess(text):
    # convert to lower case
    lower = text.lower()
    
    # tokenize text
    word_tokens = word_tokenize(lower) 
    
    # remove stopwords and punctuation
    text = [w for w in word_tokens if not w in stop_words and w not in string.punctuation]
    text = ' '.join(text)
    return WordNetLemmatizer().lemmatize(text)

[nltk_data] Downloading package wordnet to /home/ander/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


#### Apply the preprocessing step to the columns 'body_text' and 'abstract'.

In [11]:
df_covid['body_text_cleaned']=df_covid['body_text'].apply(preprocess)
df_covid['abstract_cleaned']=df_covid['abstract'].apply(preprocess)

#### As a next step we only want to keep articles which contain either the word covid or corona

In [12]:
def search(text):
    if text.find('covid') == -1 and text.find('corona') == -1:
        return False
    return True

print('# of rows before cleaning:',df_covid.shape[0])

df_covid['found_covid'] = df_covid['body_text_cleaned'].apply(search)
#df_covid = df_covid[df_covid['found_covid'] == True]
df_covid.drop(df_covid[df_covid['found_covid'] == False].index, inplace=True)
df_covid.drop(['found_covid'], axis=1, inplace=True)

print('# of rows after cleaning:',df_covid.shape[0])

# of rows before cleaning: 3213
# of rows after cleaning: 2408


#### Sort rows by date
Since the articel in the first row has publish_time = 2019-08-01 there are certainly older articles which contain the word corona or covid.

In [13]:
df_covid['publish_time'] = pd.to_datetime(df_covid.publish_time)
df_covid.sort_values(by=['publish_time'],inplace=True)
df_covid.head()

Unnamed: 0,paper_id,title,authors,publish_time,journal,body_text,abstract,body_text_cleaned,abstract_cleaned
2808,7203ff6238ce039aad6fbcbceab04051a59ae0e2,Human Antimicrobial Peptides as Therapeutics f...,"Ahmed, Aslaa; Siman-Tov, Gavriella; Hall, Gran...",2019-08-01,Viruses,"Found in virtually all organisms, antimicrobia...",Successful in vivo infection following pathoge...,found virtually organisms antimicrobial peptid...,successful vivo infection following pathogen e...
309,91f08a1a02ed51a781b1265d345b60cc63ff32fe,Toward a quantification of risks at the nexus ...,"Pruvot, Mathieu; Khammavong, Kongsy; Milavong,...",2019-08-01,Science of The Total Environment,• Bushmeat trade in Lao PDR is considerable an...,Abstract Trade of bushmeat and other wildlife ...,• bushmeat trade lao pdr considerable likely e...,abstract trade bushmeat wildlife human consump...
1302,a8ff3b5e3a206c9d3d567b87b63099413c363b55,Canine babesiosis among working dogs of organi...,"Mittal, Mitesh; Kundu, Krishnendu; Chakravarti...",2019-08-01,Preventive Veterinary Medicine,Canine babesiosis (or piroplasmosis) is a sign...,Abstract Canine babesiosis is a serious diseas...,canine babesiosis piroplasmosis significant po...,abstract canine babesiosis serious disease amo...
2471,045b111f0f2584890e9271399aa93c917a496662,Etiology and Risk Factors for Mortality in an ...,"Aston, Stephen J.; Ho, Antonia; Jary, Hannah; ...",2019-08-01,Am J Respir Crit Care Med,"Globally, pneumonia is the commonest infectiou...",Rationale: In the context of rapid antiretrovi...,globally pneumonia commonest infectious cause ...,rationale context rapid antiretroviral therapy...
922,31047a34b6e7cfe9ebd619ed6f518b5e983fd650,Emergence and genetic analysis of variant path...,"Rohaim, Mohammed A.; El Naggar, Rania F.; Hamo...",2019-08-01,Virus Genes,. IBV belongs to genus Gammacoronavirus within...,Infectious bronchitis virus (IBV) affects both...,ibv belongs genus gammacoronavirus within coro...,infectious bronchitis virus ibv affects vaccin...


#### Save just the text in a dataframe
Concatenate the body_text and abstract into one column

In [14]:
df_covid['text']=df_covid['body_text_cleaned']+df_covid['abstract_cleaned']
df_text = df_covid['text']
print(df_text)
df_text.to_csv('./data/text.csv',index=False)

2808    found virtually organisms antimicrobial peptid...
309     • bushmeat trade lao pdr considerable likely e...
1302    canine babesiosis piroplasmosis significant po...
2471    globally pneumonia commonest infectious cause ...
922     ibv belongs genus gammacoronavirus within coro...
                              ...                        
833     zoonotic diseases serious threats health varia...
816     digestive respiratory tracts continuously expo...
778     medical interventions dramatically increased l...
350     aim diagnose covid-19 earlier improve treatmen...
1348    emerging infectious disease defined infectious...
Name: text, Length: 2408, dtype: object


## 4. Word2Vec

In [15]:
df = pd.read_csv('./data/text.csv')
df.head()

Unnamed: 0,text
0,found virtually organisms antimicrobial peptid...
1,• bushmeat trade lao pdr considerable likely e...
2,canine babesiosis piroplasmosis significant po...
3,globally pneumonia commonest infectious cause ...
4,ibv belongs genus gammacoronavirus within coro...


In [16]:
from nltk.stem import SnowballStemmer
from nltk import sent_tokenize, word_tokenize
import string
translator = str.maketrans('','',string.punctuation)
from nltk.corpus import stopwords
stop = set(stopwords.words('english'))
stemmer = SnowballStemmer('english')

def get_sentences(doc):
    sentences = []
    
    for raw in sent_tokenize(doc):
        raw2 = [i for i in raw.translate(translator).lower().split() if i not in stop and len(i) < 10]
        raw3 = [stemmer.stem(t) for t in raw2]
        sentences.append(raw3)
    return sentences

In [17]:
###
# Word2Vec in gensim
###

# word2vec requires sentences as input
sentences = []
for doc in df['text']:
    sentences += get_sentences(doc)
from random import shuffle
shuffle(sentences) # stream in sentences in random order

# train the model
from gensim.models import Word2Vec
print(sentences)
w2v = Word2Vec(sentences,  # list of tokenized sentences
               workers = 8, # Number of threads to run in parallel
               size=300,  # Word vector dimensionality     
               min_count =  20, # Minimum word count  
               window = 5, # Context window size      
               sample = 1e-3, # Downsample setting for frequent words
               )

# done training, so delete context vectors
w2v.init_sims(replace=True)

w2v.save('./data/w2v-vectors.pkl')

IOPub data rate exceeded.
The notebook server will temporarily stop sending output
to the client in order to avoid crashing it.
To change this limit, set the config variable
`--NotebookApp.iopub_data_rate_limit`.

Current values:
NotebookApp.iopub_data_rate_limit=1000000.0 (bytes/sec)
NotebookApp.rate_limit_window=3.0 (secs)



#### List words which are similar to 'virus'

In [18]:
w2v.wv.most_similar('virus')

[('viral', 0.5912830829620361),
 ('avian', 0.5117582082748413),
 ('cov', 0.46821874380111694),
 ('pathogen', 0.46049946546554565),
 ('measl', 0.45975935459136963),
 ('norovirus', 0.45921483635902405),
 ('iav', 0.45544669032096863),
 ('hadv', 0.4476735591888428),
 ('host', 0.42003196477890015),
 ('mump', 0.41783708333969116)]

#### List words which are similar to 'risk'
We can see the word elder and workload mentioned. Therefore we know they are closely related to the word 'risk'.
This makes sense since COVID-19 is more dangerouse for older people.

In [19]:
w2v.wv.most_similar('risk')

[('burden', 0.6000455021858215),
 ('chanc', 0.589929461479187),
 ('incid', 0.5848392248153687),
 ('highrisk', 0.5432121157646179),
 ('awar', 0.5369606614112854),
 ('elder', 0.5215969085693359),
 ('worker', 0.4926151633262634),
 ('prioriti', 0.4844847321510315),
 ('hazard', 0.4776591360569),
 ('smoke', 0.4734295904636383)]

#### Words similar to 'bat'
The word pangolin is closely related to bat.

In [20]:
w2v.wv.most_similar('bat')

[('pangolin', 0.7696573734283447),
 ('sarslik', 0.7137672901153564),
 ('sarsrcov', 0.6897789239883423),
 ('affini', 0.6727837324142456),
 ('zxc21', 0.6595544815063477),
 ('malayan', 0.650221586227417),
 ('ratg13', 0.646531343460083),
 ('mammal', 0.6325733661651611),
 ('hedgehog', 0.6316180229187012),
 ('civet', 0.6265983581542969)]