## What do we know about COVID-19 risk factors?
##### COVID-19 Open Research Dataset Challenge (CORD-19)

**Task Details**  
What do we know about COVID-19 risk factors? What have we learned from epidemiological studies?  

**Specifically, we want to know what the literature reports about:**  

* Data on potential risks factors  
* Smoking, pre-existing pulmonary disease  
* Co-infections (determine whether co-existing respiratory/viral infections make the virus more transmissible or virulent) and other co-morbidities  
* Neonates and pregnant women  
* Socio-economic and behavioral factors to understand the economic impact of the virus and whether there were differences.  
* Transmission dynamics of the virus, including the basic reproductive number, incubation period, serial interval, modes of transmission and environmental factors  
* Severity of disease, including risk of fatality among symptomatic hospitalized patients, and high-risk patient groups  
* Susceptibility of populations  
* Public health mitigation measures that could be effective for control  

In [10]:
import os 
import pandas as pd
import json
from tqdm import tqdm
import re
import matplotlib.pyplot as plt
import heapq

import re
import nltk

# Get a list of stopwords from nltk
stopwords = nltk.corpus.stopwords.words("english")

In [2]:
dirs = ['biorxiV_medrxiv', 'comm_use_subset', 'custom_license', 'noncomm_use_subset']

docs = []
for d in dirs:
    print(d)
    for file in tqdm(os.listdir(f"{d}/{d}")):
        filepath = f"{d}/{d}/{file}"
        j = json.load(open(filepath,'rb'))
        title = j['metadata']['title']
        try: 
            abstract = j['abstract'][0]['text']
        except:
            abstract = ''
            
        fulltext = ''
        for text in j['body_text']:
            fulltext += text['text'] + "\n\n"
        docs.append([title, abstract, fulltext])

  9%|▉         | 80/885 [00:00<00:01, 796.89it/s]

biorxiV_medrxiv


100%|██████████| 885/885 [00:01<00:00, 869.08it/s]
  1%|          | 60/9118 [00:00<00:15, 597.48it/s]

comm_use_subset


100%|██████████| 9118/9118 [00:15<00:00, 582.29it/s]
  0%|          | 62/16959 [00:00<00:27, 611.40it/s]

custom_license


100%|██████████| 16959/16959 [00:30<00:00, 564.40it/s]
  3%|▎         | 64/2353 [00:00<00:03, 637.02it/s]

noncomm_use_subset


100%|██████████| 2353/2353 [00:03<00:00, 624.61it/s]


In [3]:
df = pd.DataFrame(docs, columns = ['title', 'abstract', 'fulltext'])
df.head()

Unnamed: 0,title,abstract,fulltext
0,Multimerization of HIV-1 integrase hinges on c...,New anti-AIDS treatments must be continually d...,"In the absence of a curative treatment, the hi..."
1,Time-varying transmission dynamics of Novel Co...,Rationale: Several studies have estimated basi...,"Eighteen years ago, severe acute respiratory s..."
2,p53 is not necessary for DUX4 pathology,Summary Statement: DUX4 is thought to mediate ...,Facioscapulohumeral muscular dystrophy (FSHD) ...
3,Virological assessment of hospitalized cases o...,"emerged in late 2019 1,2 . Initial outbreaks i...",Pharyngeal virus shedding was very high during...
4,Potential impact of seasonal forcing on a SARS...,A novel coronavirus (SARS-CoV-2) first detecte...,(2.2 with 90% high density interval 1.4-3.8 (R...


In [180]:
fulltexts = df['fulltext'].values

In [158]:
def clean_text(text):
    # Removing Square Brackets and Extra Spaces
    text = re.sub(r'\[[0-9]*\]',' ', text)
    text = re.sub(r'\s+',' ', text)
 
    text = re.sub(r'\{\{[\s\S]*?\}\}', '', text)

    # Remove doi links
    #text = re.sub(r'^https://$', '',text)
    return text

def clean_spchar_digs(text):
    # Removing special characters and digits
    text = re.sub('[^a-zA-Z]', ' ', text )
    text = re.sub(r'\s+', ' ', text)
    
    return text

def word_freq(formatted_text):
    #creates a dictionary of words as keys and frequency as values
    word_frequencies = {}
    for word in nltk.word_tokenize(formatted_text):
        if word not in stopwords:
            if word not in word_frequencies.keys():
                word_frequencies[word] = 1
            else:
                word_frequencies[word] += 1

    maximum_frequency = max(word_frequencies.values())
    #divides the values by the maximum frequency
    for word in word_frequencies.keys():
        word_frequencies[word] = (word_frequencies[word]/maximum_frequency)
    
    return word_frequencies

def sent_scores(sentence_list, word_frequencies):
    #uses the word frequencies to score the sentences by adding up the scores
    #of the words that make up the sentence
    sentence_scores = {}
    for sent in sentence_list:
        for word in nltk.word_tokenize(sent.lower()):
            if word in word_frequencies.keys():
                if len(sent.split(' ')) <60: #limits sentence to less than 60 words
                    if sent not in sentence_scores.keys():
                        sentence_scores[sent] = word_frequencies[word]
                    else:
                        sentence_scores[sent] += word_frequencies[word]
                    
    return sentence_scores

def get_summary(dirty_text):
    text = clean_text(dirty_text)
    formatted_text = clean_spchar_digs(text)

    sentence_list = nltk.sent_tokenize(text)

    word_frequencies = word_freq(formatted_text) 
    sentence_scores = sent_scores(sentence_list,word_frequencies)
    
    
    summary_sentences = heapq.nlargest(7, sentence_scores, key=sentence_scores.get) #first value is number highest scoring sentences to print
    summary = '\n\n '.join(summary_sentences)
    return summary

In [182]:
def get_summary_improved(searchlist):
    #get summary where all you have to do is provide the words you are searching for in a list
    covid_alias = ['CoV', 'COVID', 'Covid', 'corona virus', 'coronavirus', 'Coronavirus', 'Corona virus'] #depending on here the results w
    desired_sents = {}
    covid_sents = {}
    for text in fulltexts:
        for sentence in text.split('. '):
            for i in searchlist:
                if i.lower() in sentence.lower(): #using .lower changes the results dramatically
                    if sentence not in desired_sents.keys():
                        desired_sents[sentence] = sentence 
            for j in covid_alias:
                if j in sentence:
                    if sentence not in covid_sents.keys():
                        covid_sents[sentence] = sentence
    desired_sents = set(desired_sents.keys())
    covid_sents = set(covid_sents.keys())
    desired_sents = list(desired_sents.intersection(covid_sents))
    
    desired_text = ''
    for x in desired_sents:
        desired_text += ' ' + x
    text = clean_text(desired_text)
    formatted_text = clean_spchar_digs(text)

    sentence_list = nltk.sent_tokenize(text)

    word_frequencies = word_freq(formatted_text) 
    sentence_scores = sent_scores(sentence_list,word_frequencies)
    
    
    summary_sentences = heapq.nlargest(7, sentence_scores, key=sentence_scores.get) #first value is number highest scoring sentences to print
    summary = '\n\n '.join(summary_sentences)
    return summary

### Risk Factors

In [169]:
# riskfactors = ''
# for text in fulltexts:
#     for sentence in text.split('. '):
#         if ('risk factors' in sentence or 'high risk' in sentence or 'susceptibility' in sentence) and ('CoV' in sentence or 'Covid' in sentence or 'COVID' in sentence or 'corona virus' in sentence):
#             riskfactors += ' ' + sentence
            
# summary_rfactors = get_summary(riskfactors)

slist = ['risk factors', 'high risk', 'at risk', 'susceptib']
summary_rfactors = get_summary_improved(slist))

Our study results indicate that pigs and llamas are susceptible to MERS-CoV infection Primary hamster tracheobronchial epithelial cell cultures support higher virus replication in comparison to similar murine cultures, suggesting an increased susceptibility of hamster cells to SARS-CoV infection As calves are more susceptible to infection, the cows become exposed to virus repeatedly, increasing their antibody titer.

 It is noteworthy that OL cells of human origin and C6 cells of rat origin showed similar susceptibility to SARS-CoV infection Cats recovering from coronavirus infection will shed virus in their faeces and are a potential risk to other susceptible cats Positive binding was detected for both antigens in alpaca and dromedary (data not shown).

 Women were more susceptible to SARS (M: F = 0.52: 1), unlike the COVID-19 outbreak in Wuhan 14 Cats recovering from coronavirus infection will shed virus in their faeces and potentially put other susceptible cats at risk In contrast, 

### Pregnancy

In [133]:
# preg  = ''
# for text in fulltexts:
#     for sentence in text.split('. '):
#         if ('pregnant women' in sentence) and ('CoV' in sentence or 'Covid' in sentence or 'COVID' in sentence or 'corona virus' in sentence or 'coronavirus' in sentence):
#             preg  += ' ' + sentence
            
# summary_preg = get_summary(preg)
# print(summary_preg)

In [166]:
slist = ['pregnant women']
summary_preg = get_summary_improved(slist))

We report clinical data from nine pregnant women with laboratory-confirmed COVID-19 pneumonia The largest case series of pregnant women who had SARS comes from Wong and coworkers in China Other medications, such as interferons, have been proposed for use in future SARS outbreaks, but use of these medications in pregnant women may also be of concern.

 The clinical outcomes among pregnant women with SARS in Hong Kong were worse than those occurring in infected women who were not pregnant Laboratory investigations on admission found lower counts of WBC, neutrophils, CRP, and ALT in pregnant women, compared to the non-pneumonia controls.

 Twelve pregnant women were known to be infected with SARS in Hong Kong, and most information regarding the course of SARS in pregnancy comes from this cohort To date, none of previous studies have compared maternal and neonatal outcomes of pregnant women with COVID-19 pneumonia to those without pneumonia, to investigate the adverse effects of COVID-19 i

### Mitigation

In [69]:
# miti  = ''
# for text in fulltexts:
#     for sentence in text.split('. '):
#         if ('mitigation measure' in sentence) and ('CoV' in sentence or 'Covid' in sentence or 'COVID' in sentence or 'corona virus' in sentence):
#             miti  += ' ' + sentence
            
# summary_miti = get_summary(miti)

### Smoking 

In [188]:
smoke  = ''

for text in fulltexts:
    for sentence in text.split('. '):
        if ('smoking' in sentence or 'pulmonary' in sentence) and ('CoV' in sentence or 'Covid' in sentence or 'COVID' in sentence or 'corona virus' in sentence):
            smoke  += ' ' + sentence
                       

In [189]:
summary_smoke = get_summary(smoke)

### Comorbidities 

In [190]:
comor  = ''
for text in fulltexts:
    for sentence in text.split('. '):
        if ('morbidities' in sentence or 'coinfection' in sentence or 'underlying' in sentence) and ('CoV' in sentence or 'Covid' in sentence or 'COVID' in sentence or 'corona virus' in sentence):
            comor  += ' ' + sentence

In [191]:
summary_comor = get_summary(comor)

In [178]:
class color:
   PURPLE = '\033[95m'
   CYAN = '\033[96m'
   DARKCYAN = '\033[36m'
   BLUE = '\033[94m'
   GREEN = '\033[92m'
   YELLOW = '\033[93m'
   RED = '\033[91m'
   BOLD = '\033[1m'
   UNDERLINE = '\033[4m'
   END = '\033[0m'

print(color.BOLD + 'Risk Factor Summary' + color.END)
print()
print(summary_rfactors)
print()
print(color.BOLD + 'Pregnancy' + color.END)
print()
print(summary_preg)
print()
print(color.BOLD + 'Mitigation' + color.END)
print()
print(summary_miti)
print()
print(color.BOLD + 'Smoking' + color.END)
print()
print(summary_smoke)
print()
print(color.BOLD + 'Comorbidities' + color.END)
print()
print(summary_comor)
print()


[1mRisk Factor Summary[0m

Public bulk RNA-seq dataset analysis NA-seq profile data of 13 organs including 695 para-carcinoma normal tissues as control from public TCGA were obtained for our analysis, and Fig These findings indicated that oral cavity could be regarded as potentially high risk for 2019-nCov infectious susceptibility.

 Person who had a close (within 1 m) but short (< 15 min) contact with a confirmed case, or a distant (> 1 m) but prolonged contact in public settings, or any contact in private settings that does not match with the moderate/high risk of exposure criteria.

 Contacts are asked to measure their body temperature twice a day and check for clinical symptoms In case of occurrence of symptoms like fever, cough or dyspnoea, contacts are asked to wear a surgical mask, isolate themselves and immediately contact the emergency hotline (SAMU-centre 15) indicating that they are contacts of a confirmed COVID-19 case.

 Fever, cough, and dyspnoea were the most common s