#### Task Details

What do we know about vaccines and therapeutics? <br> 
What has been published concerning research and development and evaluation efforts of vaccines and therapeutics?

In [1]:
import numpy as np
import pandas as pd
# Visualization
import matplotlib.pyplot as plt
import seaborn as sns
# NLP
import spacy
import scispacy
from scispacy.umls_linking import UmlsEntityLinker
from scispacy.abbreviation import AbbreviationDetector 
from negspacy.negation import Negex

In [2]:
meta_data = pd.read_csv('../data/raw/all_sources_metadata_2020-03-13.csv')

In [3]:
meta_data.head()

Unnamed: 0,sha,source_x,title,doi,pmcid,pubmed_id,license,abstract,publish_time,authors,journal,Microsoft Academic Paper ID,WHO #Covidence,has_full_text
0,c630ebcdf30652f0422c3ec12a00b50241dc9bd9,CZI,Angiotensin-converting enzyme 2 (ACE2) as a SA...,10.1007/s00134-020-05985-9,,32125455.0,cc-by-nc,,2020,"Zhang, Haibo; Penninger, Josef M.; Li, Yimin; ...",Intensive Care Med,2002765000.0,#3252,True
1,53eccda7977a31e3d0f565c884da036b1e85438e,CZI,Comparative genetic analysis of the novel coro...,10.1038/s41421-020-0147-1,,,cc-by,,2020,"Cao, Yanan; Li, Lin; Feng, Zhimin; Wan, Shengq...",Cell Discovery,3003431000.0,#1861,True
2,210a892deb1c61577f6fba58505fd65356ce6636,CZI,Incubation Period and Other Epidemiological Ch...,10.3390/jcm9020538,,,cc-by,The geographic spread of 2019 novel coronaviru...,2020,"Linton, M. Natalie; Kobayashi, Tetsuro; Yang, ...",Journal of Clinical Medicine,3006065000.0,#1043,True
3,e3b40cc8e0e137c416b4a2273a4dca94ae8178cc,CZI,Characteristics of and Public Health Responses...,10.3390/jcm9020575,,32093211.0,cc-by,"In December 2019, cases of unidentified pneumo...",2020,"Deng, Sheng-Qun; Peng, Hong-Juan",J Clin Med,177663100.0,#1999,True
4,92c2c9839304b4f2bc1276d41b1aa885d8b364fd,CZI,Imaging changes in severe COVID-19 pneumonia,10.1007/s00134-020-05976-w,,32125453.0,cc-by-nc,,2020,"Zhang, Wei",Intensive Care Med,3006643000.0,#3242,False


Look at abstracts and find the ones which have lemmas of "vaccine" or "therapy" in them.<br>
Assumptions: these words would appear in the abstract if our topic is being discussed in the paper

Preprocessing: Replace all medical abbraviations with their official full form (e.g. COVID-19 = corona virus 2019). Lemmatize?
Processing: Find all papers which mention "vaccine" or "therapy". From this subgroup, visualize clusters of mentioned compounds or even a cluster which mentions that "no drug has been found"

In [4]:
meta_data.abstract[6]

'The initial cluster of severe pneumonia cases that triggered the 2019-nCoV epidemic was identified in Wuhan, China in December 2019. While early cases of the disease were linked to a wet market, human-to-human transmission has driven the rapid spread of the virus throughout China. The Chinese government has implemented containment strategies of city-wide lockdowns, screening at airports and train stations, and isolation of suspected patients; however, the cumulative case count keeps growing every day. The ongoing outbreak presents a challenge for modelers, as limited data are available on the early growth trajectory, and the epidemiological characteristics of the novel coronavirus are yet to be fully elucidated. We use phenomenological models that have been validated during previous outbreaks to generate and assess short-term forecasts of the cumulative number of confirmed reported cases in Hubei province, the epicenter of the epidemic, and for the overall trajectory in China, excludi

In [5]:
meta_data.shape

(29500, 14)

In [6]:
abstracts = meta_data[~meta_data.abstract.isna()]

In [7]:
abstracts.shape

(26553, 14)

In [8]:
nlp = spacy.load("en_core_sci_sm")

In [9]:
doc = nlp(abstracts.iloc[1,7])

In [10]:
print(list(doc.sents)[3])

As of 18 February 2020, the number of confirmed cases had reached 75,199 with 2009 fatalities.


In [11]:
print(doc.ents)

(cases, unidentified pneumonia, history, exposure, Huanan Seafood Market, Wuhan, Hubei Province, coronavirus, SARS-CoV-2, disease, Human-to-human transmission, disease, COVID-19, World Health Organization, WHO, country, world, cases, fatalities, COVID-19, case-fatality rate, cases, Severe Acute Respiratory Syndrome, SARS, Middle East Respiratory Syndrome, MERS, symptom, fatality, released, official reports, fever, cough, short of breath, chest tightness/pain, comorbidities, fatality, cases, hypertension, diabetes, coronary heart disease, cerebral infarction, chronic bronchitis, virus, pathogenesis, disease, therapeutic drug, Chinese Government, level-1, public health, disease, speed, development, vaccines, drugs, treatment, defeat, COVID-19)


In [42]:
out_sent = [w.lemma_ if w.lemma_ !='-PRON-' else w.text for w in doc]
out_sent = ' '.join(out_sent) 
out_sent

'a novel coronavirus ( 2019-ncov ) originate in Wuhan , China present a potential respiratory viral pandemic to the world population . current effort be focus on containment and quarantine of infected individual . ultimately , the outbreak could be control with a protective vaccine to prevent 2019-ncov infection . while vaccine research should be pursue intensely , there exist today no therapy to treat 2019-ncov upon infection , despite an urgent need to find option to help these patient and preclude potential death . herein , I review the potential option to treat 2019-ncov in patient , with an emphasis on the necessity for speed and timeliness in develop new and effective therapy in this outbreak . I consider the option of drug repurposing , develop neutralize monoclonal antibody therapy , and an oligonucleotide strategy target the viral rna genome , emphasize the promise and pitfall of these approach . finally , I advocate for the fast strategy to develop a treatment now , which cou

In [None]:
# abstracts['abstract_preprocessed']

In [20]:
abstracts['Entities'] = abstracts.loc[:1000,'abstract'].apply(lambda x: nlp(x).ents)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  """Entry point for launching an IPython kernel.


In [21]:
abstracts.head()

Unnamed: 0,sha,source_x,title,doi,pmcid,pubmed_id,license,abstract,publish_time,authors,journal,Microsoft Academic Paper ID,WHO #Covidence,has_full_text,Entities
2,210a892deb1c61577f6fba58505fd65356ce6636,CZI,Incubation Period and Other Epidemiological Ch...,10.3390/jcm9020538,,,cc-by,The geographic spread of 2019 novel coronaviru...,2020,"Linton, M. Natalie; Kobayashi, Tetsuro; Yang, ...",Journal of Clinical Medicine,3006065000.0,#1043,True,"((geographic), (COVID-19), (infections), (epic..."
3,e3b40cc8e0e137c416b4a2273a4dca94ae8178cc,CZI,Characteristics of and Public Health Responses...,10.3390/jcm9020575,,32093211.0,cc-by,"In December 2019, cases of unidentified pneumo...",2020,"Deng, Sheng-Qun; Peng, Hong-Juan",J Clin Med,177663100.0,#1999,True,"((cases), (unidentified, pneumonia), (history)..."
5,0df0d5270a9399cf4e23c0cdd877a80616a9725e,CZI,An updated estimation of the risk of transmiss...,10.1016/j.idm.2020.02.001,,,cc-by-nc-nd,The basic reproduction number of an infectious...,2020,"Tang, Biao; Bragazzi, Nicola Luigi; Li, Qian; ...",Infectious Disease Modelling,3006029000.0,#729,True,"((infectious, agent), (infections), (case), (c..."
6,f24242580be243d5fc3f432915d86af6854bb8b7,CZI,Real-time forecasts of the 2019-nCoV epidemic ...,10.1016/j.idm.2020.02.002,,,cc-by-nc-nd,The initial cluster of severe pneumonia cases ...,2020,"Roosa, K.; Lee, Y.; Luo, R.; Kirpich, A.; Roth...",Infectious Disease Modelling,3006029000.0,#865,True,"((cluster), (severe), (pneumonia), (cases), (e..."
8,e1b336d8be1a4c0ccc5a1bf41e48b3b004d3ece1,CZI,COVID-19 outbreak on the Diamond Princess crui...,10.1093/jtm/taaa030,,,cc-by-nc,Cruise ships carry a large number of people in...,2020,"Rocklöv, J.; Sjödin, H.; Wilder-Smith, A.",Journal of Travel Medicine,3006304000.0,#2926,True,"((Cruise), (ships), (people), (confined), (spa..."


In [22]:
vaccine_idx = ['vaccine' in str(abstracts.Entities.iloc[x]) for x in range(abstracts.Entities.shape[0])]

In [23]:
sum(vaccine_idx)

55

In [None]:
therapy = ['vaccine' in str(abstracts.Entities.iloc[x]) for x in range(abstracts.Entities.shape[0])]

In [24]:
abstracts.Entities.loc[vaccine_idx]

3      ((cases), (unidentified, pneumonia), (history)...
12     ((SUMMARY), (Genome, Detective), (web-based), ...
14     ((Novel), (Coronavirus), (pathogen), (identifi...
39     ((Viral, diseases), (morbidity), (mortality), ...
44     ((cluster), (cases), (pneumonia), (cause), (Wu...
54     ((beginning), (emergence), (COVID-19), (outbre...
75     ((coronavirus), (Wuhan), (China), (respiratory...
79     ((Wuhan, Health, Commission), (cluster), (atyp...
88     ((Background), (Middle, East, Respiratory, Syn...
96     ((Rapid), (diagnostics), (vaccines), (therapeu...
115    ((coronaviruses), (CoVs), (isolated), (humans)...
118    ((Launched), (Davos), (funding), (sovereign, i...
139    ((Middle, East, Respiratory, Syndrome, coronav...
164    ((years), (editorial, note), (unseasonable), (...
198    ((Chinese), (scientists), (genetic, informatio...
199    ((primary, plan), (editorial), (month), (comme...
200    ((writing), (commentary), (coronavirus), (COVI...
212    ((Abstract), (novel), (o

In [83]:
abstracts.abstract.iloc[np.where(vaccine_idx)[0][2]]

'Novel Coronavirus (2019-nCoV) is an emerging pathogen that was first identified in Wuhan, China in late December 2019. This virus is responsible for the ongoing outbreak that causes severe respiratory illness and pneumonia-like infection in humans. Due to the increasing number of cases in China and outside China, the WHO declared coronavirus as a global health emergency. Nearly 35,000 cases were reported and at least 24 other countries or territories have reported coronavirus cases as early on as February. Inter-human transmission was reported in a few countries, including the United States. Neither an effective anti-viral nor a vaccine is currently available to treat this infection. As the virus is a newly emerging pathogen, many questions remain unanswered regarding the virus&rsquo;s reservoirs, pathogenesis, transmissibility, and much more is unknown. The collaborative efforts of researchers are needed to fill the knowledge gaps about this new virus, to develop the proper diagnosti

In [33]:
np.where(vaccine_idx)[0][4]

28

In [35]:
abstracts.iloc[np.where(vaccine_idx)[0][6],7]

'A novel coronavirus (2019-nCoV) originating in Wuhan, China presents a potential respiratory viral pandemic to the world population. Current efforts are focused on containment and quarantine of infected individuals. Ultimately, the outbreak could be controlled with a protective vaccine to prevent 2019-nCoV infection. While vaccine research should be pursued intensely, there exists today no therapy to treat 2019-nCoV upon infection, despite an urgent need to find options to help these patients and preclude potential death. Herein, I review the potential options to treat 2019-nCoV in patients, with an emphasis on the necessity for speed and timeliness in developing new and effective therapies in this outbreak. I consider the options of drug repurposing, developing neutralizing monoclonal antibody therapy, and an oligonucleotide strategy targeting the viral RNA genome, emphasizing the promise and pitfalls of these approaches. Finally, I advocate for the fastest strategy to develop a trea

In [29]:
negex = Negex(nlp)
nlp.add_pipe(negex, last=True)

ValueError: [E007] 'Negex' already exists in pipeline. Existing names: ['tagger', 'parser', 'ner', 'Negex']

In [36]:
doc = nlp(abstracts.iloc[np.where(vaccine_idx)[0][6],7])
for e in doc.ents:
    print(e.text, e._.negex)

coronavirus False
Wuhan False
China False
respiratory False
viral pandemic False
world population False
containment False
quarantine False
infected False
individuals False
outbreak False
protective vaccine False
infection False
vaccine research False
treat True
infection True
options True
patients True
potential True
death True
I review False
options False
treat False
patients False
emphasis False
necessity False
speed False
timeliness False
effective False
therapies False
outbreak False
options False
drug repurposing False
neutralizing False
monoclonal antibody therapy False
oligonucleotide False
targeting False
viral RNA genome False
pitfalls False
approaches False
I advocate False
treatment False
resistant False
mutations False
virus False
future False
proposal False
biologic False
blocks False
entry False
soluble False
version False
viral receptor False
angiotensin-converting enzyme 2 False
ACE2 False
fused False
immunoglobulin Fc domain False
ACE2-Fc False
neutralizing antibody Fa

In [45]:
'''for token in doc:
    print(token.text, token.lemma_, token.pos_, token.tag_, token.dep_, 
          token.shape_, token.is_alpha, token.is_stop)'''

'for token in doc:\n    print(token.text, token.lemma_, token.pos_, token.tag_, token.dep_, \n          token.shape_, token.is_alpha, token.is_stop)'

In [49]:
abstracts.abstract

2        The geographic spread of 2019 novel coronaviru...
3        In December 2019, cases of unidentified pneumo...
5        The basic reproduction number of an infectious...
6        The initial cluster of severe pneumonia cases ...
8        Cruise ships carry a large number of people in...
                               ...                        
29134    The outbreak of a novel betacoronavirus (SARS-...
29135    The outbreak of coronavirus disease (COVID-19)...
29136    Since December 2019, coronavirus disease 2019 ...
29137    Recent studies have demonstrated that SARS-CoV...
29138    Tilorone is a 50-year-old synthetic small-mole...
Name: abstract, Length: 26553, dtype: object

In [44]:
abbreviation_pipe = AbbreviationDetector(nlp)

nlp.add_pipe(abbreviation_pipe) #Print the Abbreviation and it's definition

Abbreviation 	 Definition


In [66]:
print("Abbreviation", "\t", "Definition")
for i in range(100) :   
    doc = nlp(abstracts.iloc[i,7])

    for abrv in doc._.abbreviations:
        print(f"{abrv} \t ({abrv.start}, {abrv.end}) {abrv._.long_form}")

Abbreviation 	 Definition
WHO 	 (57, 58) World Health Organization
SARS 	 (113, 114) Severe Acute Respiratory Syndrome
MERS 	 (121, 122) Middle East Respiratory Syndrome
CoVs 	 (252, 253) Coronaviruses
CoVs 	 (2, 3) Coronaviruses
CoVs 	 (321, 322) Coronaviruses
CoVs 	 (231, 232) Coronaviruses
MERS-CoV 	 (106, 107) East respiratory syndrome coronavirus
SARS 	 (97, 98) severe acute respiratory syndrome
SARSr-CoV 	 (164, 165) SARS-related coronavirus
COVID-19 	 (66, 67) coronavirus disease 2019
COVID-19 	 (21, 22) coronavirus disease 2019
COVID-19 	 (9, 10) coronavirus disease 2019
WHO 	 (13, 14) World Health Organization
HRCT 	 (16, 17) high-resolution computed tomography
HRCT 	 (72, 73) high-resolution computed tomography
GGO 	 (95, 96) ground-glass opacity
PDCoV 	 (180, 181) Porcine delta coronavirus
PDCoV 	 (119, 120) Porcine delta coronavirus
PDCoV 	 (4, 5) Porcine delta coronavirus
lncRNAs 	 (29, 30) non-coding RNAs
lncRNAs 	 (107, 108) non-coding RNAs
lncRNAs 	 (115, 116) non-codin

COVID-19 	 (45, 46) Coronavirus disease 2019
COVID-19 	 (65, 66) Coronavirus disease 2019
COVID-19 	 (8, 9) Coronavirus disease 2019
SARS-CoV-2 	 (16, 17) acute respiratory syndrome coronavirus 2
sCAP 	 (23, 24) severe community-acquired pneumonia
NCP 	 (54, 55) novel coronavirus pneumonia
TCGA 	 (95, 96) Cancer Genome Atlas
TCGA 	 (61, 62) Cancer Genome Atlas
CoVs 	 (261, 262) coronaviruses
CoVs 	 (234, 235) coronaviruses
CoVs 	 (3, 4) coronaviruses
CoVs 	 (19, 20) coronaviruses
CoVs 	 (328, 329) coronaviruses
SARS-CoV 	 (202, 203) acute respiratory syndrome coronavirus
SARS-CoV 	 (68, 69) acute respiratory syndrome coronavirus
SARS-CoV 	 (142, 143) acute respiratory syndrome coronavirus
SARS-CoV 	 (28, 29) acute respiratory syndrome coronavirus
SARS-CoV 	 (245, 246) acute respiratory syndrome coronavirus
MERS-CoV 	 (247, 248) East respiratory syndrome coronavirus
MERS-CoV 	 (90, 91) East respiratory syndrome coronavirus
MERS-CoV 	 (37, 38) East respiratory syndrome coronavirus
MERS-C

In [67]:
nlp = spacy.load("en_core_sci_sm")
linker = UmlsEntityLinker(resolve_abbreviations=True)

nlp.add_pipe(linker)# Each entity is linked to UMLS with a score# (currently just char-3gram matching).
for umls_ent in entity._.umls_ents:
    print(linker.umls.cui_to_entity[umls_ent[0]])

https://s3-us-west-2.amazonaws.com/ai2-s2-scispacy/data/linking_model/tfidf_vectors_sparse.npz not found in cache, downloading to /var/folders/nt/qt5rc4mn5cx7ylghyhs2jhsh0000gn/T/tmp6sfew20t


KeyboardInterrupt: 