# **Decode COVID19 with Genome Analysis**

**Problem Statement :**
You are one of the researchers responding to the White House Office of Science and Technology Policy center’s call to conduct advanced research on Covid-19. You are working with CDC,  which has led a coordinated effort to set up a machine readable dataset.

**About Dataset :**
Dataset represents the most extensive machine-readable coronavirus literature collection available for data and text mining to date, with over 29,000 articles, more than 13,000 of which have full text.

Using the **CORD-NER dataset** and Knowledge Graph, determine and map out the details of the SARS-CoV-2 genome to assist understanding of the emergence, evolution and diagnosis of this deadly virus.



In [1]:
#importing libraries


from tqdm import tqdm
import re


# **1. Importing the Dataset**

In [2]:
#import pandas library 
import pandas as pd
#import json file using pandas
full_data = pd.read_json('/kaggle/input/cordner2020/CORD-NER-full.json',nrows = 10000, lines=True)

In [3]:
full_data.columns

Index(['id', 'source', 'doi', 'pmcid', 'pubmed_id', 'publish_time', 'authors',
       'journal', 'title', 'abstract', 'body', 'entities'],
      dtype='object')

# **2. NER Extraction from Text**

In [4]:
# spaCy based imports
import spacy
from spacy.lang.en.stop_words import STOP_WORDS
from spacy.lang.en import English
from spacy import displacy


In [5]:
nlp = spacy.load('en_core_web_lg')
nlp_sm = spacy.load('en_core_web_sm')

Here's we'll visualize the extraction of entities from some text in the dataframe generated previously.

Extract the abstract section

Most scientific papers contain a Conclusion ABSTRACT, which consists on a summary of the main observations and results from the study. In order to reduce the amount of data to analyze, it may prove useful to focus on the ABSTRACT instead of performing a full search in the paper.

In [6]:
doc = nlp(full_data["abstract"][100])

In [7]:
#take a look at how many words in a document
len(doc)

255

In [8]:
# look document-level attributes
dir(doc)

['_',
 '__bytes__',
 '__class__',
 '__delattr__',
 '__dir__',
 '__doc__',
 '__eq__',
 '__format__',
 '__ge__',
 '__getattribute__',
 '__getitem__',
 '__gt__',
 '__hash__',
 '__init__',
 '__init_subclass__',
 '__iter__',
 '__le__',
 '__len__',
 '__lt__',
 '__ne__',
 '__new__',
 '__pyx_vtable__',
 '__reduce__',
 '__reduce_ex__',
 '__repr__',
 '__setattr__',
 '__setstate__',
 '__sizeof__',
 '__str__',
 '__subclasshook__',
 '__unicode__',
 '_bulk_merge',
 '_context',
 '_get_array_attrs',
 '_realloc',
 '_vector',
 '_vector_norm',
 'cats',
 'char_span',
 'copy',
 'count_by',
 'doc',
 'ents',
 'extend_tensor',
 'from_array',
 'from_bytes',
 'from_dict',
 'from_disk',
 'from_docs',
 'from_json',
 'get_extension',
 'get_lca_matrix',
 'has_annotation',
 'has_extension',
 'has_unknown_spaces',
 'has_vector',
 'is_nered',
 'is_parsed',
 'is_sentenced',
 'is_tagged',
 'lang',
 'lang_',
 'mem',
 'noun_chunks',
 'noun_chunks_iterator',
 'remove_extension',
 'retokenize',
 'sentiment',
 'sents',
 'set

In [9]:
# tokens in a document can by accessed by their number:
print(doc[5])
dir(doc[5])

in


['_',
 '__bytes__',
 '__class__',
 '__delattr__',
 '__dir__',
 '__doc__',
 '__eq__',
 '__format__',
 '__ge__',
 '__getattribute__',
 '__gt__',
 '__hash__',
 '__init__',
 '__init_subclass__',
 '__le__',
 '__len__',
 '__lt__',
 '__ne__',
 '__new__',
 '__pyx_vtable__',
 '__reduce__',
 '__reduce_ex__',
 '__repr__',
 '__setattr__',
 '__sizeof__',
 '__str__',
 '__subclasshook__',
 '__unicode__',
 'ancestors',
 'check_flag',
 'children',
 'cluster',
 'conjuncts',
 'dep',
 'dep_',
 'doc',
 'ent_id',
 'ent_id_',
 'ent_iob',
 'ent_iob_',
 'ent_kb_id',
 'ent_kb_id_',
 'ent_type',
 'ent_type_',
 'get_extension',
 'has_dep',
 'has_extension',
 'has_head',
 'has_morph',
 'has_vector',
 'head',
 'i',
 'idx',
 'iob_strings',
 'is_alpha',
 'is_ancestor',
 'is_ascii',
 'is_bracket',
 'is_currency',
 'is_digit',
 'is_left_punct',
 'is_lower',
 'is_oov',
 'is_punct',
 'is_quote',
 'is_right_punct',
 'is_sent_end',
 'is_sent_start',
 'is_space',
 'is_stop',
 'is_title',
 'is_upper',
 'lang',
 'lang_',
 'le

In [10]:
#NER extraction using Spacy library
spacy.displacy.render(doc, style='ent',jupyter=True)

As we can see variours entities have been mentiond above like **"GPE"** which represents **'Countries, cities, states'**, **"DATE"** representing various **dates**, etc.
But there are also some errors we can spot like **"2019-nCov"** is entitled as **"DATE"**.

In [11]:
spacy.explain("GPE")

'Countries, cities, states'

In [12]:
spacy.explain("CARDINAL")

'Numerals that do not fall under another type'

In [13]:
just_text = full_data['abstract']
docs = list(tqdm(nlp.pipe(just_text), total=len(just_text)))

100%|██████████| 10000/10000 [05:41<00:00, 29.33it/s]


**Let's take a closer look at what spaCy is doing when it performs named entity recognition**

In [14]:
[(i.text, i.ent_iob_ + "-" + i.ent_type_) for i in doc[0:30]]

[('The', 'O-'),
 ('outbreak', 'O-'),
 ('of', 'O-'),
 ('pneumonia', 'O-'),
 ('originating', 'O-'),
 ('in', 'O-'),
 ('Wuhan', 'B-GPE'),
 (',', 'O-'),
 ('China', 'B-GPE'),
 (',', 'O-'),
 ('has', 'O-'),
 ('generated', 'O-'),
 ('24,500', 'B-CARDINAL'),
 ('confirmed', 'O-'),
 ('cases', 'O-'),
 (',', 'O-'),
 ('including', 'O-'),
 ('492', 'B-CARDINAL'),
 ('deaths', 'O-'),
 (',', 'O-'),
 ('as', 'O-'),
 ('of', 'O-'),
 ('5', 'B-DATE'),
 ('February', 'I-DATE'),
 ('2020', 'I-DATE'),
 ('.', 'O-'),
 ('The', 'O-'),
 ('virus', 'O-'),
 ('(', 'O-'),
 ('2019', 'B-DATE')]

In [15]:
len(docs)

10000

In [16]:
from collections import Counter

all_gpe = []
for d in docs:
    orgs = [ent.text for ent in d.ents if ent.label_ == "GPE"]
    all_gpe.extend(orgs)

Counter(all_gpe).most_common(15)

[('China', 1541),
 ('Wuhan', 389),
 ('Hong Kong', 198),
 ('DC', 169),
 ('US', 152),
 ('Taiwan', 148),
 ('the United States', 145),
 ('UK', 134),
 ('Japan', 134),
 ('Thailand', 128),
 ('Beijing', 115),
 ('West Africa', 115),
 ('Canada', 112),
 ('Korea', 109),
 ('USA', 102)]

In [17]:
all_GPE = []
for d in docs:
    for ent in d.ents:
        if ent.label_ != "GPE":
            continue
        if re.search("origin|case|transmission", ent.sent.text):
            all_GPE.append(ent.text)

Counter(all_GPE).most_common(10)

[('China', 348),
 ('Wuhan', 131),
 ('West Africa', 36),
 ('Hubei', 34),
 ('the United States', 32),
 ('Taiwan', 28),
 ('Saudi Arabia', 28),
 ('UK', 27),
 ('Hong Kong', 26),
 ('US', 26)]

# **Depenency parses**
Named entity recognition is useful for identifying named entities in isolation or in the context of other terms or concepts. NER on its own tells us little about the relationships between named entities. Often, the relationship between entities is the interesting piece of information for applied researchers, and we can get at that relationship by using the grammar of the sentence.

Dependency parses are a way of representing the syntax or grammar of a sentence. For example, a dependency parse might identify that a particular verb is a noun, and specifically that it is the subject noun of a sentence.

While this isn't strictly speaking information extraction (although it is structured prediction), having access to a dependency parse can be very valuable in extracting information from documents.

First, let's look at how a dependency parse encodes grammatical information by using spaCy's dependency visualizer.

In [18]:
doc = nlp(full_data["abstract"][100])
sent = list(doc.sents)[1]
displacy.render(sent, style="dep", jupyter=True)

In [19]:
print(doc)
tok = doc[7] #China
print(tok)

def loc_to_verb(tok):
    verb_phrase =[]
    for i in tok.ancestors:
        if i.pos_ =="VERB":
            verb_phrase.append(i)
            verb_phrase.extend([j for j in i.children if j.dep_ == "dobj" and tok in i.subtree])
            break
            
            
    for i in verb_phrase:
        for j in i.children:
            if j.dep_ == "amod":
                verb_phrase.append(j)
                
    new_list = sorted(verb_phrase,key=lambda x: x.i)
    return ''.join([i.text_with_ws for i in new_list]).strip()


loc_to_verb(tok)

The outbreak of pneumonia originating in Wuhan, China, has generated 24,500 confirmed cases, including 492 deaths, as of 5 February 2020. The virus (2019-nCoV) has spread elsewhere in China and to 24 countries, including South Korea, Thailand, Japan and USA. Fortunately, there has only been limited human-to-human transmission outside of China. Here, we assess the risk of sustained transmission whenever the coronavirus arrives in other countries. Data describing the times from symptom onset to hospitalisation for 47 patients infected early in the current outbreak are used to generate an estimate for the probability that an imported case is followed by sustained human-to-human transmission. Under the assumptions that the imported case is representative of the patients in China, and that the 2019-nCoV is similarly transmissible to the SARS coronavirus, the probability that an imported case is followed by sustained human-to-human transmission is 0.41 (credible interval [0.27, 0.55]). Howev

'originating'

We can then use our function to identify all the actions related to a single city, Wuhan.

In [20]:
wuhan_covid = []

for d in doc:
    if d.text =="Wuhan":
        wuhan_covid.append(loc_to_verb(d))
            
sorted(list(set(wuhan_covid)))

['originating']

In [21]:
doc = nlp(full_data["abstract"][100])
displacy.render(doc, style="ent", jupyter=True)

In [22]:
#clearing cache to avoid ran out of memory error
import gc 
def report_gpu(): 
    print(torch.cuda.list_gpu_processes()) 
    gc.collect() 
    torch.cuda.empty_cache()

# **Question-Answering**
One popular QA training dataset is SQuAD2, and we can download a transformer model that's already been trained on SQuAD from the Huggingface model repository.

In [23]:
from transformers import AutoModelForQuestionAnswering, AutoTokenizer, pipeline

model_name = "deepset/roberta-base-squad2"

hugg = pipeline('question-answering', model=model_name, tokenizer=model_name)

Downloading:   0%|          | 0.00/571 [00:00<?, ?B/s]

2023-01-04 18:12:00.233587: I tensorflow/core/common_runtime/process_util.cc:146] Creating new thread pool with default inter op setting: 2. Tune using inter_op_parallelism_threads for best performance.


Downloading:   0%|          | 0.00/473M [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/79.0 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/878k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/446k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/772 [00:00<?, ?B/s]

In [24]:
#clearing cache to avoid ran out of memory error
import gc 
def report_gpu(): 
    print(torch.cuda.list_gpu_processes()) 
    gc.collect() 
    torch.cuda.empty_cache()

In [25]:
QA_input = {
    'question': "Where do 2019-ncov originated?",
    'context': doc.text
}
res = hugg(QA_input)

print(res)

{'score': 0.0001072749073500745, 'start': 184, 'end': 189, 'answer': 'China'}


In [26]:
QA_input = {
    'question': "How 2019-ncov spread?",
    'context': doc.text
}
res = hugg(QA_input)

print(res)

{'score': 0.007484417874366045, 'start': 171, 'end': 209, 'answer': 'elsewhere in China and to 24 countries'}


In [27]:
QA_input = {
    'question': "How corona virus evolved?",
    'context': doc.text
}
res = hugg(QA_input)

print(res)

{'score': 2.3563768536405405e-07, 'start': 844, 'end': 848, 'answer': 'SARS'}


In [28]:
QA_input = {
    'question': "How dangerous is corona virus?",
    'context': doc.text
}
res = hugg(QA_input)

print(res)

{'score': 0.00035880590439774096, 'start': 374, 'end': 396, 'answer': 'sustained transmission'}


In [29]:
QA_input = {
    'question': "Which city the first case originated from?",
    'context': doc.text
}
res = hugg(QA_input)

print(res)

{'score': 7.412996637867764e-05, 'start': 41, 'end': 46, 'answer': 'Wuhan'}


In [30]:
QA_input = {
    'question': "How to prevent from corona virus?",
    'context': doc.text
}
res = hugg(QA_input)

print(res)

{'score': 0.00011674649431370199, 'start': 1076, 'end': 1096, 'answer': 'intense surveillance'}
