# Text Search Model for CDC -  NLP:

### Problem Statement:

You are a data scientist at the Centers for Disease Control and Prevention (CDC). You will take a critical part in the development of a new plan to prevent, control, and respond to future pandemics; launched by the Department of Health and Human Services. Your goal is to build an efficient and smart search engine for fast access to the CDC’s document database. To do this, you need to research and develop strategies to aggregate and quickly search historical and likely unstructured text data about earlier pandemics, so that they can be easily accessed when needed.

In [1]:
import spacy 
import json

In [4]:
# instantiating spacy model
model = spacy.load('en_core_web_sm')

In [5]:
# loading the json file
with open('/content/drive/MyDrive/data.json',
          'r') as outfile:
          summaries = json.load(outfile)
print(summaries[0].keys())

dict_keys(['title', 'text', 'url'])


## Exploring the Data:

In [6]:
text = summaries[1]['text']

In [7]:
# lower-casing the data in the text
# exploring the attributes of each token
text_tokenized = model(text.lower())
for token in text_tokenized[:5]:
    print(type(token),
          token.text, token.pos_,
          token.dep_)

<class 'spacy.tokens.token.Token'> hiv PROPN nmod
<class 'spacy.tokens.token.Token'> / SYM punct
<class 'spacy.tokens.token.Token'> aids PROPN nsubjpass
<class 'spacy.tokens.token.Token'> , PUNCT punct
<class 'spacy.tokens.token.Token'> or CCONJ cc


In [10]:
unclassified_tokens = [(
    token.lemma_, token.dep_)
    for token in text_tokenized if\
    token.dep_ == '']
unclassified_tokens

[(' ', ''), ('\n', ''), ('\n', '')]

## Removing stopwords and punctuation:

In [11]:
tokens_without_sw = [word for word in\
                     text_tokenized if not\
                     word.is_stop and not\
                     word.is_punct]
tokens_without_sw[:10]

[hiv,
 aids,
 human,
 immunodeficiency,
 virus,
 considered,
 authors,
 global,
 pandemic,
 currently]

## Tokenizing the texts

In [13]:
token_lemmas =[token.lemma_ for\
               token in tokens_without_sw\
               if token.dep_]
token_lemmas[:10]

['hiv',
 'aids',
 'human',
 'immunodeficiency',
 'virus',
 'consider',
 'author',
 'global',
 'pandemic',
 'currently']

In [14]:
def tokenizer(document):
    text_lowercased = model(document.lower())
    tokens_without_stopwords = [word for word\
                                in text_lowercased if\
                                not word.is_stop and\
                                not word.is_punct]
    token_lemmatized = [token.lemma_ for\
                        token in tokens_without_stopwords\
                        if token.dep_]
    return token_lemmatized

In [15]:
for doc in summaries:
    doc['tokenized_text'] = tokenizer(doc['text'])

In [16]:
summaries[0]['tokenized_text']

['pandemic',
 'greek',
 'πᾶν',
 'pan',
 'δῆμος',
 'demos',
 'people',
 'epidemic',
 'infectious',
 'disease',
 'spread',
 'large',
 'region',
 'instance',
 'multiple',
 'continent',
 'worldwide',
 'affect',
 'substantial',
 'number',
 'people',
 'widespread',
 'endemic',
 'disease',
 'stable',
 'number',
 'infected',
 'people',
 'pandemic',
 'widespread',
 'endemic',
 'disease',
 'stable',
 'number',
 'infected',
 'people',
 'recurrence',
 'seasonal',
 'influenza',
 'generally',
 'exclude',
 'occur',
 'simultaneously',
 'large',
 'region',
 'globe',
 'spread',
 'worldwide',
 'human',
 'history',
 'number',
 'pandemic',
 'disease',
 'smallpox',
 'tuberculosis',
 'fatal',
 'pandemic',
 'record',
 'history',
 'black',
 'death',
 'know',
 'plague',
 'kill',
 'estimate',
 '75–200',
 'million',
 'people',
 '14th',
 'century',
 'term',
 'later',
 'pandemic',
 'include',
 '1918',
 'influenza',
 'pandemic',
 'spanish',
 'flu',
 'current',
 'pandemic',
 'include',
 'covid-19',
 'sars',
 'cov-2',

In [17]:
with open('summaries.josn','w') as outfile:
    json.dump(summaries, outfile)

In [18]:
import itertools
from collections import Counter
import numpy as np
from sklearn.metrics.pairwise import cosine_similarity

In [20]:
with open('/content/summaries.josn','r') as outfile:
    summaries = json.load(outfile)

## Bulding a Vocabulary corpus:

In [22]:
# concatenating all tokenized texts into a simble text:
tokenized_texts = [i['tokenized_text'] for i in summaries]

# flattening the list of lists
vocab = list(itertools.chain(*tokenized_texts))

# removing duplicates
vocab = list(set(vocab))

In [24]:
with open('vocab.json','w') as outfile:
    json.dump(vocab, outfile)

In [25]:
# counting the appearance of tokens in the document
docs_token_counter = []
for doc in summaries:
    doc_tokenized = doc['tokenized_text']
    docs_token_counter.append(Counter(doc_tokenized))

## Finding All Unique tokens in the corpus:

In [26]:
number_docs_with_token = {}
for token in vocab:
    count_docs = sum([1 for doc in\
        docs_token_counter if token\
        in doc.keys()])
    number_docs_with_token[token] = count_docs

In [30]:
number_docs_with_token['pandemic']

17

## Computing the TF-IDF of the document:

In [31]:
for i, doc in enumerate(docs_token_counter):
    doc_length = len(doc)
    tfidf_vec = []
    for token in vocab:
        
        #computing the term frequency per document
        tf = doc[token] / len(summaries[i]['tokenized_text'])

        # computing the log of inverse document
        idf = np.log(len(summaries)/ number_docs_with_token[token])

        tfidf = tf*idf
        tfidf_vec.append(tfidf)

    #add tfidf vector to dictionaries
    summaries[i]['tf_idf'] = tfidf_vec

In [32]:
with open('summaries.json','w') as json_file:
    json.dump(summaries, json_file)

In [39]:
query = 'highest pandemic casualty'

In [40]:
def tokenizer(document):
    text_lowercased = model(document.lower())
    tokens_without_stopwords = [word for word\
                                in text_lowercased if\
                                not word.is_stop and\
                                not word.is_punct]
    token_lemmatized = [token.lemma_ for\
                        token in tokens_without_stopwords\
                        if token.dep_]
    return token_lemmatized

In [51]:
def vectorizer(query, vocab=vocab):
    query_tokenizer = tokenizer(query)
    query_token_counter = Counter(query_tokenizer)
    query_vec = []
    for token in vocab:
        tf = query_token_counter[token] / len(query_token_counter)
        idf = np.log(len(summaries) / number_docs_with_token[token])
        tfidf = tf*idf
        query_vec.append(tfidf)

    return query_vec

## Searching Documents using Scikit-Learn:

In [55]:
# buding a search function
def search_tfidf(query, docs):

    # vectorize query
    query_vec = vectorizer(query)
    query_arr = np.array(query_vec)
    rankings =[]
    for doc in docs:
        doc_rank = {}
        doc_arr = np.array(doc['tf_idf'])
        rank = cosine_similarity(
            query_arr.reshape(1,-1),doc_arr.reshape(1,-1))[0][0]
        if rank >0:
            doc_rank['title'] = doc['title']
            doc_rank['rank'] = rank
            rankings.append(doc_rank)
    return sorted(rankings,
                  key=lambda k: k['rank'],
                  reverse=True)
rankings = search_tfidf(query, summaries)

In [56]:
search_tfidf('ebola', summaries)

[{'rank': 0.11754261855142299, 'title': 'Plague of Cyprian'},
 {'rank': 0.071125289564604, 'title': 'Science diplomacy and pandemics'}]

In [58]:
# Lets check if the article 'Plague of Cyprian' has a word "ebola" in it
for s in summaries:
    if s["title"] == 'Plague of Cyprian':
        print(s["text"])

The Plague of Cyprian was a pandemic that afflicted the Roman Empire about from AD 249 to 262. The plague is thought to have caused widespread manpower shortages for food production and the Roman army, severely weakening the empire during the Crisis of the Third Century. Its modern name commemorates St. Cyprian, bishop of Carthage, an early Christian writer who witnessed and described the plague. The agent of the plague is highly speculative because of sparse sourcing, but suspects have included smallpox, pandemic influenza and viral hemorrhagic fever (filoviruses) like the Ebola virus.


## Building and inverted index:


In [59]:
inverted_index = {}
for i, word in enumerate(vocab):
    inverted_index[word] = []
    for doc in summaries:
        # for each word in corpus vocabulary list all articles
        # it occurs in and this word's TfIdf score for this article
        if doc['tf_idf'][i]!=0:
            inverted_index[word].append((doc['title'], doc['tf_idf'][i])) 

# Now you have a lookup table of all articles that have a particular keyword
# lets request a list of articles with the word "coronavirus" in them
inverted_index["coronavirus"]

[('COVID-19 pandemic', 0.05749582125920263)]

In [60]:
# check if 'coronavirus' is in the article
for s in summaries:
    if s['title'] =='COVID-19 pandemic':
        print(s['text'])

The COVID-19 pandemic, also known as the coronavirus pandemic, is an ongoing pandemic of coronavirus disease 2019 (COVID-19) caused by severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2). The disease was first identified in December 2019 in Wuhan, China. The outbreak was declared a Public Health Emergency of International Concern in January 2020, and a pandemic in March 2020. As of 17 October 2020, more than 39.5 million cases have been confirmed, with more than 1.1 million deaths attributed to COVID-19.

Common symptoms include fever, cough, fatigue, breathing difficulties, and loss of smell. Complications may include pneumonia and acute respiratory distress syndrome. The incubation period is typically around five days but may range from one to 14 days. There are several vaccine candidates in development, although none have proven their safety and efficacy. There is no known specific antiviral medication, so primary treatment is currently symptomatic.
Recommended preventive m

## Searching inverted Index:

In [62]:
from collections import defaultdict

In [66]:
# Reuse the tokenizer from Milestone 1 to tokenize search queries

def tokenizer(document):
    text_lowercased = model(document.lower())
    tokens_without_stopwords = [word for word 
                     in text_lowercased 
                     if not word.is_stop 
                     and not word.is_punct
                     and len(word.dep_.strip())!=0]   
    
    token_lemmatized = [token.lemma_ 
               for token
               in tokens_without_stopwords]
    
    return token_lemmatized

In [67]:
def search(query, index =inverted_index):
    query_tokens = tokenizer(query)
    new_list =[]
    for token in query_tokens:
        new_list.extend(inverted_index[token])
    
    output = defaultdict(int)
    for k,v in new_list:
        output[k] += v
    results = [(x,y) for x,y in output.items()]
    
    return sorted(results, 
                  key=lambda x: x[1], reverse=True)

In [69]:
title, score = search(query='world health organiZation')[0]
for s in summaries:
    if s['title'] ==title:
        print(s['text'])

The Johns Hopkins Center for Health Security (abbreviated CHS; previously the UPMC Center for Health Security, the Center for Biosecurity of UPMC, and the Johns Hopkins Center for Civilian Biodefense Strategies) is an independent, nonprofit organization of the Johns Hopkins Bloomberg School of Public Health, and part of the Environmental Health and Engineering department. It is concerned with the areas of health consequences from epidemics and disasters as well as averting biological weapons development, and implications of biosecurity for the bioeconomy. It is a think tank that does policy research and gives policy recommendations to the United States government as well as the World Health Organization and the UN Biological Weapons Convention.


In [70]:
search(query='Ebola virus')

[('Virus', 0.06746676589985189),
 ('Plague of Cyprian', 0.0634287152349009),
 ('Crimson Contagion', 0.0339553131009123),
 ('Viral load', 0.03386619154421699),
 ('Disease X', 0.031470777995967494),
 ('Swine influenza', 0.028050041257275376),
 ('Science diplomacy and pandemics', 0.027286695292144007),
 ('HIV/AIDS in Yunnan', 0.022837201731587032),
 ('HIV/AIDS', 0.013653988336874786),
 ('Spanish flu', 0.012903018978346673),
 ('Epidemiology of HIV/AIDS', 0.005973619897382719),
 ('COVID-19 pandemic', 0.005060007442488892)]

In [75]:
for s in summaries:
    if s["title"] == 'Crimson Contagion':
        print(s["text"])

Crimson Contagion was a joint exercise conducted from January to August 2019, in which numerous national, state and local, private and public organizations in the US participated, in order to test the capacity of the federal government and twelve states to respond to a severe pandemic of influenza originating in China.
The simulation, which was conducted months prior to the start of the COVID-19 pandemic, involves a scenario in which tourists returning from China spread a respiratory virus in the United States, beginning in Chicago. In less than two months the virus had infected 110 million Americans, killing more than half a million. The report issued at the conclusion of the exercise outlines the government's limited capacity to respond to a pandemic, with federal agencies lacking the funds, coordination, and resources to facilitate an effective response to the virus.
