# Milestone 2: TF-IDF Search using Cosine Similarity

Import spacy and the small English model.

In [1]:
import spacy
import json
import os
from sklearn.metrics.pairwise import cosine_similarity
nlp = spacy.load('en_core_web_sm')

Load the tokenized data from the json file and prepare the target directory for the outputs.

In [2]:
with open('./outputs/milestone-1/tokenized-data.json') as f:
    data = json.load(f)

#Ensures that the directory ./outputs/milestone-2 exists
if not os.path.exists('./outputs/milestone-2'):
    os.makedirs('./outputs/milestone-2')

Create a corpus from all the tokens found in the json file. Then, save it as a JSON file.

In [3]:
corpus = []
for item in data:
    for token in item['tokenized_text']:
        corpus.append(token)
corpus = set(corpus)

with open('./outputs/milestone-2/corpus.json', 'w') as f:
    json.dump(list(corpus), f)

Define a function that computes the tf-idf vector for a given text input using a corpus of tokens.

In [4]:
def build_tf_idf_vector(corpus, text_input):
    #Creates a dictionary of the corpus with the value of 0
    tf_idf_vector = dict.fromkeys(corpus, 0)
    #Tokenizes the input text
    tokens = nlp(text_input)
    #Counts the frequency of each token in the input text
    for token in tokens:
        if token.text in tf_idf_vector:
            tf_idf_vector[token.text] += 1
    return tf_idf_vector

Computes the tf-idf vector for each document in the data

In [5]:
for item in data:
    item['tf_idf'] = build_tf_idf_vector(corpus, item['text'])

Define a function that searches for a query in the data using cosine similarity.

In [6]:
def search(query, data):
    #Computes the tf-idf vector for the query
    query_tf_idf = build_tf_idf_vector(corpus, query)
    search_data = []
    #Computes the cosine similarity between the query and each document in the data
    for item in data:
        similarity = cosine_similarity([list(item['tf_idf'].values())], [list(query_tf_idf.values())])
        search_data.append({
            'title': item['title'],
            'text': item['text'],
            'url': item['url'],
            'similarity': similarity[0][0]
        })
    #Sorts the search results by similarity
    search_data = sorted(search_data, key=lambda x: x['similarity'], reverse=True)
    return search_data

Example of a search using the query "Spanish flu".

In [9]:
search_results = search("Spanish flu", data)
for result in search_results:
    print(result['title'], result['similarity'])

Swine influenza 0.4800153607373193
Spanish flu 0.35478743759344955
Pandemic 0.08980265101338746
Epidemiology of HIV/AIDS 0.0
Antonine Plague 0.0
Basic reproduction number 0.0
Bills of mortality 0.0
Cholera 0.0
COVID-19 pandemic 0.0
Crimson Contagion 0.0
Disease X 0.0
Event 201 0.0
HIV/AIDS 0.0
HIV/AIDS in Yunnan 0.0
Pandemic prevention 0.0
Pandemic Severity Assessment Framework 0.0
Pandemic severity index 0.0
Plague of Cyprian 0.0
PREDICT (USAID) 0.0
1929–1930 psittacosis pandemic 0.0
Science diplomacy and pandemics 0.0
Superspreader 0.0
Targeted immunization strategies 0.0
Unified Victim Identification System 0.0
Viral load 0.0
Virus 0.0


Save the data after adding the tf-idf vectors of each entry.

In [8]:
with open('./outputs/milestone-2/tf-idf-data.json', 'w') as f:
    json.dump(data, f)