### Create a new Jupyter Notebook and load all relevant Python libraries

In [14]:
import json
from transformers import DistilBertTokenizer, DistilBertModel
import faiss
import torch
import numpy as np

### 2. Open the provided JSON file called sentences.json. It contains a list of strings (sentences.)



In [2]:
with open('data/sentences.json', 'r') as j:
    sentences = json.load(j)

sentences

['A pandemic is an epidemic of an infectious disease that has spread across a large region, for instance multiple continents or worldwide, affecting a substantial number of people.',
 'The most fatal pandemic in recorded history was the Black Death (also known as The Plague), which killed an estimated 75–200 million people in the 14th century.',
 'Current pandemics include COVID-19 (SARS-CoV-2) and HIV/AIDS.',
 'As of 2018, approximately 37.9 million people are infected with HIV globally.',
 'Cholera is an infection of the small intestine by some strains of the bacterium Vibrio cholerae.',
 'Classic cholera symptom is large amounts of watery diarrhea that lasts a few days. Vomiting and muscle cramps may also occur. Diarrhea can be so severe that it leads within hours to severe dehydration and electrolyte imbalance.',
 'The COVID-19 pandemic, also known as the coronavirus pandemic, is an ongoing pandemic of coronavirus disease 2019 (COVID-19) caused by severe acute respiratory syndrome 

### 3. Use AutoTokenizer and AutoModel classes from Transformers library to load a pre-trained model from Transformers, along with the appropriate tokenizer.

In [4]:
tokenizer = DistilBertTokenizer.from_pretrained('distilbert-base-uncased')
model = DistilBertModel.from_pretrained("distilbert-base-uncased")

Some weights of the model checkpoint at distilbert-base-uncased were not used when initializing DistilBertModel: ['vocab_layer_norm.weight', 'vocab_transform.weight', 'vocab_projector.weight', 'vocab_layer_norm.bias', 'vocab_transform.bias', 'vocab_projector.bias']
- This IS expected if you are initializing DistilBertModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing DistilBertModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


As we're using DistilBert pretrained as encoder, we'll be creating an 768 size index for our indexes on FAISS library.

### 4. Create an empty inverted index with FAISS.

In [6]:
index = faiss.IndexIDMap(faiss.IndexFlatIP(768))
# index.add_with_ids(
#     np.array([t.numpy() for t in averaged_vectors]),
#     np.array(range(0, len(documents)))
# )

### 5. write an encoder function that inputs a string and outputs a dense PyTorch tensor

In [11]:
def encode(document: str) -> torch.Tensor:
    """this function will return a transformer representation of the doc"""
    tokens = tokenizer(document, return_tensors='pt')
    vector = model(**tokens)[0].detach().squeeze()
    return torch.mean(vector, dim=0)

### 6. Build a list of modeled vector representations for each document with a reusable encoder function you created in step 5.



In [12]:
vectors = [encode(doc) for doc in sentences]
vectors[0].size()

torch.Size([768])

### 7. Populate the empty FAISS index with the output vectors

In [15]:
index.add_with_ids(
    np.array([t.numpy() for t in vectors]),
    np.array(range(0, len(sentences)))
)

### 8. Build a search function that accepts a string query, encodes it, searches similar documents in the index, and returns top 5 results with their top_k scores

In [16]:
def search(query: str, k=1):
    encoded_query = encode(query).unsqueeze(dim=0).numpy()
    top_k = index.search(encoded_query, k)
    scores = top_k[0][0]
    results = [sentences[_id] for _id in top_k[1][0]]
    return list(zip(results, scores))

In [21]:
search('What are the geographic areas with the highest transmission of malaria?', 5)

[('A pandemic is an epidemic of an infectious disease that has spread across a large region, for instance multiple continents or worldwide, affecting a substantial number of people.',
  58.686043),
 ('As of 2018, approximately 37.9 million people are infected with HIV globally.',
  58.561726),
 ('The death toll of Spanish Flu is estimated to have been somewhere between 17 million and 50 million, and possibly as high as 100 million, making it one of the deadliest pandemics in human history.',
  54.462517),
 ('The Spanish flu, also known as the 1918 flu pandemic, was an unusually deadly influenza pandemic caused by the H1N1 influenza A virus.',
  53.605446),
 ('Current pandemics include COVID-19 (SARS-CoV-2) and HIV/AIDS.', 53.51365)]