### 1. Import Libs

In [1]:
import json
from transformers import DistilBertTokenizer, DistilBertModel
from sentence_transformers import SentenceTransformer
import faiss
import torch
import numpy as np

2022-12-12 05:12:20.011006: I tensorflow/core/platform/cpu_feature_guard.cc:193] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  AVX2 FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
2022-12-12 05:12:20.085987: W tensorflow/compiler/xla/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libcudart.so.11.0'; dlerror: libcudart.so.11.0: cannot open shared object file: No such file or directory
2022-12-12 05:12:20.085999: I tensorflow/compiler/xla/stream_executor/cuda/cudart_stub.cc:29] Ignore above cudart dlerror if you do not have a GPU set up on your machine.
2022-12-12 05:12:20.527441: W tensorflow/compiler/xla/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libnvinfer.so.7'; dlerror: libnvinfer.so.7: cannot open shared object file: No such file or directory
2022-

### 2. Load data

In [2]:
with open('data/data.json', 'r') as j:
    sentences = json.load(j)

sentences[0]

{'title': 'Pandemic',
 'text': 'A pandemic (from Greek πᾶν, pan, "all" and δῆμος, demos, "people") is an epidemic of an infectious disease that has spread across a large region, for instance multiple continents or worldwide, affecting a substantial number of people. A widespread endemic disease with a stable number of infected people is not a pandemic. Widespread endemic diseases with a stable number of infected people such as recurrences of seasonal influenza are generally excluded as they occur simultaneously in large regions of the globe rather than being spread worldwide.\nThroughout human history, there have been a number of pandemics of diseases such as smallpox and tuberculosis. The most fatal pandemic in recorded history was the Black Death (also known as The Plague), which killed an estimated 75–200 million people in the 14th century. The term was not used yet but was for later pandemics including the 1918 influenza pandemic (Spanish flu). Current pandemics include COVID-19 (S

### 3. Creating Embeddings

In [23]:
model = SentenceTransformer('all-MiniLM-L6-v2')

for sentence in sentences:
    sentence["embeddings"] = model.encode(sentence["text"], device='cuda', convert_to_numpy=False)

In [24]:
# looking at the sentence
sentences[2]["embeddings"].shape

torch.Size([384])

Curiously, this model (`all-MiniLM-L6-v2`) return just 384 size vector. Differently from the original transformer that has a $d_{model}=512$

### 4. Create and populate a FAISS index with those embeddings.

In [27]:
index = faiss.IndexIDMap(faiss.IndexFlatIP(384))

index.add_with_ids(
    np.array([sentence["embeddings"].cpu().numpy() for sentence in sentences]),
    np.array(range(0, len(sentences)))
)

### 5. Write a search function

In [34]:
def search(query: str, k=1):
    encoded_query = model.encode(query, device="cuda", convert_to_numpy=False).unsqueeze(dim=0).cpu().numpy()
    print(encoded_query.shape)
    top_k = index.search(encoded_query, k)
    scores = top_k[0][0]
    results = [sentences[_id] for _id in top_k[1][0]]
    return list(zip(results, scores))

### 6. Testing

In [39]:
results = search("Which diseases can be transmitted by animals?", k=10)
for result, score in results:
    print(result["text"], score)
    print("---")

(1, 384)
Swine influenza is an infection caused by any one of several types of swine influenza viruses. Swine influenza virus (SIV) or swine-origin influenza virus (S-OIV) is any strain of the influenza family of viruses that is endemic in pigs. As of 2009, the known SIV strains include influenza C and the subtypes of  influenza A known as H1N1, H1N2, H2N1, H3N1, H3N2, and H2N3.
Swine influenza virus is common throughout pig populations worldwide. Transmission of the virus from pigs to humans is not common and does not always lead to human flu, often resulting only in the production of antibodies in the blood. If transmission does cause human flu, it is called zoonotic swine flu. People with regular exposure to pigs are at increased risk of swine flu infection.
Around the mid-20th century, identification of influenza subtypes became possible, allowing accurate diagnosis of transmission to humans. Since then, only 50 such transmissions have been confirmed. These strains of swine flu rar