We have questions data and sentences data, now we'll take one-one question from questions file and we'll convert them into their semantic representation and at the same time we'll have to convert the answers from sentences data into their semantic representations. And then we have find out for a given question what are the k closest answers from these answers. K could be any numerical value maximum till numbers of sentences, usually we take k as 5.

In [1]:
from transformers import AutoModel, AutoTokenizer

In [2]:
# "distilbert-base-uncased" means we have used pretrained base varient model and tokenizer of DistilBERT with all the words normalized.
tokenizer = AutoTokenizer.from_pretrained("distilbert-base-uncased")
model = AutoModel.from_pretrained("distilbert-base-uncased")

Some weights of the model checkpoint at distilbert-base-uncased were not used when initializing DistilBertModel: ['vocab_transform.weight', 'vocab_projector.weight', 'vocab_layer_norm.weight', 'vocab_layer_norm.bias', 'vocab_projector.bias', 'vocab_transform.bias']
- This IS expected if you are initializing DistilBertModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing DistilBertModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


In [3]:
import json

In [4]:
file_handle = open("sentences.json")

In [5]:
sentences_data = json.load(file_handle)

In [6]:
sentences_data

['A pandemic is an epidemic of an infectious disease that has spread across a large region, for instance multiple continents or worldwide, affecting a substantial number of people.',
 'The most fatal pandemic in recorded history was the Black Death (also known as The Plague), which killed an estimated 75–200 million people in the 14th century.',
 'Current pandemics include COVID-19 (SARS-CoV-2) and HIV/AIDS.',
 'As of 2018, approximately 37.9 million people are infected with HIV globally.',
 'Cholera is an infection of the small intestine by some strains of the bacterium Vibrio cholerae.',
 'Classic cholera symptom is large amounts of watery diarrhea that lasts a few days. Vomiting and muscle cramps may also occur. Diarrhea can be so severe that it leads within hours to severe dehydration and electrolyte imbalance.',
 'The COVID-19 pandemic, also known as the coronavirus pandemic, is an ongoing pandemic of coronavirus disease 2019 (COVID-19) caused by severe acute respiratory syndrome 

In [7]:
file_handle.close()

In [8]:
# document could be an answer or a question.
# This function will take one document at a time tokenize it and then detach and squeeze it.
def encoder(document):
    
    return model(**tokenizer(document,return_tensors='pt'))[0].detach().squeeze()

In [9]:
# This list will contain the matrix of every document.
documents_embeddings = []

for document in sentences_data:
    
    documents_embeddings.append(encoder(document))

In [10]:
for embedding in documents_embeddings:
    print(embedding.size())

torch.Size([35, 768])
torch.Size([37, 768])
torch.Size([25, 768])
torch.Size([18, 768])
torch.Size([24, 768])
torch.Size([55, 768])
torch.Size([57, 768])
torch.Size([24, 768])
torch.Size([27, 768])
torch.Size([35, 768])
torch.Size([43, 768])


In [11]:
import torch

In [12]:
# Matrices for us are of no use we need then to be as vectors, so we'll take the average of each matrix and convert it into the vector of 768 dimensions.
docs2vecs = []

for embedding in documents_embeddings:
    docs2vecs.append(torch.mean(embedding,dim=0)) # DistilBERT is written in pytorch library and the encoded representations return from DistilBERT are in Pytorch tensors.

In [13]:
import faiss
import numpy as np

In [14]:
# Create Blank Index, as Facebook AI Semantic Search library requires data to be stored in index structure.
index = faiss.IndexIDMap(faiss.IndexFlatIP(768))

In [15]:
ids = np.array(range(0, len(sentences_data)))
ids = np.asarray(ids.astype('int64'))

In [16]:
# We have put the documents into different rows in index structure. 
# As we ran a loop documents(semantic representation(Pytorch tensors)) convert it into numpy array put them into a list and
# converted the complete list into numpy array ad then we saved our matrix into index data structure.
index.add_with_ids((np.array([doc_vec.numpy() for doc_vec in docs2vecs])),ids)

We don't need to write K Nearest Algorithm as Index Data Structure will give us the answers with that format. That's why we used index data structure for our answers, so that it will become easy for us to K Nearest Neighbour algorithm on answers.

In [17]:
# It will accept a query and return the k closest answers belongs to that query.
def search(query,k):
    
    encoded_query = encoder(query) # It will give the matrix of 768 dimensions
    
    encoded_query = torch.mean(encoded_query,dim=0) # It converted the 768 dimensional matrix to a 768 dimentional vector.
    
    encoded_query = encoded_query.unsqueeze(dim=0).numpy() # It will reshape it to (1,768) convert the vector into array.
    
    k_nn_docs = index.search(encoded_query,k) # Index have a search function which will returns the k nearest neighbour answer to the query.
    
    semantic_similarity_scores = k_nn_docs[0][0] # It calculates the semantic similarity score between query and answer.
    # Higher the score stronger the similarity, Lower the score weaker the similarity.
    
    results = [sentences_data[idx] for idx in k_nn_docs[1][0]] # resuts will have the sentence for the query asked.
    
    return list(zip(results,semantic_similarity_scores)),k_nn_docs # It zipped the result with semantic similarity score.

In [18]:
query_file_handle = open("questions.json")

In [19]:
query_data = json.load(query_file_handle)

In [21]:
query_data

['How many people have died during Black Death?',
 'Which diseases can be transmitted by animals?',
 'Connection between climate change and a likelihood of a pandemic',
 'What is an example of a latent virus',
 'Viruses in nanotechnology',
 'Giant viruses classification',
 'What are the notable pandemic prevention organizations?',
 'How many leprosy outbreaks are known to happen?',
 'What are the geographic areas with the highest transmission of malaria?',
 'How to prevent the spread of viral infections?']

Testing

In [32]:
query_data[2]

'Connection between climate change and a likelihood of a pandemic'

In [33]:
search(query_data[2],1)

[('A pandemic is an epidemic of an infectious disease that has spread across a large region, for instance multiple continents or worldwide, affecting a substantial number of people.',
  60.540653)]