[Reference link for detailed DPR documentation](https://huggingface.co/docs/transformers/en/model_doc/dpr)

In [1]:
!pip install transformers faiss-cpu

Collecting faiss-cpu
  Downloading faiss_cpu-1.9.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (4.4 kB)
Downloading faiss_cpu-1.9.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (27.5 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m27.5/27.5 MB[0m [31m33.2 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: faiss-cpu
Successfully installed faiss-cpu-1.9.0


In [7]:
from transformers import DPRQuestionEncoder, DPRContextEncoder, DPRQuestionEncoderTokenizer, DPRContextEncoderTokenizer
import faiss
from tqdm import tqdm
import numpy as np


## > **initialize the necessary encoders -> question and context encoders**


In [3]:
ques_encoder = DPRQuestionEncoder.from_pretrained("facebook/dpr-question_encoder-single-nq-base")
con_encoder = DPRContextEncoder.from_pretrained("facebook/dpr-ctx_encoder-single-nq-base")

config.json:   0%|          | 0.00/493 [00:00<?, ?B/s]

pytorch_model.bin:   0%|          | 0.00/438M [00:00<?, ?B/s]

Some weights of the model checkpoint at facebook/dpr-question_encoder-single-nq-base were not used when initializing DPRQuestionEncoder: ['question_encoder.bert_model.pooler.dense.bias', 'question_encoder.bert_model.pooler.dense.weight']
- This IS expected if you are initializing DPRQuestionEncoder from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing DPRQuestionEncoder from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


config.json:   0%|          | 0.00/492 [00:00<?, ?B/s]

pytorch_model.bin:   0%|          | 0.00/438M [00:00<?, ?B/s]

Some weights of the model checkpoint at facebook/dpr-ctx_encoder-single-nq-base were not used when initializing DPRContextEncoder: ['ctx_encoder.bert_model.pooler.dense.bias', 'ctx_encoder.bert_model.pooler.dense.weight']
- This IS expected if you are initializing DPRContextEncoder from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing DPRContextEncoder from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


### **Initialize the tokenizers**

In [8]:
question_tokenizer = DPRQuestionEncoderTokenizer.from_pretrained("facebook/dpr-question_encoder-single-nq-base")
context_tokenizer = DPRContextEncoderTokenizer.from_pretrained("facebook/dpr-ctx_encoder-single-nq-base")

tokenizer_config.json:   0%|          | 0.00/28.0 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/28.0 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

The tokenizer class you load from this checkpoint is not the same type as the class this function is called from. It may result in unexpected tokenization. 
The tokenizer class you load from this checkpoint is 'DPRQuestionEncoderTokenizer'. 
The class this function is called from is 'DPRContextEncoderTokenizer'.


### Again, I was dumb enough to pass the question and context vectors staright to the Encoder, before tokenizer... So, TOKENIZE before ENCODING.

In [10]:
contexts = [
    "Albert Einstein was a theoretical physicist known for his theory of relativity.",
    "Marie Curie was a physicist and chemist famous for her work on radioactivity.",
    "Isaac Newton formulated the laws of motion and universal gravitation."
]

questions = [
    "Who was a physicist known for his theory of relativity?",
    "What was Marie Curie's field of study?",
    "Who formulated the laws of motion?"
]

con_inputs = context_tokenizer(contexts, return_tensors="pt", padding=True, truncation=True, max_length = 50)
ques_inputs = question_tokenizer(questions, return_tensors="pt", padding=True, truncation=True, max_length = 50)


In [13]:
con_embeddings = con_encoder(**con_inputs).pooler_output.detach().numpy()
ques_embeddings = ques_encoder(**ques_inputs).pooler_output.detach().numpy()

### **Use Faiss in order to do similarity search - Search used - Cosine similarity**

In [25]:
index = faiss.IndexFlatL2(con_embeddings.shape[1])
index.add(con_embeddings)

In [56]:
def retrieve(question_embedding, index):
  distances, indices = index.search(np.array([question_embedding]), 1)
  return contexts[indices[0][0]]

In [57]:
def generate(ques, context):
  return f"Question: '{ques}'\nAnswer: {context}"

In [58]:
res = []

for q, ques in zip(ques_embeddings, questions):

  best_context = retrieve(q, index)
  answer = generate(ques,best_context)
  res.append(answer)
  print(f'Question : {ques} \nAnswer : {answer}')



Question : Who was a physicist known for his theory of relativity? 
Answer : Question: 'Who was a physicist known for his theory of relativity?'
Answer: Albert Einstein was a theoretical physicist known for his theory of relativity.
Question : What was Marie Curie's field of study? 
Answer : Question: 'What was Marie Curie's field of study?'
Answer: Marie Curie was a physicist and chemist famous for her work on radioactivity.
Question : Who formulated the laws of motion? 
Answer : Question: 'Who formulated the laws of motion?'
Answer: Isaac Newton formulated the laws of motion and universal gravitation.
