***IMPORTING REQUIRED LIBRARIES***

In [40]:
from transformers import AutoTokenizer, AutoModelForQuestionAnswering
from sentence_transformers import SentenceTransformer, util
import torch

***LOADING TOKENIZER AND MODEL FOR ENCODING***

In [41]:
tokenizer = AutoTokenizer.from_pretrained("bert-large-uncased-whole-word-masking-finetuned-squad")
model = AutoModelForQuestionAnswering.from_pretrained("bert-large-uncased-whole-word-masking-finetuned-squad")

tokenizer_config.json:   0%|          | 0.00/48.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/443 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/1.34G [00:00<?, ?B/s]

Some weights of the model checkpoint at bert-large-uncased-whole-word-masking-finetuned-squad were not used when initializing BertForQuestionAnswering: ['bert.pooler.dense.bias', 'bert.pooler.dense.weight']
- This IS expected if you are initializing BertForQuestionAnswering from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForQuestionAnswering from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


***LOADING SENTENCE BERT MODEL FOR DENSE RETRIEVAL***

In [42]:
retriever = SentenceTransformer('all-MiniLM-L6-v2')

modules.json:   0%|          | 0.00/349 [00:00<?, ?B/s]

config_sentence_transformers.json:   0%|          | 0.00/116 [00:00<?, ?B/s]

README.md:   0%|          | 0.00/10.7k [00:00<?, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/612 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/90.9M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/350 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]

1_Pooling/config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

***EXAMPLE DOCUMENT CORPUS***

In [56]:
corpus=["Dense retrieval refers to a method in information retrieval where documents and queries are represented as dense vectors, often generated using pre-trained neural networks like BERT or Sentence-BERT.","Dense retrieval refers to a method in information retrieval where documents and queries are represented as dense vectors, often generated using pre-trained neural networks like BERT or Sentence-BERT","FAISS (Facebook AI Similarity Search) is a library developed by Facebook for efficient similarity search and clustering of dense vectors. It is highly optimized for performance and can handle large-scale datasets."]
corpus_embeddings=retriever.encode(documents,convert_to_tensor=False)

***QUESTION EXAMPLE AND ENCODING IT***

In [57]:
questions = ["What is Dense Retrieval?", "Explain BERT's functionality", "What is Cricket?"]

In [58]:
# Iterate over each question
for question in questions:
    # Generate the embedding for the question
    question_embedding = retriever.encode(question, convert_to_tensor=True)

    # Compute cosine similarities between the question and the corpus
    similarities = util.pytorch_cos_sim(question_embedding, corpus_embeddings)[0]

    # Find the index of the most relevant passage
    most_relevant_idx = torch.argmax(similarities)
    passage = corpus[most_relevant_idx]

    # Print the similarity value for debugging
    print(f"Cosine Similarity for Question: '{question}' with Passage: '{passage}' is {similarities[most_relevant_idx]}")

    # If the cosine similarity is below a threshold, consider it irrelevant
    if similarities[most_relevant_idx] < 0.3:  # Lowered threshold
        print(f"Question: {question}")
        print("Sorry, I couldn't find a relevant passage.\n")
        continue

    # Tokenize input question and the retrieved passage
    inputs = tokenizer.encode_plus(question, passage, add_special_tokens=True, return_tensors="pt")

    # Get the input_ids
    input_ids = inputs["input_ids"].tolist()[0]

    # Get the start and end logits from the model
    outputs = model(**inputs)
    start_logits = outputs.start_logits
    end_logits = outputs.end_logits

    # Get the positions with the highest scores for start and end
    start_index = torch.argmax(start_logits)
    end_index = torch.argmax(end_logits) + 1  # Add 1 because we are dealing with inclusive end indexing

    # If the start and end index are the same, it suggests that no valid answer was found
    if start_index == end_index or similarities[most_relevant_idx] < 0.3:
        print(f"Question: {question}")
        print("Sorry, I couldn't find the answer.\n")
    else:
        # Convert the token ids to string to extract the answer
        answer = tokenizer.convert_tokens_to_string(tokenizer.convert_ids_to_tokens(input_ids[start_index:end_index]))

        # Output the question, retrieved passage, and predicted answer
        print(f"Question: {question}")
        print(f"Retrieved Passage: {passage}")
        print(f"Answer: {answer}\n")

Cosine Similarity for Question: 'What is Dense Retrieval?' with Passage: 'Dense retrieval refers to a method in information retrieval where documents and queries are represented as dense vectors, often generated using pre-trained neural networks like BERT or Sentence-BERT' is 0.7185027003288269
Question: What is Dense Retrieval?
Retrieved Passage: Dense retrieval refers to a method in information retrieval where documents and queries are represented as dense vectors, often generated using pre-trained neural networks like BERT or Sentence-BERT
Answer: a method in information retrieval where documents and queries are represented as dense vectors

Cosine Similarity for Question: 'Explain BERT's functionality' with Passage: 'Dense retrieval refers to a method in information retrieval where documents and queries are represented as dense vectors, often generated using pre-trained neural networks like BERT or Sentence-BERT.' is 0.4228471517562866
Question: Explain BERT's functionality
Retriev