Implementing DPR using Transformer's(Hugging face) library. 
1. We will load Pre-Trained Models
2. We will encode Passage and query
3. Retrieve the most relevant passage based on semantic similarity

In [9]:
from PyPDF2 import PdfReader

# Function to read a PDF file and extract text
def extract_text_from_pdf(pdf_path):
    reader = PdfReader(pdf_path)
    text = ""
    for page in reader.pages:
        text += page.extract_text()
    return text

# Split the text into passages of given length
def split_text_into_passages(text, passage_length=100):
    words = text.split()
    passages = []
    
    for i in range(0, len(words), passage_length):
        passages.append(" ".join(words[i:i+passage_length]))
    return passages

# List to store all passages
p = []

# Paths to the PDF files
pdf_path_1 = r"microsoft-annual-report.pdf"
pdf_path_2 = r"Agthia-Annual-Report-2023-EN.pdf"
pdf_path_3 = r"ADNOC Distribution Q3 23 Financial Statements_Eng.pdf"

# Extract text from all three PDFs
pdf_text_1 = extract_text_from_pdf(pdf_path_1)
pdf_text_2 = extract_text_from_pdf(pdf_path_2)
pdf_text_3 = extract_text_from_pdf(pdf_path_3)

# Combine all extracted text
combined_text = pdf_text_1 + " " + pdf_text_2 + " " + pdf_text_3

# Split the combined text into passages
passages = split_text_into_passages(combined_text, passage_length=100)

# Store the first 100 passages in list p
for i, passage in enumerate(passages[:100]):
    p.append(passage)

# p now contains the first 100 passages from the combined PDF texts


In [10]:
from transformers import DPRQuestionEncoder, DPRContextEncoder, DPRQuestionEncoderTokenizer, DPRContextEncoderTokenizer

#Load Query Encoder and Tokenizer
query_encoder=DPRQuestionEncoder.from_pretrained('facebook/dpr-question_encoder-single-nq-base')
query_tokenizer=DPRContextEncoderTokenizer.from_pretrained('facebook/dpr-ctx_encoder-single-nq-base')

passage_encoder =DPRContextEncoder.from_pretrained('facebook/dpr-ctx_encoder-single-nq-base')
passage_tokenizer=DPRContextEncoderTokenizer.from_pretrained('facebook/dpr-ctx_encoder-single-nq-base')

Some weights of the model checkpoint at facebook/dpr-question_encoder-single-nq-base were not used when initializing DPRQuestionEncoder: ['question_encoder.bert_model.pooler.dense.bias', 'question_encoder.bert_model.pooler.dense.weight']
- This IS expected if you are initializing DPRQuestionEncoder from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing DPRQuestionEncoder from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
The tokenizer class you load from this checkpoint is not the same type as the class this function is called from. It may result in unexpected tokenization. 
The tokenizer class you load from this checkpoint is 'DPRQuestionEncoderTokenizer'. 
The class this function is called from is 'DPRCon

: 

In [11]:

passage_inputs = passage_tokenizer(p,return_tensors='pt',padding=True,truncation=True)
passage_vectors= passage_encoder(**passage_inputs).pooler_output

print(passage_vectors.shape)

Asking to truncate to max_length but no maximum length is provided and the model has no predefined maximum length. Default to no truncation.


In [None]:
query= input("What's your query?")
query_input=query_tokenizer(query,return_tensors='pt')
query_vector=query_encoder(**query_input).pooler_output

#Similarty search with Query and passage
import torch
similarity_scores=torch.matmul(query_vector,passage_vectors.T)
top_k=torch.topk(similarity_scores,k=2)

#Display the top passages
for idx in top_k.indices[0]:
    print(passages[idx])

repurchase program commenced in November 2021, following completion of the program approved on September 18, 2019, has no expiration date, and may be terminated at any time. As of June 30, 2023, $22.3 billion remained of this $60.0 billion share repurchase program. We repurchased the following shares of common stock under the share repurchase programs: All repurchases were made using cash resources. Shares repurchased during fiscal year 2023 and the fourth and third quarters of fiscal year 2022 were under the share repurchase program approved on September 14, 2021. Shares repurchased during the second quarter of fiscal year 2022 were
a range of products and services. Growth depends on our ability to reach new users, add value to our core product set, and continue to expand our product and service offerings into new markets. Office Consumer revenue is mainly affected by the percentage of customers that buy Office with their new devices and the continued shift from Office licensed on-pre