# Building a Retrieval-Augmented Generation (RAG) with Dense Passage Retrieval (DPR) pipeline using Python






## Description
The goal of this experiment is to implement a simple Retrieval-Augmented Generation (RAG) with Dense Passage Retrieval (DPR) pipeline to answer user queries by combining the strengths of dense passage retrieval for fetching relevant documents and generative models for answering queries.



##Setup

In [2]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [None]:
cd "YOUR-PATH-HERE"

In [4]:
# Install necessary libraries
%%capture
!pip install transformers faiss-cpu datasets sentence-transformers


In [48]:
# Import Libraries
import pandas as pd
import numpy as np
import torch
import faiss
from transformers import DPRContextEncoder, DPRContextEncoderTokenizer
from transformers import DPRQuestionEncoder, DPRQuestionEncoderTokenizer
from transformers import pipeline
import re

In [6]:
# Load documents. Example dataset
documents = [
    "Stars appear to twinkle because their light passes through Earth’s atmosphere before reaching our eyes.",
    "The air in the atmosphere is always moving, and this bends the light in different directions.",
    "This bending makes the star’s light seem to change in brightness and position, creating the twinkling effect.",
    "If you were to see a star from space, where there is no atmosphere, it wouldn’t twinkle at all!",
    "Planets, however, usually do not twinkle as much because they appear larger in the sky and their light is more stable."
]

doc_df = pd.DataFrame(documents, columns=["text"])

In [8]:
# Create document embeddings

# Load DPR context encoder and tokenizer
%%capture
ctx_encoder = DPRContextEncoder.from_pretrained("facebook/dpr-ctx_encoder-multiset-base")
ctx_tokenizer = DPRContextEncoderTokenizer.from_pretrained("facebook/dpr-ctx_encoder-multiset-base")

# Generate embeddings for documents
def generate_embeddings(documents):
    inputs = ctx_tokenizer(documents, return_tensors="pt", padding=True, truncation=True)
    with torch.no_grad():
        embeddings = ctx_encoder(**inputs).pooler_output
    return embeddings

document_embeddings = generate_embeddings(doc_df["text"].tolist())

Some weights of the model checkpoint at facebook/dpr-ctx_encoder-multiset-base were not used when initializing DPRContextEncoder: ['ctx_encoder.bert_model.pooler.dense.bias', 'ctx_encoder.bert_model.pooler.dense.weight']
- This IS expected if you are initializing DPRContextEncoder from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing DPRContextEncoder from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
The tokenizer class you load from this checkpoint is not the same type as the class this function is called from. It may result in unexpected tokenization. 
The tokenizer class you load from this checkpoint is 'DPRQuestionEncoderTokenizer'. 
The class this function is called from is 'DPRContextEncoderTokenize

In [9]:
# Index the Embeddings Using FAISS
# Convert embeddings to numpy
document_embeddings_np = document_embeddings.numpy()

# Create FAISS index
dimension = document_embeddings_np.shape[1]
index = faiss.IndexFlatIP(dimension)  # Inner product for similarity
index.add(document_embeddings_np)

print(f"Number of documents indexed: {index.ntotal}")

Number of documents indexed: 5


In [57]:
# Retrieve the top-k most relevant documents for a user query

# Load DPR question encoder and tokenizer
%%capture
query_encoder = DPRQuestionEncoder.from_pretrained("facebook/dpr-question_encoder-multiset-base")
query_tokenizer = DPRQuestionEncoderTokenizer.from_pretrained("facebook/dpr-question_encoder-multiset-base")

# Function to retrieve documents
def retrieve_documents(query, top_k=2):
    # Encode the query
    inputs = query_tokenizer(query, return_tensors="pt", padding=True, truncation=True)
    with torch.no_grad():
        query_embedding = query_encoder(**inputs).pooler_output.numpy()

    # Search FAISS index
    distances, indices = index.search(query_embedding, top_k)
    results = [(doc_df["text"].iloc[i], distances[0][j]) for j, i in enumerate(indices[0])]
    return results

Some weights of the model checkpoint at facebook/dpr-question_encoder-multiset-base were not used when initializing DPRQuestionEncoder: ['question_encoder.bert_model.pooler.dense.bias', 'question_encoder.bert_model.pooler.dense.weight']
- This IS expected if you are initializing DPRQuestionEncoder from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing DPRQuestionEncoder from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


In [58]:
# Example query
query = "Why do stars twinkle (also called stellar scintillation) in the night sky?"
retrieved_docs = retrieve_documents(query)
print("Retrieved Documents:", retrieved_docs)

Asking to truncate to max_length but no maximum length is provided and the model has no predefined maximum length. Default to no truncation.


Retrieved Documents: [('Planets, however, usually do not twinkle as much because they appear larger in the sky and their light is more stable.', 74.23694), ('Stars appear to twinkle because their light passes through Earth’s atmosphere before reaching our eyes.', 72.949875)]


In [59]:
# Generate an Answer Using a Language Model

# Load a generative model
generator = pipeline("text2text-generation", model="facebook/bart-large")

# Generate an answer
def generate_answer(query, retrieved_docs):
    context = " ".join([doc[0] for doc in retrieved_docs])  # Combine top-k docs
    prompt = f"{context}"

    answer = generator(prompt, max_length=50, num_beams=3)[0]["generated_text"]

    # Format and display the results
    display_result(query, answer)

    return answer

Device set to use cuda:0


In [60]:
# Function to format and display results
def display_result(query, answer):
    print("\n=== Query ===")
    print(query)  # Print the retriever query

    print("\n=== Answer ===")
    sentences = re.split(r'(?<=\.) ', answer)
    for sentence in sentences:
        print(sentence.strip())  # Print each sentence on a new line

In [61]:
# Example usage
answer = generate_answer(query, retrieved_docs)


=== Query ===
Why do stars twinkle (also called stellar scintillation) in the night sky?

=== Answer ===
Planets, however, usually do not twinkle as much because they appear larger in the sky and their light is more stable.
Stars appear to twinkle because their light passes through Earth’s atmosphere before reaching our eyes.
