 Retrieval-Augmented Generation (RAG) with a focus on encoding documents, chunking them, retrieving documents, encoding queries, and the retrieval process, we'll break down each step with relevant explanations and code snippets. RAG combines retrieval and generation in a single framework, which is particularly useful for open-domain question answering.

In [1]:
!pip install sentence-transformers
!pip install faiss-cpu
!pip install nltk


Collecting sentence-transformers
  Downloading sentence_transformers-3.0.1-py3-none-any.whl (227 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m227.1/227.1 kB[0m [31m2.7 MB/s[0m eta [36m0:00:00[0m
Collecting nvidia-cuda-nvrtc-cu12==12.1.105 (from torch>=1.11.0->sentence-transformers)
  Using cached nvidia_cuda_nvrtc_cu12-12.1.105-py3-none-manylinux1_x86_64.whl (23.7 MB)
Collecting nvidia-cuda-runtime-cu12==12.1.105 (from torch>=1.11.0->sentence-transformers)
  Using cached nvidia_cuda_runtime_cu12-12.1.105-py3-none-manylinux1_x86_64.whl (823 kB)
Collecting nvidia-cuda-cupti-cu12==12.1.105 (from torch>=1.11.0->sentence-transformers)
  Using cached nvidia_cuda_cupti_cu12-12.1.105-py3-none-manylinux1_x86_64.whl (14.1 MB)
Collecting nvidia-cudnn-cu12==8.9.2.26 (from torch>=1.11.0->sentence-transformers)
  Using cached nvidia_cudnn_cu12-8.9.2.26-py3-none-manylinux1_x86_64.whl (731.7 MB)
Collecting nvidia-cublas-cu12==12.1.3.1 (from torch>=1.11.0->sentence-transform

After installing nltk, you need to download the necessary tokenizer models for sentence tokenization.

In [2]:
import nltk
nltk.download('punkt')

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.


True

 ## How to Encode Documents?
 To encode documents, you typically use a pre-trained language model that generates embeddings. For this example, we'll use the sentence-transformers library, which provides pre-trained models for encoding sentences or documents into dense vectors.

SentenceTransformer: This class from the sentence-transformers library loads pre-trained models.
Model: We load a model all-MiniLM-L6-v2 which is efficient and suitable for various NLP tasks.
Documents: A list of sample documents that will be converted into embeddings.
Embeddings: The model encodes the documents into dense vectors, capturing their semantic meaning. These embeddings can now be used for similarity comparisons.

In [3]:
from sentence_transformers import SentenceTransformer

# Load a pre-trained model
model = SentenceTransformer('all-MiniLM-L6-v2')

# Sample documents
docs = [
    "Document 1 content goes here.",
    "Document 2 content goes here.",
    "Document 3 content goes here."
]

# Encode documents
doc_embeddings = model.encode(docs)
print(doc_embeddings)


  from tqdm.autonotebook import tqdm, trange
The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


modules.json:   0%|          | 0.00/349 [00:00<?, ?B/s]

config_sentence_transformers.json:   0%|          | 0.00/116 [00:00<?, ?B/s]

README.md:   0%|          | 0.00/10.7k [00:00<?, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]



config.json:   0%|          | 0.00/612 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/90.9M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/350 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]

1_Pooling/config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

[[-0.0293765   0.00104907  0.02164329 ...  0.06537817  0.11168744
   0.02347772]
 [-0.01070729  0.02644592  0.03445313 ...  0.06088343  0.10327022
   0.01467254]
 [-0.01308329 -0.0193058  -0.0367191  ...  0.02381936  0.0993048
   0.03798515]]


## How to Chunk Documents?

Chunking documents is crucial for managing long documents and improving retrieval performance. You can split documents into smaller, manageable chunks, such as paragraphs or sentences.

nltk.tokenize: The nltk library provides various tools for text processing, including tokenization.

sent_tokenize: This function splits the text into sentences.

Chunks: The long document is divided into sentences, making it easier to process and retrieve relevant parts during the RAG process.

In [4]:
from nltk.tokenize import sent_tokenize

# Example long document
long_doc = "Long document content. This is a second sentence. And another one."

# Chunking into sentences
chunks = sent_tokenize(long_doc)
print(chunks)


['Long document content.', 'This is a second sentence.', 'And another one.']


## How to Retrieve Documents?
To retrieve documents, we'll use a similarity search technique. We'll employ the faiss library for efficient similarity search.

FAISS: A library for efficient similarity search and clustering of dense vectors.

Index: FAISS index is created using L2 distance for similarity measurement.

Query Embedding: The query is encoded into an embedding using the same model used for document encoding.

Retrieve Documents: The function searches the index to find the most similar document embeddings to the query embedding, returning the indices of the top-k similar documents.

In [5]:
import faiss
import numpy as np

# Convert embeddings to a numpy array
doc_embeddings_np = np.array(doc_embeddings)

# Create an index and add the document embeddings
index = faiss.IndexFlatL2(doc_embeddings_np.shape[1])
index.add(doc_embeddings_np)

# Retrieve the most similar documents for a given query
def retrieve_documents(query_embedding, k=2):
    D, I = index.search(np.array([query_embedding]), k)
    return I[0]

# Example query
query = "Content related to Document 1."
query_embedding = model.encode([query])[0]

# Retrieve top 2 documents
retrieved_doc_indices = retrieve_documents(query_embedding)
print(retrieved_doc_indices)


[0 1]


 ## How to Encode Queries?
 Encoding queries is similar to encoding documents. You use the same model to transform the query into an embedding.

 Query: The input query that we want to use for document retrieval.

Query Embedding: The query is encoded into a dense vector using the pre-trained model, capturing its semantic meaning. This embedding is used for similarity search against the document embeddings.

In [6]:
# Encode query
query = "Content related to Document 1."
query_embedding = model.encode(query)
print(query_embedding)


[-4.75145094e-02  3.59434485e-02  3.54594877e-03  6.62644254e-03
  7.26331249e-02  1.29147712e-02  1.41177066e-02  5.51050343e-02
  2.52027940e-02  1.02072926e-02  3.91226001e-02  9.00198817e-02
  2.45006811e-02  3.10012698e-03 -4.61227261e-02 -1.05967140e-02
  1.95167921e-02 -5.77711985e-02 -2.63882540e-02  7.54581913e-02
  3.18428800e-02  6.12635091e-02  3.83324586e-02  1.45608187e-02
  1.61190499e-02  5.77918440e-02 -1.28156140e-01 -8.18322133e-03
  6.16700873e-02 -6.64430484e-02  6.15298115e-02  3.54829952e-02
  3.27663980e-02  9.34434496e-03  5.32094799e-02  5.65565117e-02
  6.34200796e-02 -5.00394963e-02  2.43208129e-02  7.03696385e-02
 -1.09096961e-02 -2.48694215e-02 -3.67329083e-02  1.51401311e-02
  2.54734270e-02  2.62852833e-02 -4.34894264e-02  4.04231437e-02
 -8.82335752e-03  8.67691040e-02 -4.92651612e-02  2.95495670e-02
 -9.05500352e-02  1.31976018e-02  3.84256877e-02  5.45462966e-02
 -4.28279750e-02  7.64782354e-02 -3.78435664e-02 -2.06588358e-02
  1.31438579e-02  2.10527

## When to Retrieve?
Retrieval typically occurs at the beginning of the generation process, where the goal is to find relevant documents that can aid in generating a coherent and accurate response.

Timing: Retrieval is performed after encoding the query but before the generation step. It ensures that the response generation model has access to relevant information from the documents.

Retrieved Documents: The actual content of the retrieved documents based on their indices. These documents are used as input to the generation model.

In [7]:
# Assume query is encoded as shown above
# Retrieve documents
retrieved_doc_indices = retrieve_documents(query_embedding)
retrieved_docs = [docs[idx] for idx in retrieved_doc_indices]
print(retrieved_docs)


['Document 1 content goes here.', 'Document 2 content goes here.']


## How and What to Retrieve?
The retrieval process involves fetching the top-k most similar documents based on the similarity scores. You retrieve the actual document content or the relevant chunks.

Retrieve Content: This function maps the retrieved indices to the actual document content.

Relevant Chunks: You can retrieve entire documents or specific chunks, depending on your application needs.

Retrieved Documents: The final step is to print or use the retrieved documents in the subsequent generation step.

In [8]:
# Retrieve the actual document content
def retrieve_document_content(indices, docs):
    return [docs[idx] for idx in indices]

# Retrieve documents
retrieved_docs = retrieve_document_content(retrieved_doc_indices, docs)
print(retrieved_docs)


['Document 1 content goes here.', 'Document 2 content goes here.']
