# Retrieval Augmented Generation (RAG) Pipeline for Document Q&A

This notebook demonstrates a basic Retrieval Augmented Generation (RAG) pipeline. RAG is a technique that combines the power of large language models (LLMs) with a retrieval system to provide more accurate and contextually relevant answers to user queries.

In this pipeline, we will:
1.  Load and preprocess documents.
2.  Split documents into smaller chunks.
3.  Generate embeddings for these chunks.
4.  Store the embeddings in a vector database (Pinecone).
5.  Given a user query, retrieve relevant chunks from the database.
6.  Use a language model (Groq) to generate an answer based on the retrieved context.

# **SETTING UP**

In [None]:
# Installing necessary dependencies for the RAG pipeline.
# pinecone: for the vector database
# sentence-transformers: for generating embeddings
# pymupdf: for loading PDF documents
# langchain: a framework for developing applications powered by language models
# langchain-community: community contributed LangChain components
# langdetect: for detecting the language of the text
# torch: a deep learning framework, used by sentence-transformers
# groq: for the Language Model (LLM) used for generating answers
!pip install pinecone sentence-transformers pymupdf langchain -q
!pip install -U langchain-community langdetect -q
!pip install torch -q
!pip install groq

In [None]:
# Importing necessary libraries and modules.
from IPython import get_ipython
from IPython.display import display
import os
from pinecone import Pinecone
import re
import unicodedata
from langdetect import detect
from sentence_transformers import SentenceTransformer
import torch
from langchain.document_loaders import TextLoader, PyMuPDFLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter, SentenceTransformersTokenTextSplitter
from groq import Groq

In [None]:
# Setting up Pinecone connection.
# Replace with your actual Pinecone API Key and Environment.
PINECONE_API_KEY = "Your Pinecone API Key"
PINECONE_ENV = "us-east-1"
index_name = "index1" # Name of your Pinecone index

# Initialize Pinecone client.
pinecone_client = Pinecone(api_key=PINECONE_API_KEY, environment=PINECONE_ENV)

# Connect to the specified Pinecone index.
index_bge = pinecone_client.Index(name=index_name)

# Set Pinecone API key as an environment variable (sometimes required by libraries).
os.environ["PINECONE_API_KEY"] = PINECONE_API_KEY

In [None]:
# Initialize Groq client for the Language Model (LLM).
# Replace with your actual Groq API Key.
GROQ_API_KEY = "Your Groq API Key"
client = Groq(api_key=GROQ_API_KEY)

In [None]:
# Determine the device to use for the Sentence Transformer model (GPU if available, otherwise CPU).
device = "cuda" if torch.cuda.is_available() else "cpu"

# Load the Sentence Transformer model (BAAI/bge-m3) for generating embeddings.
# This model is chosen for its good performance on various tasks.
models = {
    "bge-m3": SentenceTransformer("BAAI/bge-m3", device=device)
}

In [None]:
# --- Helper Functions ---

# Function to clean text: normalize Unicode, remove extra whitespace.
def clean_text(text):
    text = unicodedata.normalize("NFKC", text) # Normalize Unicode characters
    text = re.sub(r'\s+', ' ', text) # Replace multiple whitespace characters with a single space
    text = text.strip() # Remove leading/trailing whitespace
    return text

# Function to preprocess documents: clean text, detect language, add metadata.
def preprocess_documents(documents, filename):
    cleaned_docs = []
    for i, doc in enumerate(documents):
        text = clean_text(doc.page_content) # Clean the document content
        try:
            lang = detect(text) # Detect the language of the text
        except:
            lang = "unknown" # Handle potential errors in language detection

        doc.page_content = text # Update document content with cleaned text
        doc.metadata["language"] = lang # Add detected language to metadata
        doc.metadata["source"] = filename # Add source filename to metadata
        doc.metadata["chunk_index"] = i # Add chunk index to metadata

        if "page" not in doc.metadata:
            doc.metadata["page"] = i # Add page number if not already present (e.g., for PDFs)

        cleaned_docs.append(doc)
    return cleaned_docs

# Function to upload embeddings to Pinecone.
def upload_embeddings(chunks, model_name, model, index, chunking_type):
    batch_size = 200 # Define batch size for uploading vectors to Pinecone
    for i in range(0, len(chunks), batch_size):
        batch_chunks = chunks[i:i + batch_size]
        texts = [c.page_content for c in batch_chunks]
        # Encode the text chunks into embeddings using the specified model.
        embeddings = model.encode(texts, show_progress_bar=False)

        vectors = []
        # Prepare vectors for upserting to Pinecone.
        for j, embedding in enumerate(embeddings):
            chunk_id = f"{model_name}_{chunking_type}_chunk_{i + j}" # Create a unique ID for each vector
            metadata = batch_chunks[j].metadata.copy() # Copy existing metadata
            metadata["text"] = batch_chunks[j].page_content # Add the original text chunk to metadata
            metadata["chunking"] = chunking_type # Add chunking type to metadata
            vectors.append((chunk_id, embedding.tolist(), metadata)) # Create the vector tuple (ID, embedding, metadata)

        # Upsert (insert or update) the vectors to the Pinecone index.
        index.upsert(vectors=vectors)
        print(f"✅ Uploaded {len(vectors)} vectors to {model_name} ({chunking_type})")

**Documents**

In [None]:
# Define the list of documents to be processed.
# Make sure these files exist in your Colab environment or Google Drive.
docs = ['sample1.pdf', 'sample2.txt']

# **Chunking**
This section demonstrates different methods for splitting documents into smaller pieces (chunks). Chunking is essential for managing the input size for embedding models and LLMs, and for improving the relevance of retrieved information.

In [None]:
# Initialize lists to store chunks from all documents using different strategies.
char_chunks_all = [] # Chunks created using fixed-length character splitting
token_chunks_all = [] # Chunks created using token-based splitting
sentence_chunks_all = [] # Chunks created using sentence-aware splitting

# Process each document in the list.
for doc_path in docs:
    filename = os.path.basename(doc_path) # Get the filename from the path

    # Load the document based on its file extension.
    if doc_path.endswith(".txt"):
        loader = TextLoader(doc_path)
    elif doc_path.endswith(".pdf"):
        loader = PyMuPDFLoader(doc_path)
    else:
        print(f"Skipping unsupported file: {doc_path}")
        continue # Skip to the next file if the extension is not supported

    # Load the raw documents.
    raw_docs = loader.load()
    # Preprocess the documents (clean text, add metadata).
    documents = preprocess_documents(raw_docs, filename)

    # --- Chunking Strategies ---

    # Fixed-length character chunking: Splits text into chunks of a fixed character size with overlap.
    char_splitter = RecursiveCharacterTextSplitter(chunk_size=500, chunk_overlap=100)
    char_chunks = char_splitter.split_documents(documents)
    char_chunks_all.extend(char_chunks) # Add these chunks to the combined list

    # Token-based chunking: Splits text into chunks based on a fixed number of tokens with overlap.
    # This is often preferred for models that work with tokens.
    token_splitter = SentenceTransformersTokenTextSplitter(tokens_per_chunk=256, chunk_overlap=50)
    token_chunks = token_splitter.split_documents(documents)
    token_chunks_all.extend(token_chunks) # Add these chunks to the combined list

    # Sentence-aware chunking: Splits text based on common sentence separators, aiming to keep sentences intact.
    sentence_splitter = RecursiveCharacterTextSplitter(
        chunk_size=400, # Target chunk size
        chunk_overlap=80, # Overlap between chunks
        separators=["\n\n", "\n", ". ", "? ", "! ", " "] # Characters to use as split points
    )
    sentence_chunks = sentence_splitter.split_documents(documents)
    sentence_chunks_all.extend(sentence_chunks) # Add these chunks to the combined list


# Print the total number of chunks generated by each strategy.
print("✅ Total char chunks:", len(char_chunks_all))
print("✅ Total token chunks:", len(token_chunks_all))
print("✅ Total sentence chunks:", len(sentence_chunks_all))

# **Upload Embeddings**
This section handles the process of generating embeddings for the document chunks and uploading them to the Pinecone vector database.

In [None]:
# Upload embeddings for the chunks to Pinecone.
# The upload_embeddings function handles batching and upserting to the index.

# Upload token chunks embeddings.
upload_embeddings(token_chunks_all, "bge-m3", models["bge-m3"], index_bge, chunking_type="token")

# Upload char chunks embeddings.
upload_embeddings(char_chunks_all, "bge-m3", models["bge-m3"], index_bge, chunking_type="char")

# Upload sentence chunks embeddings.
upload_embeddings(sentence_chunks_all, "bge-m3", models["bge-m3"], index_bge, chunking_type="sentence")

# **Querying & Answer**
This section covers how a user query is processed to find relevant information in the Pinecone index and then used by a Language Model to generate an answer.

In [None]:
# --- Querying and Answer Generation Functions ---

# Function to search Pinecone for relevant chunks based on a query.
def search_pinecone(query, model, index, top_k=10, chunking_type=None):
    from langdetect import detect # Import langdetect here if not imported globally or if preferred locally

    try:
        query_lang = detect(query) # Detect the language of the query
    except:
        query_lang = "unknown" # Handle potential errors

    # Encode the query into an embedding using the specified model.
    query_embedding = model.encode([query], convert_to_numpy=True)[0]

    # Define a filter for the Pinecone search to only include results in the same language as the query.
    filter = {"language": {"$eq": query_lang}}

    # If a chunking type is specified, add it to the filter to search only within that type of chunks.
    if chunking_type:
        filter["chunking"] = {"$eq": chunking_type}

    # Perform the similarity search in the Pinecone index.
    results = index.query(
        vector=query_embedding.tolist(), # The query embedding
        top_k=top_k, # The number of top similar results to retrieve
        include_metadata=True, # Include the metadata stored with each vector
        filter=filter # Apply the language and chunking type filter
    )

    # Print the details of the retrieved matches.
    print("--- Pinecone Search Results ---")
    for i, match in enumerate(results['matches']):
        print(f"\n🔹 Match #{i+1}")
        print(f"Score: {match['score']:.4f}") # Similarity score
        print(f"Language: {match['metadata'].get('language', 'unknown')}") # Language of the chunk
        print(f"Chunking: {match['metadata'].get('chunking', 'N/A')}") # Chunking strategy used
        print(f"Source: {match['metadata'].get('source', 'N/A')}") # Source document
        print(f"Text: {match['metadata']['text']}") # The actual text of the chunk
    print("-----------------------------")

    return results # Return the search results

# Function to generate an answer using Groq based on retrieved context and the original query.
def generate_answer_groq(matches, query):
    # Combine the text from the retrieved chunks to form the context for the LLM.
    context = "\n\n".join([match['metadata']['text'] for match in matches])

    # Use the Groq client to get a completion from the LLM.
    response = client.chat.completions.create(
        model="llama3-70b-8192", # Specify the LLM model to use
        messages=[
            # System message to instruct the LLM on how to behave.
            {"role": "system", "content": "You are an expert environmental researcher. Based on the following context extracted from scientific papers, provide a clear, well-structured, and thoughtful answer to the question below. Avoid bullet points unless necessary. Combine the information across sources, avoid redundancy, and make the answer sound like it was written by a human expert synthesizing multiple studies."},
            # User message containing the context and the question.
            {"role": "user", "content": f"Context:\n{context}\n\nQuestion: {query}"}
        ]
    )

    return response.choices[0].message.content # Return the generated answer

In [None]:
# --- Example Usage ---

# Define the user query.
query = "Add your query here"
# Generate the embedding for the query.
query_embedding = models["bge-m3"].encode(query, normalize_embeddings=True)

# Search Pinecone for relevant chunks based on the query.
# We search across all chunking types by not specifying 'chunking_type'.
results = search_pinecone(query, model=models["bge-m3"], index=index_bge, top_k=5, chunking_type=None)

# Generate an answer using Groq based on the retrieved chunks.
answer = generate_answer_groq(results['matches'], query)

# Print the final generated answer.
print("\n💡 Final Answer:", answer)

# Suggestions & Project Ideas

Here are some ideas and directions to help guide you:

---

## Project Ideas

- **Build a Custom Chatbot:**  
  Use your own notes, textbooks, or research papers as the document source and create a chatbot that answers questions about them.

- **Domain-Specific Q&A:**  
  Try using this pipeline for other domains—medicine, law, history, or any subject you’re interested in.

- **Web App Interface:**  
  Deploy your RAG pipeline as a web app using Streamlit or Gradio.

- **Experiment with Different Models:**  
  Swap out the embedding model or LLM (try OpenAI, Gemini, or open-source models) and compare results.

- **Multilingual Support:**  
  Add support for documents and queries in different languages.

---

## What to Learn Next

- **Prompt Engineering:**  
  Learn how to craft better prompts to get more accurate and useful responses from LLMs.

- **Fine-tuning LLMs:**  
  Explore how to fine-tune language models on your own data for improved performance.

- **Evaluation Techniques:**  
  Study how to evaluate the quality and relevance of LLM-generated answers.

- **Vector Databases:**  
  Dive deeper into how vector search works and try other vector DBs.

- **LLM Internals:**  
  Learn about the architecture of transformers, attention mechanisms, and how LLMs are trained.

---