<h1 style="text-align: center; font-size: 50px;"> Retrieval-Augmented Generation (RAG) with Agentic Workflow </h1>

This notebook implements a **Retrieval-Augmented Generation (RAG)** pipeline with an **Agentic Workflow**, using a local **Llama 2** model and **ChromaDB** for intelligent question-answering.  

The system dynamically determines whether additional context is needed before generating responses, ensuring higher accuracy and relevance.

### Key Features:
- **Llama 2 Model** for high-quality text generation.
- **PDF Document Processing** to extract relevant information.
- **ChromaDB Vector Store** for efficient semantic search.
- **Dynamic Context Retrieval** to improve answer accuracy.
- **Two Answering Modes**:
  - With RAG (Retrieves relevant document content before responding).
  - Without RAG (Directly generates responses).

# Notebook Overview

- Imports
- Configurations
- Verify Assets
- Model Setup
- Loading and Processing the PDF Document
- Splitting the Document into Chunks
- Initializing the Embedding Model
- Computing Embeddings for Document Chunks
- Storing Document Embeddings in ChromaDB
- Implementing Vector Search Tool
- Context Need Assessment
- Answer Generation with Agentic RAG
- Answer Generation Without RAG

# Imports

In [1]:
# Install required packages if you have not installed them already
%pip install -r ../requirements.txt --quiet

Note: you may need to restart the kernel to use updated packages.


In [2]:
# Standard Libraries
import os
import logging                  
import warnings   
from pathlib import Path

# Numerical Computing
import numpy as np
import torch

# Hugging Face Hub for Model Download
from huggingface_hub import hf_hub_download

# LangChain Components
from langchain_community.llms import LlamaCpp
from langchain_community.vectorstores import Chroma
from langchain.document_loaders import PyPDFLoader
from langchain.text_splitter import CharacterTextSplitter

# Sentence Transformers for Embeddings
from sentence_transformers import SentenceTransformer

# ChromaDB for Vector Storage
import chromadb
from chromadb.utils import embedding_functions

# Transformers 
import transformers
import sentence_transformers

# Configurations

In [3]:
# Suppress Python warnings
warnings.filterwarnings("ignore")

In [4]:
# Create logger
logger = logging.getLogger("notebook_logger")
logger.setLevel(logging.INFO)

formatter = logging.Formatter("%(asctime)s - %(levelname)s - %(message)s", 
                              datefmt="%Y-%m-%d %H:%M:%S")  

stream_handler = logging.StreamHandler()
stream_handler.setFormatter(formatter)
logger.addHandler(stream_handler)
logger.propagate = False

In [5]:
LOCAL_MODEL_PATH = "/home/jovyan/datafabric/llama2-7b/ggml-model-f16-Q5_K_M.gguf"
PDF_PATH = "../data/AIStudioDoc.pdf"
# Define text splitting parameters
CHUNK_SIZE = 500
CHUNK_OVERLAP = 50
# Define the embedding model name
EMBEDDING_MODEL_NAME = "all-MiniLM-L6-v2"
# Define Chroma database path and collection name
CHROMA_DB_PATH = "./chroma_db"
COLLECTION_NAME = "document_embeddings"

In [6]:
logger.info('Notebook execution started.')

2025-04-27 15:58:24 - INFO - Notebook execution started.


# Verify Assets

In [7]:
def log_asset_status(asset_path: str, asset_name: str, success_message: str, failure_message: str) -> None:
    """
    Logs the status of a given asset based on its existence.

    Parameters:
        asset_path (str): File or directory path to check.
        asset_name (str): Name of the asset for logging context.
        success_message (str): Message to log if asset exists.
        failure_message (str): Message to log if asset does not exist.
    """
    if Path(asset_path).exists():
        logger.info(f"{asset_name} is properly configured. {success_message}")
    else:
        logger.info(f"{asset_name} is not properly configured. {failure_message}")


# Check and log status for Local model
log_asset_status(
    asset_path=LOCAL_MODEL_PATH,
    asset_name="Local Llama model",
    success_message="",
    failure_message="Please create and download the required assets in your project on AI Studio if you want to use local model."
)

2025-04-27 15:58:24 - INFO - Local Llama model is properly configured. 


# 🔧 Model Setup

In [8]:
# Initialize the model with the local path and GPU acceleration
llm = LlamaCpp(
    model_path=LOCAL_MODEL_PATH,
    temperature=0.2,
    max_tokens=2000,
    n_ctx=4096,
    top_p=1.0,
    verbose=False,
    n_gpu_layers=30,  # Utilize some available GPU layers
    n_batch=1024,      # Optimize batch size for parallel processing
    f16_kv=True,      # Enable half-precision for key/value cache
    use_mlock=True,   # Lock memory to prevent swapping
    use_mmap=True     # Utilize memory mapping for faster loading
)

# 📄 Loading and Processing the PDF Document

To enable context-aware question-answering, we load a **PDF document**, extract its content, and split it into manageable chunks for efficient retrieval.

In [9]:
# --- Load the PDF Document ---

# Define the PDF file path
print(f"Loading PDF from: {PDF_PATH}")

# Load the PDF document
pdf_loader = PyPDFLoader(PDF_PATH)
documents = pdf_loader.load()

print(f"Successfully loaded {len(documents)} document(s) from the PDF.")

Loading PDF from: ../data/AIStudioDoc.pdf
Successfully loaded 8 document(s) from the PDF.


# ✂️ Splitting the Document into Chunks

Since large documents are difficult to process in full, we split the text into **small overlapping chunks** of approximately **500 characters**. These chunks will later be embedded and stored in ChromaDB.

In [10]:
# --- Split the PDF Content into Manageable Chunks ---

# Initialize the text splitter
text_splitter = CharacterTextSplitter(chunk_size=CHUNK_SIZE, chunk_overlap=CHUNK_OVERLAP)

# Split the PDF content into chunks
docs = text_splitter.split_documents(documents)

print(f"Successfully split PDF into {len(docs)} text chunks.")

Successfully split PDF into 8 text chunks.


# 🔍 Initializing the Embedding Model

To convert text into numerical representations for efficient similarity search, we use **all-MiniLM-L6-v2** from `sentence-transformers`.

In [11]:
# --- Initialize the Embedding Model ---

# Load the embedding model
embedding_model = SentenceTransformer(EMBEDDING_MODEL_NAME)

print(f"Successfully loaded embedding model: {EMBEDDING_MODEL_NAME}")

Successfully loaded embedding model: all-MiniLM-L6-v2


# 🧠 Computing Embeddings for Document Chunks

Each chunk is converted into a **vector representation** using our embedding model. This allows us to perform **semantic similarity searches** later.

In [12]:
# --- Compute Embeddings for Each Text Chunk ---

# Extract text content from each chunk
doc_texts = [doc.page_content for doc in docs]

# Compute embeddings for the extracted text chunks
document_embeddings = embedding_model.encode(doc_texts, convert_to_numpy=True)

# Display the result
print("Successfully computed embeddings for each text chunk.")
print(f"Embeddings Shape: {document_embeddings.shape}")

Successfully computed embeddings for each text chunk.
Embeddings Shape: (8, 384)


# 🗄️ Storing Document Embeddings in ChromaDB

We initialize **ChromaDB**, a high-performance **vector database**, and store our computed embeddings to enable efficient retrieval of relevant text chunks.

In [13]:
# --- Initialize and Populate the Chroma Vector Database ---

# Initialize Chroma client
chroma_client = chromadb.PersistentClient(path=CHROMA_DB_PATH)
collection = chroma_client.get_or_create_collection(name=COLLECTION_NAME)

# Add document embeddings to the Chroma collection
for i, embedding in enumerate(document_embeddings):
    collection.add(
        ids=[str(i)],  # Chroma requires string IDs
        embeddings=[embedding.tolist()],
        metadatas=[{"text": doc_texts[i]}]
    )

print("Successfully populated Chroma database with document embeddings.")

Add of existing embedding ID: 0
Insert of existing embedding ID: 0
Add of existing embedding ID: 1
Insert of existing embedding ID: 1
Add of existing embedding ID: 2
Insert of existing embedding ID: 2
Add of existing embedding ID: 3
Insert of existing embedding ID: 3
Add of existing embedding ID: 4
Insert of existing embedding ID: 4
Add of existing embedding ID: 5
Insert of existing embedding ID: 5
Add of existing embedding ID: 6
Insert of existing embedding ID: 6
Add of existing embedding ID: 7
Insert of existing embedding ID: 7


Successfully populated Chroma database with document embeddings.


# 🔎 Implementing Vector Search Tool

To retrieve relevant text passages from the database, we define a **vector search function** that finds the most relevant chunks based on a user query.

In [14]:
# --- Define the Vector Search Tool ---
def vector_search_tool(query: str) -> str:
    """
    Searches the Chroma database for relevant text chunks based on the query.
    Computes the query embedding, retrieves the top 5 most relevant text chunks,
    and returns them as a formatted string.
    """
    # Compute the query embedding
    query_embedding = embedding_model.encode(query, convert_to_numpy=True).tolist()
    
    # Define the number of nearest neighbors to retrieve
    TOP_K = 5
    
    # Perform the search in the Chroma database
    results = collection.query(
        query_embeddings=[query_embedding],
        n_results=TOP_K
    )
    
    # Retrieve and format the corresponding text chunks
    retrieved_chunks = [metadata["text"] for metadata in results["metadatas"][0]]
    return "\n\n".join(retrieved_chunks)

# 🤖 Context Need Assessment

Instead of always retrieving context, we determine if the query **requires external document context** before generating a response. This creates an agentic workflow that makes autonomous decisions to complete the task at hand.

In [15]:
# --- Define the Meta-Evaluation Function ---
def needs_context(query: str) -> bool:
    """
    Determines if additional context from an external document is required to generate an accurate and detailed answer.
    Returns True if context is needed (response contains "YES"), False otherwise.

    Args:
        query (str): The user's query to evaluate.

    Returns:
        bool: True if external context is required, False otherwise.
    """
    meta_prompt = (
        "Based on the following query, decide if additional context from an external document is needed "
        "to generate an accurate and detailed answer. Have a tendency to use an external document if the query is not a very familiar topic. If in doubt, assume context is required and answer 'YES'.\n"
        "Answer with a single word: YES if additional context from an external document would be helpful to answer the query, "
        "or NO if not. Do not say anything other than YES or NO.\n"
        f"Query: {query}\n"
        "Answer:"
    )
    meta_response = llm.invoke(meta_prompt)
    print("Meta Response (is external document retrieval necessary?):", meta_response)
    return "YES" in meta_response.upper()


# --- Define the Main Answer Generation Function with RAG (Retrieve and Generate) ---
def generate_answer_with_agentic_rag(query: str) -> str:
    """
    Generates a detailed and accurate answer to the user's query by using context when needed.
    If additional context is required, it is retrieved from the vector store and included in the prompt.
    If not, the answer is generated using the query alone.

    Args:
        query (str): The user's query to answer.

    Returns:
        str: The generated answer based on the query.
    """
    if needs_context(query):
        # Retrieve additional context from the vector store
        context = vector_search_tool(query)
        
        # Construct the enriched prompt with the additional context
        enriched_prompt = (
            "Here is additional context from our document:\n"
            f"{context}\n\n"
            f"Based on this context and the query: {query}\n"
            "Please provide a detailed and accurate answer.\n"
            "Answer:"
        )
        final_response = llm.invoke(enriched_prompt)
    else:
        # Generate an answer using the original query directly
        direct_prompt = (
            "Please provide a detailed and accurate answer to the following query:\n"
            f"{query}\n"
            "Answer:"
        )
        final_response = llm.invoke(direct_prompt)
    
    return final_response


# --- Define the Answer Generation Function without RAG ---
def generate_answer_without_rag(query: str) -> str:
    """
    Generates a detailed and accurate answer to the user's query without using any additional context from external documents.
    
    Args:
        query (str): The user's query to answer.

    Returns:
        str: The generated answer based on the query.
    """
    direct_prompt = (
        "Please provide a detailed and accurate answer to the following query:\n"
        f"{query}\n"
        "Answer:"
    )
    final_response = llm.invoke(direct_prompt)
    
    return final_response

# 💡 Answer Generation with Agentic RAG

If additional context is needed, the model retrieves **relevant document chunks** and incorporates them into the response prompt.

In [16]:
query = "What are the key features of Z by HP AI Studio?"
print("User Query:", query)
final_answer = generate_answer_with_agentic_rag(query)
print("\nFinal Answer:")
print(final_answer)

User Query: What are the key features of Z by HP AI Studio?
Meta Response (is external document retrieval necessary?):  YES

Final Answer:
 Based on the provided context, Z by HP AI Studio is a standalone application designed for data scientists and engineers that offers several key features to enhance their productivity and collaboration. Here are some of the key features of Z by HP AI Studio:
1. Data Connectors: Z by HP AI Studio allows users to connect to multiple data-stores across local and cloud networks, making it easier to access the correct data and packages wherever they are.
2. Local Computation: The platform enables users to perform all their computations locally without interruption, providing a more manageable development, data, and model environment.
3. Monitoring: AI Studio runs the tools users select natively, allowing them to use all their favorite DS applications directly from the application. Users can view the memory, cores, and GPU required for each workspace to r

# ⚡ Answer Generation Without RAG

In this case, we generate an answer without using RAG to show the difference between 2 answers

In [17]:
query = "What are the key features of Z by HP AI Studio?"
print("User Query:", query)
final_answer = generate_answer_without_rag(query)
print("\nFinal Answer:")
print(final_answer)

User Query: What are the key features of Z by HP AI Studio?

Final Answer:

Z by HP AI Studio is an all-in-one AI development platform designed to help developers build, train, and deploy AI models more efficiently. Here are some of its key features:
1. User-friendly Interface: Z by HP AI Studio provides a user-friendly interface that makes it easy for developers to build, train, and deploy AI models without requiring extensive knowledge of AI or machine learning.
2. Pre-built Models: The platform comes with pre-built models for various tasks such as image classification, object detection, and natural language processing. Developers can use these models as a starting point and customize them to suit their needs.
3. AutoML: Z by HP AI Studio offers automated machine learning (AutoML) capabilities that enable developers to train AI models without writing any code. They can simply select the type of model they want to build, choose the data they want to use, and the platform will automati

In [18]:
logger.info('Notebook execution completed.')

2025-04-27 16:02:58 - INFO - Notebook execution completed.


Built with ❤️ using [**Z by HP AI Studio**](https://zdocs.datascience.hp.com/docs/aistudio/overview).