# RAG (Retrieval-Augmented Generation) with Agentic AI Demo (LlamaStack 0.3.0)

## Overview

This notebook demonstrates a RAG (Retrieval-Augmented Generation) system using **LlamaStack 0.3.0**, which combines:
- **Document Retrieval**: Using vector databases to search through ingested documents
- **Agentic AI**: Using ReAct (Reasoning + Acting) agents that can use multiple tools
- **Multi-Tool Workflows**: Combining RAG, web search, and custom tools for comprehensive question answering

## Approach & Architecture

### Why RAG?
RAG addresses the limitation of LLMs having static knowledge by:
1. **Retrieval**: Finding relevant information from a knowledge base (vector database)
2. **Augmentation**: Adding retrieved context to the prompt
3. **Generation**: Using the LLM to generate answers based on the augmented context

### Why Agentic AI?
Traditional RAG only searches documents. Agentic AI enables:
- **Tool Selection**: Automatically choosing the right tool (RAG, web search, custom tools)
- **Multi-Step Reasoning**: Breaking down complex queries into steps
- **Dynamic Information**: Accessing real-time data (stock prices, web search)

### System Components
1. **LlamaStack Client**: Single point interface to LLM services, vector databases and agents
2. **Vector Database (Milvus)**: Stores document embeddings for semantic search
3. **Docling**: Advanced PDF extraction with OCR capabilities
4. **ReAct Agents**: Intelligent agents that reason and act using tools
5. **Custom Tools**: Domain-specific functions (e.g., Yahoo Finance for stock data)

---

In [1]:
# Install notebook dependencies (LlamaStack 0.3.0)
# Will take a while to download and install numerous dependencies. 
# Wait until it finishes before proceeding
%pip install llama_stack_client==0.3.0 docling rich

Collecting docling
  Using cached docling-2.66.0-py3-none-any.whl.metadata (11 kB)
Collecting docling-core<3.0.0,>=2.50.1 (from docling-core[chunking]<3.0.0,>=2.50.1->docling)
  Using cached docling_core-2.57.0-py3-none-any.whl.metadata (7.8 kB)
Collecting docling-parse<5.0.0,>=4.7.0 (from docling)
  Using cached docling_parse-4.7.2-cp312-cp312-macosx_14_0_arm64.whl.metadata (10 kB)
Collecting docling-ibm-models<4,>=3.9.1 (from docling)
  Using cached docling_ibm_models-3.10.3-py3-none-any.whl.metadata (7.3 kB)
Collecting filetype<2.0.0,>=1.2.0 (from docling)
  Using cached filetype-1.2.0-py2.py3-none-any.whl.metadata (6.5 kB)
Collecting pypdfium2!=4.30.1,<5.0.0,>=4.30.0 (from docling)
  Using cached pypdfium2-4.30.0-py3-none-macosx_11_0_arm64.whl.metadata (48 kB)
Collecting pydantic-settings<3.0.0,>=2.3.0 (from docling)
  Using cached pydantic_settings-2.12.0-py3-none-any.whl.metadata (3.4 kB)
Collecting huggingface_hub<1,>=0.23 (from docling)
  Using cached huggingface_hub-0.36.0-py3

In [2]:
# Verify the installed version of llama_stack_client (0.3.0)
# This ensures we're using the correct version for compatibility
import llama_stack_client

print(llama_stack_client.__version__)

0.3.0


In [3]:
# Python stdlib imports
import os
import json
from datetime import date, datetime, timedelta
import re
import logging

# Suppress verbose and noisy HTTP logs
logging.getLogger("httpx").setLevel(logging.WARNING)

# Llamastack imports (0.3.0 API)
from llama_stack_client import LlamaStackClient, Agent, AgentEventLogger
from llama_stack_client.types import Document  # Updated import path for 0.3.0
from llama_stack_client.lib.agents.react.agent import ReActAgent
from llama_stack_client.lib.agents.react.tool_parser import ReActOutput
from llama_stack_client.lib.agents.client_tool import client_tool
from llama_stack_client.lib.agents.event_logger import EventLogger

# Docling imports
from docling.document_converter import DocumentConverter

# pretty printing
import rich

  from .autonotebook import tqdm as notebook_tqdm


## Setting up Configurations

### LLM Sampling Parameters

These parameters control how the LLM generates responses:

- **temperature**: Controls randomness (0.0 = deterministic, 1.0+ = more creative)
  - Lower values (0.1-0.3): More focused, deterministic responses
  - Higher values (0.7-1.0): More creative, diverse responses
  - We use 0.7 for balanced creativity and accuracy

- **top_p** (nucleus sampling): Probability mass threshold for token selection
  - Only considers tokens whose cumulative probability is within top_p
  - 0.95 means considering tokens that make up 95% of probability mass
  - Works with temperature to control diversity

- **max_tokens**: Maximum number of tokens in the generated response
  - Prevents excessively long outputs
  - 512 tokens ≈ 400-500 words

In [5]:
# Temperature: Controls randomness in LLM output
# 0.0 = deterministic (always same output for same input)
# 0.7 = balanced creativity and consistency
# 1.0+ = highly creative/variable outputs
temperature = 0.3

# Configure sampling strategy based on temperature
if temperature > 0.0:
    # Top-p (nucleus sampling): Only consider tokens whose cumulative probability 
    # is within the top_p threshold (0.95 = 95% probability mass)
    # This provides more focused sampling than pure temperature
    top_p = float(os.getenv("TOP_P", 0.95))
    # Top-p strategy: Uses both temperature and top_p for controlled randomness
    strategy = {"type": "top_p", "temperature": temperature, "top_p": top_p}
else:
    # Greedy strategy: Always selects the most probable token (deterministic)
    strategy = {"type": "greedy"}

# Maximum tokens in the generated response
# 512 tokens ≈ 400-500 words, prevents excessively long outputs
max_tokens = 512

# Sampling parameters dictionary
# Will be passed to LlamaStack Agents/Inference APIs to control text generation
sampling_params = {
    "strategy": strategy,
    "max_tokens": max_tokens,
}

## Initializing LlamaStack Client and Selecting Models

### Client Setup
The LlamaStackClient connects to the LlamaStack service endpoint, which provides:
- LLM inference services
- Vector store management (0.3.0: `vector_stores` API)
- Agent orchestration
- Tool execution

### Key API Changes in 0.3.0

| Old (0.2.x) | New (0.3.0) |
|-------------|-------------|
| `client.vector_dbs.register()` | `client.vector_stores.create()` |
| `vector_db.identifier` | `vector_store.id` |
| `client.inference.chat_completion()` | `client.chat.completions.create()` |
| `completion.completion_message.content` | `completion.choices[0].message.content` |

### Model Selection
We need two types of models:
 1. **LLM Model**: For text generation (e.g., Granite-3.3-8B-Instruct)
 2. **Embedding Model**: For converting text to vectors (e.g., granite-embedding-125m)
    - **Embedding Dimension**: Size of the vector space (e.g., 768 dimensions)
    - Used for semantic similarity search in vector stores

In [None]:
# LlamaStack service URL (in-cluster)
LLAMASTACK_URL = "http://llama-stack-dist-service.competitor-analysis.svc.cluster.local:8321"

# For access from Notebooks external to the cluster, use the route URL instead (oc get route -n competitor-analysis)
# LLAMASTACK_URL = "https://llama-stack-ext-competitor-analysis.apps.ocp.sx7qw.sandbox2219.opentlc.com/"

# Vector DB name (logical identifier used by Llamastack)
VECTOR_DB_NAME = "agentic-rag-db"

# Initialize client
client = LlamaStackClient(
    base_url=LLAMASTACK_URL,
    timeout=600.0
)

# Test connection by listing models
models = client.models.list()
    
rich.print(models)

In [20]:
# Verify deletion by listing remaining stores
remaining = list(client.vector_stores.list())
print(f"Existing vector stores: {[getattr(vs, 'name', vs.id) for vs in remaining]}")

Existing vector stores: ['competitor-docs']


In [21]:
# Get the main inference model and embedding model
model_id = next(m.identifier for m in models if m.model_type == "llm")
embedding_model = next(m for m in models if m.model_type == "embedding")
embedding_model_id = embedding_model.identifier
embedding_dimension = int(embedding_model.metadata["embedding_dimension"])

# LlamaStack 0.3.0: Check if vector store already exists, otherwise create it
# WARNING: The create method will create a new vector store for every call
# Run this once, and for subsequent experiments in the same notebook, use the existing store ID
existing_stores = list(client.vector_stores.list())
existing_store = next((vs for vs in existing_stores if getattr(vs, 'name', None) == VECTOR_DB_NAME), None)

if existing_store:
    vector_db_id = existing_store.id
    rich.print(f"[green]Using existing vector store:[/green] {vector_db_id}")
else:
    # Create new vector store (0.3.0 OpenAI-compatible API)
    # Note: In 0.3.0, the embedding model is configured server-side via provider settings
    vector_db = client.vector_stores.create(
        name=VECTOR_DB_NAME,
        metadata={
            "embedding_model": embedding_model_id,
            "embedding_dimension": embedding_dimension,
            "provider_id": "milvus-remote"
        }
    )
    # IMPORTANT: Need to use vector store 'id' instead of logical name for ingestion and queries
    vector_db_id = vector_db.id
    rich.print(f"[yellow]Created new vector store:[/yellow] {vector_db_id}")

In [22]:
rich.print(f"Using inference model: {model_id}")
rich.print(f"Using embedding model: [red]{embedding_model_id}[/red] with dimension: {embedding_dimension}")
rich.print(f"Using vector store with ID: [red]{vector_db_id}[/red]")

## Document Ingestion using Docling

### Why Docling?
Docling is an advanced document converter that provides:
- **Intelligent PDF Parsing**: Extracts text, tables, and structure
- **OCR Capabilities**: Handles scanned documents and images
- **Table Extraction**: Preserves table structure and formatting
- **Better than Basic Extractors**: Maintains document hierarchy and context

### Document Sources
As an example, we will ingest Indian Bank financial documents from their official website:
- Financial results
- Presentations
- Notes and disclosures

> WARNING: This approach of listing URLs manually should only be used during development and testing!. For bulk ingestion of documents, use the KFP pipeline approach outlined in the previous notebook.

In [23]:
# URLs of sample Indian Bank financial documents to ingest
# These PDFs contain financial results, presentations, and notes
urls = [
     "https://indianbank.bank.in/wp-content/uploads/2025/10/Notes-forming-part-of-Reviewed-Financial-Results-for-September-2025.pdf",
     "https://indianbank.bank.in/wp-content/uploads/2025/10/Presentation-September-2025.pdf",
     "https://indianbank.bank.in/wp-content/uploads/2025/10/Reviewed-Financial-Results-Consolidated.pdf"
]

## Docling-Powered Document Ingestion (Two-Phase Approach)

To avoid HTTP timeout issues with large documents, we split ingestion into two phases:

### Phase 1: PDF → Markdown Conversion
- Download PDFs and convert using Docling
- Save markdown files locally to `/tmp/markdown/`
- This is CPU/GPU intensive but has no network timeouts

### Phase 2: Markdown → Vector DB Ingestion  
- Read saved markdown files one by one
- Insert into vector DB with extended timeout
- Smaller, independent operations are more reliable

### Chunking Strategy
- **chunk_size_in_tokens**: 512 tokens per chunk
  - Balances context size with retrieval precision
  - Smaller chunks = more precise matches
  - Larger chunks = more context per match

> **NOTE**: This two-phase approach matches the pattern used in the KFP pipeline (`component_embed.py`) which processes pre-converted markdown files.

In [24]:
import os

# Create local directory for markdown files
markdown_dir = "/tmp/markdown"
os.makedirs(markdown_dir, exist_ok=True)

print("=" * 70)
print("PHASE 1: CONVERT PDFs TO MARKDOWN (Docling)")
print("=" * 70)
print(f"Output directory: {markdown_dir}")
print(f"Documents to process: {len(urls)}")
print()

# Phase 1: Convert all PDFs to Markdown and save locally
# This separates the compute-intensive conversion from network operations
converted_files = []

for idx, pdf_url in enumerate(urls, 1):
    print(f"\n[{idx}/{len(urls)}] Converting: {pdf_url.split('/')[-1]}")
    
    try:
        # Initialize docling converter
        converter = DocumentConverter()
        result = converter.convert(pdf_url)
        text_content = result.document.export_to_markdown()
        
        # Generate filename from URL
        filename = pdf_url.split('/')[-1].replace('.pdf', '.md')
        filepath = os.path.join(markdown_dir, filename)
        
        # Save markdown to local file
        with open(filepath, 'w', encoding='utf-8') as f:
            f.write(text_content)
        
        file_size = len(text_content)
        print(f"  [OK] Converted: {file_size:,} characters")
        print(f"  [OK] Saved to: {filepath}")
        
        converted_files.append({
            'path': filepath,
            'name': filename,
            'source': pdf_url,
            'size': file_size
        })
        
    except Exception as e:
        print(f"  [ERROR] Conversion failed: {e}")
        import traceback
        traceback.print_exc()

print(f"\n{'=' * 70}")
print(f"PHASE 1 COMPLETE: {len(converted_files)}/{len(urls)} files converted")
print("=" * 70)

PHASE 1: CONVERT PDFs TO MARKDOWN (Docling)
Output directory: /tmp/markdown
Documents to process: 3


[1/3] Converting: Notes-forming-part-of-Reviewed-Financial-Results-for-September-2025.pdf


INFO:docling.datamodel.document:detected formats: [<InputFormat.PDF: 'pdf'>]
INFO:docling.document_converter:Going to convert document batch...
INFO:docling.document_converter:Initializing pipeline for StandardPdfPipeline with options hash e15bc6f248154cc62f8db15ef18a8ab7
INFO:docling.models.auto_ocr_model:Auto OCR model selected ocrmac.
INFO:docling.utils.accelerator_utils:Accelerator device: 'mps'
INFO:docling.utils.accelerator_utils:Accelerator device: 'mps'
INFO:docling.pipeline.base_pipeline:Processing document Notes-forming-part-of-Reviewed-Financial-Results-for-September-2025.pdf
INFO:docling.document_converter:Finished converting document Notes-forming-part-of-Reviewed-Financial-Results-for-September-2025.pdf in 12.57 sec.


  [OK] Converted: 18,064 characters
  [OK] Saved to: /tmp/markdown/Notes-forming-part-of-Reviewed-Financial-Results-for-September-2025.md

[2/3] Converting: Presentation-September-2025.pdf


INFO:docling.datamodel.document:detected formats: [<InputFormat.PDF: 'pdf'>]
INFO:docling.document_converter:Going to convert document batch...
INFO:docling.document_converter:Initializing pipeline for StandardPdfPipeline with options hash e15bc6f248154cc62f8db15ef18a8ab7
INFO:docling.models.auto_ocr_model:Auto OCR model selected ocrmac.
INFO:docling.utils.accelerator_utils:Accelerator device: 'mps'
INFO:docling.utils.accelerator_utils:Accelerator device: 'mps'
INFO:docling.pipeline.base_pipeline:Processing document Presentation-September-2025.pdf
INFO:docling.document_converter:Finished converting document Presentation-September-2025.pdf in 57.92 sec.


  [OK] Converted: 77,560 characters
  [OK] Saved to: /tmp/markdown/Presentation-September-2025.md

[3/3] Converting: Reviewed-Financial-Results-Consolidated.pdf


INFO:docling.datamodel.document:detected formats: [<InputFormat.PDF: 'pdf'>]
INFO:docling.document_converter:Going to convert document batch...
INFO:docling.document_converter:Initializing pipeline for StandardPdfPipeline with options hash e15bc6f248154cc62f8db15ef18a8ab7
INFO:docling.models.auto_ocr_model:Auto OCR model selected ocrmac.
INFO:docling.utils.accelerator_utils:Accelerator device: 'mps'
INFO:docling.utils.accelerator_utils:Accelerator device: 'mps'
INFO:docling.pipeline.base_pipeline:Processing document Reviewed-Financial-Results-Consolidated.pdf
INFO:docling.document_converter:Finished converting document Reviewed-Financial-Results-Consolidated.pdf in 9.85 sec.


  [OK] Converted: 14,048 characters
  [OK] Saved to: /tmp/markdown/Reviewed-Financial-Results-Consolidated.md

PHASE 1 COMPLETE: 3/3 files converted


## Phase 2: Vector DB Ingestion Options

Choose ONE of the following approaches:

### Option A: Chunked Ingestion (Recommended for Large Documents)
- Splits large documents into ~20KB chunks
- Each chunk is inserted separately
- Avoids gateway timeout issues
- **Use this if you have large PDF documents (>30KB markdown)**

### Option B: Simple Direct Ingestion (Original Approach)
- Converts and inserts in one step without saving to disk
- Simpler code, faster for small documents
- **Use this only for small documents (<30KB) or if you have extended gateway timeouts**


### Option A: Chunked Ingestion (from saved markdown files)


In [26]:
import httpx
import time

print("=" * 70)
print("PHASE 2: INGEST MARKDOWN INTO VECTOR DB")
print("=" * 70)
print(f"Vector DB ID: {vector_db_id}")
print(f"Files to ingest: {len(converted_files)}")
print()

# Configuration for chunking large documents
# The gateway timeout is typically 30-60 seconds, so we need smaller payloads
MAX_CHARS_PER_INSERT = 20000  # ~20KB chunks to avoid gateway timeout
CHUNK_SIZE_TOKENS = 256       # Smaller chunks = faster embedding

def split_content(content: str, max_chars: int) -> list:
    """Split content into smaller chunks, trying to break at paragraph boundaries."""
    if len(content) <= max_chars:
        return [content]
    
    chunks = []
    remaining = content
    
    while remaining:
        if len(remaining) <= max_chars:
            chunks.append(remaining)
            break
        
        # Try to split at paragraph boundary (double newline)
        split_point = remaining[:max_chars].rfind('\n\n')
        if split_point < max_chars // 2:
            # No good paragraph break, try single newline
            split_point = remaining[:max_chars].rfind('\n')
        if split_point < max_chars // 2:
            # No good break point, just split at max_chars
            split_point = max_chars
        
        chunks.append(remaining[:split_point])
        remaining = remaining[split_point:].lstrip()
    
    return chunks

# Phase 2: Read markdown files and insert into vector DB
# Split large documents into smaller chunks to avoid gateway timeout
successful_files = 0
failed_files = 0
total_chunks = 0

for idx, file_info in enumerate(converted_files, 1):
    print(f"\n[{idx}/{len(converted_files)}] Ingesting: {file_info['name']}")
    print(f"  Total size: {file_info['size']:,} characters")
    
    try:
        # Read the markdown content
        with open(file_info['path'], 'r', encoding='utf-8') as f:
            content = f.read()
        
        # Split into smaller chunks if needed
        content_chunks = split_content(content, MAX_CHARS_PER_INSERT)
        print(f"  Split into {len(content_chunks)} chunk(s)")
        
        chunk_success = 0
        chunk_fail = 0
        
        for chunk_idx, chunk_content in enumerate(content_chunks, 1):
            chunk_id = f"{file_info['name']}_part{chunk_idx}" if len(content_chunks) > 1 else file_info['name']
            print(f"    Chunk {chunk_idx}/{len(content_chunks)}: {len(chunk_content):,} chars...", end=" ")
            
            try:
                # Create Document object for this chunk
                document = Document(
                    document_id=chunk_id,
                    content=chunk_content,
                    mime_type="text/markdown",
                    metadata={
                        "source": file_info['source'],
                        "filename": file_info['name'],
                        "chunk": chunk_idx,
                        "total_chunks": len(content_chunks)
                    }
                )
                
                # Insert with timeout
                client.tool_runtime.rag_tool.insert(
                    documents=[document],
                    vector_db_id=vector_db_id,
                    chunk_size_in_tokens=CHUNK_SIZE_TOKENS,
                    timeout=120.0  # 2 minutes per chunk
                )
                
                print("[OK]")
                chunk_success += 1
                total_chunks += 1
                
                # Small delay between chunks to avoid overwhelming the server
                time.sleep(0.5)
                
            except Exception as chunk_error:
                print(f"[FAIL] {str(chunk_error)[:50]}")
                chunk_fail += 1
        
        if chunk_fail == 0:
            print(f"  [OK] All {chunk_success} chunks ingested successfully")
            successful_files += 1
        else:
            print(f"  [PARTIAL] {chunk_success} chunks OK, {chunk_fail} failed")
            if chunk_success > 0:
                successful_files += 1  # Count as partial success
            else:
                failed_files += 1
        
    except Exception as e:
        print(f"  [ERROR] File processing failed: {e}")
        import traceback
        traceback.print_exc()
        failed_files += 1

print(f"\n{'=' * 70}")
print("PHASE 2 COMPLETE: INGESTION SUMMARY")
print("=" * 70)
print(f"Files processed: {len(converted_files)}")
print(f"Successful: {successful_files}")
print(f"Failed: {failed_files}")
print(f"Total chunks ingested: {total_chunks}")
print(f"Chunk size: {MAX_CHARS_PER_INSERT:,} chars / {CHUNK_SIZE_TOKENS} tokens")
print("=" * 70)

if failed_files > 0:
    print(f"\n[WARNING] {failed_files} files had issues!")
else:
    print(f"\n[SUCCESS] All files ingested successfully!")


PHASE 2: INGEST MARKDOWN INTO VECTOR DB
Vector DB ID: vs_1dccb591-51db-4402-bb51-fc910d68bd5e
Files to ingest: 3


[1/3] Ingesting: Notes-forming-part-of-Reviewed-Financial-Results-for-September-2025.md
  Total size: 18,064 characters
  Split into 1 chunk(s)
    Chunk 1/1: 18,064 chars... [OK]
  [OK] All 1 chunks ingested successfully

[2/3] Ingesting: Presentation-September-2025.md
  Total size: 77,560 characters
  Split into 5 chunk(s)
    Chunk 1/5: 19,983 chars... [OK]
    Chunk 2/5: 17,668 chars... [OK]
    Chunk 3/5: 16,673 chars... [OK]
    Chunk 4/5: 17,549 chars... [OK]
    Chunk 5/5: 5,679 chars... [OK]
  [OK] All 5 chunks ingested successfully

[3/3] Ingesting: Reviewed-Financial-Results-Consolidated.md
  Total size: 14,048 characters
  Split into 1 chunk(s)
    Chunk 1/1: 14,048 chars... [OK]
  [OK] All 1 chunks ingested successfully

PHASE 2 COMPLETE: INGESTION SUMMARY
Files processed: 3
Successful: 3
Failed: 0
Total chunks ingested: 7
Chunk size: 20,000 chars / 256 tokens

### Option B: Simple Direct Ingestion (Original Approach)

⚠️ **Warning**: This approach may timeout for large documents. Use Option A for documents >30KB.


In [None]:
# OPTION B: Simple Direct Ingestion (Original Approach)
# Converts PDFs and inserts directly without saving to intermediate files
# WARNING: May timeout for large documents (>30KB)

print("=" * 70)
print("SIMPLE INGESTION: CONVERT AND INSERT DIRECTLY")
print("=" * 70)
print(f"Vector DB ID: {vector_db_id}")
print(f"Documents to process: {len(urls)}")
print()

successful = 0
failed = 0

for idx, pdf_url in enumerate(urls, 1):
    filename = pdf_url.split('/')[-1]
    print(f"\n[{idx}/{len(urls)}] Processing: {filename}")
    
    try:
        # Step 1: Convert PDF to Markdown using Docling
        print(f"  Converting with Docling...")
        converter = DocumentConverter()
        result = converter.convert(pdf_url)
        text_content = result.document.export_to_markdown()
        print(f"  [OK] Converted: {len(text_content):,} characters")
        
        # Step 2: Create Document object
        document = Document(
            document_id=filename,
            content=text_content,
            mime_type="text/markdown",
            metadata={"source": pdf_url}
        )
        
        # Step 3: Insert into vector DB
        print(f"  Inserting into vector DB...")
        client.tool_runtime.rag_tool.insert(
            documents=[document],
            vector_db_id=vector_db_id,
            chunk_size_in_tokens=512,
            timeout=300.0  # 5 minutes timeout
        )
        
        print(f"  [OK] Successfully ingested")
        successful += 1
        
    except Exception as e:
        print(f"  [ERROR] Failed: {e}")
        import traceback
        traceback.print_exc()
        failed += 1

print(f"\n{'=' * 70}")
print("INGESTION COMPLETE")
print("=" * 70)
print(f"Total: {len(urls)}")
print(f"Successful: {successful}")
print(f"Failed: {failed}")
print("=" * 70)


### a. Manual RAG Search

**Approach**: Direct control over retrieval and generation steps.

**Process**:
1. Query the vector database for relevant chunks
2. Format retrieved chunks as context
3. Build a prompt with query + context
4. Call LLM to generate answer

**Advantages**:
- Full control over retrieval parameters
- Customizable prompt templates
- Easy to debug and inspect intermediate steps

**Use Cases**: When you need fine-grained control over the RAG pipeline

#### Step 1: Retrieving Relevant Chunks

**Query Configuration Parameters**:
- **query_generator_config**: How to process the query
  - `type: "default"`: Standard query processing
  - `separator: " "`: Token separator for query parsing
- **max_tokens_in_context**: Maximum total tokens from retrieved chunks (4096)
- **max_chunks**: Number of chunks to retrieve (5)
- **chunk_template**: Format for each chunk in the response
- **mode: "vector"**: Use vector similarity search (semantic search)

In [27]:
# User query about Indian Bank shareholding
query = "As per the documents, tell me about Percentage of shares held by Government of India in Indian bank "

# Query the vector database for relevant document chunks
# This performs semantic search to find chunks most similar to the query
response = client.tool_runtime.rag_tool.query(
        vector_db_ids=[vector_db_id],  # Which vector database(s) to search
        content=query,  # The user's query/question
        query_config={
            # Query generation configuration
            "query_generator_config": {
                "type": "default",  # Standard query processing
                "separator": " "  # Token separator for query parsing
            },
            # Maximum total tokens from all retrieved chunks
            # Prevents exceeding LLM context window limits
            "max_tokens_in_context": 4096,
            # Maximum number of chunks to retrieve
            # More chunks = more context but potentially less focused
            "max_chunks": 5,
            # Template for formatting each retrieved chunk
            # {index}: Chunk number, {chunk.content}: Text content, {metadata}: Document metadata
            "chunk_template": "Result {index}\nContent: {chunk.content}\nMetadata: {metadata}\n",
            # Search mode: "vector" uses semantic similarity search
            # Alternative: "keyword" for keyword-based search
            "mode": "vector"
        },
    )
rich.print(response)

#### Step 2: Complete RAG Pipeline Function

This function combines retrieval and generation:
1. **Retrieve**: Get relevant chunks from vector DB
2. **Format**: Combine chunks into context string
3. **Augment**: Add context to prompt
4. **Generate**: Call LLM with augmented prompt
5. **Return**: Final answer text

**Prompt Engineering**:
- Instructs LLM to only use provided context
- Handles cases where answer isn't in context
- Clear separation between question and context

In [28]:
def rag_pipeline(question: str) -> str:
    """
    Complete RAG pipeline: Retrieve relevant chunks and generate answer.
    
    Args:
        question: User's question to answer
        
    Returns:
        Final answer text generated by LLM based on retrieved context
    """
    # Step 1: Retrieve relevant chunks via RAG tool
    # This performs semantic search in the vector database
    response = client.tool_runtime.rag_tool.query(
        vector_db_ids=[vector_db_id],
        content=question,
        query_config={
            "query_generator_config": {
                "type": "default",
                "separator": " "
            },
            "max_tokens_in_context": 4096,
            "max_chunks": 5,
            "chunk_template": "Result {index}\nContent: {chunk.content}\nMetadata: {metadata}\n",
            "mode": "vector"
        },
    )

    # 2. Extract plain text from retrieved chunks
    #    (rag_res.content is a list of content items; each item has .text)
    rag_text_chunks = []
    for item in response.content:
        # Depending on the client version, this may be item.text or item["text"]
        rag_text_chunks.append(str(item.text))

    context = "\n\n".join(rag_text_chunks)

    # 3. Build a prompt that includes both the question and the retrieved context
    prompt = f"""You are a question-answering assistant.
Answer the question ONLY using the context provided. 
If the answer is not in the context, respond with 'I don't know'.

<question>
{question}
</question>

<context>
{context}
</context>
"""

    # 4. Ask the LLM to generate an answer using that context (LlamaStack 0.3.0 API)
    completion = client.chat.completions.create(
        model=model_id,   # use your registered model id here
        messages=[{"role": "user", "content": prompt}],
    )

    # 5. Return the answer text (OpenAI-compatible response structure)
    return completion.choices[0].message.content


# Test the RAG pipeline
answer = rag_pipeline(question=query)
rich.print(answer)

### b. Using File Search API

**Approach**: Simplified RAG via LlamaStack's Responses API.

**Process**:
1. Single API call handles retrieval + generation
2. LlamaStack manages chunking, retrieval, and prompt construction
3. Returns final answer directly

**Advantages**:
- Simpler code (one API call)
- Less configuration needed
- Built-in optimizations

**Use Cases**: When you want a quick, production-ready RAG solution without fine-tuning

In [29]:
# Same query as before
query = "As per the documents, tell me about Percentage of shares held by Government of India in Indian bank "

# Use LlamaStack's Responses API for simplified RAG
# This API handles retrieval + generation in a single call
response = client.responses.create(
    model=model_id,  # LLM model to use for generation
    input=query,  # User's question
    tools=[
        {
            "type": "file_search",  # Built-in RAG tool type
            # vector_store_ids: Which vector databases to search
            # The API will automatically:
            # 1. Retrieve relevant chunks
            # 2. Format them as context
            # 3. Generate answer using LLM
            "vector_store_ids": [vector_db_id],
        }
    ],
)
# Extract the output text from the response
print("Responses API result:", getattr(response, "output_text", response))

Responses API result: Based on the documents provided, the Government of India holds 73.84% of the shares in Indian Bank. This information can be found in the 'Notes forming part of Reviewed Financial Results' issued by Indian Bank for the quarter ending September 30, 2025 (). Additionally, this shareholding pattern remains consistent from the year preceding through to June 30, 2025, and as per the latest information, it remains 73.84% as of September 30, 2025. 

In summary, as per Indian Bank's reviewed financial results and shareholding details, the Government of India owns 73.84% of the bank's shares.


### c. Using RAG Agent

**Approach**: Agent-based RAG with file_search tool (LlamaStack 0.3.0 API).

**Process**:
1. Create an Agent with `file_search` tool (OpenAI-compatible format)
2. Agent automatically searches vector stores for relevant documents
3. Agent reasons about the query and generates answer using retrieved context

**Advantages**:
- Agent can reason about when to use RAG
- Can combine with other tools (web_search, custom functions)
- More flexible and extensible
- Uses OpenAI-compatible tool format

**Use Cases**: When building complex systems that need multiple tools and reasoning

In [33]:
# User query
query = "As per the documents, tell me about Percentage of shares held by Government of India in Indian bank"

def agent_qa(user_question: str) -> str:
    """
    RAG using Agent with built-in file_search tool.
    
    In LlamaStack 0.3.0, the Agent uses OpenAI-compatible tools format.
    The file_search tool automatically:
    - Searches vector stores for relevant chunks
    - Retrieves and ranks relevant documents
    - Provides context to the LLM for answer generation
    
    Args:
        user_question: User's question to answer
        
    Returns:
        Final answer from the agent
    """
    # Create an Agent with file_search capability (0.3.0 API)
    # Uses OpenAI-compatible tool format instead of builtin::rag/knowledge_search
    agent = Agent(
        client,  # LlamaStack client
        model=model_id,  # LLM model for the agent
        # Instructions guide the agent's behavior
        # "Answer strictly based on retrieved documents" prevents hallucination
        instructions="You are a helpful assistant. Answer strictly based on retrieved documents.",
        tools=[
            {
                # file_search: OpenAI-compatible RAG tool (replaces builtin::rag/knowledge_search)
                # Automatically searches vector stores and retrieves relevant chunks
                "type": "file_search",
                # vector_store_ids: Which vector stores to search
                "vector_store_ids": [vector_db_id]
            }
        ],
    )
    
    # Create a session for this conversation
    # Sessions maintain conversation history and context
    session_id = agent.create_session("web-session")
    
    # Create a turn (one interaction) in the conversation
    response = agent.create_turn(
        messages=[
            {
                "role": "user",  # User message
                "content": user_question,  # The question
            }
        ],
        session_id=session_id,  # Associate with this session
        stream=False,  # Get complete response (not streaming)
    )

    # Extract the response text (0.3.0 API uses output_text property)
    # ResponseObject.output_text combines all output text from the response
    return response.output_text

# Test the agent-based RAG
answer = agent_qa(user_question=query)
rich.print(answer)

### Limitations of Pure RAG

**Problem**: RAG only searches ingested documents. It cannot answer questions about:
- Real-time information (current stock prices, latest news)
- Information not in the document corpus
- Dynamic data that changes frequently

**Example**: Asking about "latest stock price" when documents only contain historical financial data.

**Solution**: In the next notebook, you will combine RAG with other tools (web search, APIs) using Agentic AI.

In [34]:
# Example query that RAG cannot answer (requires real-time data)
query = "can you tell me about Indian bank's stock latest price?"

# This will fail or give incomplete answer because:
# 1. Documents contain historical financial data, not real-time prices
# 2. Stock prices change constantly and aren't in static documents
# 3. RAG can only retrieve from ingested documents
rag_pipeline(question=query)

"I don't know. The provided context does not contain the latest stock price for Indian Bank."