# Build a self-RAG agent with IBM granite LLMs: A practical guide

[Large language models (LLMs)](https://www.ibm.com/think/topics/large-language-models) have remarkable text generation and reasoning abilities but often produce factual inaccuracies or [hallucinations](https://www.ibm.com/think/topics/ai-hallucinations) due to their reliance on internal knowledge. [Retrieval augmented generation](https://www.ibm.com/think/topics/retrieval-augmented-generation) (RAG) based solutions aim to resolve this by injecting external documents into the model’s context. However, traditional [RAG approaches](https://www.ibm.com/think/topics/rag-techniques) retrieve a fixed number of passages regardless of their necessity or quality, leading to redundancy, inefficiency, and inconsistent factual grounding. 

The self-RAG framework provides a practical solution to this problem. It retrieves information on-demand by using special control tokens that dynamically decide when and how to perform retrieval during generation. Unlike agentic or multi-agent approaches that coordinate multiple models or components, self-RAG is a model-centric framework where a single model manages retrieval, generation, and critique internally. Its self-critique process is a structured step where the model evaluates both its own output and the quality of the retrieved information, allowing it to adapt its retrieval behavior through self-reflection tokens. It combines retrieval, generation and self-critique of its own generations with a single model trained end-to-end that allows more efficient, factual and controllable text generation. This method was originally introduced in the paper on *Self-RAG: Learning to Retrieve, Generate, and Critique Through Self-Reflection* (2024), which explores how [fine-tuning](https://www.ibm.com/think/topics/fine-tuning) LLMs for self-evaluation can improve factual consistency in [natural language processing](https://www.ibm.com/think/topics/natural-language-processing) (NLP) tasks.


## How self-RAG works

The workflow of self-RAG is orchestrated by special reflection tokens that the model generates alongside its text output, making the entire inference process dynamic and controllable. When additional information is needed, a single LLM takes on both the retriever and critic roles. A retriever component fetches relevant external passages, and the same LLM then uses reflection tokens to evaluate and refine its own generation during inference. This architecture represents a broader trend in [artificial intelligence](https://www.ibm.com/think/artificial-intelligence) (AI) toward models capable of introspection and dynamic reasoning, bridging advances in prompt engineering and long-form generations.

### 1. On-demand retrieval

The LLM first generates a retrieval token to determine whether external factual information is necessary for the query. The model skips the remaining retrieval-based steps and continues with standard generation if it concludes that retrieval is not necessary. If the retrieval token is decoded as “yes”, a retriever is called to fetch a set of relevant passages from an external knowledge base. This step makes sure that retrieval occurs when its expected utility is high.

### 2. Passage retrieval and generation

If [retrieval](https://www.ibm.com/think/topics/information-retrieval) is required, the retriever fetches relevant passages from an external knowledge base. The LLM simultaneously processes the input and retrieved passages and generates text continuation for each passage.

### 3. Generate and reflect on retrieved passages

For each segment generated, the model concurrently generates special critique tokens that are embedded directly within the output sequence. These tokens are not separate evaluations, rather they appear as part of the generated sequence and help the model check its own work as it goes:

**ISREL (Relevance):** Assesses the usefulness of the retrieved passage.

**ISSUP (Support/Factuality):** Evaluates if the generated text segment has whole, partial, or no factual support from the source material.

**ISUSE (Utility):** Evaluates the created segment's overall quality, usefulness, and structure.


### 4. Inference

During inference, reflection tokens ae used to decide when to retrieve information or not. It enables the model to adjust to different tasks, such as retrieving less for creative activities and more for factual ones. When generating text, reflection tokens help the model in adhering to particular guidelines. They either provide clear boundaries or guide word choice, which makes the model's responses more flexible and appropriate for various contexts.

### 5. Training the self-RAG

During training, reflection tokens are inserted into the [training data](https://www.ibm.com/think/topics/training-data) based on evaluations made by the critic model. This approach keeps self-rag training efficient by allowing the model to learn how to judge its own outputs and decide when it actually needs to look up information. Hence, the model becomes better at producing accurate, controlled, and high-quality responses.

In the experiment conducted in the research mentioned previously, self-RAG outperforms many standard retrieval-augmented and instruction-tuned baselines across various tasks, including open-domain [question answering](https://www.ibm.com/think/topics/question-answering), reasoning, and fact verification. It improves factuality and citation accuracy by using self-reflection tokens and on-demand retrieval, matching or outperforming OpenAI's models.

In this tutorial, you'll learn how to build a robust self-reflective [RAG agent](https://www.ibm.com/think/topics/agentic-rag) by using [IBM Granite® model](https://www.ibm.com/granite) on Watsonx and [LangGraph](https://www.ibm.com/think/topics/langgraph). Similar frameworks and tools, such as [ChatGPT](https://www.ibm.com/think/topics/chatgpt), llama2, LlamaIndex or LangChain, also enable complex RAG flows. However, this tutorial focuses on using the powerful multi-modal models available through IBM. These models understand both text and images as well as its enterprise-grade design supports secure deployment, governance, and scalability. These features make Granite particularly well-suited for building reliable, production-ready RAG systems that can handle complex data and maintain high standards of trust and performance.


## Use case: Building a self-RAG query agent over multi-modal documents

This tutorial demonstrates how to build a self-RAG agent designed to answer complex, multi-faceted queries over internal knowledge bases that include both text and visual data. This agent will analyze PDF documents including technical guidelines and survey data. It will guide you to implement the self-RAG algorithm, which:

**Creates a multi-modal knowledge base:** Uses a language model (granite-3-3-8b-instruct) and vision LLM (granite-vision-3.3-2B) to extract text and images from PDFs, generate descriptive captions, and create embeddings for both text and image data to enable semantic retrieval.

**Generates and reflects:** It creates an answer segment, adds reflection tokens (such ISREL, ISSUP, and ISUSE) and evaluates its own output quality and factual accuracy.

**Executes self-correction:** The LangGraph workflow extends the standard self-RAG approach by using a critique score derived from reflection tokens to guide its next steps. When the score is low, the agent requests stronger context and retrieves more relevant information before generating the next segment, helping produce a higher-quality final output.

**Provides segmented answers:** Provides thorough and traceable responses by generating complex answers in a sequence of factually validated chunks.


## Prerequisites

You need an [IBM Cloud® account](https://cloud.ibm.com/registration?utm_source=ibm_developer&utm_content=in_content_link&utm_id=tutorials_awb-implement-xgboost-in-python&cm_sp=ibmdev-_-developer-_-trial) to create a [watsonx.ai®](https://www.ibm.com/products/watsonx-ai?utm_source=ibm_developer&utm_content=in_content_link&utm_id=tutorials_awb-implement-xgboost-in-python&cm_sp=ibmdev-_-developer-_-product) project. Ensure that you have access to both your watsonx API Key and Project ID.

## Steps

### Step 1. Set up your environment

While you can choose from several tools, this tutorial walks you through how to set up an IBM account by using a Jupyter Notebook.

1.	Log in to [watsonx.ai](https://dataplatform.cloud.ibm.com/registration/stepone?context=wx&apps=all) by using your IBM Cloud account.

2.	Create a [watsonx.ai project](https://www.ibm.com/docs/en/watsonx/saas?topic=projects-creating-project). You can get your project ID from within your project. Click the Manage tab.Then, copy the project ID from the Details section of the General page. You need this ID for this tutorial.

3.	Create a [Jupyter Notebook](https://www.ibm.com/docs/en/watsonx/saas?topic=editor-creating-managing-notebooks).

This step opens a notebook environment where you can copy the code from this tutorial. Alternatively, you can download this notebook to your local system and upload it to your watsonx.ai project as an asset. To view more Granite tutorials, check out the [IBM Granite Community](https://github.com/ibm-granite-community). This tutorial is also available on [GitHub](https://github.com/IBM/ibmdotcom-tutorials).

**Note:** You can run the multi-modal self-RAG tutorial entirely on a local CPU system. This is achievable by adapting the setup to use local resources instead of remote cloud services. You can initialize the Granite instruct model (the 3.2 2B version) directly from Hugging Face using the appropriate `transformers` steps. For data handling, save your PDF files directly on your local system and easily read them into your Jupyter Notebook environment using their local file path, completely bypassing the need for IBM Cloud Object Storage. To handle the complex reasoning and self-critique, the larger remote Granite 3.3-8B model can be replaced by a powerful open-source LLM hosted locally using a dedicated server setup. This setup requires installing specific local Python dependencies, such as `langgraph`, `faiss-cpu`, `sentence-transformers`, and `pymupdf` for the vector store, RAG logic, embeddings, and PDF parsing, respectively. Models can be configured for efficient CPU operation by explicitly setting the device to "cpu" and adjusting the floating-point data type to manage memory usage and prevent crashes common with large models on typical desktop hardware.

### Step 2. Set up watsonx.ai runtime service and API key

1.	Create a [watsonx.ai Runtime](https://cloud.ibm.com/catalog/services/watsonxai-runtime) service instance (choose the Lite plan, which is a free instance).

2.	Generate an application programming interface [(API) Key](https://dataplatform.cloud.ibm.com/docs/content/wsj/analyze-data/ml-authentication.html).

3.	Associate the watsonx.ai Runtime service to the project that you created in [watsonx.ai](https://dataplatform.cloud.ibm.com/docs/content/wsj/getting-started/assoc-services.html?context=cpdaas).


### Step 3. Installation of the packages

To build and orchestrate this multi-modal self-reflective RAG agent, we require a comprehensive set of libraries. Install `langgraph` to define the core state machine that orchestrates the self-correction loop based on critique ratings. For integrating IBM Granite LLMs and [embeddings](https://www.ibm.com/think/topics/embedding) from the Watsonx platform, install `langchain-ibm` and `ibm-watsonx-ai`. For quick retrieval, install `faiss-cpu` that offers indexing for the vector store. We use deep learning libraries like `torch` and the [hugging face](https://www.ibm.com/think/topics/hugging-face) `transformers` library to load and run the granite-vision-3.3-2B model. To extract and process the text and images from our PDF documents, `pillow` and `pymupdf` are essential. Lastly, to access raw data from cloud object storage, `ibm-cos-sdk` is included.

In [None]:
# Install packages

!pip install -U "transformers>=4.50.0" "huggingface_hub>=0.26.2" \
  torch torchvision torchaudio \
  langgraph faiss-cpu Pillow requests tqdm pymupdf pydantic \
  langchain-ibm ibm-watsonx-ai ibm-cos-sdk sentence-transformers

print("Required packages installed.")

Collecting transformers>=4.50.0
  Downloading transformers-4.57.1-py3-none-any.whl.metadata (43 kB)
Collecting huggingface_hub>=0.26.2
  Downloading huggingface_hub-1.1.2-py3-none-any.whl.metadata (13 kB)
Collecting torch
  Downloading torch-2.9.0-cp311-cp311-manylinux_2_28_x86_64.whl.metadata (30 kB)
Collecting torchvision
  Downloading torchvision-0.24.0-cp311-cp311-manylinux_2_28_x86_64.whl.metadata (5.9 kB)
Collecting torchaudio
  Downloading torchaudio-2.9.0-cp311-cp311-manylinux_2_28_x86_64.whl.metadata (6.9 kB)
Collecting langgraph
  Downloading langgraph-1.0.3-py3-none-any.whl.metadata (7.8 kB)
Collecting faiss-cpu
  Downloading faiss_cpu-1.12.0-cp311-cp311-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl.metadata (5.1 kB)
Collecting Pillow
  Downloading pillow-12.0.0-cp311-cp311-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl.metadata (8.8 kB)
Collecting tqdm
  Downloading tqdm-4.67.1-py3-none-any.whl.metadata (57 kB)
Collecting pymupdf
  Downloading pymupdf-1.26.6-cp310-abi3-m

**Note:** No GPU is required, but execution can be slower on CPU-based systems. 

### Step 4. Import required libraries

Next, import all the necessary modules to set up the fundamental tools for managing the multi-modal components, processing documents, coordinating the RAG [workflow](https://www.ibm.com/think/topics/agentic-workflows), and connecting to IBM watsonx.

In [None]:
# Core Libraries Import

import os
import getpass
import torch
import re
from pathlib import Path
from typing import List, Dict, Any, TypedDict
import gc

# LangGraph / LangChain Core
from langgraph.graph import StateGraph, END, START
from langchain_core.documents import Document
from langchain_ibm import WatsonxLLM, WatsonxEmbeddings
from ibm_watsonx_ai.metanames import GenTextParamsMetaNames

# Vector store + text utilities
from langchain_community.vectorstores import FAISS
from langchain_text_splitters import RecursiveCharacterTextSplitter

# General utils
from tqdm import tqdm
from PIL import Image
import fitz # PyMuPDF for PDFs
import numpy as np
import io # Ensure io is imported

print("Core libraries imported successfully.")
# Set up device for Vision Model
HF_DEVICE = "cuda" if torch.cuda.is_available() else "cpu"
print(f"PyTorch device: {HF_DEVICE}")

Core libraries imported successfully.
PyTorch device: cuda


**Multi-modal context:** This tutorial uses a vision model and libraries like `fitz` to process both text and visual data into a unified context. This surpasses simple text-based RAG by enabling the agent to retrieve richer information and provide highly accurate answers derived from complex documents.

**Self-correction loop:** The system utilizes LangGraph (StateGraph) to build a self-reflective RAG agent. This allows the LLM to critique its own output for relevance and accuracy, and then automatically initiate a correction cycle by querying the vector store or refining the prompt, minimizing hallucinations.

**Production-ready integration:** The tutorial demonstrates a high-performance stack by integrating enterprise LLMs (such as Granite) accessed via an external [API](https://www.ibm.com/think/topics/api) (or Hugging Face, depending on the setup) with efficient vector storage (FAISS) and streamlined RAG logic, proving its viability for real-world deployment.

### Step 5. Load watsonx credentials

This step prepares your environment to securely connect to the IBM watsonx platform, allowing you to utilize the hosted granite LLMs and embeddings.

In [None]:
# Load Watsonx Credentials

WML_URL = "https://us-south.ml.cloud.ibm.com"

# Securely input Watsonx credentials
WML_API_KEY = getpass.getpass("Enter Watsonx API Key: ")
PROJECT_ID = input("Enter Watsonx Project ID: ")

# Set environment variables for langchain-ibm
os.environ["WATSONX_APIKEY"] = WML_API_KEY
os.environ["WATSONX_PROJECT_ID"] = PROJECT_ID

print(" Watsonx credentials loaded.")

Enter Watsonx API Key:  ········
Enter Watsonx Project ID:  4d09cb34-ffa1-4097-be01-79e1ac1f5173


 Watsonx credentials loaded.


### Step 6. Initialize models

This critical step configures the three distinct models required for our multi-modal self-RAG agent.

In [None]:
# Initialize Models

from transformers import AutoProcessor, AutoModelForVision2Seq
from huggingface_hub import hf_hub_download

# LLM: Granite-3-3-8B-Instruct (Generator & Critic)
qa_llm = WatsonxLLM(
    model_id="ibm/granite-3-3-8b-instruct",
    url=WML_URL,
    apikey=WML_API_KEY,
    project_id=PROJECT_ID,
    params={
        GenTextParamsMetaNames.MAX_NEW_TOKENS: 512, 
        GenTextParamsMetaNames.TEMPERATURE: 0.1,   
        GenTextParamsMetaNames.TOP_P: 0.9,
        GenTextParamsMetaNames.REPETITION_PENALTY: 1.05,
    },
)
print("Granite-3-3-8B-Instruct initialized for reasoning, QA, and self-critique.")

# Embedding Model: Granite-embedding-278m-multilingual
embeddings_model = WatsonxEmbeddings(
    model_id="ibm/granite-embedding-278m-multilingual",
    url=WML_URL,
    apikey=WML_API_KEY,
    project_id=PROJECT_ID,
)
print("Granite-embedding-278m-multilingual initialized for retrieval.")

# Vision Model: Granite-Vision-3.3-2B
try:
    print("Loading Granite Vision model in Bfloat16 for memory efficiency...")
    vision_model_id = "ibm-granite/granite-vision-3.3-2b"

    hf_processor = AutoProcessor.from_pretrained(vision_model_id)
    
    
    hf_vision_model = AutoModelForVision2Seq.from_pretrained(
        vision_model_id, 
        torch_dtype=torch.bfloat16 # <--- Saves ~50% VRAM on model weights
    ).to(HF_DEVICE)
    hf_vision_model.eval()

    print("Granite-Vision-3.3-2B initialized successfully with Bfloat16.")
except Exception as e:
    print(f"Vision model load failed: {e}")

print(f"Device available: {HF_DEVICE}")
print("All Watsonx + Vision models ready.")

2025-11-11 06:03:48.844314: E tensorflow/compiler/xla/stream_executor/cuda/cuda_dnn.cc:9342] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
2025-11-11 06:03:48.844348: E tensorflow/compiler/xla/stream_executor/cuda/cuda_fft.cc:609] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
2025-11-11 06:03:48.844355: E tensorflow/compiler/xla/stream_executor/cuda/cuda_blas.cc:1518] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered


Granite-3-3-8B-Instruct initialized for reasoning, QA, and self-critique.
Granite-embedding-278m-multilingual initialized for retrieval.
Loading Granite Vision model in Bfloat16 for memory efficiency...


processor_config.json:   0%|          | 0.00/174 [00:00<?, ?B/s]

chat_template.json: 0.00B [00:00, ?B/s]

preprocessor_config.json: 0.00B [00:00, ?B/s]

tokenizer_config.json: 0.00B [00:00, ?B/s]

vocab.json: 0.00B [00:00, ?B/s]

merges.txt: 0.00B [00:00, ?B/s]

tokenizer.json: 0.00B [00:00, ?B/s]

added_tokens.json:   0%|          | 0.00/107 [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/701 [00:00<?, ?B/s]

config.json: 0.00B [00:00, ?B/s]

`torch_dtype` is deprecated! Use `dtype` instead!


model.safetensors.index.json: 0.00B [00:00, ?B/s]

Fetching 2 files:   0%|          | 0/2 [00:00<?, ?it/s]

model-00001-of-00002.safetensors:   0%|          | 0.00/5.00G [00:00<?, ?B/s]

model-00002-of-00002.safetensors:   0%|          | 0.00/952M [00:00<?, ?B/s]

Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

generation_config.json:   0%|          | 0.00/132 [00:00<?, ?B/s]

Granite-Vision-3.3-2B initialized successfully with Bfloat16.
Device available: cuda
All Watsonx + Vision models ready.


This configuration will:

Initialize the **granite-3-3-8B-instruct** model to function as both the primary generator and the self-critic by producing the reflection tokens (ISREL, ISSUP, and ISUSE).  For the self-critique loop, the parameters are optimized for factual, deterministic, and stable answers.

Initialize the **granite-embedding-278m-multilingual** model. This model generates the textual embeddings essential for efficient semantic search and retrieval in the FAISS vector store.

Load the **granite-vision-3.3-2B** model locally using the transformers library. This model creates text captions for images extracted from PDF documents.

### Step 7. PDF data retrieval from cloud object storage

This step focuses on securely retrieving the source dataset from IBM cloud object storage into the memory of your execution environment. This is necessary before any text splitting or multi-modal analysis can begin. We have uploaded two PDF files to the database for this tutorial.

In [None]:
# PDF Text Extraction
import io
import os, types
import pandas as pd
from botocore.client import Config
import ibm_boto3

def __iter__(self): return 0

# @hidden_cell
# The following code accesses a file in your IBM Cloud Object Storage. It includes your credentials.
# You might want to remove those credentials before you share the notebook.

cos_client = ibm_boto3.client(service_name='s3',
    ibm_api_key_id='your_api_key_id',
    ibm_auth_endpoint="https://iam.cloud.ibm.com/identity/token",
    config=Config(signature_version='oauth'),
    endpoint_url='https://s3.direct.us-south.cloud-object-storage.appdomain.cloud')

bucket = 'bucket_key'
pdf_keys = [
    'ICH_E6(R3)_Guideline.pdf',
    'inspection_survey.pdf'
]

def read_cos_pdf(bucket, key):
    """Read a PDF from IBM COS into bytes (streamed in chunks)."""
    print(f" Downloading {key} ...")
    response = cos_client.get_object(Bucket=bucket, Key=key)
    body = response['Body']
    data = io.BytesIO()
    while True:
        chunk = body.read(10 * 1024 * 1024)  # 10 MB chunks
        if not chunk:
            break
        data.write(chunk)
    data.seek(0)
    print(f" Finished downloading {key} ({data.getbuffer().nbytes / (1024*1024):.2f} MB)")
    return data.read()

# Loop through all PDFs and download
pdf_files = {}
for key in pdf_keys:
    pdf_files[key] = read_cos_pdf(bucket, key)

print(f" All {len(pdf_files)} PDFs downloaded successfully.")

 Downloading ICH_E6(R3)_Guideline.pdf ...
 Finished downloading ICH_E6(R3)_Guideline.pdf (0.79 MB)
 Downloading inspection_survey.pdf ...
 Finished downloading inspection_survey.pdf (2.22 MB)
 All 2 PDFs downloaded successfully.


### Step 8. Multi-modal PDF parsing and captioning

This step is crucial for transforming our raw PDF documents into a multi-modal, searchable knowledge base for the self-RAG agent. 

In [None]:
# Multi-Modal PDF Parsing and Captioning
import os
import pickle
import io
from typing import List
from PIL import Image
from langchain_text_splitters import RecursiveCharacterTextSplitter
from langchain_core.documents import Document
import fitz # PyMuPDF
import torch
import gc

def extract_and_caption_pdf(filename: str, pdf_content: bytes) -> List[Document]:
    """Extracts text and images from in-memory PDF content, captions images, and returns LangChain Documents."""
    print(f"\nProcessing {filename}...", flush=True)
    
    # Open the PDF from the in-memory byte stream
    doc = fitz.open(stream=pdf_content, filetype="pdf")
    all_content = []
    
    # 1. Extract Text Chunks (Unchanged)
    text_splitter = RecursiveCharacterTextSplitter(chunk_size=1000, chunk_overlap=200)
    for i, page in enumerate(doc):
        text = page.get_text()
        chunks = text_splitter.split_text(text)
        for j, chunk in enumerate(chunks):
            doc_metadata = {"source": filename, "page": i + 1, "chunk_id": f"P{i+1}-T{j}"}
            all_content.append(Document(page_content=chunk, metadata=doc_metadata))

    # 2. Extract and Caption Images
    if 'inspection_survey' in filename.lower():
        print(f"  -> {filename} identified as image-containing. Beginning image extraction...", flush=True)
        
        for i, page in enumerate(doc):
            image_list = page.get_images(full=True)
            for j, img_info in enumerate(image_list):
                try:
                    xref = img_info[0]
                    base_image = doc.extract_image(xref)
                    image_bytes = base_image["image"]
                    
                    # Defensive Image Loading and Normalization
                    img_stream = io.BytesIO(image_bytes)
                    image = Image.open(img_stream)
                    
                    # Convert to RGB to fix 'Unable to infer channel dimension' errors
                    if image.mode != 'RGB':
                        image = image.convert('RGB')
                    
                    # Memory Optimization (Resizing)
                    MAX_DIM = 1024
                    if max(image.size) > MAX_DIM:
                        image.thumbnail((MAX_DIM, MAX_DIM), Image.Resampling.LANCZOS)
                        
                    # --- Captioning ---
                    print(f"    -> Captioning image {j+1} on page {i+1}...", flush=True)
                    
                    conversation = [
                        {
                            "role": "user",
                            "content": [
                                {"type": "image", "image": image},
                                {"type": "text", "text": "Describe this image, chart, or diagram in detail. Summarize its key findings or data points."},
                            ],
                        },
                    ]
                    
                    # Apply chat template and generate
                    inputs = hf_processor.apply_chat_template(
                        conversation,
                        add_generation_prompt=True,
                        tokenize=True,
                        return_dict=True,
                        return_tensors="pt"
                    ).to(HF_DEVICE)
                    
                    # Use Bfloat16 for input tensors to match the model's dtype
                    if hf_vision_model.dtype == torch.bfloat16:
                        inputs = {k: v.to(torch.bfloat16) if v.is_floating_point() else v for k, v in inputs.items()}
                        
                    output = hf_vision_model.generate(**inputs, max_new_tokens=256)
                    caption = hf_processor.decode(output[0], skip_special_tokens=True).strip()

                    # Create document from caption
                    caption_doc = f"IMAGE CAPTION (Source: {filename}, Page {i+1}, Image {j+1}): {caption}"
                    img_metadata = {"source": filename, "page": i + 1, "chunk_id": f"P{i+1}-I{j}", "type": "image_caption"}
                    all_content.append(Document(page_content=caption_doc, metadata=img_metadata))
                    
                    # Aggressive Memory Clearing
                    del inputs
                    del output
                    torch.cuda.empty_cache() 
                    gc.collect() 

                except Exception as e:
                    print(f"    Error processing image on page {i+1}, image {j+1}: {e}", flush=True)
                    # Clear memory even on error
                    torch.cuda.empty_cache()
                    gc.collect()
                    continue
                    
    return all_content

# Execution of the Multi-modal Parsing (Caching Logic Added)

CACHE_FILE = 'multimodal_documents_cache.pkl'
all_documents = []

if os.path.exists(CACHE_FILE):
    # Load from Cache
    print(f"\nCache file found: {CACHE_FILE}. Loading documents from cache...", flush=True)
    try:
        with open(CACHE_FILE, 'rb') as f:
            all_documents = pickle.load(f)
        print("Documents successfully loaded from cache. Skipping multi-modal parsing.", flush=True)
    except Exception as e:
        # Fallback if the cache file is corrupted
        print(f"Error loading cache file: {e}. Attempting to run full parsing.", flush=True)
        os.remove(CACHE_FILE) # Delete bad cache
        
else:
    # Run Expensive Parsing and Save to Cache
    print(f"\nCache file not found. Running Multi-Modal PDF Parsing and Captioning...", flush=True)
    
    # Assuming 'pdf_files' dictionary is populated from your COS retrieval step
    for filename, content in pdf_files.items():
        all_documents.extend(extract_and_caption_pdf(filename, content))
        
    print(f"\nFinished parsing. Total documents created: {len(all_documents)}", flush=True)
    
    # Save the results
    try:
        with open(CACHE_FILE, 'wb') as f:
            pickle.dump(all_documents, f)
        print(f"Successfully saved all {len(all_documents)} documents to {CACHE_FILE}.", flush=True)
    except Exception as e:
        print(f"WARNING: Could not save cache file {CACHE_FILE}: {e}", flush=True)


print(f"\nTotal documents (text chunks + image captions) available: {len(all_documents)}", flush=True)


Cache file not found. Running Multi-Modal PDF Parsing and Captioning...

Processing ICH_E6(R3)_Guideline.pdf...

Processing inspection_survey.pdf...
  -> inspection_survey.pdf identified as image-containing. Beginning image extraction...
    -> Captioning image 1 on page 1...
    -> Captioning image 2 on page 1...
    -> Captioning image 3 on page 1...
    -> Captioning image 4 on page 1...
    -> Captioning image 5 on page 1...
    -> Captioning image 6 on page 1...
    -> Captioning image 7 on page 1...
    -> Captioning image 8 on page 1...
    -> Captioning image 9 on page 1...
    -> Captioning image 1 on page 2...
    -> Captioning image 2 on page 2...
    -> Captioning image 3 on page 2...
    -> Captioning image 4 on page 2...
    -> Captioning image 1 on page 3...
    -> Captioning image 2 on page 3...
    -> Captioning image 3 on page 3...
    -> Captioning image 1 on page 4...
    -> Captioning image 2 on page 4...
    -> Captioning image 3 on page 4...
    -> Captioning im

This parsing will:

•	Define the function and uses `fitz` to accurately pull both text and embedded image bytes from structured documents, a task simple text readers often fail at.

•	Pass the extracted images and a descriptive prompt to the locally loaded granite vision model as it is crucial for multi-modality. By converting images into descriptive text captions, we make visual information searchable via the standard text embedding model.This mechanism ensures the agent is not "blind" to non-textual context, thus improves the completeness of the knowledge base.

•	Implement caching logic to store the results, preventing the time-consuming and computationally demanding multi-modal captioning process from having to be repeated. Storing the processed knowledge base speeds up development and repeated execution.

•	Ensure the final knowledge base gives the self-reflective agent full context that includes both textual and visual data. This is the main objective of the entire process, giving the later self-reflective retrieval the foundation it needs to be precise and well-founded.

### Step 9. Indexing and retriever setup

This step completes the preparation of the multi-modal knowledge base by indexing all processed document chunks into an efficient, searchable vector store, which forms the basis for the agent's initial retrieval capability.

In [None]:
# Indexing and Retriever Setup 
from langchain_ibm import WatsonxEmbeddings 
from langchain_community.vectorstores import FAISS

print("\n Starting Vector Store Creation ", flush=True)

try:
    # Create the FAISS Vector Store
    vectorstore = FAISS.from_documents(
        documents=all_documents,
        embedding=embeddings_model
    )
    print(f"Vector Store created successfully with {len(all_documents)} documents.", flush=True)

    # Create the Retriever
    # We set 'k=5' to retrieve the top 5 most similar documents for any given query.
    retriever = vectorstore.as_retriever(search_kwargs={"k": 5})
    print("Retriever configured (k=5). Ready for RAG.", flush=True)

except Exception as e:
    # This captures errors like embedding failures.
    print(f"Vector Store creation failed: {e}", flush=True)


--- Starting Vector Store Creation ---
Vector Store created successfully with 753 documents.
Retriever configured (k=5). Ready for RAG.


This configuration plays a key role in preparing the retrieval layer for the self-RAG workflow:

•	It builds a high efficiency vector store using FAISS, which is well known for its speed and scalability when handling dense vector indexes. This ensures that similarity searches run quickly which is critical for maintaining a responsive RAG pipeline.

•	It transforms the multi modal knowledge base into vector representations, allowing the retriever to match user queries by meaning rather than relying on exact keyword overlap.

•	It fine tunes context delivery by typically retrieving the top five most relevant documents (k=5), balancing precision and relevance within the model’s context window.

•	It establishes a single, consistent knowledge source that the self RAG agent can depend on for factual grounding that is an essential element of any trustworthy retrieval augmented system.

### Step 10. LangGraph state and core self-RAG logic

This step sets up the main sections of the self-RAG workflow. The agent state tracks the entire process. The LangGraph node functions manage the flexible, self-correcting logic.

In [None]:
# LangGraph state and core self-RAG logic

from typing import TypedDict, List
from langchain_core.documents import Document
from langgraph.graph import StateGraph, END
import re
# Assumed objects: qa_llm, retriever, calculate_score (defined below)


# Define the Agent State (Schema)
class AgentState(TypedDict):
    """Represents the state of the Self-RAG agent."""
    query: str                       # The original user query
    retrieved_docs: List[Document]   # Documents retrieved from the vector store
    generation_history: List[str]    # History of generated segments
    critique_score: float            # The critique score of the last generated segment
    segment_count: int               # Counter for generated segments
    finish_generation: bool          # Flag to stop the generation loop


# LangGraph Node Functions (SELF-RAG Logic)
MAX_SEGMENTS = 10 
SCORE_THRESHOLD = 2.5 # Defined here for convenience, but also used in evaluate_critique

def calculate_score(isrel_val: str, issup_val: str, isuse_val: int) -> float:
    """Calculates the combined weighted score for a segment (Soft Constraint)."""
    W_ISSUP = 3.0  
    W_ISREL = 1.5  
    W_ISUSE = 0.5  

    score_rel = 1.0 if isrel_val == "Relevant" else 0.0
    
    if "Fully Supported" in issup_val:
        score_sup = 1.0
    elif "Partially" in issup_val:
        score_sup = 0.5
    else:
        score_sup = 0.0

    score_use = (isuse_val - 1) / 4.0  
    
    total_score = (W_ISREL * score_rel) + (W_ISSUP * score_sup) + (W_ISUSE * score_use)
    return total_score


def initial_decision(state: AgentState) -> AgentState:
    """Initial decision on whether to retrieve based on the query type."""
    query = state["query"]
    
    prompt = f"""
    You are an expert self-reflecting LLM. Your task is to determine if external knowledge is required to answer the following query accurately.
    - If knowledge is required, output the token: <|Retrieve=Yes|>
    - If the query is open-ended or based on common knowledge, output the token: <|Retrieve=No|>

    Query: "{query}"

    Decision Token:
    """
    
    response = qa_llm.invoke(prompt)
    
    if "<|Retrieve=Yes|>" in response:
        print("Decision: Retrieval required.", flush=True) 
        return {"query": query, "retrieved_docs": [], "critique_score": 0.0, "segment_count": 0, "finish_generation": False}
    else:
        print("Decision: No retrieval required for initial generation.", flush=True) 
        return {"query": query, "retrieved_docs": [Document(page_content="No documents retrieved.")], "critique_score": 0.0, "segment_count": 0, "finish_generation": False}


def retrieve_docs(state: AgentState) -> AgentState:
    """Retrieves documents based on the current query or the last generated segment."""
    query = state["query"]
    
    if state.get("generation_history"):
        search_query = state["generation_history"][-1]
    else:
        search_query = query
        
    print(f"Retrieving documents for: '{search_query[:50]}...'", flush=True) 
    docs = retriever.invoke(search_query)
    
    return {"retrieved_docs": docs}


def generate_segment(state: AgentState) -> AgentState:
    """Generates the next answer segment and self-reflects using critique tokens."""
    query = state["query"]
    history = state.get("generation_history", [])
    
    docs_context = "\n---\n".join([f"Source ({d.metadata.get('chunk_id')}): {d.page_content}" for d in state["retrieved_docs"]])
    history_context = "\n".join(history)
    
    prompt = f"""
    You are a SELF-RAG agent using the IBM Granite model. Your goal is to generate one accurate, concise segment of an answer.
    
    INSTRUCTION: Generate a comprehensive, multi-segment answer to the user's query.
    1. CONTEXT: Use the provided document segments (which include text and image captions) to answer the question accurately.
    2. SEGMENTATION: Only use the <|END|> token when the answer is fully comprehensive and detailed, and you have no more relevant information to add.
    3. REFLECTION: After generating the segment, immediately append these key-value reflection tokens:
        - ISREL: <|ISREL=Relevant|> or <|ISREL=Irrelevant|>
        - ISSUP: <|ISSUP=Fully Supported|> or <|ISSUP=Partially Supported|> or <|ISSUP=No Support|>
        - ISUSE: <|ISUSE=N|> (where N is the overall quality/utility score from 1 to 5, 5 is best).
        
    CURRENT QUERY: "{query}"
    
    HISTORY SO FAR: "{history_context}"
    
    RETRIEVED CONTEXT (Multi-Modal: text chunks and image captions):
    {docs_context}
    
    ---
    
    Generate the NEXT SEGMENT and REFLECTION TOKENS. End the entire generation with <|END|> if the answer is complete.
    """
    
    print(f"Generating Segment {state['segment_count'] + 1}...", flush=True) 
    full_response = qa_llm.invoke(prompt)

    CRITIQUE_TOKENS = ["<|ISREL=", "<|ISSUP=", "<|ISUSE=", "|>"] 

    isrel = re.search(r"<\|ISREL=(.+?)\|>", full_response)
    issup = re.search(r"<\|ISSUP=(.+?)\|>", full_response)
    isuse = re.search(r"<\|ISUSE=(\d+)\|>", full_response)
    
    isrel_val = isrel.group(1).strip() if isrel else "Irrelevant"
    issup_val = issup.group(1).strip() if issup else "No Support"
    isuse_val = int(isuse.group(1).strip()) if isuse and isuse.group(1).isdigit() else 1

    segment = full_response
    for token in CRITIQUE_TOKENS + ["<|Retrieve=Yes|>", "<|Retrieve=No|>", "<|END|>"]:
        segment = segment.replace(token, "").strip()
    
    new_history = history + [segment]
    
    print(f"  -> ISREL: {isrel_val}, ISSUP: {issup_val}, ISUSE: {isuse_val}", flush=True) 
    
    return {
        "generation_history": new_history,
        "segment_count": state["segment_count"] + 1,
        "finish_generation": "<|END|>" in full_response,
        "critique_score": calculate_score(isrel_val, issup_val, isuse_val), 
        "retrieved_docs": state["retrieved_docs"] 
    }


def evaluate_critique(state: AgentState) -> str:
    """Conditional edge function to determine the next step based on critique score."""
    score = state["critique_score"]
    segment_count = state["segment_count"]
    is_finished = state["finish_generation"]
    
    if is_finished or segment_count >= MAX_SEGMENTS:
        return "end"
    
    if score < SCORE_THRESHOLD:
        print(f"Critique: Low score ({score:.2f}) observed. FORCING RE-RETRIEVAL for next segment.", flush=True) 
        return "retrieve"  
    
    print(f"Critique: High score ({score:.2f}) observed. Continuing generation.", flush=True)
    return "continue"


def finalize_answer(state: AgentState) -> AgentState:
    """Compiles the final answer."""
    final_answer = "\n".join(state["generation_history"])
    print("\n--- FINAL ANSWER ---", flush=True) 
    print(final_answer, flush=True)
    return state


# Build and Compile the LangGraph Workflow

print("\n Building and Compiling LangGraph Workflow ", flush=True)

workflow = StateGraph(AgentState)

# Add Nodes (Function Calls)
workflow.add_node("initial_decision", initial_decision)
workflow.add_node("retrieve_docs", retrieve_docs)
workflow.add_node("generate_segment", generate_segment)
workflow.add_node("finalize_answer", finalize_answer)


# Define Edges (Flow Control)
workflow.set_entry_point("initial_decision")

# Edge 1: Decide between retrieval or initial generation
workflow.add_conditional_edges(
    "initial_decision", 
    lambda state: "retrieve" if not state["retrieved_docs"] else "generate",
    {
        "retrieve": "retrieve_docs",
        "generate": "generate_segment",
    }
)

# Edge 2: After retrieval, always generate a segment
workflow.add_edge("retrieve_docs", "generate_segment")


# Edge 3: The core loop - Evaluate the critique score to determine the next action
workflow.add_conditional_edges(
    "generate_segment",
    evaluate_critique, 
    {
        "retrieve": "retrieve_docs",
        "continue": "generate_segment",
        "end": "finalize_answer",
    }
)

# Edge 4: End the workflow
workflow.add_edge("finalize_answer", END)


# Compile the Graph
app = workflow.compile()
print("LangGraph workflow compiled successfully (object named 'app').", flush=True)


--- Building and Compiling LangGraph Workflow ---
LangGraph workflow compiled successfully (object named 'app').


This code serves several purposes:

•	The agent keeps a core memory that stores its evolving response, the evidence it has retrieved, and internal feedback. This memory helps the agent's logic to dynamically improve its reasoning by storing context across various steps.

•	 The agent first determines whether adequate factual grounding is present before producing any segments. To ensure that the generated response is accurate and pertinent, the agent intelligently seeks for stronger, more supportive information if the existing context is deemed incomplete.

•	Alongside each generated segment, the model issues internal reflection tokens that immediately quantify the output's relevance, factual support, and overall quality. These critical signals are then combined into a single critique score, giving the agent an objective, measurable way to judge its own performance.

•	Determined by the critique score, the agent then decides whether to rework, expand upon, or finalize its answer. This iterative process makes the system inherently resilient, forcing it to improve incorrect generations and maintain factual precision over multiple reasoning rounds.

### Step 11. LangGraph state and core self-RAG logic

The entire self-RAG workflow begins with this last step.

In [None]:
# Execute the LangGraph Workflow

# 1. Define the Query
# This query is designed to require information from both documents.
user_query = "What is the primary purpose of the ICH E6(R3) Guideline and what are the key findings from the EFPIA 2024 inspection survey regarding remote inspections?"

# 2. Define the Initial Input State
# The generation_history must start empty.
inputs = {
    "query": user_query, 
    "generation_history": []
}

print(f"\n--- STARTING LANGGRAPH EXECUTION ---", flush=True)
print(f"Query: {user_query}\n", flush=True)

# 3. Stream the Execution
# This loop runs the graph and prints the state update after each node completes.
for step in app.stream(inputs):
    # Print the name of the node that just executed and its resulting state
    print(step, flush=True)
    print("\n--- NODE TRANSITION ---", flush=True) 

print(f"--- LANGGRAPH EXECUTION COMPLETE ---", flush=True)


--- STARTING LANGGRAPH EXECUTION ---
Query: What is the primary purpose of the ICH E6(R3) Guideline and what are the key findings from the EFPIA 2024 inspection survey regarding remote inspections?

Decision: Retrieval required.
{'initial_decision': {'query': 'What is the primary purpose of the ICH E6(R3) Guideline and what are the key findings from the EFPIA 2024 inspection survey regarding remote inspections?', 'retrieved_docs': [], 'critique_score': 0.0, 'segment_count': 0, 'finish_generation': False}}

--- NODE TRANSITION ---
Retrieving documents for: 'What is the primary purpose of the ICH E6(R3) Guid...'
{'retrieve_docs': {'retrieved_docs': [Document(id='399b5b94-41dd-460a-84ad-69d9da64aedb', metadata={'source': 'inspection_survey.pdf', 'page': 30, 'chunk_id': 'P30-T0'}, page_content='30\nEFPIA ANNUAL INSPECTION SURVEY - 2024 DATA - PUBLIC VERSION\nA strong basis of an inspection using physical \npresence by a strengthened domestic inspectorate\nCollaboration, Reliance, Recognit

In [None]:
# Final Answer Extraction and Review


final_state = None
# The query and inputs are reused from Step 10
user_query = "What is the primary purpose of the ICH E6(R3) Guideline and what are the key findings from the EFPIA 2024 inspection survey regarding remote inspections?"
inputs = {"query": user_query, "generation_history": []}


print("\n--- RE-RUNNING EXECUTION FOR FINAL EXTRACTION ---", flush=True)

for step in app.stream(inputs):
    for key, value in step.items():
        # The key tells us which node just ran (e.g., 'finalize_answer')
        # The value is the state output of that node
        if key == "finalize_answer":
            final_state = value 
        elif key == END:
            # If the END node is hit, the graph is finished
            final_state = value 

# 2. Extract and Format the Final Answer
if final_state and "generation_history" in final_state:
    # Join all generated segments into one cohesive answer
    final_answer = "\n".join(final_state["generation_history"]).strip()

    print("\n==============================================", flush=True)
    print(" RAG PIPELINE COMPLETE", flush=True)
    print("==============================================", flush=True)
    print(f"USER QUERY:\n{user_query}\n", flush=True)
    print(f"FINAL GENERATED ANSWER ({final_state['segment_count']} segments):", flush=True)
    print("----------------------------------------------", flush=True)
    print(final_answer, flush=True)
    print("----------------------------------------------", flush=True)
else:
    print("\n EXECUTION FAILED or final state was not captured.", flush=True)
    print(f"Last recorded state: {final_state}", flush=True)


--- RE-RUNNING EXECUTION FOR FINAL EXTRACTION ---
Decision: Retrieval required.
Retrieving documents for: 'What is the primary purpose of the ICH E6(R3) Guid...'
Generating Segment 1...
  -> ISREL: Relevant, ISSUP: Fully Supported, ISUSE: 5

--- FINAL ANSWER ---
SEGMENT: The ICH E6(R3) Guideline primarily focuses on good clinical practice for design and conduct of clinical trials on medicinal products. It aims to harmonize these practices across different regions to ensure the protection of human subjects involved in clinical trials and the quality and integrity of the data generated. Regarding remote inspections, the EFPIA 2024 inspection survey reveals that while there is a trend of fewer remote inspections in the EU/EEA post-pandemic, the US shows no clear trend, with a slight decrease. The survey also highlights the potential for minimizing increased efforts through strategies like utilizing local inspectorates as leads, leveraging different time zones for document reviews, and pr

Once the agent either hits the maximum number of segments or completes its multi-segment answer, it produces the final output to the user question. The `.stream()` method is then used to run the compiled graph, represented by the app object.

The initial state, which contains the detailed `user_query`, is passed in through the inputs dictionary.

As the graph streams, each loop processes one node at a time based on the system’s internal logic. Every node’s output is printed as it runs, letting us watch the agent refine its reasoning in real time and build its multi-part response ultimately ending with a well-supported final answer. The final step reruns the full self-RAG workflow to create a refined answer. It executes the LangGraph and watches the streaming state updates until the `finalize_answer` or `END` node shows up. It pulls the generated segments and joins them into a grounded final answer whenever the final state is reached.


In [None]:

# 1. Define the query and inputs for the graph execution
user_query = "What does the ICH ER6 guideline say about Quality Assurance and Quality Control?"
inputs = {"query": user_query, "generation_history": []}

# 2. Rerun stream and capture the final state
print("--- Rerunning stream to answer the new query ---", flush=True)

final_state = None
# This loop runs the graph and prints the state update after each node completes.
for step in app.stream(inputs):
    print(step, flush=True)
    print("\n--- NODE TRANSITION ---\n", flush=True)
    # Capture the last yielded state (which contains the compiled final history)
    for key, value in step.items():
        if key != END:
            final_state = value

# 3. Extract and Format the Final Answer
if final_state and "generation_history" in final_state:
    # Join all generated segments (which should now be clean)
    final_answer_text = "\n".join(final_state["generation_history"]).strip()
    
    # Run a final cleanup pass
    final_answer_text = re.sub(r'\s*(Relevant|Irrelevant)\s*(Fully Supported|Partially Supported|No Support)\s*\d', '', final_answer_text).strip()
    final_answer_text = final_answer_text.replace("<|END", "").strip()

    # 4. Present the Results
    print("\n\n#####################################################", flush=True)
    print("            FINAL SELF-RAG ANSWER                 ", flush=True)
    print("#####################################################\n", flush=True)

    print("--- ANSWER ---", flush=True)
    print(final_answer_text, flush=True)
    print("\n#####################################################", flush=True)

else:
    print("\n EXECUTION FAILED or final state was not captured.", flush=True)


--- Rerunning stream to answer the new query ---
Decision: Retrieval required.
{'initial_decision': {'query': 'What does the ICH ER6 guideline say about Quality Assurance and Quality Control?', 'retrieved_docs': [], 'critique_score': 0.0, 'segment_count': 0, 'finish_generation': False}}

--- NODE TRANSITION ---

Retrieving documents for: 'What does the ICH ER6 guideline say about Quality ...'
{'retrieve_docs': {'retrieved_docs': [Document(id='7908cddc-7be4-46e8-b8dd-1109af8da414', metadata={'source': 'ICH_E6(R3)_Guideline.pdf', 'page': 84, 'chunk_id': 'P84-T0'}, page_content='ICH E6(R3) Guideline \n \n77 \n \nQuality Assurance (QA) \n \nAll those planned and systematic actions that are established to ensure that the trial is performed \nand the data are generated, documented (recorded) and reported in compliance with GCP and \nthe applicable regulatory requirement(s). \n \nQuality Control (QC) \n \nThe operational techniques and activities undertaken to verify that the requirements for

In [None]:
# USER QUERY: According to EFPIA 2024 data on multiple inspections at manufacturing sites, which countries recorded the highest inspection counts per site, and what does this reveal about their regulatory significance?
# 1. Define the query and inputs for the graph execution
user_query = "According to EFPIA 2024 data on multiple inspections at manufacturing sites, which countries recorded the highest inspection counts per site, and what does this reveal about their regulatory significance?"
inputs = {"query": user_query, "generation_history": []}

# 2. Rerun stream and capture the final state
print("--- Rerunning stream to answer the new combined query ---", flush=True)

final_state = None
# This loop runs the graph and prints the state update after each node completes.
for step in app.stream(inputs):
    print(step, flush=True)
    print("\n--- NODE TRANSITION ---\n", flush=True)
    # Capture the last yielded state (which contains the compiled final history)
    for key, value in step.items():
        if key != END:
            final_state = value

# 3. Extract and Format the Final Answer
if final_state and "generation_history" in final_state:
    # Join all generated segments (which should now be clean)
    final_answer_text = "\n".join(final_state["generation_history"]).strip()
    
    # Run a final cleanup pass
    final_answer_text = re.sub(r'\s*(Relevant|Irrelevant)\s*(Fully Supported|Partially Supported|No Support)\s*\d', '', final_answer_text).strip()
    final_answer_text = final_answer_text.replace("<|END", "").strip()

    # 4. Present the Results
    print("\n\n#####################################################", flush=True)
    print("            FINAL SELF-RAG ANSWER                  ", flush=True)
    print("#####################################################\n", flush=True)

    print("--- ANSWER ---", flush=True)
    print(final_answer_text, flush=True)
    print("\n#####################################################", flush=True)

else:
    print("\n EXECUTION FAILED or final state was not captured.", flush=True)

--- Rerunning stream to answer the new combined query ---
Decision: Retrieval required.
{'initial_decision': {'query': 'According to EFPIA 2024 data on multiple inspections at manufacturing sites, which countries recorded the highest inspection counts per site, and what does this reveal about their regulatory significance?', 'retrieved_docs': [], 'critique_score': 0.0, 'segment_count': 0, 'finish_generation': False}}

--- NODE TRANSITION ---

Retrieving documents for: 'According to EFPIA 2024 data on multiple inspectio...'
{'retrieve_docs': {'retrieved_docs': [Document(id='6f221e35-3bae-4e4d-a47f-8e6df714b254', metadata={'source': 'inspection_survey.pdf', 'page': 44, 'chunk_id': 'P44-T0'}, page_content='44\nEFPIA ANNUAL INSPECTION SURVEY - 2024 DATA - PUBLIC VERSION\nMultiple inspections at one manufacturing site (6 and more)\nINSPECTIONS AT MANUFACTURING SITES\n.\n.\nUS\n• AMA\n• Tanzania\n• Brazil\n• Russia\n• Türkiye\nGermany\n• US-FDA (2)\n• Belarus\n• Türkiye\nFrance\n• US-FDA\n• 

The self-reflective retrieval augmented generation setup in this tutorial offers major advantages over standard RAG, mainly in terms of reliability and smart efficiency. Its biggest strength is improved factual accuracy and traceability, made possible by the Granite LLM running its own self-critiques with reflection tokens. These critiques produce a score that guides the workflow, allowing adaptive retrieval and the model only pulls new context when a segment isn’t well supported. This approach also makes it easier to work with complex, multi-modal documents, since image captions can be added to the vector store. The result is a more trustworthy, flexible query agent that checks and segments its answers against the knowledge base before giving the final result.