<h1 style="text-align: center; font-size: 50px;">Multimodal RAG Chatbot with Langchain and VLLM</h1>

Retrieval-Augmented Generation (RAG) is an architectural approach that can enhance the effectiveness of large language model (LLM) applications using customized data. In this example, we use LangChain, an orchestrator for language pipelines, to build an assistant capable of loading information from a web page and use it for answering user questions. We'll leverage torch and transformers for multimodal model support in Python. We'll also use the MLFlow platform to evaluate and trace the LLM responses (in `register-workflow.ipynb`)

# Notebook Overview
- Configuring the Environment
- Data Loading & Cleaning
- Setup Embeddings & Vector Store
- Retrieval Function
- Model Setup & Chain Creation

## Step 0: Configuring the Environment

In this step, we import all the necessary libraries and internal components required to run the RAG pipeline, including modules for notebook parsing, embedding generation, vector storage, and code generation with LLMs.


By using our Local GenAI workspace image, many of the necessary libraries to work with RAG already come pre-installed - in our case, we just need to extra support for multimodal processes.

In [1]:
import time
import os 
from pathlib import Path
import sys
import logging

sys.path.append(os.path.abspath(os.path.join(os.getcwd(), "..")))

# Create logger
logger = logging.getLogger("multimodal_rag_logger")
logger.setLevel(logging.INFO)

formatter = logging.Formatter("%(asctime)s - %(levelname)s - %(message)s", datefmt="%Y-%m-%d %H:%M:%S") 
stream_handler = logging.StreamHandler()
stream_handler.setFormatter(formatter)
logger.addHandler(stream_handler)
logger.propagate = False

In [2]:
start_time = time.time()  

logger.info('Notebook execution started.')

2025-08-05 02:56:44 - INFO - Notebook execution started.


In [3]:
%pip install -r ../requirements.txt --quiet

Note: you may need to restart the kernel to use updated packages.


In [4]:
# === Standard Library Imports ===
import gc
import json
import math
import hashlib
import shutil
import warnings
import numpy as np
from pathlib import Path
from rank_bm25 import BM25Okapi
from statistics import mean
from typing import Any, Dict, List, Optional, TypedDict
from IPython.display import display, Markdown
from collections import defaultdict

# === Third-Party Library Imports ===
import mlflow
import torch
from langchain_core.embeddings import Embeddings
from langchain.schema.document import Document
from langchain.text_splitter import RecursiveCharacterTextSplitter, MarkdownHeaderTextSplitter
from langchain.vectorstores import Chroma
from langchain_huggingface import HuggingFaceEmbeddings
from PIL import Image as PILImage
from transformers import AutoImageProcessor, AutoTokenizer
from vllm import LLM, SamplingParams
from qwen_vl_utils import process_vision_info

# === Project-Specific Imports ===
from src.components import SemanticCache, SiglipEmbeddings
from src.wiki_pages_clone import orchestrate_wiki_clone
from src.utils import (
    configure_hf_cache,
    multimodal_rag_asset_status,
    load_config,
    load_secrets,
    load_mm_docs_clean,
    display_images,
)

2025-08-05 03:02:02.832750: I tensorflow/core/util/port.cc:153] oneDNN custom operations are on. You may see slightly different numerical results due to floating-point round-off errors from different computation orders. To turn them off, set the environment variable `TF_ENABLE_ONEDNN_OPTS=0`.
2025-08-05 03:02:02.846118: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:467] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
E0000 00:00:1754362922.861433     603 cuda_dnn.cc:8579] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
E0000 00:00:1754362922.866025     603 cuda_blas.cc:1407] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
W0000 00:00:1754362922.878492     603 computation_placer.cc:177] computation placer already registered. Please check linkage and avoid linking 

INFO 08-05 03:02:06 [__init__.py:244] Automatically detected platform cuda.


In [5]:
warnings.filterwarnings("ignore")
os.environ["TF_ENABLE_ONEDNN_OPTS"] = "0"
os.environ["TF_CPP_MIN_LOG_LEVEL"] = "2"
os.environ["VLLM_WORKER_MULTIPROC_METHOD"] = "spawn"

In [6]:
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
DEVICE = "cuda" if torch.cuda.is_available() else "cpu"

print(f"Using device: {device}")

Using device: cuda


In [7]:
print(f"--- Container PyTorch version: {torch.__version__}")
print(f"--- Container CUDA available: {torch.cuda.is_available()}")

--- Container PyTorch version: 2.7.0+cu126
--- Container CUDA available: True



### Verify Assets

In [8]:
CONFIG_PATH = "../configs/config.yaml"
SECRETS_PATH = "../configs/secrets.yaml"

LOCAL_MODEL_PATH: Path = Path("/home/jovyan/datafabric/Qwen2.5-VL-7B-Instruct-GPTQ-Int4")
CONTEXT_DIR: Path = Path("../data/context")             
CHROMA_DIR: Path = Path("../data/chroma_store")     
CACHE_DIR: Path = CHROMA_DIR / "semantic_cache"
MANIFEST_PATH: Path = CHROMA_DIR / "manifest.json"

IMAGE_DIR = CONTEXT_DIR / "images"
WIKI_METADATA_DIR = CONTEXT_DIR / "wiki_flat_structure.json"

DEMO_FOLDER = "../demo"

CHROMA_DIR.mkdir(parents=True, exist_ok=True)
CACHE_DIR.mkdir(parents=True, exist_ok=True)

multimodal_rag_asset_status(
    local_model_path=LOCAL_MODEL_PATH,
    config_path=CONFIG_PATH,
    secrets_path=SECRETS_PATH,
    wiki_metadata_dir=WIKI_METADATA_DIR,
    context_dir=CONTEXT_DIR,
    chroma_dir=CHROMA_DIR,
    cache_dir=CACHE_DIR,
    manifest_path=MANIFEST_PATH
)

2025-08-05 03:02:08 - INFO - Local Model is properly configured. 
2025-08-05 03:02:08 - INFO - Config is properly configured. 
2025-08-05 03:02:08 - INFO - Secrets is properly configured. 
2025-08-05 03:02:08 - INFO - wiki_flat_structure.json is not properly configured. Place JSON Wiki Pages in data/
2025-08-05 03:02:08 - INFO - CONTEXT is properly configured. 
2025-08-05 03:02:08 - INFO - CHROMA is properly configured. 
2025-08-05 03:02:08 - INFO - CACHE is properly configured. 
2025-08-05 03:02:08 - INFO - MANIFEST is not properly configured. Please check if the MANIFEST path was properly configured in your project on AI Studio.


### Config Loading

In this section, we load configuration parameters from the YAML file in the configs folder.

- **config.yaml**: Contains non-sensitive configuration parameters like model sources and URLs

In [9]:
config = load_config(CONFIG_PATH)

### Config HuggingFace Caches

In the next cell, we configure HuggingFace cache, so that all the models downloaded from them are persisted locally, even after the workspace is closed. This is a future desired feature for AI Studio and the GenAI addon.

In [None]:
# Configure HuggingFace cache
configure_hf_cache()

In [11]:
%%time

# Initialize HuggingFace Embeddings
embeddings = HuggingFaceEmbeddings(
    model_name="intfloat/e5-large-v2",
    cache_folder="/tmp/hf_cache"
)

modules.json:   0%|          | 0.00/387 [00:00<?, ?B/s]

README.md: 0.00B [00:00, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/57.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/616 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/1.34G [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/314 [00:00<?, ?B/s]

vocab.txt: 0.00B [00:00, ?B/s]

tokenizer.json: 0.00B [00:00, ?B/s]

special_tokens_map.json:   0%|          | 0.00/125 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/201 [00:00<?, ?B/s]

CPU times: user 2.56 s, sys: 2.36 s, total: 4.92 s
Wall time: 22.1 s


## Step 1: Data Loading & Cleaning

`wiki_flat_structure.json` is a custom json metadata for ADO Wiki data. It is flatly structured, with keys for filepath, md content, and a list of images. We also have a image folder that contains all the images for every md page. We directly scrape this data from ADO and perform any cleanup if necessary.

- **secrets.yaml**: For Freemium users, use secrets.yaml to store your sensitive data like API Keys. If you are a Premium user, you can use secrets manager.
- **AIS Secrets Manager**: For Paid users, use the secrets manager in the `Project Setup` tab to configure your API key.

In [12]:
%%time

ADO_PAT = os.getenv("AIS_ADO_TOKEN")
if not ADO_PAT:
    logger.info("Environment variable not found... Secrets Manager not properly set. Falling to secrets.yaml.")
    try:
        secrets = load_secrets(SECRETS_PATH)
        ADO_PAT = secrets.get('AIS_ADO_TOKEN')
    except NameError:
        logger.error("The 'secrets' object is not defined or available.")

try:
    orchestrate_wiki_clone(
        pat=ADO_PAT,
        config=config,
        output_dir=CONTEXT_DIR
    )
    logger.info("✅ Wiki data preparation step completed successfully.")

except Exception as e:
    logger.error("Halting notebook execution due to a critical error in the wiki preparation step.")

2025-08-05 03:02:30 - INFO - Environment variable not found... Secrets Manager not properly set. Falling to secrets.yaml.
2025-08-05 03:02:30 - INFO - Starting ADO Wiki clone process...
2025-08-05 03:02:30 - INFO - Cloning wiki 'Phoenix-DS-Platform.wiki' to temporary directory: /tmp/tmpd9whd777
2025-08-05 03:02:44 - INFO - Scanning for Markdown files...
2025-08-05 03:02:45 - INFO - → Found 575 Markdown pages.
2025-08-05 03:02:45 - INFO - Copying referenced images to ../data/context/images...
2025-08-05 03:02:56 - INFO - → 792 unique images copied.
2025-08-05 03:02:56 - INFO - Assembling flat JSON structure...
2025-08-05 03:02:56 - INFO - ✅ Wiki data successfully cloned to ../data/context
2025-08-05 03:02:56 - INFO - Cleaned up temporary directory: /tmp/tmpd9whd777
2025-08-05 03:02:56 - INFO - ✅ Wiki data preparation step completed successfully.


CPU times: user 1.2 s, sys: 1.36 s, total: 2.56 s
Wall time: 25.5 s


In [13]:
%%time

WIKI_METADATA_DIR   = Path(WIKI_METADATA_DIR)
IMAGE_DIR = Path(IMAGE_DIR)

mm_raw_docs = load_mm_docs_clean(WIKI_METADATA_DIR, Path(IMAGE_DIR))

def log_stage(name: str, docs: List[Document]):
    logger.info(f"{name}: {len(docs)} docs, avg_tokens={sum(len(d.page_content) for d in docs)/len(docs):.0f}")
log_stage("Docs loaded", mm_raw_docs)

2025-08-05 03:02:57 - INFO - Docs loaded: 575 docs, avg_tokens=3138


CPU times: user 67.5 ms, sys: 81.6 ms, total: 149 ms
Wall time: 1.52 s


## Step 2: Creation of Chunks

Here, we split the loaded documents into chunks, so we have smaller and more specific texts to add to our vector database. 

We chunk based on header style, and then within each header style we futher chunk based on the provided chunk size. Each chunk retains the page name, which preserves the relevance of each chunk. 

In [14]:
%%time

def chunk_documents(
    docs,
    chunk_size: int = 1200,
    overlap: int = 200,
) -> list[Document]:
    """
    1) Split each wiki page on Markdown headers (#, ## …) to keep logical
       sections together.
    2) Recursively break long sections to <= `chunk_size` chars with `overlap`.
    3) Prefix every chunk with its page-title and store the title in metadata.
    """
    header_splitter = MarkdownHeaderTextSplitter(
        headers_to_split_on=[("#", "title"), ("##", "section")]
    )
    recursive_splitter = RecursiveCharacterTextSplitter(
        chunk_size=chunk_size,
        chunk_overlap=overlap,
    )

    all_chunks: list[Document] = []
    for doc in docs:
        page_title = Path(doc.metadata["source"]).stem.replace("-", " ")

        # 1️. section‑level split (returns list[Document])
        section_docs = header_splitter.split_text(doc.page_content)

        for section in section_docs:
            # 2. size‑based split inside each section
            tiny_texts = recursive_splitter.split_text(section.page_content)

            for idx, tiny in enumerate(tiny_texts):
                all_chunks.append(
                    Document(
                        page_content=f"{page_title}\n\n{tiny.strip()}",
                        metadata={
                            "title": page_title,
                            "source": doc.metadata["source"],
                            "section_header": section.metadata.get("header", ""),
                            "chunk_id": idx,
                        },
                    )
                )
    if all_chunks:
        avg_len = int(mean(len(c.page_content) for c in all_chunks))
        logger.info(
            "Chunking complete: %d docs → %d chunks (avg %d chars)",
            len(docs),
            len(all_chunks),
            avg_len,
        )
    else:
        logger.warning("Chunking produced zero chunks for %d docs", len(docs))

    return all_chunks

splits = chunk_documents(mm_raw_docs)

2025-08-05 03:02:57 - INFO - Chunking complete: 575 docs → 2678 chunks (avg 721 chars)


CPU times: user 133 ms, sys: 3.89 ms, total: 137 ms
Wall time: 136 ms


## Step 3: Setup Embeddings & Vector Store
Here we setup Siglip for Image embeddings, and also transform our cleaned text chunks into embeddings to be stored in Chroma. We store the chroma data locally on the disk to reduce memory usage. 

### Setup Text ChromaDB

In [15]:
%%time

# 1) TEXT store
def _current_manifest() -> Dict[str, str]:
    """
    Returns a dictionary mapping every context JSON file to its SHA256 content hash.
    This allows detecting changes in file content, not just filenames.
    """
    manifest = {}
    json_files = sorted(CONTEXT_DIR.rglob("*.json"))

    for file_path in json_files:
        try:
            with open(file_path, "rb") as f:
                file_bytes = f.read()
                file_hash = hashlib.sha256(file_bytes).hexdigest()
                manifest[str(file_path.resolve())] = file_hash
        except IOError as e:
            logger.error(f"Could not read file {file_path} for hashing: {e}")
    return manifest

def _needs_rebuild() -> bool:
    """
    Determines if the ChromaDB needs to be rebuilt.
    A rebuild is needed if:
    1. The Chroma directory or manifest file doesn't exist.
    2. The manifest is unreadable.
    3. The stored file hashes in the manifest do not match the current file hashes.
    """
    if not CHROMA_DIR.exists() or not MANIFEST_PATH.exists():
        logger.info("Chroma directory or manifest not found. A rebuild is required.")
        return True
    try:
        old_manifest = json.loads(MANIFEST_PATH.read_text())
    except Exception as e:
        logger.warning(f"Could not read manifest file. A rebuild is required. Error: {e}")
        return True

    current_manifest = _current_manifest()
    if old_manifest != current_manifest:
        logger.info("Data content has changed. A rebuild is required.")
        return True

    return False

def _save_manifest(manifest: Dict[str, str]) -> None:
    """Saves the current data manifest (mapping file paths to hashes) to disk."""
    CHROMA_DIR.mkdir(parents=True, exist_ok=True)
    MANIFEST_PATH.write_text(json.dumps(manifest, indent=2))

def _build_text_db() -> Chroma:
    collection = "mm_text"
    # The rebuild check is now done outside this function.
    # We check if the directory exists. If not, we build.
    if not CHROMA_DIR.exists() or not (CHROMA_DIR / "chroma.sqlite3").exists():
        logger.info("Creating new text context index in %s ...", CHROMA_DIR)
        chroma = Chroma.from_documents(
            documents          = splits,
            embedding          = embeddings,
            collection_name    = collection,
            persist_directory  = str(CHROMA_DIR),
        )
        return chroma

    logger.info("Loading existing Chroma index from %s", CHROMA_DIR)
    return Chroma(
        collection_name   = collection,
        persist_directory = str(CHROMA_DIR),
        embedding_function= embeddings,
    )
    
# Check if a rebuild is needed and wipe the old DB if so.
# This ensures both the text and image databases are rebuilt from scratch.
if _needs_rebuild():
    logger.warning("REBUILDING: Wiping old ChromaDB store at %s", CHROMA_DIR)
    if CHROMA_DIR.exists():
        shutil.rmtree(CHROMA_DIR)
    # Save the new manifest immediately after deciding to rebuild
    _save_manifest(_current_manifest())

# Now, initialize your databases. They will be created fresh if they were just deleted.
text_db = _build_text_db()
CACHE_DIR.mkdir(parents=True, exist_ok=True)

2025-08-05 03:02:57 - INFO - Chroma directory or manifest not found. A rebuild is required.
2025-08-05 03:02:58 - INFO - Creating new text context index in ../data/chroma_store ...


CPU times: user 1min, sys: 1.95 s, total: 1min 2s
Wall time: 1min 2s


### Setup Image ChromaDB

In [16]:
%%time

#  Helper: walk all docs once and gather *unique* image vectors + metadata
def _collect_image_vectors():
    """
    Scans every wiki page for image references and returns three parallel lists:
        img_paths : list[str]   → full file-system paths (for SigLIP)
        img_ids   : list[str]   → unique key per (page, image) pair
        img_meta  : list[dict]  → {"source": wiki_page, "image": file_name}
    Runs in < 1s even for thousands of docs.
    """
    img_paths, img_ids, img_meta = [], [], []
    seen = set()

    for doc in mm_raw_docs:                         # raw wiki pages
        src = doc.metadata["source"]
        for name in doc.metadata.get("images", []): # list[str]
            img_id = f"{src}::{name}"
            if img_id in seen:
                continue                            # de‑dupe
            seen.add(img_id)

            img_paths.append(str(IMAGE_DIR / name))
            img_ids.append(img_id)
            img_meta.append({"source": src, "image": name})

    return img_paths, img_ids, img_meta

siglip_embeddings = SiglipEmbeddings("google/siglip2-base-patch16-224", DEVICE)

# 2) IMAGE store
image_db = Chroma(
    collection_name    = "mm_image",
    persist_directory  = str(CHROMA_DIR),   # SAME dir as text db
    embedding_function = siglip_embeddings, # <-- class you kept
)

# Populate vectors *only* if it is empty
if not image_db._collection.count():
    img_paths, img_ids, img_meta = _collect_image_vectors()
    image_db.add_texts(texts=img_paths, metadatas=img_meta, ids=img_ids)
    image_db.persist()
    logger.info("Indexed %d unique images.", len(img_paths))
else:
    logger.info("Loaded existing image index (%d vectors).",
                image_db._collection.count())

config.json:   0%|          | 0.00/253 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/1.50G [00:00<?, ?B/s]

Using a slow image processor as `use_fast` is unset and a slow processor was saved with this model. `use_fast=True` will be the default behavior in v4.52, even if the model was saved with a slow processor. This will result in minor differences in outputs. You'll still be able to use a slow processor with `use_fast=False`.


preprocessor_config.json:   0%|          | 0.00/394 [00:00<?, ?B/s]

tokenizer_config.json: 0.00B [00:00, ?B/s]

tokenizer.model:   0%|          | 0.00/4.24M [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/34.4M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/636 [00:00<?, ?B/s]

2025-08-05 03:04:59 - INFO - Indexed 806 unique images.


CPU times: user 29.8 s, sys: 16.5 s, total: 46.4 s
Wall time: 59.1 s


### Setup Memory Store

In [17]:
# Initialize the semantic cache
semantic_cache = SemanticCache(persist_directory=CACHE_DIR, embedding_function=embeddings)

## Step 4: Retrieval Function

This code implements a hybrid retrieval process that combines two powerful search techniques to find the most relevant text documents and associated images.

1.  **Initial Recall (Hybrid Search)**: The system performs two searches in parallel:
    * **Dense Search**: A vector similarity search against `text_db` (ChromaDB) to find semantically related documents.
    * **Sparse Search**: A keyword-based search using a `BM25` index to find documents with exact term matches.

2.  **Fusion (RRF)**: The results from both searches are combined into a single, more robust ranked list using **Reciprocal Rank Fusion (RRF)**. This method intelligently merges the rankings without needing complex parameter tuning.

3.  **Image Retrieval**: Using the top text documents from the fused list, the system performs a targeted search in the `image_db` to find images that are on the same source pages, ensuring contextual relevance.




In [18]:
# This is necessary because the chunking process can sometimes create identical chunks.
unique_docs_map = {doc.page_content: doc for doc in splits}
unique_splits = list(unique_docs_map.values())

logger.info(f"De-duplicated {len(splits)} chunks down to {len(unique_splits)} unique chunks.")

# Now, build the BM25 index and the final doc_map using only the unique documents.
# This ensures the index and the search corpus are perfectly aligned.
corpus = [doc.page_content for doc in unique_splits]
tokenized_corpus = [doc.split(" ") for doc in corpus]
bm25 = BM25Okapi(tokenized_corpus)
doc_map = {doc.page_content: doc for doc in unique_splits}

# %%
# Helper function for Reciprocal Rank Fusion
def reciprocal_rank_fusion(
    results: list[list[Document]], k: int = 60
) -> list[tuple[Document, float]]:
    """Performs RRF on multiple lists of ranked documents."""
    ranked_lists = [
        {doc.page_content: (doc, i + 1) for i, doc in enumerate(res)}
        for res in results
    ]
    rrf_scores = defaultdict(float)
    all_docs = {}
    for ranked_list in ranked_lists:
        for content, (doc, rank) in ranked_list.items():
            rrf_scores[content] += 1 / (k + rank)
            if content not in all_docs:
                all_docs[content] = doc
    fused_results = [
        (all_docs[content], rrf_scores[content])
        for content in sorted(rrf_scores, key=rrf_scores.get, reverse=True)
    ]
    return fused_results


def retrieve_mm(
    query: str,
    text_db: Chroma,
    image_db: Chroma,
    bm25_index: BM25Okapi,
    doc_map: dict,
    k_text: int = 3,
    k_img: int = 2,
    recall_k: int = 20,
) -> dict[str, any]:
    """
    Performs hybrid search for text and retrieves contextually relevant images.
    """
    # 1. Hybrid Search for Text
    dense_hits = text_db.similarity_search(query, k=recall_k)
    tokenized_query = query.lower().split(" ")
    sparse_texts = bm25_index.get_top_n(tokenized_query, list(doc_map.keys()), n=recall_k)
    sparse_hits = [doc_map[text] for text in sparse_texts]

    if not dense_hits and not sparse_hits:
        return {"docs": [], "scores": [], "images": []}

    fused_results = reciprocal_rank_fusion([dense_hits, sparse_hits])
    final_docs = [doc for doc, score in fused_results[:k_text]]
    final_scores = [score for doc, score in fused_results[:k_text]]

    # 2. Retrieve Relevant Images
    retrieved_images = []
    if final_docs:
        # Get the source pages of the top text results
        final_sources = list(set(d.metadata["source"] for d in final_docs))

        # Perform a vector search for images, filtered by the relevant sources
        # The image_db's embedding function (SigLIP) will automatically handle the text query.
        image_hits = image_db.similarity_search(
            query,
            k=k_img,
            filter={"source": {"$in": final_sources}}
        )
        # The `page_content` of an image document is its path/name
        retrieved_images = [img.page_content for img in image_hits]

    return {
        "docs": final_docs,
        "scores": final_scores,
        "images": retrieved_images,
    }

2025-08-05 03:05:00 - INFO - De-duplicated 2678 chunks down to 2667 unique chunks.


## Step 5: Model Setup & Chain Creation

In this section, we set up our local Large Language Model (LLM) and integrate it into a Question Answering (QA) pipeline. We're using `Qwen2.5VL-7B-Instruct` as our multimodal model, which can process both text and images. This setup is encapsulated within the QwenVLLM class, designed for efficient and robust multimodal interactions.

### Cleanup Previous Embeddings

In [None]:
logger.info("✅ Embeddings and vector stores are ready. Offloading embedding models to free up VRAM.")

# Explicitly delete the objects to free memory
del embeddings
del siglip_embeddings
gc.collect()


# For PyTorch, you can also empty the CUDA cache
torch.cuda.empty_cache() if torch.cuda.is_available() else None

2025-08-05 03:05:00 - INFO - ✅ Embeddings and vector stores are ready. Offloading embedding models to free up VRAM.


### QwenVLMM QA Wrapper

In [20]:
%%time

class QwenVLMM:
    """
    Multimodal QA wrapper around the quantized Qwen2.5-VL model using vLLM.
    Requires:
      * `vllm` installed and importable.
      * `qwen_vl_utils.process_vision_info` for multimodal image handling.
      * HuggingFace transformers for tokenizer / image processor.
      * External retrieval function (e.g., `retrieve_mm`) and a `SemanticCache`-like cache.
    Expects the quantized safetensors model `RedHatAI/Qwen2.5-VL-7B-Instruct-quantized.w8a8`
    to be accessible (vLLM will pull it from HuggingFace).
    """

    def __init__(
        self,
        cache,
        text_db,
        image_db,
        bm25_index,
        doc_map: dict,
        model_name: str = "RedHatAI/Qwen2.5-VL-7B-Instruct-quantized.w8a8",
        base_for_tokenizer: str = "Qwen/Qwen2.5-VL-7B-Instruct",
        device: str = "cuda",
    ):
        self.cache = cache
        self.text_db = text_db
        self.image_db = image_db
        self.bm25_index = bm25_index
        self.doc_map = doc_map

        self.model_name = model_name
        self.base_for_tokenizer = base_for_tokenizer
        self.device = device

        self.tok = None
        self.image_processor = None
        self.llm = None  # vLLM instance

        os.environ['PYTORCH_CUDA_ALLOC_CONF'] = 'expandable_segments:True'
        self._load()

    # ---------- public function ----------
    def generate(self, query: str, force_regenerate: bool = False, **retrieval_kwargs) -> dict:
        """
        Run retrieval, prompt assembly, and model generation via vLLM.
        """
        # 1. Cache check
        if not force_regenerate:
            cached_result = self.cache.get(query, threshold=0.92)
            if cached_result:
                logger.info(f"SEMANTIC CACHE HIT for query: '{query}'")
                return cached_result
        if force_regenerate:
            logger.info(f"Forced regeneration for query: '{query}'. Clearing old cache entry.")
            self.cache.delete(query)
        logger.info(f"CACHE MISS for query: '{query}'. Running full pipeline.")

        if self.llm is None or self.tok is None:
            return {"reply": "Error: model not initialised.", "used_images": []}

        # 2. Retrieval
        hits = retrieve_mm(
            query,
            text_db=self.text_db,
            image_db=self.image_db,
            bm25_index=self.bm25_index,
            doc_map=self.doc_map,
            **retrieval_kwargs
        )
        docs = hits.get("docs", [])
        images = hits.get("images", [])

        if not docs and not images:
            return {
                "reply": "Based on the provided context, I cannot answer this question.", 
                "used_images": [],
                "retrieved_sources": {"text_documents": [], "images": []},
            }

        # Limit number of images to reduce memory usage
        if len(images) > 2:
            logger.warning(f"Limiting images from {len(images)} to 2 to save memory")
            images = images[:2]

        # 3. Build prompt
        context_str = "\n\n".join(
            f"<source_document name=\"{d.metadata.get('source', 'unknown')}\">\n{d.page_content}\n</source_document>"
            for d in docs
        )

        system_prompt = """You are a Multimodal RAG Assistant. Your task is to answer the user's query using ONLY the provided context from retrieved documents and images.
            
            **Instructions:**
            1. **Analyze Context:** Carefully examine the retrieved images and text documents provided in the context.
            2. **Answer Directly:** Provide a clear, comprehensive answer to the user's query by synthesizing information from both text and image sources.
            3. **Stay Focused:** Do not include unnecessary sections or verbose explanations. Answer the question directly and concisely.
            4. **No Hallucination:** Use ONLY the information provided in the context. Do not make up facts or add information not present in the retrieved materials.
            
            **Output Format:**
            - If the context is relevant: Provide a direct answer using the retrieved context.
            - If the context is irrelevant: Respond with "The provided context does not contain relevant information to answer the query."
            """
                    
        # Build user content with proper image placeholders for Qwen2.5-VL
        if images:
            # Use the standard Qwen2.5-VL image token format
            image_tokens = ""
            for i in range(len(images)):
                image_tokens += f"<|vision_start|><|image_pad|><|vision_end|>"
            
            user_content = f"""{image_tokens}

            <context>
            {context_str}
            </context>
            
            <user_query>
            {query}
            </user_query>"""
        else:
            user_content = f"""<context>
            {context_str}
            </context>
            
            <user_query>
            {query}
            </user_query>"""

        # Use chat template if available
        if hasattr(self.tok, 'apply_chat_template') and self.tok.chat_template:
            messages = [
                {"role": "system", "content": system_prompt},
                {"role": "user", "content": user_content}
            ]
            try:
                prompt_string = self.tok.apply_chat_template(
                    messages, 
                    tokenize=False, 
                    add_generation_prompt=True
                )
            except Exception as e:
                logger.warning(f"Chat template failed: {e}, using fallback")
                prompt_string = f"<|im_start|>system\n{system_prompt}<|im_end|>\n<|im_start|>user\n{user_content}<|im_end|>\n<|im_start|>assistant\n"
        else:
            # Fallback to manual template
            prompt_string = f"<|im_start|>system\n{system_prompt}<|im_end|>\n<|im_start|>user\n{user_content}<|im_end|>\n<|im_start|>assistant\n"

        # 4. Generation via vLLM
        try:
            self._clear_cuda()

            # More conservative sampling parameters
            sampling_params = SamplingParams(
                temperature=0.0,     # Deterministic
                top_p=1.0,
                max_tokens=2048,
            )

            if images:
                # Process images with size limit
                pil_images = []
                for i, img_path in enumerate(images):
                    try:
                        img = PILImage.open(img_path).convert("RGB")
                        # Resize large images to save memory
                        if img.size[0] > 512 or img.size[1] > 512:
                            img.thumbnail((512, 512), PILImage.Resampling.LANCZOS)
                        pil_images.append(img)
                        logger.info(f"Processed image {i+1}: {img_path}")
                    except Exception as e:
                        logger.warning(f"Failed to process image {img_path}: {e}")
                        continue
                
                if not pil_images:
                    logger.warning("No images successfully processed, proceeding text-only")
                    request_payload = {"prompt": prompt_string}
                else:
                    request_payload = {
                        "prompt": prompt_string,
                        "multi_modal_data": {
                            "image": pil_images
                        },
                    }
            else:
                request_payload = {"prompt": prompt_string}
            
            output_list = self.llm.generate(request_payload, sampling_params=sampling_params)
            
            if output_list and output_list[0].outputs:
                reply = output_list[0].outputs[0].text.strip()
            else:
                reply = "Error: no output from LLM."

            self._clear_cuda()
            
            # Prepare retrieved sources for programmatic return
            retrieved_sources = {
                "text_documents": [
                    {
                        "source": d.metadata.get('source', 'unknown'),
                        "content": d.page_content[:500] + "..." if len(d.page_content) > 500 else d.page_content,
                        "metadata": d.metadata
                    }
                    for d in docs
                ],
                "images": [
                    {
                        "path": img_path,
                        "filename": img_path.split('/')[-1] if '/' in img_path else img_path
                    }
                    for img_path in images
                ]
            }

            if reply == "The provided context does not contain relevant information to answer the query.":
                images = []
            
            result = {
                "reply": reply, 
                "used_images": images,
                "retrieved_sources": retrieved_sources,
            }
            self.cache.set(query, result)
            return result

        except RuntimeError as e:
            msg = str(e).lower()
            if "cuda" in msg or "out of memory" in msg:
                logger.warning("CUDA error – resetting model: %s", e)
                self._reset()
                error_reply = "I ran into a GPU memory error – please try again."
            else:
                logger.error("Runtime error: %s", e)
                error_reply = f"Error: {e}"
            return {
                "reply": error_reply, 
                "used_images": images,
                "retrieved_sources": {"text_documents": [], "images": []},
            }

    # ---------- internal helpers ----------

    def _load(self):
        """Load tokenizer, image_processor, & vLLM model."""
        logger.info("Loading Qwen2.5-VL via vLLM...")
        gc.collect()
        self._clear_cuda()
    
        # Tokenizer & image processor (base model)
        self.tok = AutoTokenizer.from_pretrained(
            self.base_for_tokenizer, trust_remote_code=True
        )
        if self.tok.pad_token is None:
            self.tok.pad_token = self.tok.eos_token
    
        self.image_processor = AutoImageProcessor.from_pretrained(
            self.base_for_tokenizer, trust_remote_code=True, use_fast=True
        )
    
        # Load vLLM with the quantized safetensors model (no use_mlock)
        self.llm = LLM(
            model=self.model_name,
            quantization="gptq",
            gpu_memory_utilization=0.70,    # Leave headroom for image tensors
            max_model_len=4096,
            enforce_eager=True,
            limit_mm_per_prompt={"image": 2},  # No more than 2 images
            disable_custom_all_reduce=True,
            tensor_parallel_size=1,
            dtype="float16",
        )

        logger.info("vLLM model loaded.")


    def _reset(self):
        """Free everything and reload on error."""
        logger.warning("Resetting InternQwenVLMM model …")
        try:
            del self.llm, self.tok, self.image_processor
        except Exception:
            pass
        self.llm = self.tok = self.image_processor = None
        gc.collect()
        self._clear_cuda()
        time.sleep(1)
        self._load()

    @staticmethod
    def _clear_cuda():
        try:
            import torch

            if torch.cuda.is_available():
                torch.cuda.empty_cache()
                torch.cuda.synchronize()
        except ImportError:
            pass


# Initalize mm llm
mm = QwenVLMM(
    cache=semantic_cache,
    text_db=text_db,
    image_db=image_db,
    bm25_index=bm25,
    doc_map=doc_map,
    model_name=str(LOCAL_MODEL_PATH),  # quantized safetensors model
    base_for_tokenizer="Qwen/Qwen2.5-VL-7B-Instruct",  # for tokenizer / image processor
    device="cuda" if torch.cuda.is_available() else "cpu",
)

2025-08-05 03:05:01 - INFO - Loading Qwen2.5-VL via vLLM...


tokenizer_config.json: 0.00B [00:00, ?B/s]

vocab.json: 0.00B [00:00, ?B/s]

merges.txt: 0.00B [00:00, ?B/s]

tokenizer.json: 0.00B [00:00, ?B/s]

preprocessor_config.json:   0%|          | 0.00/350 [00:00<?, ?B/s]

INFO 08-05 03:05:13 [config.py:841] This model supports multiple tasks: {'classify', 'generate', 'reward', 'embed'}. Defaulting to 'generate'.
INFO 08-05 03:05:13 [config.py:1472] Using max model len 4096
INFO 08-05 03:05:13 [llm_engine.py:230] Initializing a V0 LLM engine (v0.9.2) with config: model='/home/jovyan/datafabric/Qwen2.5-VL-7B-Instruct-GPTQ-Int4', speculative_config=None, tokenizer='/home/jovyan/datafabric/Qwen2.5-VL-7B-Instruct-GPTQ-Int4', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, override_neuron_config={}, tokenizer_revision=None, trust_remote_code=False, dtype=torch.float16, max_seq_len=4096, download_dir=None, load_format=LoadFormat.AUTO, tensor_parallel_size=1, pipeline_parallel_size=1, disable_custom_all_reduce=True, quantization=gptq, enforce_eager=True, kv_cache_dtype=auto,  device_config=cuda, decoding_config=DecodingConfig(backend='auto', disable_fallback=False, disable_any_whitespace=False, disable_additional_properties=False, reasoning_backe

Loading safetensors checkpoint shards:   0% Completed | 0/2 [00:00<?, ?it/s]


INFO 08-05 03:05:51 [default_loader.py:272] Loading weights took 34.98 seconds
INFO 08-05 03:05:52 [model_runner.py:1203] Model loading took 6.5939 GiB and 35.508739 seconds


Using a slow image processor as `use_fast` is unset and a slow processor was saved with this model. `use_fast=True` will be the default behavior in v4.52, even if the model was saved with a slow processor. This will result in minor differences in outputs. You'll still be able to use a slow processor with `use_fast=False`.
You have video processor config saved in `preprocessor.json` file which is deprecated. Video processor configs should be saved in their own `video_preprocessor.json` file. You can rename the file or load and save the processor back which renames it automatically. Loading from `preprocessor.json` will be removed in v5.0.


INFO 08-05 03:06:45 [worker.py:294] Memory profiling takes 52.84 seconds
INFO 08-05 03:06:45 [worker.py:294] the current vLLM instance can use total_gpu_memory (24.00GiB) x gpu_memory_utilization (0.70) = 16.80GiB
INFO 08-05 03:06:45 [worker.py:294] model weights take 6.59GiB; non_torch_memory takes 0.02GiB; PyTorch activation peak memory takes 7.51GiB; the rest of the memory reserved for KV Cache is 2.68GiB.
INFO 08-05 03:06:45 [executor_base.py:113] # cuda blocks: 3139, # CPU blocks: 4681
INFO 08-05 03:06:45 [executor_base.py:118] Maximum concurrency for 4096 tokens per request: 12.26x
INFO 08-05 03:06:46 [llm_engine.py:428] init engine (profile, create kv cache, warmup model) took 54.17 seconds


2025-08-05 03:06:46 - INFO - vLLM model loaded.


CPU times: user 1min 16s, sys: 25.4 s, total: 1min 41s
Wall time: 1min 45s


## Step 6: Test Generation and Outputs

In [None]:
question = "What are the AI Blueprints Repository best practices?"
results = mm.generate(question, force_regenerate=True)
print("--- MODEL CONTEXT ---")
print(results["retrieved_sources"])

print("\n--- MODEL RESPONSE ---")
display(Markdown(results["reply"]))
print("----------------------\n")

display_images(results["used_images"])

In [22]:
%%time

question1 = "What is the capital of paris?"
results = mm.generate(question1, force_regenerate=True)
print("--- MODEL CONTEXT ---")
print(results["retrieved_sources"])

print("\n--- MODEL RESPONSE ---")
display(Markdown(results["reply"]))
print("----------------------\n")

display_images(results["used_images"])

2025-08-05 03:07:03 - INFO - Forced regeneration for query: 'What is the capital of paris?'. Clearing old cache entry.
2025-08-05 03:07:03 - INFO - CACHE MISS for query: 'What is the capital of paris?'. Running full pipeline.
2025-08-05 03:07:03 - INFO - Processed image 1: ../data/context/images/image-21f1f2bf-917f-4a1c-af46-3f71d885aad6.png
2025-08-05 03:07:03 - INFO - Processed image 2: ../data/context/images/image-68e01678-f7a5-4be7-9284-6355854edb03.png


Adding requests:   0%|          | 0/1 [00:00<?, ?it/s]

Processed prompts:   0%|          | 0/1 [00:00<?, ?it/s, est. speed input: 0.00 toks/s, output: 0.00 toks/s]

--- MODEL CONTEXT ---
{'text_documents': [{'source': 'Dogfooding-Experiences/How-to-add-a-bug-for-dogfooding.md', 'content': "How to add a bug for dogfooding\n\n![image.png](/.attachments/image-68e01678-f7a5-4be7-9284-6355854edb03.png)\nYou should now see a blank bug template:\n![image.png](/.attachments/image-68354047-38d8-4ca9-9e6e-d942698b6dc0.png)  \n`\nIMPORTANT ADO WILL NOT AUTO SAVE. Remember to hit the save button a lot. Do not close with out saving. The save button is in the upper right hand corner.  \n`  \n3. Fill in the basics:(to make sure we have basic info you can't save unless you get rid of all the fields with ...", 'metadata': {'chunk_id': 3, 'section_header': '', 'source': 'Dogfooding-Experiences/How-to-add-a-bug-for-dogfooding.md', 'title': 'How to add a bug for dogfooding'}}, {'source': 'Development/RFCs/Private-Projects.md', 'content': 'Private Projects\n\n![groups-details.png](/.attachments/groups-details-297d6616-fa4f-4e69-b451-2dc37d366219.png)', 'metadata': {'c

The provided context does not contain relevant information to answer the query.

----------------------

▶ No images to display.
CPU times: user 1.21 s, sys: 88.3 ms, total: 1.29 s
Wall time: 1.16 s


In [None]:
%%time

question2 = "What is ITG, STG, and Prod?"
results = mm.generate(question2, force_regenerate=True)
print("--- MODEL CONTEXT ---")
print(results["retrieved_sources"])

print("\n--- MODEL RESPONSE ---")
display(Markdown(results["reply"]))
print("----------------------\n")

display_images(results["used_images"])

In [24]:
end_time: float = time.time()
elapsed_time: float = end_time - start_time
elapsed_minutes: int = int(elapsed_time // 60)
elapsed_seconds: float = elapsed_time % 60

logger.info(f"⏱️ Total execution time: {elapsed_minutes}m {elapsed_seconds:.2f}s")
logger.info("✅ Notebook execution completed successfully.")

2025-08-05 03:07:07 - INFO - ⏱️ Total execution time: 10m 23.51s
2025-08-05 03:07:07 - INFO - ✅ Notebook execution completed successfully.


Built with ❤️ using Z by HP AI Studio.