<h1 style="text-align: center; font-size: 50px;"> 🤖 MLFlow Registration for Multimodal RAG with Local Chroma Cache</h1>

# MLFlow Model Service 

In this section, we demonstrate how to deploy a RAG-based chatbot service. This service provides a REST API endpoint that allows users to query the knowledge base with natural language questions, use any Azure DevOps Wikis as the knowledge base, and manage conversation history. This service encapsulates all the functionality we developed in this notebook, including the document retrieval system, RAG-based question answering capabilities, and MLFlow integration for observation and evaluation. For this specific service, we store the databases created for embeddings locally, to reduce the processing time.

## Step 0: Imports and Environment Setup

In [1]:
import time
import os 
from pathlib import Path
import sys
import logging

sys.path.append(os.path.abspath(os.path.join(os.getcwd(), "..")))

# Create logger
logger = logging.getLogger("multimodal_rag_logger")
logger.setLevel(logging.INFO)
if not logger.handlers:
    formatter = logging.Formatter("%(asctime)s - %(levelname)s - %(message)s", datefmt="%Y-%m-%d %H:%M:%S")
    stream_handler = logging.StreamHandler(sys.stdout)
    stream_handler.setFormatter(formatter)
    logger.addHandler(stream_handler)
logger.propagate = False

In [2]:
start_time = time.time()  

logger.info("Notebook execution started.")

2025-08-05 00:49:10 - INFO - Notebook execution started.


In [3]:
%pip install -r ../requirements.txt --quiet

Note: you may need to restart the kernel to use updated packages.


In [4]:
# === Standard Library Imports ===
import gc
import json
import base64
import tempfile
import threading
import shutil
import warnings
import hashlib
from rank_bm25 import BM25Okapi
from typing import Any, Dict, List, Optional, TypedDict
from collections import defaultdict
from IPython.display import display, Image, Markdown

# === Third-Party Library Imports ===
import mlflow
import pandas as pd
import torch
from langchain.schema.document import Document
from langchain.vectorstores import Chroma
from langchain_huggingface import HuggingFaceEmbeddings
from langchain.text_splitter import RecursiveCharacterTextSplitter, MarkdownHeaderTextSplitter
from mlflow.models.signature import ModelSignature
from mlflow.tracking import MlflowClient
from mlflow.types import ColSpec, DataType, Schema
from PIL import Image as PILImage
from sentence_transformers import SentenceTransformer
from transformers import AutoImageProcessor, AutoTokenizer, SiglipModel, SiglipProcessor
from vllm import LLM, SamplingParams

# === Project-Specific Imports ===
# Add the project root to the system path to allow importing from 'src'
sys.path.append(os.path.abspath(os.path.join(os.getcwd(), "..")))

from src.components import SemanticCache, SiglipEmbeddings
from src.wiki_pages_clone import orchestrate_wiki_clone
from src.local_genai_judge import LocalGenAIJudge
from src.utils import (
    configure_hf_cache,
    multimodal_rag_asset_status,
    load_config,
    load_secrets,
    load_mm_docs_clean,
)

2025-08-05 00:49:19.081531: I tensorflow/core/util/port.cc:153] oneDNN custom operations are on. You may see slightly different numerical results due to floating-point round-off errors from different computation orders. To turn them off, set the environment variable `TF_ENABLE_ONEDNN_OPTS=0`.
2025-08-05 00:49:19.091308: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:467] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
E0000 00:00:1754354959.101627    4724 cuda_dnn.cc:8579] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
E0000 00:00:1754354959.104670    4724 cuda_blas.cc:1407] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
W0000 00:00:1754354959.113605    4724 computation_placer.cc:177] computation placer already registered. Please check linkage and avoid linking 

INFO 08-05 00:49:20 [__init__.py:244] Automatically detected platform cuda.


In [5]:
warnings.filterwarnings("ignore")
os.environ["TF_ENABLE_ONEDNN_OPTS"] = "0"
os.environ["TF_CPP_MIN_LOG_LEVEL"] = "2"
os.environ["VLLM_WORKER_MULTIPROC_METHOD"] = "spawn"

In [6]:
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
DEVICE = "cuda" if torch.cuda.is_available() else "cpu"

print(f"Using device: {device}")

Using device: cuda


## Step 1: Configurations

### Verify Assets

In [7]:
CONFIG_PATH = "../configs/config.yaml"
SECRETS_PATH = "../configs/secrets.yaml"

LOCAL_MODEL = "/home/jovyan/datafabric/Qwen2.5-VL-7B-Instruct-GPTQ-Int4-1"
CONTEXT_DIR: Path = Path("../data/context")
CHROMA_DIR: Path = Path("../data/chroma_store")     
CACHE_DIR: Path = CHROMA_DIR / "semantic_cache"
MANIFEST_PATH: Path = CHROMA_DIR / "manifest.json"

IMAGE_DIR = CONTEXT_DIR / "images"
WIKI_METADATA_DIR = CONTEXT_DIR / "wiki_flat_structure.json"

CHROMA_DIR.mkdir(parents=True, exist_ok=True)
CACHE_DIR.mkdir(parents=True, exist_ok=True)

multimodal_rag_asset_status(
    local_model_path=LOCAL_MODEL,
    config_path=CONFIG_PATH,
    secrets_path=SECRETS_PATH,
    wiki_metadata_dir=WIKI_METADATA_DIR,
    context_dir=CONTEXT_DIR,
    chroma_dir=CHROMA_DIR,
    cache_dir=CACHE_DIR,
    manifest_path=MANIFEST_PATH
)

2025-08-05 00:49:22 - INFO - Local Model is properly configured. 
2025-08-05 00:49:22 - INFO - Config is properly configured. 
2025-08-05 00:49:22 - INFO - Secrets is properly configured. 
2025-08-05 00:49:22 - INFO - wiki_flat_structure.json is properly configured. 
2025-08-05 00:49:22 - INFO - CONTEXT is properly configured. 
2025-08-05 00:49:22 - INFO - CHROMA is properly configured. 
2025-08-05 00:49:22 - INFO - CACHE is properly configured. 
2025-08-05 00:49:22 - INFO - MANIFEST is properly configured. 


In [8]:
config = load_config(CONFIG_PATH)

### Config HuggingFace Caches

In the next cell, we configure HuggingFace cache, so that all the models downloaded from them are persisted locally, even after the workspace is closed. This is a future desired feature for AI Studio and the GenAI addon.

In [9]:
configure_hf_cache()

In [10]:
# Initialize HuggingFace Embeddings
txt_embeddings = HuggingFaceEmbeddings(
    model_name="intfloat/e5-large-v2",
    cache_folder="/tmp/hf_cache"
)

### MLflow Configuration

In [11]:
MODEL_NAME = "AIStudio-Multimodal-Chatbot-Model"
RUN_NAME = f"Register_{MODEL_NAME}"
EXPERIMENT_NAME = "AIStudio-Multimodal-Chatbot-Experiment"

# Set MLflow tracking URI and experiment
# This should be configured for your environment, e.g., a remote server or local file path
mlflow.set_tracking_uri(os.getenv("MLFLOW_TRACKING_URI", "/phoenix/mlflow"))
mlflow.set_experiment(experiment_name=EXPERIMENT_NAME)

logger.info(f"Using MLflow tracking URI: {mlflow.get_tracking_uri()}")
logger.info(f"Using MLflow experiment: '{EXPERIMENT_NAME}'")

2025-08-05 00:49:24 - INFO - Using MLflow tracking URI: /phoenix/mlflow
2025-08-05 00:49:24 - INFO - Using MLflow experiment: 'AIStudio-Multimodal-Chatbot-Experiment'


## Step 2: MLflow Model Setup

In [None]:
class MultimodalRagModel(mlflow.pyfunc.PythonModel):
    """
    An MLflow PythonModel for a dynamic, updatable Multimodal RAG pipeline
    using Qwen-VL and a persistent ChromaDB cache.

    This model acts as a service with two main commands:
    - 'update_kb': Incrementally updates the vector knowledge base from a source.
    - 'query': Answers a user's question using the RAG pipeline, with
      built-in semantic caching and self-evaluation.
    """

    # ==========================================================================
    # 1. Inner Class for the RAG Generation Pipeline
    # ==========================================================================
    class QwenVLRAGPipeline:
        """Minimal, self-contained multimodal QA wrapper for Qwen-VL with vLLM."""
        def __init__(self, llm: "LLM", tok: "AutoTokenizer", image_processor: "AutoImageProcessor", device: str, cache: Any, text_db: Chroma, image_db: Chroma, bm25_index: Optional[BM25Okapi], doc_map: dict):
            self.llm = llm
            self.tok = tok
            self.image_processor = image_processor
            self.device = device
            self.cache = cache
            self.text_db = text_db
            self.image_db = image_db
            self.bm25_index = bm25_index
            self.doc_map = doc_map

        @staticmethod
        def _reciprocal_rank_fusion(results: list[list[Document]], k: int = 60) -> list[tuple[Document, float]]:
            """Performs Reciprocal Rank Fusion on multiple ranked lists of documents."""
            ranked_lists = [{doc.page_content: (doc, i + 1) for i, doc in enumerate(res)} for res in results]
            rrf_scores = defaultdict(float)
            all_docs = {}
            # Iterate through each ranked list and calculate RRF scores
            for ranked_list in ranked_lists:
                for content, (doc, rank) in ranked_list.items():
                    rrf_scores[content] += 1 / (k + rank)
                    if content not in all_docs:
                        all_docs[content] = doc
            fused_results = [(all_docs[content], rrf_scores[content]) for content in sorted(rrf_scores, key=rrf_scores.get, reverse=True)]
            return fused_results

        def _retrieve_mm(self, query: str, k_text: int = 3, k_img: int = 2, recall_k: int = 20) -> dict[str, any]:
            """Retrieves relevant documents and images using dense and sparse retrieval."""
            dense_hits = self.text_db.similarity_search(query, k=recall_k)
            sparse_hits = []
            
            # If BM25 index is available, perform sparse retrieval
            if self.bm25_index and list(self.doc_map.keys()):
                tokenized_query = query.lower().split(" ")
                sparse_texts = self.bm25_index.get_top_n(tokenized_query, list(self.doc_map.keys()), n=recall_k)
                sparse_hits = [self.doc_map[text] for text in sparse_texts]

            if not dense_hits and not sparse_hits:
                return {"docs": [], "scores": [], "images": []}

            # Combine dense and sparse hits, ensuring no duplicates
            fused_results = self._reciprocal_rank_fusion([dense_hits, sparse_hits])
            final_docs = [doc for doc, score in fused_results[:k_text]]
            final_scores = [score for doc, score in fused_results[:k_text]]

            # Retrieve images based on the sources of the final documents
            retrieved_images = []
            if final_docs and self.image_db:
                final_sources = list(set(d.metadata["source"] for d in final_docs))
                image_hits = self.image_db.similarity_search(query, k=k_img, filter={"source": {"$in": final_sources}})
                retrieved_images = [img.page_content for img in image_hits]

            return {"docs": final_docs, "scores": final_scores, "images": retrieved_images}

        def generate(self, query: str, force_regenerate: bool = False, **retrieval_kwargs) -> Dict[str, Any]:
            """
            Generates a response using the Qwen-VL RAG pipeline, now with semantic caching restored.
            """
            start_gen_time = time.time()
            
            # Check if the query is already cached
            if not force_regenerate and self.cache:
                cached_result = self.cache.get(query, threshold=0.92)
                if cached_result:
                    logger.info(f"SEMANTIC CACHE HIT for query: '{query}'")
                    cached_result.setdefault("generation_time_seconds", 0.0)
                    return cached_result
            
            # Restore the cache entry if it exists
            if force_regenerate and self.cache:
                logger.info(f"Forced regeneration for query: '{query}'. Clearing old cache entry.")
                self.cache.delete(query)

            logger.info(f"CACHE MISS for query: '{query}'. Running full pipeline.")

            # --- Document & Image Retrieval ---
            hits = self._retrieve_mm(query, **retrieval_kwargs)
            docs, images = hits["docs"], hits["images"]
            
            if not docs and not images:
                return {"reply": "Based on the provided context, I cannot answer this question.", "used_images": [], "generation_time_seconds": 0.0}

            # Limit the number of images to 2 for memory efficiency
            if len(images) > 2:
                logger.warning(f"Limiting images from {len(images)} to 2 to save memory")
                images = images[:2]
                
            context_str = "\n\n".join(
                f"<source_document name=\"{d.metadata.get('source', 'unknown')}\">\n{d.page_content}\n</source_document>"
                for d in docs
            )
            
            system_prompt = """You are a Multimodal RAG Assistant. Your task is to answer the user's query using ONLY the provided context...""" # Your full prompt
            
            if images:
                image_tokens = "".join(["<|vision_start|><|image_pad|><|vision_end|>" for _ in images])
                user_content = f"{image_tokens}\n\n<context>\n{context_str}\n</context>\n\n<user_query>\n{query}\n</user_query>"
            else:
                user_content = f"<context>\n{context_str}\n</context>\n\n<user_query>\n{query}\n</user_query>"

            messages = [{"role": "system", "content": system_prompt}, {"role": "user", "content": user_content}]
            prompt_string = self.tok.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)

            try:
                # ... Image processing and vLLM call ...
                pil_images = []
                if images:
                    # ... (image processing logic with 512x512 thumbnail) ...
                    for i, img_path in enumerate(images):
                        try:
                            img = PILImage.open(img_path).convert("RGB")
                            if img.width > 512 or img.height > 512:
                                img.thumbnail((512, 512), PILImage.Resampling.LANCZOS)
                            pil_images.append(img)
                        except Exception as e:
                            logger.warning(f"Failed to process image {img_path}: {e}")
                
                request_payload = {"prompt": prompt_string}
                if pil_images:
                    request_payload["multi_modal_data"] = {"image": pil_images}

                output_list = self.llm.generate([request_payload], SamplingParams(temperature=0.0, top_p=1.0, max_tokens=2048))
                reply = output_list[0].outputs[0].text.strip() if output_list and output_list[0].outputs else "Error: no output from LLM."
                
                end_gen_time = time.time()
                
                if reply == "The provided context does not contain relevant information to answer the query.":
                    images = []
                
                result_dict = {"reply": reply, "used_images": images, "generation_time_seconds": end_gen_time - start_gen_time}

                # --- Semantic Caching ---
                if self.cache:
                    self.cache.set(query, result_dict)
                
                return result_dict

            except Exception as e:
                logger.error(f"Qwen-VL generation failed: {e}", exc_info=True)
                return {"reply": f"Error during generation: {e}", "used_images": images, "generation_time_seconds": 0.0}
        
        def _clear_cuda(self):
            if torch.cuda.is_available():
                gc.collect()
                torch.cuda.empty_cache()

    # ==========================================================================
    # 2. MLflow `pyfunc` Life-cycle and Service Methods
    # ==========================================================================
    def load_context(self, context: mlflow.pyfunc.PythonModelContext) -> None:
        """Initializes the service, loads models, and sets up persistent storage."""
        
        logger.info("--- Initializing Dynamic MultimodalRAG Service (Qwen-VL) ---")
        self.device = "cuda" if torch.cuda.is_available() else "cpu"

        # Setup persistent storage paths
        self.persistent_storage_path = Path("/tmp/multimodal_rag_service_data")
        self.persistent_storage_path.mkdir(parents=True, exist_ok=True)
        self.chroma_store_path = self.persistent_storage_path / "chroma_store"
        self.cache_dir = self.persistent_storage_path / "semantic_cache"
        self.manifest_path = self.persistent_storage_path / "manifest.json"
        logger.info(f"Service data will be managed at: {self.persistent_storage_path}")

        # Load embedding models from artifacts
        e5_model_path = context.artifacts["e5_model_dir"]
        siglip_model_path = context.artifacts["siglip_model_dir"]
        self.text_embed_model = HuggingFaceEmbeddings(model_name=e5_model_path, model_kwargs={"device": self.device})
        self.siglip_embed_model = SiglipEmbeddings(model_id=siglip_model_path, device=self.device)

        # Load Qwen-VL model using vLLM
        if self.device == "cuda":
            model_path = Path(context.artifacts["local_model_dir"]).resolve()
            base_model_name = "Qwen/Qwen2.5-VL-7B-Instruct"
            self.tok = AutoTokenizer.from_pretrained(base_model_name, trust_remote_code=True)
            self.image_processor = AutoImageProcessor.from_pretrained(base_model_name, trust_remote_code=True)
            self.llm = LLM(
                model=str(model_path),
                quantization="gptq",
                gpu_memory_utilization=0.80,
                max_model_len=4096,
                enforce_eager=True,
                limit_mm_per_prompt={"image": 2},
                disable_custom_all_reduce=True,
                tensor_parallel_size=1,
                dtype="float16",
            )
            self.judge = LocalGenAIJudge(llm=self.llm, tokenizer=self.tok)
            logger.info("Qwen-VL LLM and Judge loaded successfully.")
        else:
            self.llm, self.tok, self.image_processor, self.judge = None, None, None, None
            logger.error("vLLM requires a CUDA device. LLM and Judge are disabled.")
        
        # Initialize placeholders for RAG components
        self.text_db, self.image_db, self.bm25_index, self.doc_map, self.cache, self.mm_llm = [None] * 6
        logger.info("--- Service initialized. Ready for commands. ---")

    def predict(self, context: mlflow.pyfunc.PythonModelContext, model_input: pd.DataFrame) -> pd.DataFrame:
        """Handles 'update_kb' and 'query' commands."""
        command = model_input["command"].iloc[0]
        logger.info(f"Received command: '{command}'")

        # if the command is "update_kb", update the knowledge base
        if command == "update_kb":
            payload = json.loads(model_input["payload"].iloc[0])
            result = self.update_knowledge_base(config=payload["config"], secrets=payload["secrets"])
            return pd.DataFrame([result])
    
        # if the command is "query", process the query
        elif command == "query":
            if self.mm_llm is None:
                logger.info("First query received. Initializing RAG pipeline from persistent storage...")
                self._initialize_rag_pipeline()
            
            query = model_input["query"].iloc[0]
            force_regenerate = model_input.get("force_regenerate", pd.Series([False])).iloc[0]
            response_dict = self.mm_llm.generate(query, force_regenerate=force_regenerate)
            
            # Convert image paths to Base64 for JSON response
            image_paths = response_dict.get("used_images", [])
            base64_images = []
            for path in image_paths:
                try:
                    with open(path, "rb") as img_file:
                        encoded_string = base64.b64encode(img_file.read()).decode('utf-8')
                        base64_images.append(encoded_string)
                except FileNotFoundError:
                    logger.warning(f"Image file not found: {path}")
            response_dict["used_images"] = json.dumps(base64_images)
    
            # Add query to response dictionary for evaluation, using llm judge
            if self.judge:
                retrieved_info = self.mm_llm._retrieve_mm(query)
                context_str = "\n\n".join(d.page_content for d in retrieved_info["docs"])
                eval_df = pd.DataFrame([{"questions": query, "result": response_dict["reply"], "source_documents": context_str}])
                response_dict["faithfulness"] = self.judge.evaluate_faithfulness(eval_df).iloc[0]
                response_dict["relevance"] = self.judge.evaluate_relevance(eval_df).iloc[0]
            else:
                response_dict["faithfulness"], response_dict["relevance"] = None, None
            
            return pd.DataFrame([response_dict])
    
        else:
            return pd.DataFrame([{"status": "error", "message": f"Unknown command: {command}"}])

    def update_knowledge_base(self, config: dict, secrets: dict) -> dict:
        """Checks for source data changes and rebuilds the knowledge base if necessary."""
        logger.info("Starting knowledge base update check...")
        processed_data_dir = self.persistent_storage_path / "processed_data"
        processed_data_dir.mkdir(parents=True, exist_ok=True)

        try:
            # Use a temporary directory for the initial clone to ensure atomicity
            with tempfile.TemporaryDirectory() as temp_dir_str:
                temp_path = Path(temp_dir_str)

                # Now, pass the Path object to the cloning function
                orchestrate_wiki_clone(pat=secrets['AIS_ADO_TOKEN'], config=config, output_dir=temp_path)

                # Create a manifest for the cloned data
                current_manifest = self._create_json_manifest(temp_path)
                if not current_manifest:
                    return {"status": "error", "message": "Cloning step failed to produce a manifest file."}

                old_manifest = json.loads(self.manifest_path.read_text()) if self.manifest_path.exists() else {}

                # If the data hasn't changed, no need to re-index
                if current_manifest == old_manifest:
                    logger.info("Knowledge base is already up-to-date.")
                    # Initialize the pipeline if it's the first run
                    if not self.mm_llm:
                        self._initialize_rag_pipeline()
                    return {"status": "success", "message": "Knowledge base is already up-to-date."}

                # If data has changed, perform a full re-index
                logger.info("Knowledge base has changed. Performing a full re-index.")
                if self.chroma_store_path.exists():
                    shutil.rmtree(self.chroma_store_path)
                self.chroma_store_path.mkdir()

                # Move successfully cloned data from the temp to the permanent location
                shutil.copytree(temp_path, processed_data_dir, dirs_exist_ok=True)

            # Process data from the permanent location
            image_dir = processed_data_dir / "images"
            wiki_metadata_path = processed_data_dir / "wiki_flat_structure.json"
            all_raw_docs = load_mm_docs_clean(wiki_metadata_path, image_dir)
            all_chunks = self._chunk_docs(all_raw_docs)

            # Index text chunks
            if all_chunks:
                logger.info(f"Indexing {len(all_chunks)} text chunks...")
                Chroma.from_documents(
                    documents=all_chunks, embedding=self.text_embed_model,
                    persist_directory=str(self.chroma_store_path), collection_name="mm_text"
                )

            # Index images
            img_paths, img_ids, img_meta = self._collect_image_vectors(all_raw_docs, image_dir)
            if img_paths:
                logger.info(f"Indexing {len(img_paths)} images...")
                image_db = Chroma(
                    collection_name="mm_image", persist_directory=str(self.chroma_store_path),
                    embedding_function=self.siglip_embed_model,
                )
                image_db.add_texts(texts=img_paths, metadatas=img_meta, ids=img_ids)
                image_db.persist()

            # Clear the old semantic cache and save the new manifest
            if self.cache_dir.exists():
                shutil.rmtree(self.cache_dir)
            self.manifest_path.write_text(json.dumps(current_manifest, indent=2))
            
            # Re-initialize the full RAG pipeline with the new data
            self._initialize_rag_pipeline()

            return {"status": "success", "message": "Knowledge base rebuilt successfully."}
        except Exception as e:
            logger.error(f"Knowledge base update failed: {e}", exc_info=True)
            return {"status": "error", "message": str(e)}

    # ==========================================================================
    # 3. Helper and Class Methods
    # ==========================================================================
    def _initialize_rag_pipeline(self):
        """Initializes all RAG components from the persistent storage."""
        if not self.chroma_store_path.exists():
            raise FileNotFoundError(f"ChromaDB not found. Run 'update_kb' first.")
        
        # Setup chromadbs
        logger.info("Loading RAG components from persistent storage...")
        self.text_db = Chroma(collection_name="mm_text", persist_directory=str(self.chroma_store_path), embedding_function=self.text_embed_model)
        self.image_db = Chroma(collection_name="mm_image", persist_directory=str(self.chroma_store_path), embedding_function=self.siglip_embed_model)
    
        # Load all documents from the text database and deduplicate them
        all_docs_data = self.text_db.get(include=["documents", "metadatas"])
        if not all_docs_data['ids']:
             unique_splits = []
        else:
            all_docs = [Document(page_content=txt, metadata=meta) for txt, meta in zip(all_docs_data['documents'], all_docs_data['metadatas'])]
            unique_splits = list({doc.page_content: doc for doc in all_docs}.values())

        # Create BM25 index from unique text splits
        corpus = [doc.page_content for doc in unique_splits]
        self.bm25_index = BM25Okapi([doc.split(" ") for doc in corpus]) if corpus else None
        self.doc_map = {doc.page_content: doc for doc in unique_splits}
    
        # Initialize the semantic cache
        self.cache_dir.mkdir(parents=True, exist_ok=True)
        self.cache = SemanticCache(persist_directory=str(self.cache_dir), embedding_function=self.text_embed_model)
        
        if not self.llm:
            raise RuntimeError("LLM not loaded (requires CUDA). Cannot initialize RAG pipeline.")
        
        # Initialize the multimodal RAG pipeline
        self.mm_llm = self.QwenVLRAGPipeline(
            llm=self.llm, tok=self.tok, image_processor=self.image_processor, device=self.device, cache=self.cache,
            text_db=self.text_db, image_db=self.image_db,
            bm25_index=self.bm25_index, doc_map=self.doc_map
        )
        logger.info("✅ RAG pipeline fully initialized with Qwen-VL.")

    def _create_json_manifest(self, context_dir: Path) -> Dict[str, str]:
        """Creates a manifest by hashing 'wiki_flat_structure.json'."""
        manifest = {}
        json_file = context_dir / "wiki_flat_structure.json"
        if json_file.exists():
            file_bytes = json_file.read_bytes()
            manifest[json_file.name] = hashlib.sha256(file_bytes).hexdigest()
        return manifest
    
    def _chunk_docs(self, docs: List[Document]) -> List[Document]:
        """Chunks raw documents and assigns unique IDs."""
        # This method chunks documents based on headers and recursive character splitting.
        header_splitter = MarkdownHeaderTextSplitter(headers_to_split_on=[("#", "title"), ("##", "section")])
        recursive_splitter = RecursiveCharacterTextSplitter(chunk_size=1200, chunk_overlap=200)
        all_chunks: list[Document] = []
        for doc in docs:
            page_title = Path(doc.metadata["source"]).stem.replace("-", " ")
            section_docs = header_splitter.split_text(doc.page_content)
            doc_chunk_counter = 0
            for section in section_docs:
                tiny_texts = recursive_splitter.split_text(section.page_content)
                for tiny in tiny_texts:
                    chunk_metadata = {
                        "title": page_title,
                        "source": doc.metadata["source"],
                        "section_header": section.metadata.get("header", ""),
                        "chunk_id": doc_chunk_counter,
                    }
                    all_chunks.append(Document(page_content=f"{page_title}\n\n{tiny.strip()}", metadata=chunk_metadata))
                    doc_chunk_counter += 1
        return all_chunks
        
    def _collect_image_vectors(self, mm_raw_docs: List[Document], image_dir: Path) -> tuple:
        """Collects paths, IDs, and metadata for unique images."""
        # This method collects image paths, IDs, and metadata from the raw documents.
        img_paths, img_ids, img_meta = [], [], []
        seen = set()
        for doc in mm_raw_docs:
            src = doc.metadata["source"]
            for name in doc.metadata.get("images", []):
                img_id = f"{src}::{name}"
                if img_id in seen: continue
                seen.add(img_id)
                img_paths.append(str(image_dir / name))
                img_ids.append(img_id)
                img_meta.append({"source": src, "image": name})
        return img_paths, img_ids, img_meta

    @classmethod
    def log_model(cls, model_name: str, local_model: str) -> None:
        """Logs the model service to MLflow."""
        logger.info(f"--- Logging '{model_name}' Service to MLflow ---")
        # This method logs the model service to MLflow, including artifacts and signatures.
        with tempfile.TemporaryDirectory() as temp_dir:
            temp_path = Path(temp_dir)
            e5_path = temp_path / "e5-large-v2"
            SentenceTransformer("intfloat/e5-large-v2").save(str(e5_path))
            
            siglip_path = temp_path / "siglip2-base-patch16-224"
            SiglipModel.from_pretrained("google/siglip2-base-patch16-224").save_pretrained(siglip_path)
            SiglipProcessor.from_pretrained("google/siglip2-base-patch16-224").save_pretrained(siglip_path)
            
            artifacts = {
                "local_model_dir": local_model,
                "e5_model_dir": str(e5_path),
                "siglip_model_dir": str(siglip_path),
            }
            
            input_schema = Schema([
                ColSpec(DataType.string, "command"),
                ColSpec(DataType.string, "query", required=False),
                ColSpec(DataType.string, "payload", required=False),
                ColSpec(DataType.boolean, "force_regenerate", required=False)
            ])
            output_schema = Schema([
                ColSpec(DataType.string, "reply", required=False),
                ColSpec(DataType.string, "used_images", required=False),
                ColSpec(DataType.double, "generation_time_seconds", required=False),
                ColSpec(DataType.double, "faithfulness", required=False),
                ColSpec(DataType.double, "relevance", required=False),
                ColSpec(DataType.string, "status", required=False),
                ColSpec(DataType.string, "message", required=False),
            ])
            signature = ModelSignature(inputs=input_schema, outputs=output_schema)

            mlflow.pyfunc.log_model(
                artifact_path=model_name,
                python_model=cls(),
                artifacts=artifacts,
                pip_requirements="../requirements.txt",
                signature=signature,
                code_paths=["../src"],
            )
        logger.info(f"✅ Successfully logged '{model_name}' service.")

## Step 3: Start Run, Log & Register Model

In [13]:
%%time

# --- Start MLflow Run and Log the Model ---
try:
    with mlflow.start_run(run_name=RUN_NAME) as run:
        run_id = run.info.run_id
        logger.info(f"Started MLflow run: {run_id}")

        # Use the class method to log the model and its artifacts
        MultimodalRagModel.log_model(model_name=MODEL_NAME, local_model=LOCAL_MODEL)

        model_uri = f"runs:/{run_id}/{MODEL_NAME}"
        logger.info(f"Registering model from URI: {model_uri}")
        
        # Register the model in the MLflow Model Registry
        mlflow.register_model(model_uri=model_uri, name=MODEL_NAME)
        logger.info(f"✅ Successfully registered model '{MODEL_NAME}'")

except FileNotFoundError as e:
    logger.error(f"Error: A required file or directory was not found. Please ensure the project structure is correct.")
    logger.error(f"Details: {e}")
except Exception as e:
    logger.error(f"An unexpected error occurred during the MLflow run: {e}", exc_info=True)

2025-08-05 00:49:25 - INFO - Started MLflow run: 4bc2abcd34464fa28f185b1b29941ae0
2025-08-05 00:49:25 - INFO - --- Logging 'AIStudio-Multimodal-Chatbot-Model' Service to MLflow ---


Using a slow image processor as `use_fast` is unset and a slow processor was saved with this model. `use_fast=True` will be the default behavior in v4.52, even if the model was saved with a slow processor. This will result in minor differences in outputs. You'll still be able to use a slow processor with `use_fast=False`.


Downloading artifacts:   0%|          | 0/84 [00:00<?, ?it/s]

Downloading artifacts:   0%|          | 0/11 [00:00<?, ?it/s]

Downloading artifacts:   0%|          | 0/8 [00:00<?, ?it/s]

2025-08-05 00:51:28 - INFO - ✅ Successfully logged 'AIStudio-Multimodal-Chatbot-Model' service.
2025-08-05 00:51:28 - INFO - Registering model from URI: runs:/4bc2abcd34464fa28f185b1b29941ae0/AIStudio-Multimodal-Chatbot-Model
2025-08-05 00:51:28 - INFO - ✅ Successfully registered model 'AIStudio-Multimodal-Chatbot-Model'
CPU times: user 3.77 s, sys: 25.8 s, total: 29.6 s
Wall time: 2min 3s


Registered model 'AIStudio-Multimodal-Chatbot-Model' already exists. Creating a new version of this model...
Created version '5' of model 'AIStudio-Multimodal-Chatbot-Model'.


In [14]:
# --- Retrieve the latest version from the Model Registry ---
try:
    client = MlflowClient()
    versions = client.get_latest_versions(MODEL_NAME, stages=["None"])
    if not versions:
        raise RuntimeError(f"No registered versions found for model '{MODEL_NAME}'.")
    
    latest_version = versions[0]
    logger.info(f"Found latest version '{latest_version.version}' for model '{MODEL_NAME}'.")
    model_uri_registry = latest_version.source

except Exception as e:
    logger.error(f"Failed to retrieve model from registry: {e}", exc_info=True)
    model_uri_registry = None

2025-08-05 00:51:28 - INFO - Found latest version '5' for model 'AIStudio-Multimodal-Chatbot-Model'.


In [15]:
if model_uri_registry:
    try:
        logger.info(f"Loading model from: {model_uri_registry}")
        # This step will trigger the load_context method and initialize the vLLM engine
        loaded_model = mlflow.pyfunc.load_model(model_uri=model_uri_registry)
        logger.info("✅ Successfully loaded model from registry.")
    except Exception as e:
        logger.error(f"Failed to load model from registry URI: {e}", exc_info=True)
        loaded_model = None
else:
    logger.warning("Skipping model loading due to previous errors.")
    loaded_model = None

2025-08-05 00:51:28 - INFO - Loading model from: /phoenix/mlflow/594178897322281329/4bc2abcd34464fa28f185b1b29941ae0/artifacts/AIStudio-Multimodal-Chatbot-Model
2025-08-05 00:51:28 - INFO - --- Initializing Dynamic MultimodalRAG Service (Qwen-VL) ---
2025-08-05 00:51:28 - INFO - Service data will be managed at: /tmp/multimodal_rag_service_data


Using a slow image processor as `use_fast` is unset and a slow processor was saved with this model. `use_fast=True` will be the default behavior in v4.52, even if the model was saved with a slow processor. This will result in minor differences in outputs. You'll still be able to use a slow processor with `use_fast=False`.


INFO 08-05 00:51:48 [config.py:841] This model supports multiple tasks: {'generate', 'classify', 'reward', 'embed'}. Defaulting to 'generate'.
INFO 08-05 00:51:48 [config.py:1472] Using max model len 4096
INFO 08-05 00:51:48 [gptq_marlin.py:174] Detected that the model can run with gptq_marlin, however you specified quantization=gptq explicitly, so forcing gptq. Use quantization=gptq_marlin for faster inference
INFO 08-05 00:51:48 [config.py:2285] Chunked prefill is enabled with max_num_batched_tokens=8192.


2025-08-05 00:51:50.899674: I tensorflow/core/util/port.cc:153] oneDNN custom operations are on. You may see slightly different numerical results due to floating-point round-off errors from different computation orders. To turn them off, set the environment variable `TF_ENABLE_ONEDNN_OPTS=0`.
2025-08-05 00:51:50.907052: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:467] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
E0000 00:00:1754355110.915721    4922 cuda_dnn.cc:8579] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
E0000 00:00:1754355110.918213    4922 cuda_blas.cc:1407] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
W0000 00:00:1754355110.924729    4922 computation_placer.cc:177] computation placer already registered. Please check linkage and avoid linking 

INFO 08-05 00:51:52 [__init__.py:244] Automatically detected platform cuda.
INFO 08-05 00:51:53 [core.py:526] Waiting for init message from front-end.
INFO 08-05 00:51:53 [core.py:69] Initializing a V1 LLM engine (v0.9.2) with config: model='/phoenix/mlflow/594178897322281329/4bc2abcd34464fa28f185b1b29941ae0/artifacts/AIStudio-Multimodal-Chatbot-Model/artifacts/Qwen2.5-VL-7B-Instruct-GPTQ-Int4-1', speculative_config=None, tokenizer='/phoenix/mlflow/594178897322281329/4bc2abcd34464fa28f185b1b29941ae0/artifacts/AIStudio-Multimodal-Chatbot-Model/artifacts/Qwen2.5-VL-7B-Instruct-GPTQ-Int4-1', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, override_neuron_config={}, tokenizer_revision=None, trust_remote_code=False, dtype=torch.float16, max_seq_len=4096, download_dir=None, load_format=LoadFormat.AUTO, tensor_parallel_size=1, pipeline_parallel_size=1, disable_custom_all_reduce=True, quantization=gptq, enforce_eager=True, kv_cache_dtype=auto,  device_config=cuda, decoding_confi

Loading safetensors checkpoint shards:   0% Completed | 0/2 [00:00<?, ?it/s]
Loading safetensors checkpoint shards:  50% Completed | 1/2 [00:09<00:09,  9.05s/it]
Loading safetensors checkpoint shards: 100% Completed | 2/2 [00:19<00:00, 10.16s/it]
Loading safetensors checkpoint shards: 100% Completed | 2/2 [00:19<00:00,  9.99s/it]



INFO 08-05 00:52:15 [default_loader.py:272] Loading weights took 17.45 seconds
INFO 08-05 00:52:16 [gpu_model_runner.py:1801] Model loading took 6.5934 GiB and 18.128455 seconds
INFO 08-05 00:52:16 [gpu_model_runner.py:2238] Encoder cache will be initialized with a budget of 16384 tokens, and profiled with 1 image items of the maximum feature size.


Using a slow image processor as `use_fast` is unset and a slow processor was saved with this model. `use_fast=True` will be the default behavior in v4.52, even if the model was saved with a slow processor. This will result in minor differences in outputs. You'll still be able to use a slow processor with `use_fast=False`.
You have video processor config saved in `preprocessor.json` file which is deprecated. Video processor configs should be saved in their own `video_preprocessor.json` file. You can rename the file or load and save the processor back which renames it automatically. Loading from `preprocessor.json` will be removed in v5.0.
If this is not desired, please set os.environ['TORCH_CUDA_ARCH_LIST'].


INFO 08-05 00:52:25 [gpu_worker.py:232] Available KV cache memory: 0.60 GiB
INFO 08-05 00:52:25 [kv_cache_utils.py:716] GPU KV cache size: 11,280 tokens
INFO 08-05 00:52:25 [kv_cache_utils.py:720] Maximum concurrency for 4,096 tokens per request: 2.75x
INFO 08-05 00:52:25 [core.py:172] init engine (profile, create kv cache, warmup model) took 9.39 seconds
2025-08-05 00:52:26 - INFO - Qwen-VL LLM and Judge loaded successfully.
2025-08-05 00:52:26 - INFO - --- Service initialized. Ready for commands. ---
2025-08-05 00:52:26 - INFO - ✅ Successfully loaded model from registry.


## Step 4: Display Results

In [16]:
%%time
if loaded_model:
    logger.info("Populating the model's knowledge base using the 'update_kb' command...")
    try:
        config = load_config(CONFIG_PATH)
        secrets = load_secrets(SECRETS_PATH)
        
        # 1. Construct the payload with credentials and config
        update_payload_dict = {
            "config": config,
            "secrets": {"AIS_ADO_TOKEN": secrets.get('AIS_ADO_TOKEN')}
        }
        
        # 2. Create the DataFrame for the 'update_kb' command
        update_df = pd.DataFrame([{
            "command": "update_kb",
            "payload": json.dumps(update_payload_dict)
        }])
        
        # 3. Send the command to the model to build/verify the database
        update_status = loaded_model.predict(update_df)
        logger.info(f"Update Status: {update_status['message'].iloc[0]}")

    except Exception as e:
        logger.error(f"Failed to update knowledge base: {e}", exc_info=True)
else:
    logger.warning("Skipping knowledge base update because the model was not loaded.")

2025-08-05 00:52:26 - INFO - Populating the model's knowledge base using the 'update_kb' command...
2025-08-05 00:52:26 - INFO - Received command: 'update_kb'
2025-08-05 00:52:26 - INFO - Starting knowledge base update check...
2025-08-05 00:52:26 - INFO - Starting ADO Wiki clone process...
2025-08-05 00:52:26 - INFO - Cloning wiki 'Phoenix-DS-Platform.wiki' to temporary directory: /tmp/tmplomlu0ok
2025-08-05 00:52:41 - INFO - Scanning for Markdown files...
2025-08-05 00:52:41 - INFO - → Found 575 Markdown pages.
2025-08-05 00:52:41 - INFO - Copying referenced images to /tmp/tmpbqg13flv/images...
2025-08-05 00:52:42 - INFO - → 792 unique images copied.
2025-08-05 00:52:42 - INFO - Assembling flat JSON structure...
2025-08-05 00:52:42 - INFO - ✅ Wiki data successfully cloned to /tmp/tmpbqg13flv
2025-08-05 00:52:42 - INFO - Cleaned up temporary directory: /tmp/tmplomlu0ok
2025-08-05 00:52:42 - INFO - Knowledge base is already up-to-date.
2025-08-05 00:52:42 - INFO - Loading RAG component

In [17]:
def display_results(query: str, result_df: pd.DataFrame):
    """Helper to neatly print the query, reply, and display Base64 images."""
    if result_df.empty:
        print("Received an empty result.")
        return

    row = result_df.iloc[0]
    reply = row.get("reply", "No reply generated.")
    used_images_json = row.get("used_images", "[]")
    gen_time = row.get("generation_time_seconds", 0)
    faithfulness = row.get("faithfulness", -1)
    relevance = row.get("relevance", -1)

    try:
        base64_images = json.loads(used_images_json)
    except (json.JSONDecodeError, TypeError):
        base64_images = []

    # Display the output
    print("---" * 20)
    print(f"❓ Query:\n{query}\n")
    print("🤖 Reply:")
    display(Markdown(reply))
    
    print(f"\n📊 Faithfulness: {faithfulness:.4f} | Relevance: {relevance:.4f}")
    print(f"⏱️ Generation Time: {gen_time:.2f}s\n")

    if base64_images:
        print(f"🖼️ Displaying {len(base64_images)} retrieved image(s):")
        for b64_string in base64_images:
            display(Image(data=base64.b64decode(b64_string), width=400))
    else:
        print("▶ No images were retrieved for this query.")
    print("---" * 20 + "\n")


In [None]:
all_results = []
if loaded_model:
    logger.info("Running sample inference with the loaded model...")
    sample_queries = [
        "What are the AI Blueprints Repository best practices?",
        "What are some feature flags that i can enable in AIStudio?",
        "How do i manually clean my environment without hooh?",
    ]

    for query in sample_queries:
        try:
            input_payload = pd.DataFrame([{
                "command": "query",
                "query": query,
                "force_regenerate": False
            }])
            
            result_df = loaded_model.predict(input_payload)
            result_df['query'] = query
            display_results(query, result_df)
            all_results.append(result_df)

        except Exception as e:
            logger.error(f"Prediction failed for query '{query}': {e}", exc_info=True)

    final_results_df = pd.concat(all_results, ignore_index=True) if all_results else pd.DataFrame()
else:
    logger.warning("Skipping sample inference because the model was not loaded.")
    final_results_df = pd.DataFrame()

## Step 5: Log Hallucinations & Relevance Evaluations to MlFlow

In [19]:
# Check if the model was loaded, run_id exists, AND we have results to log
if loaded_model and 'run_id' in locals() and not final_results_df.empty:
    logger.info(f"--- Reopening original run ({run_id}) to log evaluations ---")
    
    # Reopen the existing run using its ID
    with mlflow.start_run(run_id=run_id) as run:
        logger.info("Successfully reopened run. Logging metrics and artifacts...")

        # Calculate average scores from the DataFrame of results
        avg_faithfulness = final_results_df["faithfulness"].astype(float).mean()
        avg_relevance = final_results_df["relevance"].astype(float).mean()

        # Log the average scores as summary metrics
        mlflow.log_metrics({
            "avg_faithfulness": avg_faithfulness,
            "avg_relevance": avg_relevance
        })

        # Log the detailed results as a table artifact for inspection
        mlflow.log_table(
            data=final_results_df[['query', 'reply', 'faithfulness', 'relevance']], 
            artifact_file="inline_evaluation_results.json"
        )
        
        logger.info("✅ Successfully logged metrics and artifacts.")
else:
    logger.warning("Skipping evaluation logging: model not loaded, run_id not found, or no results were generated.")


2025-08-05 00:53:08 - INFO - --- Reopening original run (4bc2abcd34464fa28f185b1b29941ae0) to log evaluations ---
2025-08-05 00:53:08 - INFO - Successfully reopened run. Logging metrics and artifacts...
2025-08-05 00:53:08 - INFO - ✅ Successfully logged metrics and artifacts.


In [20]:
end_time: float = time.time()
elapsed_time: float = end_time - start_time
elapsed_minutes: int = int(elapsed_time // 60)
elapsed_seconds: float = elapsed_time % 60

logger.info(f"⏱️ Total execution time: {elapsed_minutes}m {elapsed_seconds:.2f}s")
logger.info("✅ Notebook execution completed.")

2025-08-05 00:53:08 - INFO - ⏱️ Total execution time: 3m 58.02s
2025-08-05 00:53:08 - INFO - ✅ Notebook execution completed.


Built with ❤️ using Z by HP AI Studio.