<h1 style="text-align: center; font-size: 50px;"> 🤖 MLFlow Registration for Multimodal RAG No Local DB Cache
</h1>

# MLFlow Model Service 

In this section, we demonstrate how to deploy a RAG-based chatbot service. This service provides a REST API endpoint that allows users to query the knowledge base with natural language questions, upload new documents to the knowledge base, and manage conversation history, all with built-in safeguards against sensitive information and toxicity. This service encapsulates all the functionality we developed in this notebook, including the document retrieval system, RAG-based question answering capabilities, and MLFlow integration for observation and evaluation. It demonstrates how to use our ChatbotService from the src/service directory. 

## Step 0: Imports and Environment Setup

In [1]:
import time
import os 
from pathlib import Path
import sys
import logging

sys.path.append(os.path.abspath(os.path.join(os.getcwd(), "..")))

# Create logger
logger = logging.getLogger("multimodal_rag_logger")
logger.setLevel(logging.INFO)
if not logger.handlers:
    formatter = logging.Formatter("%(asctime)s - %(levelname)s - %(message)s", datefmt="%Y-%m-%d %H:%M:%S")
    stream_handler = logging.StreamHandler(sys.stdout)
    stream_handler.setFormatter(formatter)
    logger.addHandler(stream_handler)
logger.propagate = False

In [2]:
start_time = time.time()  

logger.info("Notebook execution started.")

2025-08-01 15:38:54 - INFO - Notebook execution started.


In [3]:
%pip install -r ../requirements.txt --quiet

[31mERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
galileo-protect 0.15.1 requires galileo-core<3.0.0,>=2.17.0, but you have galileo-core 3.60.0 which is incompatible.
galileo-observe 1.13.2 requires galileo-core<3.0.0,>=2.20.0, but you have galileo-core 3.60.0 which is incompatible.[0m[31m
[0mNote: you may need to restart the kernel to use updated packages.


In [4]:
# === Standard Library Imports ===
import gc
import json
import math
import os
import base64
import tempfile
import shutil
import warnings
from rank_bm25 import BM25Okapi
from typing import Any, Dict, List, Optional, TypedDict
from statistics import mean
from collections import defaultdict
import hashlib
from IPython.display import display, Image, Markdown

# === Third-Party Library Imports ===
import mlflow
import numpy as np
import pandas as pd
import torch
from langchain_core.embeddings import Embeddings
from langchain.schema.document import Document
from langchain.vectorstores import Chroma
from langchain_huggingface import HuggingFaceEmbeddings
from langchain.text_splitter import RecursiveCharacterTextSplitter, MarkdownHeaderTextSplitter
from mlflow.models.signature import ModelSignature
from mlflow.tracking import MlflowClient
from mlflow.types import ColSpec, DataType, Schema, TensorSpec
from PIL import Image as PILImage
from sentence_transformers import CrossEncoder, SentenceTransformer
from transformers import pipeline, AutoImageProcessor, AutoModel, AutoTokenizer, AutoModelForCausalLM, BitsAndBytesConfig, SiglipModel, SiglipProcessor

# === Project-Specific Imports ===
# Add the project root to the system path to allow importing from 'src'
sys.path.append(os.path.abspath(os.path.join(os.getcwd(), "..")))

from src.components import SemanticCache, SiglipEmbeddings
from src.wiki_pages_clone import orchestrate_wiki_clone
from src.local_genai_judge import LocalGenAIJudge
from src.utils import (
    configure_hf_cache,
    multimodal_rag_asset_status,
    load_config,
    load_secrets,
    load_mm_docs_clean,
)

2025-08-01 15:40:38.424514: I tensorflow/core/util/port.cc:153] oneDNN custom operations are on. You may see slightly different numerical results due to floating-point round-off errors from different computation orders. To turn them off, set the environment variable `TF_ENABLE_ONEDNN_OPTS=0`.
2025-08-01 15:40:38.436008: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:467] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
E0000 00:00:1754062838.446286     307 cuda_dnn.cc:8579] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
E0000 00:00:1754062838.449207     307 cuda_blas.cc:1407] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
W0000 00:00:1754062838.457081     307 computation_placer.cc:177] computation placer already registered. Please check linkage and avoid linking 

In [5]:
warnings.filterwarnings("ignore")

In [6]:
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
DEVICE = "cuda" if torch.cuda.is_available() else "cpu"

print(f"Using device: {device}")

Using device: cuda


## Step 1: Configurations

### Verify Assets

In [7]:
CONFIG_PATH = "../configs/config.yaml"
SECRETS_PATH = "../configs/secrets.yaml"

LOCAL_MODEL = "/home/jovyan/datafabric/InternVL3-8B-Instruct"
CONTEXT_DIR: Path = Path("../data/context")
CHROMA_DIR: Path = Path("../data/chroma_store")     
CACHE_DIR: Path = CHROMA_DIR / "semantic_cache"
MANIFEST_PATH: Path = CHROMA_DIR / "manifest.json"

IMAGE_DIR = CONTEXT_DIR / "images"
WIKI_METADATA_DIR = CONTEXT_DIR / "wiki_flat_structure.json"

CHROMA_DIR.mkdir(parents=True, exist_ok=True)
CACHE_DIR.mkdir(parents=True, exist_ok=True)

multimodal_rag_asset_status(
    local_model_path=LOCAL_MODEL,
    config_path=CONFIG_PATH,
    secrets_path=SECRETS_PATH,
    wiki_metadata_dir=WIKI_METADATA_DIR,
    context_dir=CONTEXT_DIR,
    chroma_dir=CHROMA_DIR,
    cache_dir=CACHE_DIR,
    manifest_path=MANIFEST_PATH
)

2025-08-01 15:40:40 - INFO - Local Model is properly configured. 
2025-08-01 15:40:40 - INFO - Config is properly configured. 
2025-08-01 15:40:40 - INFO - Secrets is properly configured. 
2025-08-01 15:40:40 - INFO - wiki_flat_structure.json is properly configured. 
2025-08-01 15:40:40 - INFO - CONTEXT is properly configured. 
2025-08-01 15:40:40 - INFO - CHROMA is properly configured. 
2025-08-01 15:40:40 - INFO - CACHE is properly configured. 
2025-08-01 15:40:40 - INFO - MANIFEST is properly configured. 


In [8]:
config = load_config(CONFIG_PATH)

### Config HuggingFace Caches

In the next cell, we configure HuggingFace cache, so that all the models downloaded from them are persisted locally, even after the workspace is closed. This is a future desired feature for AI Studio and the GenAI addon.

In [9]:
configure_hf_cache()

In [10]:
# Initialize HuggingFace Embeddings
txt_embeddings = HuggingFaceEmbeddings(
    model_name="intfloat/e5-large-v2",
    cache_folder="/tmp/hf_cache"
)

modules.json:   0%|          | 0.00/387 [00:00<?, ?B/s]

README.md: 0.00B [00:00, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/57.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/616 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/1.34G [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/314 [00:00<?, ?B/s]

vocab.txt: 0.00B [00:00, ?B/s]

tokenizer.json: 0.00B [00:00, ?B/s]

special_tokens_map.json:   0%|          | 0.00/125 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/201 [00:00<?, ?B/s]

### MLflow Configuration

In [11]:
MODEL_NAME = "AIStudio-Multimodal-Chatbot-Model"
RUN_NAME = f"Register_{MODEL_NAME}"
EXPERIMENT_NAME = "AIStudio-Multimodal-Chatbot-Experiment"

# Set MLflow tracking URI and experiment
# This should be configured for your environment, e.g., a remote server or local file path
mlflow.set_tracking_uri(os.getenv("MLFLOW_TRACKING_URI", "/phoenix/mlflow"))
mlflow.set_experiment(experiment_name=EXPERIMENT_NAME)

logger.info(f"Using MLflow tracking URI: {mlflow.get_tracking_uri()}")
logger.info(f"Using MLflow experiment: '{EXPERIMENT_NAME}'")

2025/08/01 15:41:04 INFO mlflow.tracking.fluent: Experiment with name 'AIStudio-Multimodal-Chatbot-Experiment' does not exist. Creating a new experiment.


2025-08-01 15:41:04 - INFO - Using MLflow tracking URI: /phoenix/mlflow
2025-08-01 15:41:04 - INFO - Using MLflow experiment: 'AIStudio-Multimodal-Chatbot-Experiment'


## Step 2: MLflow Model Setup

In [12]:
class MultimodalRagModel(mlflow.pyfunc.PythonModel):
    """
    An MLflow PythonModel for a stateless, in-memory Multimodal RAG pipeline.

    This model acts as a service that performs the entire RAG pipeline for
    each 'query' request. It does not store any knowledge base data persistently.

    - 'query': Fetches data from a source (e.g., ADO Wiki) using credentials
      provided in the request, builds an in-memory vector store, answers the
      user's question, and then discards the data.
    """

    # ==========================================================================
    # 1. Inner Class for the RAG Generation Pipeline
    # ==========================================================================
    class InternVLMM:
        """Minimal, self-contained multimodal QA wrapper."""
        def __init__(self, model: AutoModel, tok: AutoTokenizer, image_processor: AutoImageProcessor, device: str, text_db: Chroma, image_db: Chroma, bm25_index: Optional[BM25Okapi], doc_map: dict):
            self.model = model
            self.tok = tok
            self.image_processor = image_processor
            self.device = device
            self.text_db = text_db
            self.image_db = image_db
            self.bm25_index = bm25_index
            self.doc_map = doc_map
        
        @staticmethod
        def _reciprocal_rank_fusion(results: list[list[Document]], k: int = 60) -> list[tuple[Document, float]]:
            """Performs Reciprocal Rank Fusion on multiple ranked lists of documents."""
            ranked_lists = [ {doc.page_content: (doc, i + 1) for i, doc in enumerate(res)} for res in results]
            rrf_scores = defaultdict(float)
            all_docs = {}
            for ranked_list in ranked_lists:
                for content, (doc, rank) in ranked_list.items():
                    rrf_scores[content] += 1 / (k + rank)
                    if content not in all_docs: all_docs[content] = doc
            fused_results = [(all_docs[content], rrf_scores[content]) for content in sorted(rrf_scores, key=rrf_scores.get, reverse=True)]
            return fused_results

        def _retrieve_mm(self, query: str, k_text: int = 3, k_img: int = 4, recall_k: int = 20) -> dict[str, any]:
            """Retrieves relevant documents and images based on the query using both dense and sparse retrieval methods."""
            dense_hits = self.text_db.similarity_search(query, k=recall_k)
            
            # If no dense hits, try sparse retrieval with BM25
            sparse_hits = []
            if self.bm25_index and list(self.doc_map.keys()):
                tokenized_query = query.lower().split(" ")
                sparse_texts = self.bm25_index.get_top_n(tokenized_query, list(self.doc_map.keys()), n=recall_k)
                sparse_hits = [self.doc_map[text] for text in sparse_texts]

            if not dense_hits and not sparse_hits:
                return {"docs": [], "scores": [], "images": []}

            # Perform Reciprocal Rank Fusion on the hits
            fused_results = self._reciprocal_rank_fusion([dense_hits, sparse_hits])
            final_docs = [doc for doc, score in fused_results[:k_text]]
            final_scores = [score for doc, score in fused_results[:k_text]]

            # Retrieve images based on the sources of the final documents
            retrieved_images = []
            if final_docs and self.image_db:
                final_sources = list(set(d.metadata["source"] for d in final_docs))
                image_hits = self.image_db.similarity_search(query, k=k_img, filter={"source": {"$in": final_sources}})
                retrieved_images = [img.page_content for img in image_hits]

            return {"docs": final_docs, "scores": final_scores, "images": retrieved_images}

        def generate(self, query: str, **retrieval_kwargs) -> Dict[str, Any]:
            """Generates a response to the user's query using the multimodal RAG pipeline."""
            start_gen_time = time.time()
            
            # Retrieve relevant documents and images based on the query
            hits = self._retrieve_mm(query, **retrieval_kwargs)
            docs, images = hits["docs"], hits["images"]
            if not docs and not images:
                return {"reply": "Based on the provided context, I cannot answer this question.", "used_images": [], "generation_time_seconds": 0.0}

            # Prepare the context string for the system prompt
            context_str = "\n\n".join(f"<source_document name=\"{d.metadata.get('source', 'unknown')}\">\n{d.page_content}\n</source_document>" for d in docs)
            
            system_prompt = """You are an AI Studio Expert Assistant. Your task is to answer the user's query based ONLY on the context provided. 
            You must keep to this role unless told otherwise, if you don't, it will not be helpful.
            
                **Instructions:**
                1.  **Analyze Context:** First, analyze the user's images (if any) and the text in the `<context>` block.
                2.  **Synthesize Answer:** Answer the user's query directly, synthesizing information from the context.
                3.  **Cite Sources:** List all source documents you used in a `Source Documents` section.
                4.  **Handle Missing Information:** If the answer is not in the context, respond with this exact phrase: "Based on the provided context, I cannot answer this question."
                5.  **Do not Hallucinate:** Do not hallucinate or make up factual information.
                
                **Output Format:**
                Your response must follow this exact markdown structure and nothing else. Do not add any other commentary.
                
                ### Visual Analysis
                (Analyze the user's images here.)
                
                ### Synthesized Answer
                (Your answer to the user's query goes here.)
                
                ### Source Documents
                (List the sources here, like [`source-file-name.md`].)
            """
            user_content = f"<context>\n{context_str}\n</context>\n\n<user_query>\n{query}\n</user_query>"
            conversation = [{"role": "system", "content": system_prompt}, {"role": "user", "content": user_content}]
            prompt = self.tok.apply_chat_template(conversation, tokenize=False, add_generation_prompt=True)

            try:
                self._clear_cuda()
                # Process images if provided
                pixel_values = self._process_images(images) if images else None
                # Generate the response using the model
                reply = self.model.chat(
                    self.tok, pixel_values, prompt,
                    generation_config=dict(
                        do_sample=False,
                        max_new_tokens=1024,
                        repetition_penalty=1.1,
                        pad_token_id=self.tok.pad_token_id,
                        eos_token_id=self.tok.eos_token_id
                    )
                )
                self._clear_cuda()
                end_gen_time = time.time()
                return {"reply": reply, "used_images": images, "generation_time_seconds": end_gen_time - start_gen_time}
            except RuntimeError as e:
                logger.error("InternVL generation failed: %s", e)
                return {"reply": f"Error during generation: {e}", "used_images": images, "generation_time_seconds": 0.0}

        def _process_images(self, image_paths: List[str]):
            """Processes a list of image paths into pixel values for the model."""
            if not image_paths: return None
            try:
                pil_images = [PILImage.open(p).convert("RGB") for p in image_paths]
                processed_data = self.image_processor(images=pil_images, return_tensors="pt")
                return processed_data['pixel_values'].to(device=self.device, dtype=next(self.model.parameters()).dtype)
            except Exception as e:
                logger.error("Image processing failed: %s", e)
                return None

        def _clear_cuda(self):
            """Clears CUDA memory to prevent OOM errors during generation."""
            if torch.cuda.is_available():
                torch.cuda.empty_cache()
                torch.cuda.synchronize()

    # ==========================================================================
    # 2. MLflow `pyfunc` Life-cycle and Service Methods
    # ==========================================================================
    def load_context(self, context: mlflow.pyfunc.PythonModelContext) -> None:
        """Initializes the model and loads all necessary components into memory."""
        logger.info("--- Initializing Stateless MultimodalRAG Service ---")
        self.device = "cuda" if torch.cuda.is_available() else "cpu"
        
        # Load all models ONCE and store them in memory.
        model_path = Path(context.artifacts["local_model_dir"]).resolve()
        e5_model_path = context.artifacts["e5_model_dir"]
        siglip_model_path = context.artifacts["siglip_model_dir"]
        
        logger.info("Loading text embedding model (E5)...")
        self.text_embed_model = HuggingFaceEmbeddings(model_name=e5_model_path, model_kwargs={"device": self.device})
        
        logger.info("Loading image embedding model (SigLIP)...")
        self.siglip_embed_model = SiglipEmbeddings(model_id=siglip_model_path, device=self.device)

        logger.info("Loading main LLM (InternVL)...")
        self.tok = AutoTokenizer.from_pretrained(model_path, trust_remote_code=True)
        if self.tok.pad_token is None: self.tok.pad_token = self.tok.eos_token
        self.image_processor = AutoImageProcessor.from_pretrained(model_path, trust_remote_code=True, use_fast=True)
        q_cfg = BitsAndBytesConfig(
            load_in_4bit=True, 
            bnb_4bit_quant_type="nf4", 
            bnb_4bit_compute_dtype=torch.bfloat16, 
            bnb_4bit_use_double_quant=True) if self.device == "cuda" else None
        self.model = AutoModel.from_pretrained(model_path, 
           quantization_config=q_cfg, 
           torch_dtype=(torch.bfloat16 if self.device == "cuda" else torch.float32),
           low_cpu_mem_usage=True,
           use_flash_attn=False,
           trust_remote_code=True,
           device_map="auto" if self.device == "cuda" else None).eval()
        
        logger.info("Loading evaluation judge...")
        self.judge = LocalGenAIJudge(model=self.model, tokenizer=self.tok)
        
        logger.info("--- Service initialized with all models loaded. Ready for queries. ---")

    def _build_transient_kb(self, config: dict, secrets: dict, temp_path: Path) -> Dict[str, Any]:
        """Fetches, processes, and indexes data entirely in memory from a given temp path."""
        logger.info("Cloning wiki to temporary directory...")
        orchestrate_wiki_clone(pat=secrets['AIS_ADO_TOKEN'], config=config, output_dir=temp_path)
        
        image_dir = temp_path / "images"
        wiki_metadata_path = temp_path / "wiki_flat_structure.json"
        
        if not wiki_metadata_path.exists():
            raise FileNotFoundError("Cloning failed: 'wiki_flat_structure.json' not found.")
        
        all_raw_docs = load_mm_docs_clean(wiki_metadata_path, image_dir)
        all_chunks = self._chunk_docs(all_raw_docs)

        # 1. Build In-Memory Text Vector Store
        text_db = Chroma.from_documents(documents=all_chunks, embedding=self.text_embed_model)
        logger.info(f"Built in-memory text vector store with {len(all_chunks)} chunks.")

        # 2. Build In-Memory Image Vector Store
        img_paths, img_ids, img_meta = self._collect_image_vectors(all_raw_docs, image_dir)
        image_db = Chroma(collection_name="temp_mm_image", embedding_function=self.siglip_embed_model)
        if img_paths:
            image_db.add_texts(texts=img_paths, metadatas=img_meta, ids=img_ids)
            logger.info(f"Built in-memory image vector store with {len(img_paths)} images.")

        # 3. Build In-Memory BM25 Index
        unique_splits = list({doc.page_content: doc for doc in all_chunks}.values())
        corpus = [doc.page_content for doc in unique_splits]
        bm25_index = BM25Okapi([doc.split(" ") for doc in corpus]) if corpus else None
        doc_map = {doc.page_content: doc for doc in unique_splits}
    
        return {"text_db": text_db, "image_db": image_db, "bm25_index": bm25_index, "doc_map": doc_map}

    def predict(self, context: mlflow.pyfunc.PythonModelContext, model_input: pd.DataFrame) -> pd.DataFrame:
        """Processes a query using the stateless RAG pipeline.
        This method builds a transient knowledge base in a temporary directory,
        runs the RAG pipeline, and returns the response.
        """
        logger.info("Received new query. Starting stateless RAG process.")
        pipeline_start_time = time.time()
        
        query = model_input["query"].iloc[0]
        payload_str = model_input["payload"].iloc[0]
        payload = json.loads(payload_str)
        transient_kb = None
        rag_pipeline = None

        # Create a temporary directory that lasts for the ENTIRE prediction.
        with tempfile.TemporaryDirectory() as temp_dir:
            try:
                # Step 1: Build the entire knowledge base in the temp directory.
                temp_path = Path(temp_dir)
                transient_kb = self._build_transient_kb(
                    config=payload["config"], 
                    secrets=payload["secrets"], 
                    temp_path=temp_path
                )
                
                # Step 2: Instantiate the RAG pipeline with the transient KB.
                rag_pipeline = self.InternVLMM(
                    model=self.model, tok=self.tok, image_processor=self.image_processor,
                    device=self.device, **transient_kb
                )

                # Step 3: Generate the response.
                response_dict = rag_pipeline.generate(query)
                
                # Step 4: Perform self-evaluation.
                retrieved_info = rag_pipeline._retrieve_mm(query)
                context_str = "\n\n".join(d.page_content for d in retrieved_info["docs"])
                eval_df = pd.DataFrame([{"questions": query, "result": response_dict["reply"], "source_documents": context_str}])
                response_dict["faithfulness"] = self.judge.evaluate_faithfulness(eval_df).iloc[0]
                response_dict["relevance"] = self.judge.evaluate_relevance(eval_df).iloc[0]

                # Step 5: Encode images to Base64 for the response.
                image_paths = response_dict.get("used_images", [])
                base64_images = []
                for path in image_paths:
                    try:
                        # The path is now guaranteed to exist within this 'with' block
                        with open(path, "rb") as img_file:
                            base64_images.append(base64.b64encode(img_file.read()).decode('utf-8'))
                    except FileNotFoundError:
                        logger.warning(f"Image file not found at temp path during encoding: {path}")
                response_dict["used_images"] = json.dumps(base64_images)
                pipeline_end_time = time.time()
                response_dict["total_pipeline_time_seconds"] = pipeline_end_time - pipeline_start_time

                return pd.DataFrame([response_dict])

            except Exception as e:
                logger.error(f"Stateless RAG pipeline failed: {e}", exc_info=True)
                return pd.DataFrame([{"status": "error", "message": str(e)}])
            
            finally:
                # Step 6: CRITICAL - Cleanup all transient objects and clear memory.
                logger.info("Cleaning up transient KB objects and VRAM...")
                del transient_kb
                del rag_pipeline
                gc.collect()
                if torch.cuda.is_available():
                    torch.cuda.empty_cache()
                    torch.cuda.synchronize()
                logger.info("Cleanup complete.")
        # The temporary directory is automatically deleted HERE, after everything is done.
    
    # ==========================================================================
    # 3. Helper and Class Methods
    # ==========================================================================
    def _chunk_docs(self, docs: List[Document]) -> List[Document]:
        """Takes a list of raw docs and performs chunking with unique IDs per doc."""
        header_splitter = MarkdownHeaderTextSplitter(headers_to_split_on=[("#", "title"), ("##", "section")])
        recursive_splitter = RecursiveCharacterTextSplitter(chunk_size=1200, chunk_overlap=200)
        all_chunks: list[Document] = []
        
        for doc in docs:
            page_title = Path(doc.metadata["source"]).stem.replace("-", " ")
            section_docs = header_splitter.split_text(doc.page_content)
            doc_chunk_counter = 0
            for section in section_docs:
                tiny_texts = recursive_splitter.split_text(section.page_content)
                for tiny in tiny_texts:
                    chunk_metadata = {"title": page_title, "source": doc.metadata["source"], "section_header": section.metadata.get("header", ""),"chunk_id": doc_chunk_counter}
                    all_chunks.append(Document(page_content=f"{page_title}\n\n{tiny.strip()}", metadata=chunk_metadata))
                    doc_chunk_counter += 1
        return all_chunks
        
    def _collect_image_vectors(self, mm_raw_docs: List[Document], image_dir: Path):
        """Scans raw docs and returns paths, IDs, and metadata for unique images."""
        img_paths, img_ids, img_meta = [], [], []
        seen = set()
        for doc in mm_raw_docs:
            src = doc.metadata["source"]
            for name in doc.metadata.get("images", []):
                img_id = f"{src}::{name}"
                if img_id in seen: continue
                seen.add(img_id)
                img_path = image_dir / name
                # Add only if the image file actually exists to prevent errors later
                if img_path.is_file():
                    img_paths.append(str(img_path))
                    img_ids.append(img_id)
                    img_meta.append({"source": src, "image": name})
        return img_paths, img_ids, img_meta

    @classmethod
    def log_model(cls, model_name: str, local_model: str) -> None:
        """Logs the Multimodal RAG model to MLflow with all necessary artifacts."""
        logger.info(f"--- Logging '{model_name}' Service to MLflow ---")
        with tempfile.TemporaryDirectory() as temp_dir:
            temp_path = Path(temp_dir)
            e5_path = temp_path / "e5-large-v2"
            SentenceTransformer("intfloat/e5-large-v2").save(str(e5_path))
            
            siglip_path = temp_path / "siglip2-base-patch16-224"
            SiglipModel.from_pretrained("google/siglip2-base-patch16-224").save_pretrained(siglip_path)
            SiglipProcessor.from_pretrained("google/siglip2-base-patch16-224").save_pretrained(siglip_path)
            
            artifacts = {"local_model_dir": local_model, "e5_model_dir": str(e5_path), "siglip_model_dir": str(siglip_path)}
            
            input_schema = Schema([
                ColSpec(DataType.string, "query"),
                ColSpec(DataType.string, "payload"), # JSON string with config and secrets
                ColSpec(DataType.boolean, "force_regenerate", required=False)
            ])
            output_schema = Schema([
                ColSpec(DataType.string, "reply", required=False),
                ColSpec(DataType.string, "used_images", required=False),
                ColSpec(DataType.double, "total_pipeline_time_seconds", required=False), # <-- ADDED
                ColSpec(DataType.double, "generation_time_seconds", required=False),
                ColSpec(DataType.double, "faithfulness", required=False),
                ColSpec(DataType.double, "relevance", required=False),
                ColSpec(DataType.string, "status", required=False),
                ColSpec(DataType.string, "message", required=False),
            ])
            signature = ModelSignature(inputs=input_schema, outputs=output_schema)

            mlflow.pyfunc.log_model(
                artifact_path=model_name, python_model=cls(), artifacts=artifacts,
                pip_requirements="../requirements.txt", signature=signature, code_paths=["../src"]
            )
        logger.info(f"✅ Successfully logged '{model_name}' service and cleaned up.")

## Step 3: Start Run, Log & Register Model

In [13]:
%%time

# --- Start MLflow Run and Log the Model ---
try:
    with mlflow.start_run(run_name=RUN_NAME) as run:
        run_id = run.info.run_id
        logger.info(f"Started MLflow run: {run_id}")

        # Use the class method to log the model and its artifacts
        MultimodalRagModel.log_model(model_name=MODEL_NAME, local_model=LOCAL_MODEL)

        model_uri = f"runs:/{run_id}/{MODEL_NAME}"
        logger.info(f"Registering model from URI: {model_uri}")
        
        # Register the model in the MLflow Model Registry
        mlflow.register_model(model_uri=model_uri, name=MODEL_NAME)
        logger.info(f"✅ Successfully registered model '{MODEL_NAME}'")

except FileNotFoundError as e:
    logger.error(f"Error: A required file or directory was not found. Please ensure the project structure is correct.")
    logger.error(f"Details: {e}")
except Exception as e:
    logger.error(f"An unexpected error occurred during the MLflow run: {e}", exc_info=True)

2025-08-01 15:41:04 - INFO - Started MLflow run: 1838f240c3dc43c5a436199007f4e1f1
2025-08-01 15:41:04 - INFO - --- Logging 'AIStudio-Multimodal-Chatbot-Model' Service to MLflow ---


modules.json:   0%|          | 0.00/387 [00:00<?, ?B/s]

README.md: 0.00B [00:00, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/57.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/616 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/1.34G [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/314 [00:00<?, ?B/s]

vocab.txt: 0.00B [00:00, ?B/s]

tokenizer.json: 0.00B [00:00, ?B/s]

special_tokens_map.json:   0%|          | 0.00/125 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/201 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/253 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/1.50G [00:00<?, ?B/s]

Using a slow image processor as `use_fast` is unset and a slow processor was saved with this model. `use_fast=True` will be the default behavior in v4.52, even if the model was saved with a slow processor. This will result in minor differences in outputs. You'll still be able to use a slow processor with `use_fast=False`.


preprocessor_config.json:   0%|          | 0.00/394 [00:00<?, ?B/s]

tokenizer_config.json: 0.00B [00:00, ?B/s]

tokenizer.model:   0%|          | 0.00/4.24M [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/34.4M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/636 [00:00<?, ?B/s]

Downloading artifacts:   0%|          | 0/49 [00:00<?, ?it/s]

Downloading artifacts:   0%|          | 0/11 [00:00<?, ?it/s]

Downloading artifacts:   0%|          | 0/7 [00:00<?, ?it/s]

2025-08-01 15:45:56 - INFO - ✅ Successfully logged 'AIStudio-Multimodal-Chatbot-Model' service and cleaned up.
2025-08-01 15:45:56 - INFO - Registering model from URI: runs:/1838f240c3dc43c5a436199007f4e1f1/AIStudio-Multimodal-Chatbot-Model
2025-08-01 15:45:56 - INFO - ✅ Successfully registered model 'AIStudio-Multimodal-Chatbot-Model'
CPU times: user 6.15 s, sys: 48.6 s, total: 54.7 s
Wall time: 4min 52s


Successfully registered model 'AIStudio-Multimodal-Chatbot-Model'.
Created version '1' of model 'AIStudio-Multimodal-Chatbot-Model'.


In [14]:
# --- Retrieve the latest version from the Model Registry ---
try:
    client = MlflowClient()
    versions = client.get_latest_versions(MODEL_NAME, stages=["None"])
    if not versions:
        raise RuntimeError(f"No registered versions found for model '{MODEL_NAME}'.")
    
    latest_version = versions[0]
    logger.info(f"Found latest version '{latest_version.version}' for model '{MODEL_NAME}' in stage '{latest_version.current_stage}'.")
    model_uri_registry = latest_version.source

except Exception as e:
    logger.error(f"Failed to retrieve model from registry: {e}", exc_info=True)
    model_uri_registry = None # Ensure variable exists


2025-08-01 15:45:56 - INFO - Found latest version '1' for model 'AIStudio-Multimodal-Chatbot-Model' in stage 'None'.


In [None]:
if model_uri_registry:
    try:
        logger.info(f"Loading model from: {model_uri_registry}")
        loaded_model = mlflow.pyfunc.load_model(model_uri=model_uri_registry)
        logger.info("✅ Successfully loaded model from registry.")
    except Exception as e:
        logger.error(f"Failed to load model from registry URI: {e}", exc_info=True)
        loaded_model = None
else:
    logger.warning("Skipping model loading due to previous errors.")
    loaded_model = None

2025-08-01 15:45:56 - INFO - Loading model from: /phoenix/mlflow/224888605984399430/1838f240c3dc43c5a436199007f4e1f1/artifacts/AIStudio-Multimodal-Chatbot-Model
2025-08-01 15:45:56 - INFO - --- Initializing Stateless MultimodalRAG Service ---
2025-08-01 15:45:56 - INFO - Loading text embedding model (E5)...
2025-08-01 15:46:03 - INFO - Loading image embedding model (SigLIP)...
2025-08-01 15:46:14 - INFO - Loading main LLM (InternVL)...
FlashAttention2 is not installed.


Loading checkpoint shards:   0%|          | 0/4 [00:00<?, ?it/s]

## Step 4: Display Results

In [None]:
def display_results(query: str, result_df: pd.DataFrame):
    """Helper to neatly print the query, reply, and display Base64 images."""
    if result_df.empty:
        print("Received an empty result.")
        return

    # Extract results from the DataFrame
    row = result_df.iloc[0]
    reply = row.get("reply", "No reply generated.")
    # This field now contains a JSON string of a list of Base64 strings
    used_images_json = row.get("used_images", "[]")
    gen_time = row.get("generation_time_seconds", 0)
    total_pipeline_time_seconds = row.get("total_pipeline_time_seconds", 0)
    faithfulness = row.get("faithfulness", 0)
    relevance = row.get("relevance", 0)

    # Safely parse the JSON string of Base64 images
    base64_images = []
    try:
        # Use json.loads to parse the string into a list
        base64_images = json.loads(used_images_json)
    except (json.JSONDecodeError, TypeError):
        print("Warning: Could not parse image data from the API response.")

    # Display the output
    print("---" * 20)
    print(f"❓ Query:\n{query}\n")
    print(f"🤖 Reply:")
    display(Markdown(reply)) # Render markdown for better formatting
    
    print(f"\n📊 Faithfulness: {faithfulness:.4f} | Relevance: {relevance:.4f}")
    print(f"⏱️ Generation Time: {gen_time:.2f}s\n")
    print(f" Total Pipeline Time: {total_pipeline_time_seconds:.2f}s\n")

    if base64_images and isinstance(base64_images, list):
        print(f"🖼️ Displaying {len(base64_images)} retrieved image(s):")
        for b64_string in base64_images:
            try:
                # Decode the Base64 string into bytes
                image_bytes = base64.b64decode(b64_string)
                # Display the image directly from the bytes data
                display(Image(data=image_bytes, width=400))
            except Exception as e:
                print(f"  - Could not decode or display an image: {e}")
    else:
        print("▶ No images were retrieved for this query.")
    print("---" * 20 + "\n")

In [None]:
all_results = []

if loaded_model:
    logger.info("--- Running sample inference with the loaded model ---")
    
    try:
        # 1. Load config and secrets ONCE before making queries.
        config = load_config(CONFIG_PATH)
        ADO_PAT = os.getenv("AIS_ADO_TOKEN")
        if not ADO_PAT:
            logger.info("Environment variable not found. Falling back to secrets.yaml.")
            secrets = load_secrets(SECRETS_PATH)
            ADO_PAT = secrets.get('AIS_ADO_TOKEN')

        # 2. Construct the payload dictionary that will be sent with each request.
        request_payload_dict = {
            "config": config,
            "secrets": {"AIS_ADO_TOKEN": ADO_PAT}
        }
        
        sample_queries = [
            "What are the AI Blueprints Repository best practices?",
            "What are some feature flags that I can enable in AIStudio?",
            "How do I manually clean my environment without hooh?",
        ]

        # 3. For each query, send BOTH the query and the payload.
        for query in sample_queries:
            logger.info(f"Processing query: '{query}'...")
            
            # Create the DataFrame that matches the model's signature
            prediction_df = pd.DataFrame([{
                "query": query,
                "payload": json.dumps(request_payload_dict),
                "force_regenerate": False 
            }])
            
            result_df = loaded_model.predict(prediction_df)
            result_df['query'] = query
            display_results(query, result_df)

            all_results.append(result_df)

    except Exception as e:
        logger.error(f"Prediction failed: {e}", exc_info=True)

    # --- ADD: Combine all individual results into one DataFrame ---
    if all_results:
        final_results_df = pd.concat(all_results, ignore_index=True)
    else:
        final_results_df = pd.DataFrame()

else:
    logger.warning("Skipping sample inference because the model was not loaded.")
    # --- ADD: Ensure the variable exists ---
    final_results_df = pd.DataFrame()

## Step 5: Log Hallucinations & Relevance Evaluations to MlFlow

In [None]:
if loaded_model and 'run_id' in locals() and not final_results_df.empty:
    logger.info(f"--- Reopening original run ({run_id}) to log pre-computed evaluations ---")

    # The results_df already contains the scores from the `predict` calls
    results_df = final_results_df
    
    # Reopen the existing run using its ID
    with mlflow.start_run(run_id=run_id) as run:
        logger.info("Successfully reopened existing run. Logging metrics and artifacts...")

        # Calculate average scores from the DataFrame
        avg_faithfulness = results_df["faithfulness"].astype(float).mean()
        avg_relevance = results_df["relevance"].astype(float).mean()
        avg_pipeline_time = results_df["total_pipeline_time_seconds"].astype(float).mean()

        # Log the average scores as metrics
        mlflow.log_metrics({
            "avg_faithfulness": avg_faithfulness,
            "avg_relevance": avg_relevance,
            "avg_pipeline_time_seconds": avg_pipeline_time,
        })

        # Log the detailed results as a table artifact
        mlflow.log_table(
            data=results_df[['query', 'reply', 'faithfulness', 'relevance', 'total_pipeline_time_seconds']], 
            artifact_file="stateless_evaluation_results.json"
        )
        
        logger.info("✅ Successfully logged metrics and artifacts to the original model run.")

else:
    logger.warning("Skipping evaluation logging because the model was not loaded, run_id was not found, or no results were generated.")

In [None]:
end_time: float = time.time()
elapsed_time: float = end_time - start_time
elapsed_minutes: int = int(elapsed_time // 60)
elapsed_seconds: float = elapsed_time % 60

logger.info(f"⏱️ Total execution time: {elapsed_minutes}m {elapsed_seconds:.2f}s")
logger.info("✅ Notebook execution completed.")

Built with ❤️ using Z by HP AI Studio.