<h1 style="text-align: center; font-size: 50px;">Multimodal RAG Chatbot with Langchain and ML Flow Evaluation</h1>

Retrieval-Augmented Generation (RAG) is an architectural approach that can enhance the effectiveness of large language model (LLM) applications using customized data. In this example, we use LangChain, an orchestrator for language pipelines, to build an assistant capable of loading information from a web page and use it for answering user questions. We'll also use the DeepEval platform to evaluate, observe and protect the LLM responses.

# Notebook Overview
- Imports
- Configurations
- Verify Assets
- Data Loading
- Creation of Chunks
- Retrieval
- Model Setup
- Chain Creation
- Model Service 

# Imports

By using our Local GenAI workspace image, many of the necessary libraries to work with RAG already come pre-installed - in our case, we just need to add the connector to work with PDF documents

In [1]:
%pip install -r ../requirements.txt --quiet

Note: you may need to restart the kernel to use updated packages.


In [2]:
# # === Standard Library Imports ===
import base64
import json
import logging
import mimetypes
import os
import sys
import time
import warnings
from collections import defaultdict
from copy import deepcopy
from datetime import datetime
from pathlib import Path
from typing import Any, Dict, List, Optional, TypedDict
from IPython.display import Image, display
from statistics import mean
from PIL import Image as PILImage
import io

# === Third-Party Library Imports ===
import mlflow
import pandas as pd
import torch
from chromadb.config import Settings
from langchain.chains import ConversationalRetrievalChain
from langchain.document_loaders import JSONLoader, WebBaseLoader
from langchain.memory import ConversationBufferMemory
from langchain.schema import StrOutputParser
from langchain.schema.document import Document
from langchain.text_splitter import RecursiveCharacterTextSplitter, MarkdownHeaderTextSplitter
from langchain.vectorstores import Chroma
from langchain_community.document_loaders import PyMuPDFLoader
from langchain_community.vectorstores.utils import filter_complex_metadata
from langchain_huggingface import HuggingFaceEmbeddings
from langchain_core.documents import Document as CoreDocument
from langchain_core.embeddings import Embeddings
from langchain_core.output_parsers import StrOutputParser as CoreStrOutputParser
from langchain_core.runnables import Runnable, RunnableLambda, RunnablePassthrough
from tqdm import tqdm
from transformers import SiglipModel, SiglipProcessor
from langchain.chains.question_answering import load_qa_chain
from sentence_transformers import CrossEncoder
from langchain.llms import LlamaCpp

# === ML Inference Backends ===
from llama_cpp import Llama

# Define the relative path to the 'core' directory (one level up from current working directory)
sys.path.append(os.path.abspath(os.path.join(os.getcwd(), "..")))

# === Project-Specific Imports ===
from src.local_genai_judge import LocalGenAIJudge
from src.utils import (
    configure_hf_cache,
    load_config,
    mlflow_evaluate_setup,
)

USER_AGENT environment variable not set, consider setting it to identify your requests.
2025-07-16 23:21:33.459516: I tensorflow/core/util/port.cc:153] oneDNN custom operations are on. You may see slightly different numerical results due to floating-point round-off errors from different computation orders. To turn them off, set the environment variable `TF_ENABLE_ONEDNN_OPTS=0`.
2025-07-16 23:21:33.473781: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:467] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
E0000 00:00:1752708093.490510   10653 cuda_dnn.cc:8579] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
E0000 00:00:1752708093.495736   10653 cuda_blas.cc:1407] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
W0000 00:00:1752708093.508879   10653 computation_place

# Configurations

In [3]:
warnings.filterwarnings("ignore")

In [4]:
# Create logger
logger = logging.getLogger("multimodal_rag_logger")
logger.setLevel(logging.INFO)

formatter = logging.Formatter("%(asctime)s - %(levelname)s - %(message)s", datefmt="%Y-%m-%d %H:%M:%S") 
stream_handler = logging.StreamHandler()
stream_handler.setFormatter(formatter)
logger.addHandler(stream_handler)
logger.propagate = False

In [5]:
CONFIG_PATH = "../configs/config.yaml"

CONTEXT_DIR: Path = Path("../data/context")             
CHROMA_DIR: Path = Path("../data/chroma_store")     
MEMORY_PATH: Path = Path("../data/memory/memory.json")
MANIFEST_PATH: Path = CHROMA_DIR / "manifest.json"

IMAGE_DIR = CONTEXT_DIR / "images"
WIKI_METADATA_DIR = CONTEXT_DIR / "wiki_flat_structure.json"

MLFLOW_EXPERIMENT_NAME = "AIStudio-Multimodal-Chatbot-Experiment"
MLFLOW_RUN_NAME = "AIStudio-Multimodal-Chatbot-Run"
MLFLOW_MODEL_NAME = "AIStudio-Multimodal-Model"

LOCAL_MODEL_PATH = "/home/jovyan/datafabric/llama3.1-8b-instruct/Meta-Llama-3.1-8B-Instruct-Q8_0.gguf"
INTERNVL_MODEL_PATH = "/home/jovyan/datafabric/InternVL3-8B-Instruct-Q8_0-1/InternVL3-8B-Instruct-Q8_0.gguf"
MM_PROJ_PATH = "/home/jovyan/datafabric/mmproj-InternVL3-8B-Instruct-Q8_0-1/mmproj-InternVL3-8B-Instruct-Q8_0.gguf"

DEMO_FOLDER = "../demo"
MLFLOW_MODEL_NAME = "AIStudio-Multimodal-Chatbot-Model"

In [6]:
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
print(f"Using device: {device}")

Using device: cuda


In [7]:
start_time = time.time()  

logger.info('Notebook execution started.')

2025-07-16 23:21:36 - INFO - Notebook execution started.



# Verify Assets

In [8]:
# If folders do not exist, create them automatically
for _dir in (CHROMA_DIR, MEMORY_PATH.parent):
    _dir.mkdir(parents=True, exist_ok=True)

if not MEMORY_PATH.exists():
    MEMORY_PATH.write_text("{}", encoding="utf‑8")

In [9]:
def log_asset_status(asset_path: str, asset_name: str, success_message: str, failure_message: str) -> None:
    """
    Logs the status of a given asset based on its existence.

    Parameters:
        asset_path (str): File or directory path to check.
        asset_name (str): Name of the asset for logging context.
        success_message (str): Message to log if asset exists.
        failure_message (str): Message to log if asset does not exist.
    """
    if Path(asset_path).exists():
        logger.info(f"{asset_name} is properly configured. {success_message}")
    else:
        logger.info(f"{asset_name} is not properly configured. {failure_message}")

log_asset_status(
    asset_path=CONFIG_PATH,
    asset_name="Config",
    success_message="",
    failure_message="Please check if the configs.yaml was propely connfigured in your project on AI Studio."
)

log_asset_status(
    asset_path=INTERNVL_MODEL_PATH,
    asset_name="Local InternVL-8B model",
    success_message="",
    failure_message="Please create and download the required assets in your project on AI Studio if you want to use local model.")

log_asset_status(
    asset_path=MM_PROJ_PATH,
    asset_name="Vision projector (.gguf)",
    success_message="",
    failure_message="Download mmproj-InternVL3-8B-Instruct-Q8_0.gguf")

log_asset_status(
    asset_path=WIKI_METADATA_DIR,
    asset_name="wiki_flat_structure.json",
    success_message="",
    failure_message="Place JSON Wiki Pages in data/")

log_asset_status(
    asset_path=CONTEXT_DIR,
    asset_name="CONTEXT",
    success_message="",
    failure_message="Please check if CONTEXT path was downloaded correctly in your project on AI Studio."
)

log_asset_status(
    asset_path=CHROMA_DIR,
    asset_name="CHROMA",
    success_message="",
    failure_message="Please check if CHROMA path was downloaded correctly in your project on AI Studio."
)

log_asset_status(
    asset_path=MEMORY_PATH,
    asset_name="MEMORY",
    success_message="",
    failure_message="Please check if the MEMORY path was propely connfigured in your project on AI Studio."
)


log_asset_status(
    asset_path=MANIFEST_PATH,
    asset_name="MANIFEST",
    success_message="",
    failure_message="Please check if the MANIFEST path was propely connfigured in your project on AI Studio."
)

2025-07-16 23:21:36 - INFO - Config is properly configured. 
2025-07-16 23:21:36 - INFO - Local InternVL-8B model is properly configured. 
2025-07-16 23:21:36 - INFO - Vision projector (.gguf) is properly configured. 
2025-07-16 23:21:36 - INFO - wiki_flat_structure.json is properly configured. 
2025-07-16 23:21:36 - INFO - CONTEXT is properly configured. 
2025-07-16 23:21:36 - INFO - CHROMA is properly configured. 
2025-07-16 23:21:36 - INFO - MEMORY is properly configured. 
2025-07-16 23:21:36 - INFO - MANIFEST is properly configured. 


# Memory Store

In [10]:
class SimpleKVMemory:
    """Very small persistent key‑value store (JSON on disk)."""
    def __init__(self, file_path: Path) -> None:
        self.file_path = file_path
        self._store    = self._load()

    # Public ------------
    def get(self, key: str) -> Optional[str]:
        return self._store.get(key)

    def set(self, key: str, value: str) -> None:
        self._store[key] = value
        self._dump()

    # Private -----------
    def _load(self) -> Dict[str, str]:
        if self.file_path.exists():
            try:
                with self.file_path.open("r", encoding="utf‑8") as f:
                    return json.load(f)
            except Exception as exc:
                logger.warning("Memory load failed (%s). Starting fresh.", exc)
        return {}

    def _dump(self) -> None:
        self.file_path.parent.mkdir(parents=True, exist_ok=True)
        with self.file_path.open("w", encoding="utf‑8") as f:
            json.dump(self._store, f, ensure_ascii=False, indent=2)

memory = SimpleKVMemory(MEMORY_PATH)


In [11]:
class RAGState(TypedDict, total=False):
    topic            : str
    query            : str
    answer           : Optional[str]
    retrieved_docs   : List[Document]
    images           : List[str]
    from_memory      : Optional[bool]
    messages         : List[Dict[str, Any]]   # full chat‑history if you need it

## Configuration of HuggingFace caches

In the next cell, we configure HuggingFace cache, so that all the models downloaded from them are persisted locally, even after the workspace is closed. This is a future desired feature for AI Studio and the GenAI addon.

In [12]:
# Configure HuggingFace cache
configure_hf_cache()

In [13]:
# Initialize HuggingFace Embeddings
embeddings = HuggingFaceEmbeddings(
    model_name="intfloat/e5-large-v2",
    cache_folder="/tmp/hf_cache"
)

## Configuration and Secrets Loading

In this section, we load configuration parameters and API keys from separate YAML files. This separation helps maintain security by keeping sensitive information (API keys) separate from configuration settings.

- **config.yaml**: Contains non-sensitive configuration parameters like model sources and URLs
- **secrets.yaml**: Contains sensitive API keys for services like Galileo and HuggingFace

In [14]:
config = load_config(CONFIG_PATH)

# Data Loading & Cleaning

We load wiki-pages from `wiki_flat_structure.json`, but:
* remove any image name that  
  – is empty / `None`  
  – contains invalid characters (e.g. the `==image_0==` placeholders)  
  – has an extension not in {png, jpg, jpeg, webp, gif}  
  – points to a file that does **not** exist in `data/images/`
* log every discarded image so we can fix the parser later.

In [15]:
VALID_EXTS = {".png", ".jpg", ".jpeg", ".webp", ".gif"}

WIKI_METADATA_DIR   = Path(WIKI_METADATA_DIR)
IMAGE_DIR = Path(IMAGE_DIR)

def load_mm_docs_clean(json_path: Path, img_dir: Path) -> List[Document]:
    """
    Load wiki Markdown + image references from *json_path*.
    • Filters out images with bad extensions or missing files.
    • Logs the first 20 broken refs.
    • Returns a list[Document] where metadata = {source, images}
    """
    bad_imgs, docs = [], []

    rows = json.loads(json_path.read_text("utf-8"))
    for row in rows:
        images_ok = []
        for name in row.get("images", []):
            if not name:                                     # empty / placeholder
                bad_imgs.append((row["path"], name, "empty"))
                continue
            ext = Path(name).suffix.lower()
            if ext not in VALID_EXTS:                       # unsupported ext
                bad_imgs.append((row["path"], name, f"ext {ext}"))
                continue
            img_path = img_dir / name
            if not img_path.is_file():                      # missing on disk
                bad_imgs.append((row["path"], name, "missing file"))
                continue
            images_ok.append(name)

        docs.append(
            Document(
                page_content=row["content"],
                metadata={"source": row["path"], "images": images_ok},
            )
        )

    # ---- summary logging ----------------------------------------------------
    if bad_imgs:
        logger.warning("⚠️ %d broken image refs filtered out", len(bad_imgs))
        for src, name, reason in bad_imgs[:20]:
            logger.debug("  » %s → %s (%s)", src, name or "<EMPTY>", reason)
    else:
        logger.info("✅ no invalid image refs found")

    return docs

mm_raw_docs = load_mm_docs_clean(WIKI_METADATA_DIR, Path(IMAGE_DIR))
def log_stage(name: str, docs: List[Document]):
    logger.info(f"{name}: {len(docs)} docs, avg_tokens={sum(len(d.page_content) for d in docs)/len(docs):.0f}")
log_stage("Docs loaded", mm_raw_docs)

2025-07-16 23:21:42 - INFO - Docs loaded: 558 docs, avg_tokens=3082


# Creation of Chunks
Here, we split the loaded documents into chunks, so we have smaller and more specific texts to add to our vector database.

In [16]:
def chunk_documents(
    docs,
    chunk_size: int = 1200,
    overlap: int = 200,
) -> list[Document]:
    """
    1) Split each wiki page on Markdown headers (#, ## …) to keep logical
       sections together.
    2) Recursively break long sections to <= `chunk_size` chars with `overlap`.
    3) Prefix every chunk with its page‑title and store the title in metadata.
    """
    header_splitter = MarkdownHeaderTextSplitter(
        headers_to_split_on=[("#", "title"), ("##", "section")]
    )
    recursive_splitter = RecursiveCharacterTextSplitter(
        chunk_size=chunk_size,
        chunk_overlap=overlap,
    )

    all_chunks: list[Document] = []
    for doc in docs:
        page_title = Path(doc.metadata["source"]).stem.replace("-", " ")

        # 1️⃣ section‑level split (returns list[Document])
        section_docs = header_splitter.split_text(doc.page_content)

        for section in section_docs:
            # 2️⃣ size‑based split inside each section
            tiny_texts = recursive_splitter.split_text(section.page_content)

            for idx, tiny in enumerate(tiny_texts):
                all_chunks.append(
                    Document(
                        page_content=f"{page_title}\n\n{tiny.strip()}",
                        metadata={
                            "title": page_title,
                            "source": doc.metadata["source"],
                            "section_header": section.metadata.get("header", ""),
                            "chunk_id": idx,
                        },
                    )
                )

    if all_chunks:
        avg_len = int(mean(len(c.page_content) for c in all_chunks))
        logger.info(
            "Chunking complete: %d docs → %d chunks (avg %d chars)",
            len(docs),
            len(all_chunks),
            avg_len,
        )
    else:
        logger.warning("Chunking produced zero chunks for %d docs", len(docs))

    return all_chunks

splits = chunk_documents(mm_raw_docs)

2025-07-16 23:21:42 - INFO - Chunking complete: 558 docs → 2574 chunks (avg 713 chars)


# Setup Embeddings & Vector Store
Here we setup Siglip for Image embeddings, and also store our cleaned text chunks embeddings into Chroma.

In [17]:
# ---------------------------------------------------------------------------
#  Helper: walk all docs once and gather *unique* image vectors + metadata
# ---------------------------------------------------------------------------
def _collect_image_vectors():
    """
    Scans every wiki page for image references and returns three parallel lists:
        img_paths : list[str]   → full file‑system paths (for SigLIP)
        img_ids   : list[str]   → unique key per (page, image) pair
        img_meta  : list[dict]  → {"source": wiki_page, "image": file_name}
    Runs in < 1 s even for thousands of docs.
    """
    img_paths, img_ids, img_meta = [], [], []
    seen = set()

    for doc in mm_raw_docs:                         # raw wiki pages
        src = doc.metadata["source"]
        for name in doc.metadata.get("images", []): # list[str]
            img_id = f"{src}::{name}"
            if img_id in seen:
                continue                            # de‑dupe
            seen.add(img_id)

            img_paths.append(str(IMAGE_DIR / name))
            img_ids.append(img_id)
            img_meta.append({"source": src, "image": name})

    return img_paths, img_ids, img_meta


In [18]:
# --- Image‑embedding helper  (place this once, ABOVE the vector‑store code) --
class SiglipEmbeddings(Embeddings):
    def __init__(self,
                 model_id: str = "google/siglip2-base-patch16-224",
                 device: str | None = None):
        from transformers import SiglipModel, SiglipProcessor
        import torch, PIL.Image as PILImage
        self.device    = device or ("cuda" if torch.cuda.is_available() else "cpu")
        self.model     = SiglipModel.from_pretrained(model_id).to(self.device)
        self.processor = SiglipProcessor.from_pretrained(model_id)
        self.torch     = torch
        self.PILImage  = PILImage

    # ---- private helpers ---------------------------------------------------
    def _embed_text(self, txts):
        inp = self.processor(text=txts, return_tensors="pt",
                             padding=True, truncation=True).to(self.device)
        with self.torch.no_grad():
            return self.model.get_text_features(**inp).cpu().numpy()

    def _embed_imgs(self, paths):
        imgs = [self.PILImage.open(p).convert("RGB") for p in paths]
        inp  = self.processor(images=imgs, return_tensors="pt").to(self.device)
        with self.torch.no_grad():
            return self.model.get_image_features(**inp).cpu().numpy()

    # ---- LangChain API -----------------------------------------------------
    def embed_documents(self, docs):          # list[str]  (image paths)
        return self._embed_imgs(docs).tolist()

    def embed_query(self, txt):               # single str  (textual query)
        return self._embed_text([txt])[0].tolist()

siglip_embeddings = SiglipEmbeddings()


Using a slow image processor as `use_fast` is unset and a slow processor was saved with this model. `use_fast=True` will be the default behavior in v4.52, even if the model was saved with a slow processor. This will result in minor differences in outputs. You'll still be able to use a slow processor with `use_fast=False`.


In [19]:
# ── 1) TEXT store ────────────────────────────────────────────────────────────
def _current_manifest() -> List[str]:
    """Returns an ordered list of every Markdown/JSON context file we index."""
    return sorted(str(p.resolve()) for p in CONTEXT_DIR.rglob("*.json"))

def _needs_rebuild() -> bool:
    if not CHROMA_DIR.exists() or not MANIFEST_PATH.exists():
        return True
    try:
        old = json.loads(MANIFEST_PATH.read_text())
    except Exception:
        return True
    return old != _current_manifest()

def _save_manifest(manifest: List[str]) -> None:
    CHROMA_DIR.mkdir(parents=True, exist_ok=True)
    MANIFEST_PATH.write_text(json.dumps(manifest, indent=2))

def _build_text_db() -> Chroma:
    collection = "mm_text"
    if _needs_rebuild():
        logger.info("Re‑indexing text context → %s …", CHROMA_DIR)
        chroma = Chroma.from_documents(
            documents        = splits,            # you already created these
            embedding        = embeddings,
            collection_name  = collection,
            persist_directory= str(CHROMA_DIR),
        )
        _save_manifest(_current_manifest())
        return chroma
    logger.info("Loading existing Chroma index from %s", CHROMA_DIR)
    return Chroma(
        collection_name   = collection,
        persist_directory = str(CHROMA_DIR),
        embedding_function= embeddings,
    )

text_db = _build_text_db()

# ── 2) IMAGE store ───────────────────────────────────────────────────────────
image_db = Chroma(
    collection_name    = "mm_image",
    persist_directory  = str(CHROMA_DIR),   # SAME dir as text db
    embedding_function = siglip_embeddings, # <-- class you kept
)

# Populate vectors *only* if this looks empty -------------------------------
if not image_db._collection.count():
    img_paths, img_ids, img_meta = _collect_image_vectors()
    image_db.add_texts(texts=img_paths, metadatas=img_meta, ids=img_ids)
    image_db.persist()
    logger.info("Indexed %d unique images.", len(img_paths))
else:
    logger.info("Loaded existing image index (%d vectors).",
                image_db._collection.count())



2025-07-16 23:21:48 - INFO - Loading existing Chroma index from ../data/chroma_store
2025-07-16 23:21:48 - INFO - Loaded existing image index (738 vectors).


# Retrieval

We transform the texts and images into embeddings and store them in a vector database. This allows us to perform similarity search, and proper retrieval of documents

In [None]:

_cross_encoder = CrossEncoder("cross-encoder/ms-marco-MiniLM-L-6-v2")


def retrieve_mm(
    query: str,
    k_txt: int = 2,
    k_img: int = 8,
    fetch_k: int = 20,
    hybrid_weights: Dict[str, float] = {"init": 0.6, "rerank": 0.4},
    boost_slug: float = 0.1,
) -> Dict[str, Any]:
    """
    1) Coarse recall: top `fetch_k` docs + init_scores
    2) Cross‑encoder rerank: rerank_scores
    3) Build hybrid_score = w1*init + w2*rerank + optional slug boost
    4) Sort by hybrid_score, then pick top‑k_txt **without repeating sources**
    5) Image retrieval with $in filter
    """
    # 1) Coarse recall
    docs_and_init = text_db.similarity_search_with_score(query, k=fetch_k)
    docs, init_scores = zip(*docs_and_init)

    # 2) Rerank
    pairs = [(query, d.page_content) for d in docs]
    rerank_scores = _cross_encoder.predict(pairs)

    # 3) Compute hybrid scores (+ slug boost)
    slug = query.lower().replace(" ", "-")
    hybrid_scores = []
    for d, init, rerank in zip(docs, init_scores, rerank_scores):
        score = hybrid_weights["init"] * init + hybrid_weights["rerank"] * rerank
        if slug in d.metadata.get("source", "").lower():
            score += boost_slug
        hybrid_scores.append(score)

    # 4) Deduplicate & select top‑k_txt
    scored_docs = list(zip(docs, hybrid_scores))
    scored_docs.sort(key=lambda x: x[1], reverse=True)

    selected: List[tuple[Document, float]] = []
    seen_sources = set()
    for doc, score in scored_docs:
        src = doc.metadata.get("source")
        if src in seen_sources:
            continue
        seen_sources.add(src)
        selected.append((doc, score))
        if len(selected) >= k_txt:
            break

    if not selected:
        return {"docs": [], "images": [], "scores": []}
    selected_docs, final_scores = zip(*selected)

    # 5) Image retrieval
    sources = [d.metadata["source"] for d in selected_docs]
    q_emb = siglip_embeddings.embed_query(query)
    img_hits = image_db.similarity_search_by_vector(
        q_emb,
        k=k_img * 2,
        filter={"source": {"$in": sources}},
    )
    images = [img.page_content for img in img_hits[:k_img]]

    return {
        "docs": list(selected_docs),
        "images": images,
        "scores": list(final_scores),
    }

In [21]:
query = "What are some feature flags that i can enable in AIStudio?"

results = retrieve_mm(query)

# --- text context -------------------------------------------------
for i, doc in enumerate(results["docs"], 1):
    print(f"\n▶ Doc {i}  •  {doc.metadata['source']}")
    print(doc.page_content[:700], "…")

# --- images -------------------------------------------------------
print("\n▶ Images")
for p in results["images"]:
    print(p)

Asking to truncate to max_length but no maximum length is provided and the model has no predefined maximum length. Default to no truncation.



▶ Doc 1  •  Development/Feature-Flags/Using-Feature-Flags-in-your-code.md
Using Feature Flags in your code

Feature flags are simple string flags that can be used to enable or disable features in the application. This is useful for rolling out features gradually, or for enabling features for specific users. We want to use it today for the following things: Azure and GCP datasets; and the integration with NVIDIA.  
##Back End
Feature flags location: `business/sys/featureflags/flags.go`  
Implementation-wise, there is no need to do anything fancy. Use our old friend, `if`:  
```go
datasetTypes := []string{"aws", "local"}
if featureflags.Has("azure") {
datasetTypes = append(datasetTypes, "azure")
}
if featureflags.Has("gcp") {
datasetTypes = append(datasetTypes, "gcp …

▶ Doc 2  •  Feature-Flags.md
Feature Flags

Enables shared and restricted project creating in AIS.
**For Cloud Only:** Set Header with `{"private-project-key": "e16873af-6fbf-427d-967f-b433503fccfd"}`  
[features.yaml](ht

# Model Setup

In this notebook, we provide three different options for loading the model:
 * **local**: by loading the internvl3-8b-instruct-Q8_0 model from the asset downloaded on the project
 * **hugging-face-local** by downloading a DeepSeek model from Hugging Face and running locally
 * **hugging-face-cloud** by accessing the Mistral model through Hugging Face cloud API (requires HuggingFace API key saved on secrets.yaml)

This choice can be set in the config.yaml file. The model deployed on the bottom cells of this notebook will load the choice from the config file.

In [22]:

llm_mm = LlamaCpp(
    model_path=INTERNVL_MODEL_PATH,
    n_gpu_layers=-1,
    n_ctx=32768,
    n_batch=256,
    f16_kv=True,
    verbose=False,
    # pass any extra args down into llama-cpp-python
    model_kwargs={"mmproj_path": MM_PROJ_PATH},
)

llama_kv_cache_unified: LLAMA_SET_ROWS=0, using old ggml_cpy() method for backwards compatibility


# Chain Creation
In this part, we define a pipeline that receives a question and context, formats the context documents, and uses the Qwen chat model to answer the question based on the provided context. The output is then formatted as a string for easy reading.

In [29]:
SYSTEM_PROMPT = """
You are **AI Studio DevOps Assistant**. Everything you know for this turn is inside
the <context> block. Follow *all* rules below:
1. **Answer comprehensively from the context.**
   - Provide detailed, thorough responses using all relevant information
   - If the answer is missing, write: "I don't know based on the provided context."
   - Never invent facts or rely on outside knowledge.
2. **Be detailed and structured.**
   - For procedures, provide complete numbered steps with explanations
   - Quote file paths / commands in back‑ticks
   - Include relevant examples and details from the context
3. **Use all available information.**
   - Draw from multiple documents when relevant
   - Synthesize information to provide complete answers
"""


def _b64(path: str) -> str:
    with open(path, "rb") as f:
        return base64.b64encode(f.read()).decode()

# FIXED: Proper multimodal input handling
def build_multimodal_prompt(inp: dict) -> str:
    """Build a prompt that references images without embedding them"""
    context = "\n\n".join(d.page_content for d in inp["docs"])
    
    # Start with system prompt and context
    prompt = f"{SYSTEM_PROMPT}\n{context}\n\nUser query:\n{inp['query']}"
    
    # Add image references (not the actual base64 data)
    if inp["images"]:
        prompt += f"\n\n[{len(inp['images'])} image(s) provided for analysis]"
    
    return prompt

In [None]:


def resize_image_for_llm(img_path, max_size=256, quality=60):
    """Aggressively resize image to reduce tokens"""
    img = PILImage.open(img_path)
    
    # Resize to much smaller size
    img.thumbnail((max_size, max_size))
    
    # Convert to JPEG with heavy compression
    buffer = io.BytesIO()
    if img.mode in ('RGBA', 'LA', 'P'):
        img = img.convert('RGB')
    img.save(buffer, format="JPEG", quality=quality)
    buffer.seek(0)
    
    return base64.b64encode(buffer.read()).decode()

def call_multimodal_llm(inp: dict) -> str:
    """Call InternVL with proper multimodal format - FIXED"""
    context = "\n\n".join(d.page_content for d in inp["docs"])
    
    # Display previews
    for img_path in inp["images"]:
        display(Image(filename=img_path, width=350))
    
    if not inp["images"]:
        # Text-only fallback
        text_prompt = f"{SYSTEM_PROMPT}\n{context}\n\nUser query:\n{inp['query']}"
        return llm_mm(text_prompt)
    
    # FIXED: Use string content for messages
    user_text = f"{context}\n\nUser query:\n{inp['query']}"
    
    # Add image references in text (not as separate objects)
    for i, img_path in enumerate(inp["images"]):
        user_text += f"\n\n[Image {i+1} provided]"
    
    messages = [
        {"role": "system", "content": SYSTEM_PROMPT},
        {"role": "user", "content": user_text}
    ]
    
    try:
        # Try chat completion first
        response = llm_mm.client.create_chat_completion(
            messages=messages,
            max_tokens=1000,
            temperature=0.7
        )
        return response["choices"][0]["message"]["content"]
    except Exception as e:
        print(f"Chat completion failed: {e}")
        
        try:
            prompt = f"{SYSTEM_PROMPT}\n{context}\n\nUser query:\n{inp['query']}\n\n"
            
            # Add COMPRESSED images
            for i, img_path in enumerate(inp["images"]):
                compressed_b64 = resize_image_for_llm(img_path)  # Much smaller!
                prompt += f"<img>{compressed_b64}</img>\n"
            
            return llm_mm(prompt)
        except Exception as e2:
            print(f"Compressed image fallback failed: {e2}")
            # Last resort: text-only with image count
            text_prompt = f"{SYSTEM_PROMPT}\n{context}\n\nUser query:\n{inp['query']}\n\n[Note: {len(inp['images'])} images were provided but could not be processed due to size constraints]"
            return llm_mm(text_prompt)

mm_chain = (
    {
      "query":   RunnablePassthrough(),
      "results": RunnableLambda(lambda q: retrieve_mm(q)),
    }
    | RunnableLambda(lambda d: {
          "docs":     d["results"]["docs"],
          "images":   d["results"]["images"],
          "query":    d["query"],  # Keep as 'query' to match call_multimodal_llm
      })
    | RunnableLambda(call_multimodal_llm)
    | StrOutputParser()
)

In [None]:
# ✅ Quick Test

question = "What are the ai blueprints best practices?"
print(mm_chain.invoke(question))


In [None]:
question2 = "What are some feature flags that i can enable in AIStudio?"
print(mm_chain.invoke(question2))

In [None]:
question3 = "How do i manually clean my environment without hooh?"
print(mm_chain.invoke(question3))

In [None]:
question4 = "How do I test a config after i sign it?"
print(mm_chain.invoke(question4))

Built with ❤️ using Z by HP AI Studio.