<h1 style="text-align: center; font-size: 50px;">Multimodal RAG Chatbot with Langchain and ML Flow Evaluation</h1>

Retrieval-Augmented Generation (RAG) is an architectural approach that can enhance the effectiveness of large language model (LLM) applications using customized data. In this example, we use LangChain, an orchestrator for language pipelines, to build an assistant capable of loading information from a web page and use it for answering user questions. We'll also use the DeepEval platform to evaluate, observe and protect the LLM responses.

# Notebook Overview
- Imports
- Configurations
- Verify Assets
- Data Loading
- Creation of Chunks
- Retrieval
- Model Setup
- Chain Creation
- Model Service 

# Imports

By using our Local GenAI workspace image, many of the necessary libraries to work with RAG already come pre-installed - in our case, we just need to add the connector to work with PDF documents

In [1]:
%pip install -r ../requirements.txt --quiet

Note: you may need to restart the kernel to use updated packages.


In [2]:
# === Standard Library Imports ===
from typing import List
from datetime import datetime
import warnings
from pathlib import Path
import os
import sys
import logging
import pandas as pd
import json
# === MLflow integration ===
import mlflow

# Define the relative path to the 'core' directory (one level up from current working directory)
sys.path.append(os.path.abspath(os.path.join(os.getcwd(), "..")))
# === Import ChatbotService from project core ===
from core.chatbot_service.chatbot_service import ChatbotService

# === Third-Party Imports ===
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain.vectorstores import Chroma
from langchain.prompts import ChatPromptTemplate, PromptTemplate
from langchain.schema import StrOutputParser
from langchain.schema.runnable import RunnablePassthrough, RunnableLambda
from langchain.schema.document import Document
from langchain.document_loaders import WebBaseLoader, JSONLoader
from langchain_community.document_loaders import PyMuPDFLoader
from langchain_huggingface import HuggingFaceEmbeddings
import promptquality as pq
import torch

from transformers import SiglipProcessor, SiglipModel
from llama_cpp import Llama
# Define the relative path to the 'src' directory (one level up from current working directory)
sys.path.append(os.path.abspath(os.path.join(os.getcwd(), "..")))

# === Project-Specific Imports (from src) ===
from src.local_genai_judge import LocalGenAIJudge
from src.utils import (
    load_config_and_secrets,
    configure_proxy,
    initialize_llm,
    configure_hf_cache,
    mlflow_evaluate_setup,
)

2025-07-10 20:14:55.547704: I tensorflow/core/util/port.cc:153] oneDNN custom operations are on. You may see slightly different numerical results due to floating-point round-off errors from different computation orders. To turn them off, set the environment variable `TF_ENABLE_ONEDNN_OPTS=0`.
2025-07-10 20:14:55.921847: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:467] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
E0000 00:00:1752178496.061168     373 cuda_dnn.cc:8579] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
E0000 00:00:1752178496.105455     373 cuda_blas.cc:1407] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
W0000 00:00:1752178496.402931     373 computation_placer.cc:177] computation placer already registered. Please check linkage and avoid linking 

# Configurations

In [3]:
warnings.filterwarnings("ignore")

In [4]:
# Create logger
logger = logging.getLogger("multimodal_rag_logger")
logger.setLevel(logging.INFO)

formatter = logging.Formatter("%(asctime)s - %(levelname)s - %(message)s", 
                              datefmt="%Y-%m-%d %H:%M:%S") 

stream_handler = logging.StreamHandler()
stream_handler.setFormatter(formatter)
logger.addHandler(stream_handler)
logger.propagate = False

In [22]:
CONFIG_PATH = "../configs/config.yaml"
SECRETS_PATH = "../configs/secrets.yaml"
DATA_PATH = "../data"
IMAGE_DIR = os.path.join(DATA_PATH, "images")  # PNG/JPGs
MM_JSON = os.path.join(DATA_PATH, "wiki_flat_structure.json")

MLFLOW_EXPERIMENT_NAME = "AIStudio-Multimodal-Chatbot-Experiment"
MLFLOW_RUN_NAME = "AIStudio-Multimodal-Chatbot-Run"

INTERNVL_MODEL_PATH = "/home/jovyan/datafabric/InternVL3-8B-InstructQ8_0/InternVL3-8B-Instruct-Q8_0.gguf"
MM_PROJ_PATH = "/home/jovyan/datafabric/mmproj-InternVL3-8B-InstructQ8_0/mmproj-InternVL3-8B-Instruct-Q8_0.gguf"

DEMO_FOLDER = "../demo"
MLFLOW_MODEL_NAME = "AIStudio-Multimodal-Chatbot-Model"

In [23]:
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
print(f"Using device: {device}")

Using device: cuda


In [24]:
from llama_cpp import llama_supports_gpu_offload
print("GPU off-load supported:", llama_supports_gpu_offload())

GPU off-load supported: False


In [25]:
logger.info('Notebook execution started.')

2025-07-10 20:25:19 - INFO - Notebook execution started.


## Configuration of HuggingFace caches

In the next cell, we configure HuggingFace cache, so that all the models downloaded from them are persisted locally, even after the workspace is closed. This is a future desired feature for AI Studio and the GenAI addon.

In [26]:
# Configure HuggingFace cache
configure_hf_cache()

In [27]:
# Initialize HuggingFace Embeddings
embeddings = HuggingFaceEmbeddings(
    model_name="intfloat/e5-large-v2",
    cache_folder="/tmp/hf_cache"
)

2025-07-10 20:25:19,460 - INFO - Use pytorch device_name: cuda:0
2025-07-10 20:25:19,461 - INFO - Load pretrained SentenceTransformer: intfloat/e5-large-v2


## Configuration and Secrets Loading

In this section, we load configuration parameters and API keys from separate YAML files. This separation helps maintain security by keeping sensitive information (API keys) separate from configuration settings.

- **config.yaml**: Contains non-sensitive configuration parameters like model sources and URLs
- **secrets.yaml**: Contains sensitive API keys for services like Galileo and HuggingFace

In [28]:
config, secrets = load_config_and_secrets(CONFIG_PATH, SECRETS_PATH)

# Verify Assets

In [29]:
def log_asset_status(asset_path: str, asset_name: str, success_message: str, failure_message: str) -> None:
    """
    Logs the status of a given asset based on its existence.

    Parameters:
        asset_path (str): File or directory path to check.
        asset_name (str): Name of the asset for logging context.
        success_message (str): Message to log if asset exists.
        failure_message (str): Message to log if asset does not exist.
    """
    if Path(asset_path).exists():
        logger.info(f"{asset_name} is properly configured. {success_message}")
    else:
        logger.info(f"{asset_name} is not properly configured. {failure_message}")

log_asset_status(
    asset_path=CONFIG_PATH,
    asset_name="Config",
    success_message="",
    failure_message="Please check if the configs.yaml was propely connfigured in your project on AI Studio."
)

log_asset_status(
    asset_path=SECRETS_PATH,
    asset_name="Secrets",
    success_message="",
    failure_message="Please check if the secrets.yaml was propely connfigured in your project on AI Studio."
)

log_asset_status(
    asset_path=INTERNVL_MODEL_PATH,
    asset_name="Local InternVL-8B model",
    success_message="",
    failure_message="Please create and download the required assets in your project on AI Studio if you want to use local model.")

log_asset_status(
    asset_path=MM_PROJ_PATH,
    asset_name="Vision projector (.gguf)",
    success_message="",
    failure_message="Download mmproj-InternVL3-8B-Instruct-Q8_0.gguf")

log_asset_status(
    asset_path=MM_JSON,
    asset_name="wiki_flat_structure.json",
    success_message="",
    failure_message="Place JSON Wiki Pages in data/")

2025-07-10 20:25:23 - INFO - Config is properly configured. 
2025-07-10 20:25:23 - INFO - Secrets is properly configured. 
2025-07-10 20:25:23 - INFO - Local InternVL-8B model is properly configured. 
2025-07-10 20:25:23 - INFO - Vision projector (.gguf) is properly configured. 
2025-07-10 20:25:23 - INFO - wiki_flat_structure.json is properly configured. 


# Data Loading

In this step, we will use the Langchain framework to  extract the content from a local PDF file with the product documentation. Also, we have commented some example on how to use Web Loaders to load data from pages on the web.

In [30]:
def load_mm_docs(json_path: str) -> List[Document]:
    with open(json_path, encoding="utf-8") as f:
        data = json.load(f)
    docs = []
    for row in data:
        meta = {"source": row["path"], "images": row.get("images", [])}
        docs.append(Document(page_content=row["content"], metadata=meta))
    return docs

mm_raw_docs = load_mm_docs(MM_JSON)

# Creation of Chunks
Here, we split the loaded documents into chunks, so we have smaller and more specific texts to add to our vector database.

In [31]:
# === Initialize text splitter ===
# - chunk_size: Maximum number of characters per text chunk.
# - chunk_overlap: Number of overlapping characters between chunks.

def chunk_documents(docs, chunk_size=600, overlap=100):
    splitter = RecursiveCharacterTextSplitter(
        chunk_size=chunk_size,
        chunk_overlap=overlap,
        separators=["\n## ", "\n# ", "\n\n", ".", "!", "?"]
    )
    splits = splitter.split_documents(docs)
    return splits

splits = chunk_documents(mm_raw_docs)
def log_stage(name: str, docs: List[Document]):
    logger.info(f"{name}: {len(docs)} docs, avg_tokens={sum(len(d.page_content) for d in docs)/len(docs):.0f}")
# e.g. after splits
log_stage("Chunks created", splits)

2025-07-10 20:25:23 - INFO - Chunks created: 4089 docs, avg_tokens=412


# Retrieval

We transform the texts into embeddings and store them in a vector database. This allows us to perform similarity search, and proper retrieval of documents

In [32]:
from chromadb.config import Settings
from langchain_community.vectorstores import Chroma
from langchain_community.vectorstores.utils import filter_complex_metadata
from copy import deepcopy
from langchain_core.embeddings import Embeddings
from PIL import Image

CHROMA_SETTINGS = Settings(anonymized_telemetry=False)
clean_chunks = filter_complex_metadata(deepcopy(splits))

# TEXT collection  (e5-large embeddings)  ― already have `clean_chunks`

text_db = Chroma.from_documents(
    documents=clean_chunks,
    embedding=embeddings,
    collection_name="wiki_text_mm",
    client_settings=CHROMA_SETTINGS,
)
logger.info("Text collection ready: %d vectors", text_db._collection.count())

2025-07-10 20:25:23,677 - ERROR - Failed to send telemetry event ClientStartEvent: capture() takes 1 positional argument but 3 were given
2025-07-10 20:25:23,679 - ERROR - Failed to send telemetry event ClientCreateCollectionEvent: capture() takes 1 positional argument but 3 were given
2025-07-10 20:27:45 - INFO - Text collection ready: 8178 vectors


In [33]:
class SiglipEmbeddings(Embeddings):
    def __init__(self,
                 model_id: str = "google/siglip2-base-patch16-224",
                 device: str | None = None):
        self.device   = device or ("cuda" if torch.cuda.is_available() else "cpu")
        self.model    = SiglipModel.from_pretrained(model_id).to(self.device)
        self.processor = SiglipProcessor.from_pretrained(model_id)

    # --- private helpers ---------------------------------------------------
    def _embed_text(self, texts: List[str]):
        inp = self.processor(text=texts, return_tensors="pt",
                             padding=True, truncation=True).to(self.device)
        with torch.no_grad():
            return self.model.get_text_features(**inp).cpu().numpy()

    def _embed_imgs(self, paths: List[str]):
        imgs = [Image.open(p).convert("RGB") for p in paths]
        inp  = self.processor(images=imgs, return_tensors="pt").to(self.device)
        with torch.no_grad():
            return self.model.get_image_features(**inp).cpu().numpy()

    # --- LangChain --------------------------------------------------------
    def embed_documents(self, docs: List[str]) -> List[List[float]]:
        return self._embed_imgs(docs).tolist()

    def embed_query(self, text: str) -> List[float]:
        return self._embed_text([text])[0].tolist()


In [34]:
siglip_embeddings = SiglipEmbeddings()
image_db = Chroma(
    collection_name="wiki_image_mm",
    embedding_function=siglip_embeddings,
    client_settings=CHROMA_SETTINGS,
)


2025-07-10 20:27:50,190 - ERROR - Failed to send telemetry event ClientStartEvent: capture() takes 1 positional argument but 3 were given
2025-07-10 20:27:50,192 - ERROR - Failed to send telemetry event ClientCreateCollectionEvent: capture() takes 1 positional argument but 3 were given


# Model Setup

In this notebook, we provide three different options for loading the model:
 * **local**: by loading the llama3.1-8b-instruct-Q8_0 model from the asset downloaded on the project
 * **hugging-face-local** by downloading a DeepSeek model from Hugging Face and running locally
 * **hugging-face-cloud** by accessing the Mistral model through Hugging Face cloud API (requires HuggingFace API key saved on secrets.yaml)

This choice can be set in the config.yaml file. The model deployed on the bottom cells of this notebook will load the choice from the config file.

In [35]:
# model_source = config["model_source"]

In [36]:
%%time


llm_mm = Llama(
    model_path   = INTERNVL_MODEL_PATH,   # 8-B model .gguf
    mmproj_path  = MM_PROJ_PATH,          # projector .gguf  ← the key line
    chat_format  = "internvl",
    n_gpu_layers = -1,
    n_ctx        = 8192,
    n_batch      = 256,
    f16_kv       = True,
    verbose      = False,
)


# llm = initialize_llm(model_source, secrets)

llama_context: n_ctx_per_seq (8192) < n_ctx_train (32768) -- the full capacity of the model will not be utilized
llama_kv_cache_unified: LLAMA_SET_ROWS=0, using old ggml_cpy() method for backwards compatibility


CPU times: user 2.09 s, sys: 28.2 s, total: 30.3 s
Wall time: 27min 47s


# Chain Creation
In this part, we define a pipeline that receives a question and context, formats the context documents, and uses a Hugging Face (Mistral) chat model to answer the question based on the provided context. The output is then formatted as a string for easy reading.

In [37]:
# -------------------------------------------------
# ### ⬇ NEW  Multimodal Chain Creation ------------
# -------------------------------------------------
from langchain import Runnable

def _b64(path: str) -> str:
    with open(path, "rb") as f:
        return base64.b64encode(f.read()).decode()

def retrieve_mm(query: str, k_txt=4, k_img=4):
    # text
    txt_docs = text_db.similarity_search(
        query, k=k_txt, search_type="mmr", fetch_k=20)
    # image
    q_emb   = siglip_text_embed([query])[0]
    img_hits = image_coll.query(query_embeddings=[q_emb], n_results=k_img)
    img_paths = img_hits["documents"][0]
    return {"docs": txt_docs, "images": img_paths}

def build_messages(inp: Dict):
    context = "\n\n".join(d.page_content for d in inp["docs"])
    images  = [{"type": "image_url",
                "image_url": {"url": f"data:image/png;base64,{_b64(p)}"}}
               for p in inp["images"]]

    msgs = [
        {"role": "system",
         "content": ("You are a technical assistant for HP’s Z by HP AI Studio.\n"
                     "Answer **only** from the <context> and the images; "
                     "if the answer is missing reply: "
                     "\"I don’t have that information in the wiki yet.\"")},
        {"role": "user",
         "content": [{"type": "text",
                      "text": f"<context>\n{context}\n</context>\n\n{inp['query']}"}] + images}
    ]
    return msgs

def call_llm(msgs):
    res = llm_mm.create_chat_completion(
        messages   = msgs,
        max_tokens = 512,
        temperature=0.2
    )
    return res["choices"][0]["message"]["content"]

# LangChain Runnable pipeline
mm_chain: Runnable = (
    {
        "query": RunnablePassthrough(),
        # retriever returns dict with docs & images
        **{"retrieved": RunnableLambda(lambda q: retrieve_mm(q))},
    }
    | RunnableLambda(lambda d: {"query": d["query"],
                                "docs": d["retrieved"]["docs"],
                                "images": d["retrieved"]["images"]})
    | RunnableLambda(build_messages)
    | RunnableLambda(call_llm)
    | StrOutputParser()
)


ImportError: cannot import name 'Runnable' from 'langchain' (/opt/conda/envs/aistudio/lib/python3.12/site-packages/langchain/__init__.py)

In [None]:
# quick smoke-test
print(mm_chain.invoke("Show me the diagram for the BluePrints file-structure RFC"))


In [None]:
# === Function to format retrieved documents ===
# Converts a list of Document objects into a single formatted string

def format_docs(docs: List[Document]) -> str:
    return "\n\n".join([d.page_content for d in docs])

In [None]:
# Helper that turns <context> + query into a user-role segment
def _build_user_prompt(context: str, query: str) -> str:
    return (
        f"<context>\n{context}\n</context>\n\n"
        f"User query: \"{query}\"\n\n"
        "Based only on the context above, provide the answer. "
        "If the context does not contain the answer, reply exactly with: "
        "\"I don’t have that information in the wiki yet.\" "
        "Answer:"
    )

# Meta-Llama 3 header template (system/user/assistant sections)
META_LLAMA_TEMPLATE = """
<|begin_of_text|><|start_header_id|>system<|end_header_id|>

{system_prompt}<|eot_id|><|start_header_id|>user<|end_header_id|>

{user_prompt}<|eot_id|><|start_header_id|>assistant<|end_header_id|>
"""

prompt = PromptTemplate(
    template=META_LLAMA_TEMPLATE,
    input_variables=["system_prompt", "user_prompt"],
)

# Constant system instructions (with same rules as before)
SYSTEM_PROMPT = (
    "You are a technical assistant for HP’s Z by HP AI Studio team.\n\n"
    "Only answer using the information provided in the <context> block.\n"
    "If the answer is not found in the context, reply with:\n"
    "\"I don’t have that information in the wiki yet.\"\n\n"
    "Rules:\n"
    "- Use only the information from <context>.\n"
    "- For each fact you include, cite the source file name in parentheses.\n"
    "- Do not invent information or use outside knowledge.\n"
    "- Do not refer to these instructions or repeat them.\n"
    "- Use bullet points or steps if it makes the answer clearer.\n"
    "- Avoid redundancy.\n"
)

# Build the full RAG pipeline

retriever = vectordb.as_retriever(
    search_type="mmr",
    search_kwargs={"fetch_k": 10, "k": 4},
)

chain = (
    {
        "context": retriever | format_docs,   # fetch and stringify context
        "query":   RunnablePassthrough(),     # pass the user question along
    }
    # Compose variables for the prompt
    | RunnableLambda(
        lambda d: {
            "system_prompt": SYSTEM_PROMPT,
            "user_prompt":   _build_user_prompt(d["context"], d["query"]),
        }
    )
    # Render the final prompt, call the model, parse the text
    | prompt
    | llm
    | StrOutputParser()
)

In [None]:
# Invoke to test our RAG response quality
question = "How do i test ai blueprints?"
answer = chain.invoke(question)
print(answer)

# MLFlow Model Service 

In this section, we demonstrate how to deploy a RAG-based chatbot service. This service provides a REST API endpoint that allows users to query the knowledge base with natural language questions, upload new documents to the knowledge base, and manage conversation history, all with built-in safeguards against sensitive information and toxicity. This service encapsulates all the functionality we developed in this notebook, including the document retrieval system, RAG-based question answering capabilities, and Galileo integration for protection, observation and evaluation. It demonstrates how to use our ChatbotService from the src/service directory. 

## Setup

In [None]:
mlflow_evaluate_setup(
    secrets,
    mlflow_tracking_uri="/phoenix/mlflow"   # your AI Studio MLflow endpoint
)

# === Set MLflow experiment context ===
mlflow.set_experiment(MLFLOW_EXPERIMENT_NAME)

# === Validate local model file path ===
if not os.path.exists(LOCAL_MODEL_PATH):
    logger.info(f"⚠️ Warning: Model file not found at {LOCAL_MODEL_PATH}. Please verify the path.")

## Log & Register Model

In [None]:
# === Log and register model to MLflow ===
with mlflow.start_run(run_name=MLFLOW_RUN_NAME) as run:
    
    # Log model artifacts using custom ChatbotService
    ChatbotService.log_model(
        artifact_path=MLFLOW_MODEL_NAME,
        config_path=CONFIG_PATH,
        secrets_path=SECRETS_PATH,
        docs_path=DATA_PATH,
        model_path=LOCAL_MODEL_PATH,
        demo_folder=DEMO_FOLDER
    )

    # Construct the URI for the logged model
    model_uri = f"runs:/{run.info.run_id}/{MLFLOW_MODEL_NAME}"

In [None]:
# Register the model into MLflow Model Registry
mlflow.register_model(
    model_uri=model_uri,
    name=MLFLOW_MODEL_NAME
)

logger.info(f"✅ Model registered successfully with run ID: {run.info.run_id}")

## Evaluate Hallucination, Answer Relevance, Context Precision and Recall

In [None]:

def model(batch_df: pd.DataFrame) -> pd.DataFrame:
    preds, contexts = [], []
    for q in batch_df["questions"]:
        answer = chain.invoke(q)
        preds.append(answer)

        docs = retriever.get_relevant_documents(q)
        contexts.append(" ".join(d.page_content for d in docs))

    # keep the incoming index so every batch’s rows stay unique
    return pd.DataFrame(
        {
            "result": preds,
            "source_documents": contexts,
        },
        index=batch_df.index,      #  ← key line
    )

# --- 3)  Evaluation dataset
eval_df = pd.DataFrame({"questions": [
    "What naming convention should I use for a new blueprint project folder?",
    "What is the first step in the standard blueprint testing workflow?",
    "How do I fetch logs from a running Kubernetes pod?",
]})

judge = LocalGenAIJudge(
    llm=llm
)

faithfulness_metric = judge.to_mlflow_metric("faithfulness")
relevance_metric = judge.to_mlflow_metric("relevance")

results = mlflow.evaluate(
    model,
    eval_df,
    predictions="result",
    evaluators="default",
    extra_metrics=[faithfulness_metric, relevance_metric],
    evaluator_config={
        "col_mapping": {
            "inputs": "questions",
            "context": "source_documents"
        }
    },
)


Built with ❤️ using Z by HP AI Studio.