<h1 style="text-align: center; font-size: 50px;">Vanilla RAG Chatbot with Langchain and Opik Evaluation</h1>

Retrieval-Augmented Generation (RAG) is an architectural approach that can enhance the effectiveness of large language model (LLM) applications using customized data. In this example, we use LangChain, an orchestrator for language pipelines, to build an assistant capable of loading information from a web page and use it for answering user questions. We'll also use the Opik platform to evaluate and observe the LLM responses.

# Notebook Overview
- Imports
- Configurations
- Verify Assets
- Data Loading
- Creation of Chunks
- Retrieval
- Model Setup
- Chain Creation
- Model Service 

# Imports

By using our Local GenAI workspace image, many of the necessary libraries to work with RAG already come pre-installed - in our case, we just need to add the connector to work with PDF documents

In [1]:
%pip install -r ../requirements.txt --quiet

Note: you may need to restart the kernel to use updated packages.


In [2]:
# === Standard Library Imports ===
from typing import List
from datetime import datetime
import warnings
from pathlib import Path
import os
import sys
import logging

# === MLflow integration ===
import mlflow

# Define the relative path to the 'core' directory (one level up from current working directory)
sys.path.append(os.path.abspath(os.path.join(os.getcwd(), "..")))
# === Import ChatbotService from project core ===
from core.chatbot_service.chatbot_service import ChatbotService

# === Third-Party Imports ===
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain.vectorstores import Chroma
from langchain.prompts import ChatPromptTemplate, PromptTemplate
from langchain.schema import StrOutputParser
from langchain.schema.runnable import RunnablePassthrough, RunnableLambda
from langchain.schema.document import Document
from langchain.document_loaders import WebBaseLoader, JSONLoader
from langchain_community.document_loaders import PyMuPDFLoader
from langchain_huggingface import HuggingFaceEmbeddings
import promptquality as pq
import torch

import opik
from opik import Opik
from opik.evaluation import evaluate_prompt
from opik.evaluation.metrics import (
    Hallucination,
    AnswerRelevance,
    ContextPrecision,
    ContextRecall,
    GEval,
)

# Define the relative path to the 'src' directory (one level up from current working directory)
sys.path.append(os.path.abspath(os.path.join(os.getcwd(), "..")))

# === Project-Specific Imports (from src) ===
from src.local_judge import LangChainJudge
from src.utils import (
    load_config_and_secrets,
    configure_proxy,
    initialize_llm,
    configure_hf_cache,
    setup_opik_environment,
)



# Configurations

In [3]:
warnings.filterwarnings("ignore")

In [4]:
# Create logger
logger = logging.getLogger("vanilla_rag_logger")
logger.setLevel(logging.INFO)

formatter = logging.Formatter("%(asctime)s - %(levelname)s - %(message)s", 
                              datefmt="%Y-%m-%d %H:%M:%S") 

stream_handler = logging.StreamHandler()
stream_handler.setFormatter(formatter)
logger.addHandler(stream_handler)
logger.propagate = False

In [5]:
CONFIG_PATH = "../configs/config.yaml"
SECRETS_PATH = "../configs/secrets.yaml"
DATA_PATH = "../data"
MLFLOW_EXPERIMENT_NAME = "Wiki-Chatbot-Experiment"
MLFLOW_RUN_NAME = "Wiki-Chatbot-Run"
LOCAL_MODEL_PATH = "/home/jovyan/datafabric/llama3.1-8b-instruct/Meta-Llama-3.1-8B-Instruct-Q8_0.gguf"
LOCAL_MODEL_PATHV2 = "/home/jovyan/datafabric/Qwen3-8B-Q4_K_M/Qwen3-8B-Q4_K_M.gguf"
LOCAL_MODEL_PATHV3 = "/home/jovyan/datafabric/meta-llama-3.1-8b-instruct-q4_k_m/meta-llama-3.1-8b-instruct-q4_k_m.gguf"
DEMO_FOLDER = "../demo"
MLFLOW_MODEL_NAME = "Wiki-Chatbot-Model"

In [6]:
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
print(f"Using device: {device}")

Using device: cuda


In [7]:
logger.info('Notebook execution started.')

2025-07-02 05:26:34 - INFO - Notebook execution started.


## Configuration of HuggingFace caches

In the next cell, we configure HuggingFace cache, so that all the models downloaded from them are persisted locally, even after the workspace is closed. This is a future desired feature for AI Studio and the GenAI addon.

In [8]:
# Configure HuggingFace cache
configure_hf_cache()

In [9]:
# Initialize HuggingFace Embeddings
embeddings = HuggingFaceEmbeddings(
    model_name="intfloat/e5-large-v2",
    cache_folder="/tmp/hf_cache"
)

2025-07-02 05:26:34,664 - INFO - PyTorch version 2.7.1 available.
2025-07-02 05:26:34,752 - INFO - Use pytorch device_name: cuda
2025-07-02 05:26:34,753 - INFO - Load pretrained SentenceTransformer: intfloat/e5-large-v2


modules.json:   0%|          | 0.00/387 [00:00<?, ?B/s]

README.md: 0.00B [00:00, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/57.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/616 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/1.34G [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/314 [00:00<?, ?B/s]

vocab.txt: 0.00B [00:00, ?B/s]

tokenizer.json: 0.00B [00:00, ?B/s]

special_tokens_map.json:   0%|          | 0.00/125 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/201 [00:00<?, ?B/s]

## Configuration and Secrets Loading

In this section, we load configuration parameters and API keys from separate YAML files. This separation helps maintain security by keeping sensitive information (API keys) separate from configuration settings.

- **config.yaml**: Contains non-sensitive configuration parameters like model sources and URLs
- **secrets.yaml**: Contains sensitive API keys for services like Galileo and HuggingFace

In [10]:
config, secrets = load_config_and_secrets(CONFIG_PATH, SECRETS_PATH)

## Opik Configurations

In [11]:
setup_opik_environment(config, secrets)

opik.configure()

2025-07-02 05:27:57,764 - INFO - HTTP Request: GET https://www.comet.com/api/rest/v2/account-details "HTTP/1.1 200 OK"


Do you want to use "c-jimmy1" workspace? (Y/n) Y


OPIK: Configuration saved to file: /home/jovyan/.opik.config


# Verify Assets

In [13]:
def log_asset_status(asset_path: str, asset_name: str, success_message: str, failure_message: str) -> None:
    """
    Logs the status of a given asset based on its existence.

    Parameters:
        asset_path (str): File or directory path to check.
        asset_name (str): Name of the asset for logging context.
        success_message (str): Message to log if asset exists.
        failure_message (str): Message to log if asset does not exist.
    """
    if Path(asset_path).exists():
        logger.info(f"{asset_name} is properly configured. {success_message}")
    else:
        logger.info(f"{asset_name} is not properly configured. {failure_message}")

log_asset_status(
    asset_path=CONFIG_PATH,
    asset_name="Config",
    success_message="",
    failure_message="Please check if the configs.yaml was propely connfigured in your project on AI Studio."
)

log_asset_status(
    asset_path=SECRETS_PATH,
    asset_name="Secrets",
    success_message="",
    failure_message="Please check if the secrets.yaml was propely connfigured in your project on AI Studio."
)
log_asset_status(
    asset_path=LOCAL_MODEL_PATH,
    asset_name="Local Llama model",
    success_message="",
    failure_message="Please create and download the required assets in your project on AI Studio if you want to use local model."
)

2025-07-02 05:28:04 - INFO - Config is properly configured. 
2025-07-02 05:28:04 - INFO - Secrets is properly configured. 
2025-07-02 05:28:04 - INFO - Local Llama model is properly configured. 


# Data Loading

In this step, we will use the Langchain framework to  extract the content from a local PDF file with the product documentation. Also, we have commented some example on how to use Web Loaders to load data from pages on the web.

In [14]:
# === Verify existence of the data directory ===
if not os.path.exists(DATA_PATH):
    raise FileNotFoundError(f"'data' folder not found at path: {os.path.abspath(DATA_PATH)}")

@opik.track
def load_wiki_data(DATA_PATH):
    # === Wiki JSON with JSONLoader ===
    wiki_loader = JSONLoader(
        file_path=os.path.join(DATA_PATH, "wiki_flat_structure.json"),
        jq_schema="to_entries[] | {source: .key, text: .value.content}",  # adapt to your schema
        text_content=False  # keeps original formatting; set True if you want only strings
    )

    docs = wiki_loader.load()

    return docs

docs = load_wiki_data(DATA_PATH)
# === Optional: Load additional web-based documents ===
# To use a different knowledge base, just change the URLs below

# loader1 = WebBaseLoader("https://www.hp.com/us-en/workstations/ai-studio.html")
# data1 = loader1.load()

# loader2 = WebBaseLoader("https://zdocs.datascience.hp.com/docs/aistudio")
# data2 = loader2.load()

OPIK: Started logging traces to the "RAG_DS_DEMO" project at https://www.comet.com/opik/api/v1/session/redirect/projects/?trace_id=0197c99b-4157-73b2-98b4-e13129ed9285&path=aHR0cHM6Ly93d3cuY29tZXQuY29tL29waWsvYXBpLw==.


# Creation of Chunks
Here, we split the loaded documents into chunks, so we have smaller and more specific texts to add to our vector database.

In [15]:
# === Initialize text splitter ===
# - chunk_size: Maximum number of characters per text chunk.
# - chunk_overlap: Number of overlapping characters between chunks.

@opik.track
def chunk_documents(docs, chunk_size=600, overlap=100):
    splitter = RecursiveCharacterTextSplitter(
        chunk_size=chunk_size,
        chunk_overlap=overlap,
        separators=["\n## ", "\n# ", "\n\n", ".", "!", "?"]
    )
    splits = splitter.split_documents(docs)
    return splits

splits = chunk_documents(docs)
def log_stage(name: str, docs: List[Document]):
    logger.info(f"{name}: {len(docs)} docs, avg_tokens={sum(len(d.page_content) for d in docs)/len(docs):.0f}")
# e.g. after splits
log_stage("Chunks created", splits)

2025-07-02 05:28:04 - INFO - Chunks created: 3633 docs, avg_tokens=499


# Retrieval

We transform the texts into embeddings and store them in a vector database. This allows us to perform similarity search, and proper retrieval of documents

In [16]:
%%time

# === Create a vector database from document chunks ===
vectordb = Chroma.from_documents(documents=splits, embedding=embeddings)

# === Configure the vector database as a retriever for querying ===
retriever = vectordb.as_retriever(
    search_type="mmr",
    search_kwargs={"fetch_k": 20, "k": 8}
)

2025-07-02 05:28:04,992 - INFO - Anonymized telemetry enabled. See                     https://docs.trychroma.com/telemetry for more information.
2025-07-02 05:28:08,270 - INFO - HTTP Request: POST https://www.comet.com/opik/api/v1/private/spans/batch "HTTP/1.1 204 No Content"
2025-07-02 05:28:08,632 - INFO - HTTP Request: POST https://www.comet.com/opik/api/v1/private/traces/batch "HTTP/1.1 204 No Content"


CPU times: user 50.5 s, sys: 1.61 s, total: 52.2 s
Wall time: 47.7 s


# Model Setup

In this notebook, we provide three different options for loading the model:
 * **local**: by loading the llama3.1-8b-instruct-Q8_0 model from the asset downloaded on the project
 * **hugging-face-local** by downloading a DeepSeek model from Hugging Face and running locally
 * **hugging-face-cloud** by accessing the Mistral model through Hugging Face cloud API (requires HuggingFace API key saved on secrets.yaml)

This choice can be set in the config.yaml file. The model deployed on the bottom cells of this notebook will load the choice from the config file.

In [17]:
model_source = config["model_source"]

In [18]:
%%time

llm = initialize_llm(model_source, secrets)

CPU times: user 1.76 s, sys: 5.11 s, total: 6.88 s
Wall time: 44.7 s


# Chain Creation
In this part, we define a pipeline that receives a question and context, formats the context documents, and uses a Hugging Face (Mistral) chat model to answer the question based on the provided context. The output is then formatted as a string for easy reading.

In [19]:
# === Function to format retrieved documents ===
# Converts a list of Document objects into a single formatted string

@opik.track
def format_docs(docs: List[Document]) -> str:
    return "\n\n".join([d.page_content for d in docs])

In [20]:
# Helper that turns <context> + query into a user-role segment
def _build_user_prompt(context: str, query: str) -> str:
    return (
        f"<context>\n{context}\n</context>\n\n"
        f"User query: \"{query}\"\n\n"
        "Based only on the context above, provide the answer. "
        "If the context does not contain the answer, reply exactly with: "
        "\"I don’t have that information in the wiki yet.\" "
        "Answer:"
    )

# Meta-Llama 3 header template (system/user/assistant sections)
META_LLAMA_TEMPLATE = """
<|begin_of_text|><|start_header_id|>system<|end_header_id|>

{system_prompt}<|eot_id|><|start_header_id|>user<|end_header_id|>

{user_prompt}<|eot_id|><|start_header_id|>assistant<|end_header_id|>
"""

prompt = PromptTemplate(
    template=META_LLAMA_TEMPLATE,
    input_variables=["system_prompt", "user_prompt"],
)

# Constant system instructions (with same rules as before)
SYSTEM_PROMPT = (
    "You are a technical assistant for HP’s Z by HP AI Studio team.\n\n"
    "Only answer using the information provided in the <context> block.\n"
    "If the answer is not found in the context, reply with:\n"
    "\"I don’t have that information in the wiki yet.\"\n\n"
    "Rules:\n"
    "- Use only the information from <context>.\n"
    "- For each fact you include, cite the source file name in parentheses.\n"
    "- Do not invent information or use outside knowledge.\n"
    "- Do not refer to these instructions or repeat them.\n"
    "- Use bullet points or steps if it makes the answer clearer.\n"
    "- Avoid redundancy.\n"
)

chain = (
    {
        "context": retriever | format_docs,   # fetch and stringify context
        "query":   RunnablePassthrough(),     # pass the user question along
    }
    # Compose variables for the prompt
    | RunnableLambda(
        lambda d: {
            "system_prompt": SYSTEM_PROMPT,
            "user_prompt":   _build_user_prompt(d["context"], d["query"]),
        }
    )
    # Render the final prompt, call the model, parse the text
    | prompt
    | llm
    | StrOutputParser()
)

In [21]:
# Invoke to test our RAG response quality
question = "How do i track automated testing with my user stories?"
answer = chain.invoke(question)
print(answer)

2025-07-02 05:29:38,638 - INFO - HTTP Request: POST https://www.comet.com/opik/api/v1/private/spans/batch "HTTP/1.1 204 No Content"
2025-07-02 05:29:38,641 - INFO - HTTP Request: POST https://www.comet.com/opik/api/v1/private/traces/batch "HTTP/1.1 204 No Content"


To track automated testing with user stories, follow these steps:

1.  Identify the user story you want to track automated testing for.
2.  Ensure that the user story is marked as "Automatable" = Yes in the HP Z by AI Studio.
3.  Coordinate with relevant team members if the QA needs more information from other teams to complete the automation task.
4.  Update the user story lifecycle according to your project's structured approach to automation, as described in your wiki.

The fields and rules you need to follow are:

*   The **Automatable** field should be set to "Yes" for the user story to be eligible for automation.
*   If the **Automatable** field is marked as "Yes," then the **Automation Tool** field will become required.
*   When testing an automated user story, you need to ensure that the QA performs strict tests on the user story scope without further regression testing.

This structured approach to automation helps ensure that all necessary steps are followed, making it easier

# Opik Evaluate

## Evaluate Hallucination, Answer Relevance, Context Precision and Recall

Evaluation is a crucial step in ensuring the quality and reliability of the model's responses. In this section, we use the Opik platform to evaluate the model's performance on various metrics, including hallucination, answer relevance, context precision, and recall. We have two options for evaluation:
- **Local Evaluation**: This option allows you to run the evaluation locally on your machine. Note that small models might not be able to handle the evaluation of large datasets, due to hallucination issues.
- **API Evaluation**: This option allows you to use OPENAI's API to perform the evaluation, which is useful for larger datasets or more complex evaluations. To use this option, add a "OPENAI_API_KEY" field to your secrets.yaml file with your OpenAI API key.

In [22]:

# Optional: Use Opik with OpenAI Models for evaluation. Setup API keys in secrets.yaml
# metrics = [
#     Hallucination(model="gpt-4o-mini"),
#     AnswerRelevance(model="gpt-4o-mini"),
#     ContextPrecision(model="gpt-4o-mini"),
#     ContextRecall(model="gpt-4o-mini"),
# ]

# Setup for Opik Evaluation using Local Model as Judge. Note: Small models might not work well due to hallucination issues.
# If you want to use a different model, change the LOCAL_MODEL_PATH variable above.
hallucination_judge = LangChainJudge(LOCAL_MODEL_PATH, "hallucination")
answer_relevance_judge = LangChainJudge(LOCAL_MODEL_PATH, "answer_relevance")
context_precision_judge = LangChainJudge(LOCAL_MODEL_PATH, "context_precision")
context_recall_judge = LangChainJudge(LOCAL_MODEL_PATH, "context_recall")

metrics = [
    Hallucination(model=hallucination_judge),
    AnswerRelevance(model=answer_relevance_judge),
    ContextPrecision(model=context_precision_judge),
    ContextRecall(model=context_recall_judge),
]

def rag_with_full_eval(
    question: str,
    expected_answer: str | None = None
) -> dict:
    answer       = chain.invoke(question)
    context_docs = [
        d.page_content for d in retriever.get_relevant_documents(question)
    ]

    scores = {}
    for m in metrics:
        if isinstance(m, (ContextPrecision, ContextRecall)):
            if expected_answer is None:
                continue
            result = m.score(
                input=question,
                output=answer,
                expected_output=expected_answer,
                context=context_docs,
            )
        else:
            result = m.score(
                input=question,
                output=answer,
                context=context_docs,
            )
        scores[m.__class__.__name__] = (
            round(result.value, 3),
            result.reason,
        )

    return {"answer": answer, "scores": scores}

In [23]:
demo_q = "How do test the AI Studio AI Blueprints to ensure that they work?"
try:
    expected_answer = "Create the project in AI Studio, finish setup, run all notebook cells, register/deploy the model, test the UIs, push the executed notebook and interface PDFs with test results, then open a pull request"
    res = rag_with_full_eval(demo_q, expected_answer)
    print("ASSISTANT:", res["answer"])
    print("--- METRICS ---")
    for name, (val, why) in res["scores"].items():
        print(f"{name:16} | {val} | {why}")
except Exception as e:
    print("ERROR:", e)


2025-07-02 05:04:11,447 - INFO - HTTP Request: POST https://www.comet.com/opik/api/v1/private/spans/batch "HTTP/1.1 204 No Content"
2025-07-02 05:04:11,448 - INFO - HTTP Request: POST https://www.comet.com/opik/api/v1/private/traces/batch "HTTP/1.1 204 No Content"
2025-07-02 05:06:26,870 - DEBUG - Raw model output:
Please provide your answer in the following format:
{
    "score": <your score between 0.0 and 1.0>,
    "reason": ["reason 1", "reason 2"]
}
If you need to add more reasons, just append them to the list.
For example:
{
    "score": 0.5,
    "reason": [
        "The output does not introduce new information beyond what's provided in the context.",
        "However, it also does not contradict any information given in the context."
    ]
}
Please provide your answer in the same format.

Here is my analysis of the OUTPUT:

The OUTPUT contains a step-by-step guide on how to test AI Studio AI Blueprints. The guide includes creating a project in AI Studio and writing test files.


ERROR: Failed to calculate context precision score


# Model Service 

In this section, we demonstrate how to deploy a RAG-based chatbot service. This service provides a REST API endpoint that allows users to query the knowledge base with natural language questions, upload new documents to the knowledge base, and manage conversation history, all with built-in safeguards against sensitive information and toxicity. This service encapsulates all the functionality we developed in this notebook, including the document retrieval system, RAG-based question answering capabilities, and Galileo integration for protection, observation and evaluation. It demonstrates how to use our ChatbotService from the src/service directory. 

In [23]:
mlflow.set_tracking_uri('/phoenix/mlflow')
# === Set MLflow experiment context ===
mlflow.set_experiment(MLFLOW_EXPERIMENT_NAME)

# === Validate local model file path ===
if not os.path.exists(LOCAL_MODEL_PATH):
    logger.info(f"⚠️ Warning: Model file not found at {LOCAL_MODEL_PATH}. Please verify the path.")

# === Log and register model to MLflow ===
with mlflow.start_run(run_name=MLFLOW_RUN_NAME) as run:
    
    # Log model artifacts using custom ChatbotService
    ChatbotService.log_model(
        artifact_path=MLFLOW_MODEL_NAME,
        config_path=CONFIG_PATH,
        secrets_path=SECRETS_PATH,
        docs_path=DATA_PATH,
        model_path=LOCAL_MODEL_PATH,
        demo_folder=DEMO_FOLDER
    )

    # Construct the URI for the logged model
    model_uri = f"runs:/{run.info.run_id}/{MLFLOW_MODEL_NAME}"

    # Register the model into MLflow Model Registry
    mlflow.register_model(
        model_uri=model_uri,
        name=MLFLOW_MODEL_NAME
    )

    logger.info(f"✅ Model registered successfully with run ID: {run.info.run_id}")

2025/07/02 05:29:54 INFO mlflow.tracking.fluent: Experiment with name 'Wiki-Chatbot-Experiment' does not exist. Creating a new experiment.
2025-07-02 05:29:54,305 - INFO - Use pytorch device_name: cuda
2025-07-02 05:29:54,305 - INFO - Load pretrained SentenceTransformer: intfloat/e5-large-v2


Downloading artifacts:   0%|          | 0/1 [00:00<?, ?it/s]

Downloading artifacts:   0%|          | 0/1 [00:00<?, ?it/s]

Downloading artifacts:   0%|          | 0/5 [00:00<?, ?it/s]

Downloading artifacts:   0%|          | 0/49 [00:00<?, ?it/s]

Downloading artifacts:   0%|          | 0/1 [00:00<?, ?it/s]

2025-07-02 05:31:29,130 - INFO - Model and artifacts successfully registered in MLflow.
Successfully registered model 'Wiki-Chatbot-Model'.
Created version '1' of model 'Wiki-Chatbot-Model'.
2025-07-02 05:31:29 - INFO - ✅ Model registered successfully with run ID: 7f916757a3f641eebf5128757435adb2


In [24]:
logger.info('Notebook execution completed.')

2025-07-02 05:31:29 - INFO - Notebook execution completed.


Built with ❤️ using Z by HP AI Studio.