<h1 style="text-align: center; font-size: 50px;">Vanilla RAG Chatbot with Langchain</h1>

Retrieval-Augmented Generation (RAG) is an architectural approach that can enhance the effectiveness of large language model (LLM) applications using customized data. In this example, we use LangChain, an orchestrator for language pipelines, to build an assistant capable of loading information from a web page and use it for answering user questions. We'll also use the DeepEval platform to evaluate, observe and protect the LLM responses.

# Notebook Overview
- Imports
- Configurations
- Verify Assets
- Data Loading
- Creation of Chunks
- Retrieval
- Model Setup
- Chain Creation
- Model Service 

# Imports

By using our Local GenAI workspace image, many of the necessary libraries to work with RAG already come pre-installed - in our case, we just need to add the connector to work with PDF documents

In [1]:
%pip install -r ../requirements.txt --quiet

Note: you may need to restart the kernel to use updated packages.


In [2]:
# === Standard Library Imports ===
from typing import List
from datetime import datetime
import warnings
from pathlib import Path
import os
import sys
import logging

# === MLflow integration ===
import mlflow

# Define the relative path to the 'core' directory (one level up from current working directory)
sys.path.append(os.path.abspath(os.path.join(os.getcwd(), "..")))
# === Import ChatbotService from project core ===
from core.chatbot_service.chatbot_service import ChatbotService

# === Third-Party Imports ===
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain.vectorstores import Chroma
from langchain.prompts import ChatPromptTemplate
from langchain.schema import StrOutputParser
from langchain.schema.runnable import RunnablePassthrough
from langchain.schema.document import Document
from langchain.document_loaders import WebBaseLoader, JSONLoader
from langchain_community.document_loaders import PyMuPDFLoader
from langchain_huggingface import HuggingFaceEmbeddings
import promptquality as pq
import torch

# Define the relative path to the 'src' directory (one level up from current working directory)
sys.path.append(os.path.abspath(os.path.join(os.getcwd(), "..")))

# === Project-Specific Imports (from src.utils) ===
from src.utils import (
    load_config_and_secrets,
    configure_proxy,
    initialize_llm,
    configure_hf_cache,
    setup_opik_environment,
)



# Configurations

In [3]:
warnings.filterwarnings("ignore")

In [4]:
# Create logger
logger = logging.getLogger("vanilla_rag_logger")
logger.setLevel(logging.INFO)

formatter = logging.Formatter("%(asctime)s - %(levelname)s - %(message)s", 
                              datefmt="%Y-%m-%d %H:%M:%S") 

stream_handler = logging.StreamHandler()
stream_handler.setFormatter(formatter)
logger.addHandler(stream_handler)
logger.propagate = False

In [5]:
CONFIG_PATH = "../configs/config.yaml"
SECRETS_PATH = "../configs/secrets.yaml"
DATA_PATH = "../data"
MLFLOW_EXPERIMENT_NAME = "AIStudio-Chatbot-Experiment"
MLFLOW_RUN_NAME = "AIStudio-Chatbot-Run"
LOCAL_MODEL_PATH = "/home/jovyan/datafabric/llama3.1-8b-instruct/Meta-Llama-3.1-8B-Instruct-Q8_0.gguf"
DEMO_FOLDER = "../demo"
MLFLOW_MODEL_NAME = "AIStudio-Chatbot-Model"

In [6]:
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
print(f"Using device: {device}")

Using device: cuda


In [7]:
logger.info('Notebook execution started.')

2025-06-25 22:15:25 - INFO - Notebook execution started.


## Configuration of HuggingFace caches

In the next cell, we configure HuggingFace cache, so that all the models downloaded from them are persisted locally, even after the workspace is closed. This is a future desired feature for AI Studio and the GenAI addon.

In [8]:
# Configure HuggingFace cache
configure_hf_cache()

In [9]:
# Initialize HuggingFace Embeddings
embeddings = HuggingFaceEmbeddings(
    model_name="intfloat/e5-large-v2",
    cache_folder="/tmp/hf_cache"
)

2025-06-25 22:15:25,518 - INFO - PyTorch version 2.7.0 available.
2025-06-25 22:15:25,584 - INFO - Use pytorch device_name: cuda
2025-06-25 22:15:25,585 - INFO - Load pretrained SentenceTransformer: intfloat/e5-large-v2


## Configuration and Secrets Loading

In this section, we load configuration parameters and API keys from separate YAML files. This separation helps maintain security by keeping sensitive information (API keys) separate from configuration settings.

- **config.yaml**: Contains non-sensitive configuration parameters like model sources and URLs
- **secrets.yaml**: Contains sensitive API keys for services like Galileo and HuggingFace

In [10]:
config, secrets = load_config_and_secrets(CONFIG_PATH, SECRETS_PATH)

## Proxy Configuration

In order to connect to Galileo service, a SSH connection needs to be established. For certain enterprise networks, this might require an explicit setup of the proxy configuration. If this is your case, set up the "proxy" field on your config.yaml and the following cell will configure the necessary environment variable.

In [11]:
configure_proxy(config)

# Verify Assets

In [12]:
def log_asset_status(asset_path: str, asset_name: str, success_message: str, failure_message: str) -> None:
    """
    Logs the status of a given asset based on its existence.

    Parameters:
        asset_path (str): File or directory path to check.
        asset_name (str): Name of the asset for logging context.
        success_message (str): Message to log if asset exists.
        failure_message (str): Message to log if asset does not exist.
    """
    if Path(asset_path).exists():
        logger.info(f"{asset_name} is properly configured. {success_message}")
    else:
        logger.info(f"{asset_name} is not properly configured. {failure_message}")

log_asset_status(
    asset_path=CONFIG_PATH,
    asset_name="Config",
    success_message="",
    failure_message="Please check if the configs.yaml was propely connfigured in your project on AI Studio."
)

log_asset_status(
    asset_path=SECRETS_PATH,
    asset_name="Secrets",
    success_message="",
    failure_message="Please check if the secrets.yaml was propely connfigured in your project on AI Studio."
)
log_asset_status(
    asset_path=LOCAL_MODEL_PATH,
    asset_name="Local Llama model",
    success_message="",
    failure_message="Please create and download the required assets in your project on AI Studio if you want to use local model."
)

2025-06-25 22:15:28 - INFO - Config is properly configured. 
2025-06-25 22:15:28 - INFO - Secrets is properly configured. 
2025-06-25 22:15:28 - INFO - Local Llama model is properly configured. 


# Data Loading

In this step, we will use the Langchain framework to  extract the content from a local PDF file with the product documentation. Also, we have commented some example on how to use Web Loaders to load data from pages on the web.

In [13]:
# === Verify existence of the data directory ===
if not os.path.exists(DATA_PATH):
    raise FileNotFoundError(f"'data' folder not found at path: {os.path.abspath(DATA_PATH)}")

# === Wiki JSON with JSONLoader ===
wiki_loader = JSONLoader(
    file_path=os.path.join(DATA_PATH, "wiki_flat_structure.json"),
    jq_schema="to_entries[] | {source: .key, text: .value.content}",  # adapt to your schema
    text_content=False  # keeps original formatting; set True if you want only strings
)

docs = wiki_loader.load()
# === Optional: Load additional web-based documents ===
# To use a different knowledge base, just change the URLs below

# loader1 = WebBaseLoader("https://www.hp.com/us-en/workstations/ai-studio.html")
# data1 = loader1.load()

# loader2 = WebBaseLoader("https://zdocs.datascience.hp.com/docs/aistudio")
# data2 = loader2.load()

# Creation of Chunks
Here, we split the loaded documents into chunks, so we have smaller and more specific texts to add to our vector database.

In [14]:
# === Initialize text splitter ===
# - chunk_size: Maximum number of characters per text chunk.
# - chunk_overlap: Number of overlapping characters between chunks.

text_splitter = RecursiveCharacterTextSplitter(
    chunk_size=600, chunk_overlap=100,
    separators=["\n## ", "\n# ", "\n\n", ".", "!", "?"]
)

# text_splitter = SemanticChunker(
#     embeddings=embeddings,
# )

# ─── split your loaded PDF docs ─────────────────────────────────────────────────
splits: List[Document] = text_splitter.split_documents(docs)

def log_stage(name: str, docs: List[Document]):
    logger.info(f"{name}: {len(docs)} docs, avg_tokens={sum(len(d.page_content) for d in docs)/len(docs):.0f}")
# e.g. after splits
log_stage("Chunks created", splits)

2025-06-25 22:15:28 - INFO - Chunks created: 3633 docs, avg_tokens=499


# Retrieval

We transform the texts into embeddings and store them in a vector database. This allows us to perform similarity search, and proper retrieval of documents

In [15]:
%%time

# === Create a vector database from document chunks ===
vectordb = Chroma.from_documents(documents=splits, embedding=embeddings)

# === Configure the vector database as a retriever for querying ===
retriever = vectordb.as_retriever(
    search_type="mmr",
    search_kwargs={"fetch_k": 20, "k": 8}
)

2025-06-25 22:15:28,776 - INFO - Anonymized telemetry enabled. See                     https://docs.trychroma.com/telemetry for more information.


CPU times: user 49.6 s, sys: 1.87 s, total: 51.5 s
Wall time: 48.1 s


# Model Setup

In this notebook, we provide three different options for loading the model:
 * **local**: by loading the llama3.1-8b-instruct-Q8_0 model from the asset downloaded on the project
 * **hugging-face-local** by downloading a DeepSeek model from Hugging Face and running locally
 * **hugging-face-cloud** by accessing the Mistral model through Hugging Face cloud API (requires HuggingFace API key saved on secrets.yaml)

This choice can be set in the config.yaml file. The model deployed on the bottom cells of this notebook will load the choice from the config file.

In [16]:
model_source = config["model_source"]

In [17]:
%%time

llm = initialize_llm(model_source, secrets)

CPU times: user 1.38 s, sys: 5.24 s, total: 6.63 s
Wall time: 38.3 s


# Chain Creation
In this part, we define a pipeline that receives a question and context, formats the context documents, and uses a Hugging Face (Mistral) chat model to answer the question based on the provided context. The output is then formatted as a string for easy reading.

In [18]:
# === Function to format retrieved documents ===
# Converts a list of Document objects into a single formatted string
def format_docs(docs: List[Document]) -> str:
    return "\n\n".join([d.page_content for d in docs])

In [19]:
# # ─── New imports ────────────────────────────────────────────────────────────────
# from langchain import hub
# from langchain.chains import create_retrieval_chain
# from langchain.chains.combine_documents import create_stuff_documents_chain

# retrieval_qa_chat_prompt = hub.pull("langchain-ai/retrieval-qa-chat")

# # Build a combine‐docs chain (chunks → single context block)
# combine_docs_chain = create_stuff_documents_chain(
#     llm,
#     retrieval_qa_chat_prompt,
# )

# # Build the full RAG chain
# rag_chain = create_retrieval_chain(
#     retriever,           # your Chroma retriever
#     combine_docs_chain,  # the chain that stuffs docs into the prompt
# )

# # Invoke it
# question = "How do i test my ai blueprints?"
# result = rag_chain.invoke({"input": question})

# print("Answer:", result["answer"])



In [20]:
# === Define chatbot prompt template ===
# Ensures that responses are strictly related to "Z by HP AI Studio"
template = """
You are a technical assistant for HP’s Z by HP AI Studio team.

Only answer using the information provided in the <context> block.  
If the answer is not found in the context, reply with:  
**"I don’t have that information in the wiki yet."**

Rules:
- Use only the information from <context>.
- For each fact you include, cite the source file name in parentheses.
- Do not invent information or use outside knowledge.
- Do not repeat or refer to these instructions.
- Use bullet points or steps if it makes the answer clearer.
- Avoid repeating the same sentence or phrase. Be concise and avoid redundancy.

<context>
{context}
</context>

**Question:** {query}  
**Answer:**
"""

prompt = ChatPromptTemplate.from_template(template)

# === Create an LLM-powered retrieval chain ===
# - The retriever fetches relevant documents.
# - The documents are formatted using format_docs().
# - The query is passed directly using RunnablePassthrough().
# - The formatted context and query are injected into the prompt.
# - The LLM processes the prompt and the response is parsed into a string.
chain = {
    "context": retriever | format_docs,
    "query": RunnablePassthrough()
} | prompt | llm | StrOutputParser()

In [21]:
# One-off query (synchronous)
question = "How do i test my ai blueprints?"
answer = chain.invoke(question)          # or chain(question)
print(answer)

To test your AI blueprints, you will need to follow these steps:

### 1. Create a Project in AI Studio

*   If the blueprint is published, create a new project in AI Studio using the blueprint directly

### 2. Add Your Repo and Data to Your AI Studio Project

*   Add the repo that contains your model
*   Upload the data required for training or testing the model
*   Note: You will need to follow the instructions provided in the **[Usage](https://hp-my.sharepoint.com/:u:/p/davi_aragao/EffcPInhp9hJhf_tYfsUZxsB9LwIqbJIZrnR0yAeMSuajQ?e=uQAOXg)**] section for more details on how to perform the above steps.

### 3. Run the Notebook

*   Open the Jupyter notebook associated with the blueprint.
*   Execute all cells by clicking **"Run All"**, as described in **Step 1 of the Usage section**.

### 4. Register the Model and Deploy Locally (if applicable)

*   If the blueprint includes MLflow integration:
    *   Follow the next usage step to **register the model in MLflow** and deploy it successf

# Opik Evaluate

## Configure

Configure opik to either local or cloud (api). For local use `opik.configure(use_local=True)`

In [22]:
import opik
from opik.evaluation import evaluate_prompt
from opik.evaluation.metrics import Hallucination, ContextPrecision
from opik import Opik

setup_opik_environment(secrets)

opik.configure()

OPIK: Existing Opik clients will not use updated values for "url", "api_key", "workspace".
OPIK: Opik is already configured. You can check the settings by viewing the config file at /home/jovyan/.opik.config


In [23]:

translation = chain.invoke("Hello, how are you?", callbacks=[opik_tracer])


In [24]:
from src.local_judge import LangChainJudge
from opik.evaluation.metrics import Hallucination

judge = LangChainJudge(LOCAL_MODEL_PATH)

local_metric = Hallucination(model=judge)

def rag_with_eval(question: str) -> str:
    answer = chain.invoke(question)
    refs   = [d.page_content for d in retriever.get_relevant_documents(question)]

    score  = local_metric.score(input=question,
                                output=answer,
                                context=refs)
    print("Hallucination:", score.value, "|", score.reason)
    return answer

In [25]:
from opik.integrations.langchain import OpikTracer

tracer = OpikTracer(tags=["llama_cpp"])


In [26]:
question = "How do I test my AI Blueprints?"
response = rag_with_eval(question)
print("Assistant:", response)


OPIK: Started logging traces to the "Default Project" project at https://www.comet.com/opik/api/v1/session/redirect/projects/?trace_id=0197a92b-35ea-7319-b071-d2cca45c12af&path=aHR0cHM6Ly93d3cuY29tZXQuY29tL29waWsvYXBpLw==.


AttributeError: 'NoneType' object has no attribute 'value'

# Model Service 

In this section, we demonstrate how to deploy a RAG-based chatbot service with integrated Galileo Protect and Observe capabilities. This service provides a REST API endpoint that allows users to query the knowledge base with natural language questions, upload new documents to the knowledge base, and manage conversation history, all with built-in safeguards against sensitive information and toxicity. This service encapsulates all the functionality we developed in this notebook, including the document retrieval system, RAG-based question answering capabilities, and Galileo integration for protection, observation and evaluation. It demonstrates how to use our ChatbotService from the src/service directory. 

In [None]:
mlflow.set_tracking_uri('/phoenix/mlflow')
# === Set MLflow experiment context ===
mlflow.set_experiment(MLFLOW_EXPERIMENT_NAME)

# === Validate local model file path ===
if not os.path.exists(LOCAL_MODEL_PATH):
    logger.info(f"⚠️ Warning: Model file not found at {LOCAL_MODEL_PATH}. Please verify the path.")

# === Log and register model to MLflow ===
with mlflow.start_run(run_name=MLFLOW_RUN_NAME) as run:
    
    # Log model artifacts using custom ChatbotService
    ChatbotService.log_model(
        artifact_path=MLFLOW_MODEL_NAME,
        config_path=CONFIG_PATH,
        secrets_path=SECRETS_PATH,
        docs_path=DATA_PATH,
        model_path=LOCAL_MODEL_PATH,
        demo_folder=DEMO_FOLDER
    )

    # Construct the URI for the logged model
    model_uri = f"runs:/{run.info.run_id}/{MLFLOW_MODEL_NAME}"

    # Register the model into MLflow Model Registry
    mlflow.register_model(
        model_uri=model_uri,
        name=MLFLOW_MODEL_NAME
    )

    logger.info(f"✅ Model registered successfully with run ID: {run.info.run_id}")

In [None]:
logger.info('Notebook execution completed.')

Built with ❤️ using Z by HP AI Studio.