# RAG Orchestration and Testing for Internal Chatbot

### ***Reference code file: Kaggle - RAGLab: An automated evaluation tool for configuring RAG pipelines***
Retrieval-Augmented Generation (RAG) has emerged as a powerful way for AI chatbots to leverage external data, giving them the ability to produce more accurate, context-rich responses. Instead of relying solely on a large language model’s internal knowledge, RAG applications retrieve domain-specific documents—often stored in a vector database—and feed them to the model as additional context. This approach is particularly helpful when building domain-focused chatbots in areas like law, medicine, or finance, where ensuring that responses reflect up-to-date, specialized information is essential.


There are six major elements to the modular RAG process. Pre-retrieval, Retrieval, and Generation constitute the RAG pipeline (assuming a simple linear pipeline):

- Indexing: Splits, structures, and transforms raw documents to enable efficient retrieval.
    
- Orchestration: Directs how each module interacts, controlling branching, scheduling, and flow.
    
- Pre-retrieval: Refines or expands the user’s query before any retrieval is performed.
    
- Retrieval: Searches the indexed data to find relevant chunks or documents.
        
- Generation: Produces the final response by leveraging a language model, using the query plus the refined context.

- Judging: grades the output of the RAG pipeline against a supplied gold answer

### Function Definitions and Orchestration Setup

In [1]:
# Chromadb, for creating knowledge base and generating embeddings
!pip install chromadb

# Google genai, for LLM access
!pip install google-api-core
!pip install -U -q "google-genai==1.7.0"

# IPython
!pip install ipython

Collecting chromadb
  Downloading chromadb-1.0.15-cp39-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (7.0 kB)
Collecting pybase64>=1.4.1 (from chromadb)
  Downloading pybase64-1.4.1-cp311-cp311-manylinux_2_5_x86_64.manylinux1_x86_64.manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (8.4 kB)
Collecting posthog<6.0.0,>=2.4.0 (from chromadb)
  Downloading posthog-5.4.0-py3-none-any.whl.metadata (5.7 kB)
Collecting onnxruntime>=1.14.1 (from chromadb)
  Downloading onnxruntime-1.22.1-cp311-cp311-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl.metadata (4.6 kB)
Collecting opentelemetry-api>=1.2.0 (from chromadb)
  Downloading opentelemetry_api-1.35.0-py3-none-any.whl.metadata (1.5 kB)
Collecting opentelemetry-exporter-otlp-proto-grpc>=1.2.0 (from chromadb)
  Downloading opentelemetry_exporter_otlp_proto_grpc-1.35.0-py3-none-any.whl.metadata (2.4 kB)
Collecting opentelemetry-sdk>=1.2.0 (from chromadb)
  Downloading opentelemetry_sdk-1.35.0-py3-none-any.whl.metadata (1.5 k

In [None]:
# type: ignore

import json  # Used in the judging process

# Chromadb, for creating knowledge base and generating embeddings
import chromadb

# Google genai, for LLM access
from google import genai
from google.genai import types

from IPython.display import HTML, Markdown, display

from google.api_core import retry

# Retry on 429 or 503 errors
is_retriable = lambda e: (isinstance(e, genai.errors.APIError) and e.code in {429, 503})

# Patch the method with retry logic
genai.models.Models.generate_content = retry.Retry(
    predicate=is_retriable)(genai.models.Models.generate_content)

from google.colab import userdata
GOOGLE_API_KEY = userdata.get('GOOGLE_API_KEY')

# For orchestration function
import sys  # Standard library for system-specific parameters and functions
import uuid # Standard library for generating unique identifiers
import time # Standard library for working with time and dates

In [3]:
# System prompt, supplies basic context for the chatbot
def get_system_prompt():
    return "You are a helpful information bot embedded in the internal site for a legal firm. Your role is to provide accurate answers to visitors' questions about the chapter, its members, and its events. As a Retrieval-Augmented Generation (RAG) chatbot, you will receive additional context with each user query to ensure relevant and detailed responses."

In [18]:
# Functions applied before the knowledge retrieval process

# Original verbose prompt template
QUERY_REWRITE_PROMPT_1 = """Here is a user's prompt, could you please improve it, if required,
    by a) correcting spelling or grammar if needed, and b) matching it better to the subject domain. For example,
    if they ask 'What is a non-disclosure agreement? ', return 'A non-disclosure agreement (NDA) is a legal document that is used to protect confidential information shared between two or more parties. It is a legally binding agreement that outlines the terms and conditions under which the information may be shared, used, and disclosed. The purpose of an NDA is to ensure that the confidential information is not disclosed to unauthorized individuals or used for unauthorized purposes.'  **Important: ** your answer
    will be given directly to an automated program, so **only** return the text of the improved prompt,
    nothing else.\n\nOK, now here is their prompt:\n\n{user_query}"""

# New concise prompt template
QUERY_REWRITE_PROMPT_2 = """Improve this user query for our knowledge base by:
1. Correcting spelling and grammar
2. Expanding domain-specific context
3. Maintaining the original intent

Query: {user_query}

Return only the improved query text, no additional explanation."""

# Template for generating hypothetical documents
HYDE_PROMPT_1 = """Generate a detailed, factual passage that would perfectly answer this question.
Write in a clear, professional style similar to documentation or knowledge base articles.
Focus on specific, relevant details while maintaining accuracy.

Question: {user_query}

Write a passage that directly answers this question. Include only factual information, no disclaimers or meta-commentary."""


def query_rewrite_1(chat_model_id: str, user_prompt_string: str) -> str:
    """
    Original prompt adjustment function with added error handling.
    Uses the more verbose original prompt template.

    Args:
        chat_model_id (str): ID of the chat model to use
        user_prompt_string (str): Original user prompt

    Returns:
        str: Improved prompt string
    """

    client = genai.Client(api_key=GOOGLE_API_KEY)

    system_prompt = get_system_prompt()

    user_prompt_string = system_prompt + "\n\n" + QUERY_REWRITE_PROMPT_1.format(user_query=user_prompt_string)

    adjusted_prompt = client.models.generate_content(
        model=chat_model_id,
        contents = user_prompt_string)

    return adjusted_prompt.text


def query_rewrite_2(chat_model_id: str, user_prompt_string: str) -> str:
    """
    New prompt adjustment function using the more concise prompt template.

    Args:
        chat_model_id (str): ID of the chat model to use
        user_prompt_string (str): Original user prompt

    Returns:
        str: Improved prompt string
    """

    client = genai.Client(api_key=GOOGLE_API_KEY)

    system_prompt = get_system_prompt()

    user_prompt_string = system_prompt + "\n\n" + QUERY_REWRITE_PROMPT_2.format(user_query=user_prompt_string)

    adjusted_prompt = client.models.generate_content(
        model=chat_model_id,
        contents = user_prompt_string)

    return adjusted_prompt.text


def HyDE_1(chat_model_id: str, user_prompt_string: str) -> str:
    """
    Hypothetical Document Embedding (HyDE) function that generates a synthetic document
    based on the user's query. This document can then be embedded and used for retrieval.

    Args:
        chat_model_id (str): ID of the chat model to use
        user_prompt_string (str): Original user query

    Returns:
        str: Generated hypothetical document
    """
    client = genai.Client(api_key=GOOGLE_API_KEY)

    system_prompt = get_system_prompt()

    prompt_string = system_prompt + "\n\n" + HYDE_PROMPT_1.format(user_query=user_prompt_string)

    adjusted_prompt = client.models.generate_content(
        model=chat_model_id,
        contents = prompt_string)

    return adjusted_prompt.text

In [19]:
# Define a function to retrieve one or more knowledge base items based on an embedding
# Returns a list variable with a row for each knowledge item returned, columns similarity and text

def getknowledge(prompt_string: object, num_results: object, model_id: object) -> object:
    results = collection.query(
        query_texts=[prompt_string],  # Chroma will embed this for you
        n_results=num_results  # how many results to return
    )

    assert isinstance(results, object)
    return results

In [20]:
# Template for generation phase

GENERATION_PROMPT_TEMPLATE = """{system_prompt}

Please answer this question:

'{query_prompt}'

Using these blocks of context text, presented in order of decreasing relevance.
This RAG program is using a fixed value of k={num_results}, so be aware that
later context may not be truly relevant.

**Important:** Your answer will display in a text cell in a Markdown table, so format
the output as follows:
 1) Do not use Markdown or HTML formatting, output plain text only.
 2) Space is limited—be succinct and keep answers to 75 words or less.

{prompt_context}"""

In [21]:
# Query the llm for an answer to the user's question
def query_llm(chat_model_id, prompt_context, num_results, query_prompt):
    client = genai.Client(api_key=GOOGLE_API_KEY)

    system_prompt = get_system_prompt()

    prompt_string = GENERATION_PROMPT_TEMPLATE.format(
        system_prompt=system_prompt,       # e.g. "You are an AI assistant…"
        query_prompt=query_prompt,         # the user’s question
        num_results=num_results,           # integer k used in retrieval
        prompt_context=prompt_context      # concatenated context blocks
    )
    # print(prompt_string)

    answer = client.models.generate_content(
        model=chat_model_id,
        contents = prompt_string)

    generated_answer = answer.text

    return generated_answer

In [22]:
# Template for the LLM‑as‑judge prompt
EVAL_PROMPT_TEMPLATE = """You are an impartial expert grader for retrieval‑augmented chatbots.
Your job is to evaluate the chatbot’s answer against the gold answer
and the judging instructions, then return a JSON object with four fields.

### Inputs
• Query: {query}

• Gold answer (reference): {gold_answer}

• Chatbot answer to grade: {chatbot_answer}

• Judging instructions (override anything else if conflicts arise):
  {judging_instructions}

### Rubric — score each dimension 1‑5
1 = very poor, 3 = acceptable, 5 = excellent.

1. Relevance – Does the chatbot answer address the user’s query?
2. Faithfulness – Is every factual statement supported by the gold answer
   (or clearly marked as outside scope)? Penalize hallucinations.
3. Completeness – Does the answer include all key elements required by the
   gold answer **and** satisfy the judging instructions (if provided)?

### Confidence
After scoring, give a **confidence** value 0.0‑1.0
(1.0 = absolutely certain your scores are correct; 0.0 = pure guess).

### Output — return **only** valid JSON in this exact schema, with NO markdown fences or extra text
{{
  "relevance": <integer 1‑5>,
  "faithfulness": <integer 1‑5>,
  "completeness": <integer 1‑5>,
  "confidence": <float between 0.0 and 1.0>
}}
No additional keys, text, or explanations.
Think silently before answering, but output only the JSON, with NO markdown fences or extra text."""

In [23]:
# Query the llm for its judgment of the answer to the user's question
def judge_answer(chat_model_id, question, gold_answer,answer,instructions):
    client = genai.Client(api_key=GOOGLE_API_KEY)

    prompt_string = EVAL_PROMPT_TEMPLATE.format(
        query=question,
        gold_answer=gold_answer,
        chatbot_answer=answer,
        judging_instructions=instructions
    )
    # print(prompt_string)

    judgment = client.models.generate_content(
        model=chat_model_id,
        contents = prompt_string)

    judgment_answer = judgment.text

    return judgment_answer

In [24]:
# Handle cases where the human must do the judging

# Suppose we are in a cell where we've determined
# there's no gold answer or the LLM’s self-reported
# confidence is below 0.2. We'll ask the human to judge.

def human_judgment_flow(
    gold_answer: str,
    chatbot_answer: str
):
    # Print context for the human
    if gold_answer:
        print("=== Gold (Reference) Answer ===")
        print(gold_answer)
    else:
        print("No gold answer was provided.")

    print("\n=== Chatbot Answer ===")
    print(chatbot_answer)

    print("\n=== Judging Instructions ===")
    print(
        "Please provide scores for:\n"
        "1) Relevance      (1-5): Does the answer address the user’s query?\n"
        "2) Faithfulness   (1-5): Is every factual statement supported by context?\n"
        "3) Completeness   (1-5): Does the answer cover all key elements requested?\n"
        "Finally, provide a confidence in your scoring (0.0 to 1.0)."
    )
    print("\n(Use whole numbers for the first three, e.g. 3, 4, 5, and a decimal for confidence.)\n")

    # Collect scores one-by-one
    relevance_str = input("Relevance (1-5): ").strip()
    faithfulness_str = input("Faithfulness (1-5): ").strip()
    completeness_str = input("Completeness (1-5): ").strip()
    conf_str = input("Confidence in your scoring (0.0-1.0): ").strip()

    # Convert to proper types
    try:
        relevance = int(relevance_str)
    except ValueError:
        relevance = None
    try:
        faithfulness = int(faithfulness_str)
    except ValueError:
        faithfulness = None
    try:
        completeness = int(completeness_str)
    except ValueError:
        completeness = None
    try:
        confidence = float(conf_str)
    except ValueError:
        confidence = None

    # Optional: Add quick checks on valid ranges
    if relevance not in [1,2,3,4,5]:
        print("Warning: Relevance out of range. Setting to None.")
        relevance = None
    if faithfulness not in [1,2,3,4,5]:
        print("Warning: Faithfulness out of range. Setting to None.")
        faithfulness = None
    if completeness not in [1,2,3,4,5]:
        print("Warning: Completeness out of range. Setting to None.")
        completeness = None
    if confidence is not None and not (0.0 <= confidence <= 1.0):
        print("Warning: Confidence out of range. Setting to None.")
        confidence = None

    # Return in the same structure we used
    result = {
        "relevance": relevance,
        "faithfulness": faithfulness,
        "completeness": completeness,
        "confidence": confidence
        }
    return json.dumps(result)

In [25]:
# Basic orchestration process to carry out the RAG chatbot response to a user query and log the results

# This function executes the chatbot code. It performs these functions:
#   • Receives the user’s questions from the web services
#   • Processes them as needed
#   • Looks up relevant context in the knowledge base
#   • Queries ChatGPT to get an answer to the user’s question
#   • Hands that answer back to the web services for display to the user

def orchestration(config, test_results, debug=False):
    # Generate a unique Run ID and timestamp
    unique_run_id = str(uuid.uuid4())
    timestamp = time.strftime("%Y-%m-%d %H:%M:%S")  # Current timestamp

    # ----- Setup section -----

    # Print original prompt
    original_prompt_string = config ["question"]["query"]


    # ----- Pre-retrieval section -----

    #  Set the preretrieval function and chat model id
    pre_retrieval_function = config ["pre_retrieval"]["function"]
    pre_retrieval_chat_model_id = config ["pre_retrieval"]["model"]

    # Adjust the prompt string as needed to make it more compatible with the RAG database
    match pre_retrieval_function:
        case 'query_rewrite_1':
            final_prompt_string = query_rewrite_1(pre_retrieval_chat_model_id, original_prompt_string)
        case 'query_rewrite_2':
            final_prompt_string = query_rewrite_2(pre_retrieval_chat_model_id, original_prompt_string)
        case 'HyDE_1':
            final_prompt_string = HyDE_1(pre_retrieval_chat_model_id, original_prompt_string)
        case _:
            print(f"Pre_retrieval function not found: {pre_retrieval_function}")

    if debug:
        print(final_prompt_string)


    # ----- Retrieval section -----

    # Fetch context items that match the prompt string
    #  Set the embedding model id
    retrieval_function = config ["retrieval"]["function"]
    retrieval_embedding_model_id = config ["retrieval"]["model"]

    # Set k, the number of results, to return a constant k items of context from the knowledge base
    num_results = config ["retrieval"]["num_results"]
    if num_results == 0:
        num_results = 5

    # Returns a list variable "results" with a row for each knowledge item returned, with columns "similarity" and "text"
    match retrieval_function:
        case 'getknowledge':
            results = getknowledge(final_prompt_string, num_results, retrieval_embedding_model_id)
        case 'getknowledge_via_chunks':
            results = getknowledge_via_chunks(final_prompt_string, num_results, retrieval_embedding_model_id)
        case _:
            print(f"Retrieval function not found: {retrieval_function}")

    knowledge = ""
    for doc_group in results["documents"]:
        for doc in doc_group:
            knowledge += doc + "\n\n"

    if debug:
        print(knowledge)


    # ----- Generation section -----

    #  Set the generation function and chat model id
    generation_function = config ["generation"]["function"]
    generation_chat_model_id = config ["generation"]["model"]

    # Get an answer from the LLM
    match generation_function:
        case 'query_llm':
            answer = query_llm(generation_chat_model_id, knowledge, num_results, original_prompt_string)
            # answer = query_llm(generation_chat_model_id, knowledge, num_results, final_prompt_string)
        case _:
            print(f"Generation function not found: {generation_function}")


    # ----- Judging section -----

    #  Set the judging function and chat model id
    judgment_function = config ["judging"]["function"]
    judgment_chat_model_id = config ["judging"]["model"]

    # Judge the answer
    gold_answer = config ["question"]["gold_answer"]
    # If a gold answer is provided
    if gold_answer != "":
        # Call an LLM to judge the chatbot answer by comparing it to the gold answer, taking into account the judging instructions
        instructions = config ["question"]["instructions"]

        match judgment_function:
            case 'judge_answer':
                judgment = judge_answer(judgment_chat_model_id, final_prompt_string, gold_answer,answer,instructions)
            case _:
                print(f"Judging function not found: {judgment_function}")

        # Clean up JSON result: take substring between first { and last }
        start = judgment.find("{")
        end   = judgment.rfind("}")
        if start == -1 or end == -1:
            raise json.JSONDecodeError("No JSON object found", judgment, 0)
        judgment = judgment[start : end+1]

        try:
            data = json.loads(judgment)
            relevance = data["relevance"]
            faithfulness = data["faithfulness"]
            completeness = data["completeness"]
            confidence = data["confidence"]
        except (json.JSONDecodeError, KeyError) as e:
            print("Error parsing judgment:", e)
            relevance = None
            faithfulness = None
            completeness = None
            confidence = None

        if debug:
            print("Relevance:", relevance)
            print("Faithfulness:", faithfulness)
            print("Completeness:", completeness)
            print("Confidence:", confidence)

    # If a gold answer is NOT provided, or if the confidence is <0.2
    if gold_answer == "" or confidence < 0.2:
        # Ask the human to judge the answer
        human_judgment = human_judgment_flow(
            gold_answer=gold_answer,
            chatbot_answer=answer
        )
        data = json.loads(human_judgment)
        relevance = data["relevance"]
        faithfulness = data["faithfulness"]
        completeness = data["completeness"]
        confidence = data["confidence"]

    # Process the scores here...
    scores_composite = 0.4*relevance + 0.4*faithfulness + 0.2*completeness  # final score = 0.4*R + 0.4*F + 0.2*C

    # Log the results to the test_results list

    # Create the row as a list of values in the specified order
    results_row = [
        unique_run_id,
        timestamp,
        pre_retrieval_function,
        pre_retrieval_chat_model_id,
        retrieval_function,
        retrieval_embedding_model_id,
        generation_function,
        generation_chat_model_id,
        judgment_function,
        judgment_chat_model_id,
        original_prompt_string,
        final_prompt_string,
        answer,
        gold_answer,
        relevance,
        faithfulness,
        completeness,
        scores_composite,
        confidence
        # For future work:
        # pipeline_tokens_in,
        # pipeline_tokens_out,
        # pipeline_latency,
        # grading_tokens_in,
        # grading_tokens_out,
        # grading_latency
        # quality_per_dollar  # = final_score / (pipeline_tokens / 1000 \* 0.005)
    ]

    # Append the new row to the test_results list
    test_results.append(results_row)

    # ----- Return response section -----
    return answer

### Main Program and Function Calling

In [12]:
# If true, causes intermediate results to be printed to the console
debug_mode = True

In [13]:
# Instantiate a ChromaDB client. The default Chroma client is ephemeral, meaning it will not save to disk.
chroma_client = chromadb.Client()

# Create a Collection
collection = chroma_client.create_collection(name="rag_collection")

In [None]:
# type: ignore

!pip install langchain_community

import pandas as pd # Import pandas for CSV handling
from langchain_community.vectorstores import FAISS
from langchain_community.embeddings import HuggingFaceEmbeddings
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain.schema import Document # Import Document class
# Removed PyMuPDFLoader as we are handling CSV

from google.colab import files
uploaded_files = files.upload() #uploaded open source legal verbiage dataset from kaggle (https://www.kaggle.com/datasets/amohankumar/legal-text-classification-dataset)

docs = []
for filename, content in uploaded_files.items():
    # Assuming the CSV contains a column with text data, replace 'text_column_name' with the actual column name
    # Also, decode the bytes content to string
    df = pd.read_csv(pd.io.common.BytesIO(content), encoding='utf-8')
    # Convert relevant columns to string and combine them into a single text field per row
    # Replace 'column1', 'column2' with the actual column names you want to include
    df['combined_text'] = df.astype(str).agg(' '.join, axis=1)
    docs.extend([str({'page_content': row['combined_text'], 'metadata': {'source': filename, 'row': i}}) for i, row in df.iterrows()])



Saving legal_text_classification.csv to legal_text_classification (1).csv


In [45]:
print(len(docs))

#ID list marker generation for Chromadb
id_list = []
subset = []
for i in range(1, 500):
    id_list.append(str("id"+str(i)))
    subset.append(docs[i])

print(id_list)
print(subset[0])

24985
['id1', 'id2', 'id3', 'id4', 'id5', 'id6', 'id7', 'id8', 'id9', 'id10', 'id11', 'id12', 'id13', 'id14', 'id15', 'id16', 'id17', 'id18', 'id19', 'id20', 'id21', 'id22', 'id23', 'id24', 'id25', 'id26', 'id27', 'id28', 'id29', 'id30', 'id31', 'id32', 'id33', 'id34', 'id35', 'id36', 'id37', 'id38', 'id39', 'id40', 'id41', 'id42', 'id43', 'id44', 'id45', 'id46', 'id47', 'id48', 'id49', 'id50', 'id51', 'id52', 'id53', 'id54', 'id55', 'id56', 'id57', 'id58', 'id59', 'id60', 'id61', 'id62', 'id63', 'id64', 'id65', 'id66', 'id67', 'id68', 'id69', 'id70', 'id71', 'id72', 'id73', 'id74', 'id75', 'id76', 'id77', 'id78', 'id79', 'id80', 'id81', 'id82', 'id83', 'id84', 'id85', 'id86', 'id87', 'id88', 'id89', 'id90', 'id91', 'id92', 'id93', 'id94', 'id95', 'id96', 'id97', 'id98', 'id99', 'id100', 'id101', 'id102', 'id103', 'id104', 'id105', 'id106', 'id107', 'id108', 'id109', 'id110', 'id111', 'id112', 'id113', 'id114', 'id115', 'id116', 'id117', 'id118', 'id119', 'id120', 'id121', 'id122', 'id

In [46]:
# Add documents to the Collection
collection.add(documents= subset , ids=id_list)

# Query list, contains a list of rows each with a question, gold answer, and (optional) judging instructions -- just one sample here, can have more
query_list = [["What is the purpose of a non-disclosure agreement?","'The purpose of a non-disclosure agreement is to protect important interests in encouraging negotiated settlements of disputes, ensuring that parties in such negotiations are frank and open with each other, and ensuring that parties communicate without apprehension that confidential and potentially damaging information could be used against them.'","None."]]

# Set up model options for 4 test configurations (retrieval)
pre_retrieval_function_options = ["query_rewrite_1","query_rewrite_2","HyDE_1"]
pre_retrieval_model_options = ["gemini-2.0-flash"]
retrieval_function_options = ["getknowledge"]
retrieval_model_options = ["chromadb_default"]
retrieval_num_results_options = [5]

#Model options for generation
generation_function_options = ["query_llm"]
# generation_model_options = ["gemini-1.5-flash-8b","gemini-2.0-flash","gemini-2.0-flash-lite"]
generation_model_options = ["gemini-2.0-flash","gemini-2.0-flash-lite"]

#Model options for LLM-as-a-judge
judge_function_options = ["judge_answer"]
judge_model_options = ["gemini-2.0-flash"]

In [47]:
test_configurations = []

#  Loop through pre-retrieval functions
for pre_func in pre_retrieval_function_options:
    #  Loop through pre-retrieval models
    for pre_model in pre_retrieval_model_options:
        #  Loop through retrieval functions
        for ret_func in retrieval_function_options:
            #  Loop through retrieval models
            for ret_model in retrieval_model_options:
                #  Loop through number of results to retrieve
                for ret_results in retrieval_num_results_options:
                    #  Loop through generation functions
                    for gen_func in generation_function_options:
                        #  Loop through generation models
                        for gen_model in generation_model_options:
                            #  Loop through judging functions
                            for jud_func in judge_function_options:
                                #  Loop through judging models
                                for jud_model in judge_model_options:
                                    #  Add a line to the test_configs list
                                    # Each configuration is a tuple containing one option from each list
                                    test_configurations.append(
                                        (pre_func, pre_model, ret_func, ret_model, ret_results, gen_func, gen_model,jud_func,jud_model)
                                    )

# Optional: Print out the test configurations to verify
if debug_mode:
    for config in test_configurations:
       print(config)

('query_rewrite_1', 'gemini-2.0-flash', 'getknowledge', 'chromadb_default', 5, 'query_llm', 'gemini-2.0-flash', 'judge_answer', 'gemini-2.0-flash')
('query_rewrite_1', 'gemini-2.0-flash', 'getknowledge', 'chromadb_default', 5, 'query_llm', 'gemini-2.0-flash-lite', 'judge_answer', 'gemini-2.0-flash')
('query_rewrite_2', 'gemini-2.0-flash', 'getknowledge', 'chromadb_default', 5, 'query_llm', 'gemini-2.0-flash', 'judge_answer', 'gemini-2.0-flash')
('query_rewrite_2', 'gemini-2.0-flash', 'getknowledge', 'chromadb_default', 5, 'query_llm', 'gemini-2.0-flash-lite', 'judge_answer', 'gemini-2.0-flash')
('HyDE_1', 'gemini-2.0-flash', 'getknowledge', 'chromadb_default', 5, 'query_llm', 'gemini-2.0-flash', 'judge_answer', 'gemini-2.0-flash')
('HyDE_1', 'gemini-2.0-flash', 'getknowledge', 'chromadb_default', 5, 'query_llm', 'gemini-2.0-flash-lite', 'judge_answer', 'gemini-2.0-flash')


### Testing the RAG Implementation on the Golden Dataset/Question

In [48]:
# Set up a results list to log test results
RESULTS_HEADER = ["Unique Run ID", # Add a first row with column headings
    "Timestamp",
    "Pre-Retrieval Function",
    "Pre-Retrieval Chat Model ID",
    "Retrieval Function",
    "Retrieval Embedding Model ID",
    "Generation Function",
    "Generation Chat Model ID",
    "Judging Function",
    "Judging Chat Model ID",
    "Prompt String",
    "Final Prompt String",
    "Answer",
    "Sample Answer",
    "Relevance (0-5)",
    "Faithfulness (0-5)",
    "Completeness (0-5)",
    "Composite Score (0-5)",
    "Confidence In Score (0-1)"
    # For future work:
    # "Pipeline Tokens In",
    # "Pipeline Tokens Out",
    # "Pipeline Latency",
    # "Grading Tokens In",
    # "Grading Tokens Out",
    # "Grading Latency",
    # "Quality per Dollar"  # = final_score / (pipeline_tokens / 1000 \* 0.005)
    ]

# Start the results table with *one* row – the header
test_results = [RESULTS_HEADER]

if debug_mode:
    print(test_results)

[['Unique Run ID', 'Timestamp', 'Pre-Retrieval Function', 'Pre-Retrieval Chat Model ID', 'Retrieval Function', 'Retrieval Embedding Model ID', 'Generation Function', 'Generation Chat Model ID', 'Judging Function', 'Judging Chat Model ID', 'Prompt String', 'Final Prompt String', 'Answer', 'Sample Answer', 'Relevance (0-5)', 'Faithfulness (0-5)', 'Completeness (0-5)', 'Composite Score (0-5)', 'Confidence In Score (0-1)']]


In [49]:
# Loop through the list of test configurations, calling the orchestration function once for each configuration

for pre_function, pre_model, ret_function, ret_model, ret_results, gen_function, gen_model, jud_function, jud_model in test_configurations:

    # Loop through the list of questions and answers
    for query_val, gold_answer, instructions in query_list:

        # Set up the pipeline configuration
        this_config = {
            "question": {
                "query": query_val,
                "gold_answer": gold_answer,
                "instructions": instructions
            },

            "pre_retrieval": {
                "function": pre_function,
                "model": pre_model
            },

            "retrieval": {
                "function": ret_function,
                "model": ret_model,
                "num_results" : ret_results
            },

            "generation": {
                "function": gen_function,
                "model": gen_model
            },

            "judging": {
                "function": jud_function,
                "model": jud_model
            }
        }
        # "orchestration" will run the test for this configuration, and then add a line with results to test_results
        answer = orchestration(this_config, test_results, debug_mode)

What is the purpose of a non-disclosure agreement (NDA) in protecting confidential information, and what are its key provisions?

{'page_content': 'Case445 followed Australian Competition and Consumer Commission v ABB Power Transmission Pty Ltd [2003] FCA 626 The "public interest" claimed by the ACCC to require protection of Category III documents from disclosure was said to be to encourage, by ensuring the confidentiality of information they provide, cartel whistleblowers to come forward. In ACCC v ABB Power Transmission Pty Ltd [2003] FCA 626 at [43] , the Court suggested in obiter dicta that such a public interest might exist. In the present case, Counsel for the ACCC put it this way: The public interest ... is to induce the Amcors of this world - those who come forward early, under the immunity policy - to give the fullest possible assistance to the [ACCC] to ensure that the cartel is put at an end as quickly as possible, prosecuted, and brought to finality quickly. The more assist

In [50]:
from IPython.display import Markdown, display

# Selected columns
selected_columns = [
    'Pre-Retrieval Function',
    'Generation Chat Model ID',
    'Prompt String',
    'Final Prompt String',
    'Answer',
    'Sample Answer',
    'Relevance (0-5)',
    'Faithfulness (0-5)',
    'Completeness (0-5)',
    'Composite Score (0-5)',
    'Confidence In Score (0-1)'
]

# Header and data rows
headers = test_results[0]
rows = test_results[1:]
selected_indexes = [headers.index(col) for col in selected_columns]

# Determine which columns to center
centered_columns = set(selected_columns[-5:])  # last 5 columns
separator_cells = [
    ":---:" if col in centered_columns else "---" for col in selected_columns
]

# Build the markdown string
markdown = "| " + " | ".join(selected_columns) + " |\n"
markdown += "| " + " | ".join(separator_cells) + " |\n"

for row in rows:
    selected_row = [str(row[i]).replace("\n", " ") for i in selected_indexes]
    markdown += "| " + " | ".join(selected_row) + " |\n"

# Display it
safe_md = markdown.replace("$", r"\$") # This lkeeps '$' characters from messing up the table display
print(safe_md)

| Pre-Retrieval Function | Generation Chat Model ID | Prompt String | Final Prompt String | Answer | Sample Answer | Relevance (0-5) | Faithfulness (0-5) | Completeness (0-5) | Composite Score (0-5) | Confidence In Score (0-1) |
| --- | --- | --- | --- | --- | --- | :---: | :---: | :---: | :---: | :---: |
| query_rewrite_1 | gemini-2.0-flash | What is the purpose of a non-disclosure agreement? | What is the purpose of a non-disclosure agreement (NDA) in protecting confidential information, and what are its key provisions?  | A non-disclosure agreement (NDA) ensures that confidential information shared between parties remains protected. It legally binds the involved parties to not disclose the information to others, safeguarding trade secrets, proprietary data, and other sensitive details. NDAs foster trust and encourage open communication during negotiations, collaborations, or potential business ventures.  | 'The purpose of a non-disclosure agreement is to protect important interests 

**Option 1:**

| query_rewrite_1 | gemini-2.0-flash | What is the purpose of a non-disclosure agreement? | What is the purpose of a non-disclosure agreement (NDA) in protecting confidential information, and what are its key provisions?  |

***A non-disclosure agreement (NDA) ensures that confidential information shared between parties remains protected. It legally binds the involved parties to not disclose the information to others, safeguarding trade secrets, proprietary data, and other sensitive details. NDAs foster trust and encourage open communication during negotiations, collaborations, or potential business ventures.***

| 'The purpose of a non-disclosure agreement is to protect important interests in encouraging negotiated settlements of disputes, ensuring that parties in such negotiations are frank and open with each other, and ensuring that parties communicate without apprehension that confidential and potentially damaging information could be used against them.'

| 5 | 5 | 4 | 4.8 | 1.0 |

**Option 2:**

| query_rewrite_1 | gemini-2.0-flash-lite | What is the purpose of a non-disclosure agreement? | What is the legal purpose and scope of a non-disclosure agreement (NDA)?  |

***A non-disclosure agreement (NDA) is a legal contract. Its purpose is to protect confidential information. NDAs prevent the unauthorized disclosure of sensitive data, trade secrets, or proprietary knowledge shared between parties, typically during business dealings or collaborations.***

| 'The purpose of a non-disclosure agreement is to protect important interests in encouraging negotiated settlements of disputes, ensuring that parties in such negotiations are frank and open with each other, and ensuring that parties communicate without apprehension that confidential and potentially damaging information could be used against them.'

| 5 | 5 | 3 | 4.6 | 0.9 |


**Option 3:**

| query_rewrite_2 | gemini-2.0-flash | What is the purpose of a non-disclosure agreement? | What is the purpose of a non-disclosure agreement (NDA) in the context of legal proceedings, client confidentiality, or business transactions handled by our firm?  |

***A non-disclosure agreement (NDA) is a legal contract designed to protect confidential information. It establishes a confidential relationship between parties, where one or both agree not to disclose specific information to third parties. NDAs are often used to protect trade secrets, business strategies, and other sensitive data.***

| 'The purpose of a non-disclosure agreement is to protect important interests in encouraging negotiated settlements of disputes, ensuring that parties in such negotiations are frank and open with each other, and ensuring that parties communicate without apprehension that confidential and potentially damaging information could be used against them.'

| 5 | 5 | 3 | 4.6 | 0.9 |

**Option 4:**

| query_rewrite_2 | gemini-2.0-flash-lite | What is the purpose of a non-disclosure agreement? | What is the purpose of a non-disclosure agreement (NDA), and in what contexts would our firm typically advise clients to use one, particularly concerning intellectual property protection, confidential information exchange during negotiations, or potential litigation?  |

***A non-disclosure agreement (NDA) is a contract between parties that restricts the sharing of confidential information. It ensures that sensitive data, trade secrets, or proprietary information remain protected from unauthorized disclosure. NDAs are often used in business deals or employment.***  

| 'The purpose of a non-disclosure agreement is to protect important interests in encouraging negotiated settlements of disputes, ensuring that parties in such negotiations are frank and open with each other, and ensuring that parties communicate without apprehension that confidential and potentially damaging information could be used against them.'

| 4 | 4 | 2 | 3.6 | 0.9 |

**Option 5:**

| HyDE_1 | gemini-2.0-flash | What is the purpose of a non-disclosure agreement? |

***A Non-Disclosure Agreement (NDA), also known as a confidentiality agreement, is a legally binding contract that establishes a confidential relationship. The primary purpose of an NDA is to protect sensitive information by restricting access to it and preventing its unauthorized disclosure to third parties. Specifically, an NDA accomplishes the following:***  
*   **Defines Confidential Information:** The agreement clearly identifies what constitutes "confidential information," which may include trade secrets, proprietary data, financial information, customer lists, business plans, marketing strategies, technical specifications, inventions, and other non-public details. This definition is crucial for establishing the scope of protection.
*   **Establishes Obligations of the Receiving Party:** The NDA outlines the obligations of the party receiving the confidential information (the "Recipient"). These obligations typically include: maintaining the confidentiality of the information; using the information only for a specific, agreed-upon purpose; restricting access to the information to only those employees or agents who need to know it and who are also bound by confidentiality obligations; and taking reasonable measures to protect the information from unauthorized disclosure or use.
*   ****Provides Legal Recourse:** An NDA provides the disclosing party with legal recourse in the event of a breach of the agreement. This means that if the Recipient discloses or uses the confidential information in violation of the NDA, the disclosing party can seek legal remedies, such as injunctive relief (a court order to stop the unauthorized use or disclosure) and monetary damages to compensate for the harm caused by the breach.  
*   **Specifies Exclusions from Confidentiality:** While broad in scope, NDAs often specify certain exclusions from the definition of confidential information. Common exclusions include information that is already publicly known, information that was already in the Recipient's possession prior to disclosure, information that is independently developed by the Recipient without reference to the disclosed information, or information that is lawfully received by the Recipient from a third party without restriction.  
*   **Defines the Term of Confidentiality:** The agreement will specify the duration for which the confidentiality obligations remain in effect. This term may be indefinite or for a fixed period of time, depending on the nature of the information and the parties' agreement.  | The purpose of a non-disclosure agreement is to ensure that confidential information remains protected. It is a legally binding contract that establishes a confidential relationship, typically between two or more parties. The agreement outlines what information is considered confidential and restricts the parties from disclosing that information to others.  

| 'The purpose of a non-disclosure agreement is to protect important interests in encouraging negotiated settlements of disputes, ensuring that parties in such negotiations are frank and open with each other, and ensuring that parties communicate without apprehension that confidential and potentially damaging information could be used against them.'

| 5 | 5 | 3 | 4.6 | 1.0 |

**Option 6:**

| HyDE_1 | gemini-2.0-flash-lite | What is the purpose of a non-disclosure agreement? |

***A non-disclosure agreement (NDA), also known as a confidentiality agreement (CA), is a legally binding contract that establishes a confidential relationship. The primary purpose of an NDA is to protect sensitive information, including but not limited to trade secrets, proprietary knowledge, business strategies, customer lists, financial data, and product designs, from unauthorized disclosure. NDAs define what information is considered confidential, specify the permitted uses of the information, and restrict the recipient from disclosing it to third parties. By entering into an NDA, parties can freely share confidential information for specific purposes, such as exploring a business relationship, conducting due diligence, or engaging in collaborative research, with the assurance that the information will be protected from misuse or dissemination. This protection encourages open communication and collaboration while safeguarding valuable assets. The agreement typically outlines the duration of the confidentiality obligation, any exceptions to confidentiality (e.g., information already in the public domain), and the remedies available to the disclosing party in the event of a breach.  | A non-disclosure agreement (NDA) is a legal contract that protects confidential information. It restricts one or more parties from disclosing proprietary or sensitive information. The purpose is to safeguard trade secrets, business strategies, and other confidential data, preventing its unauthorized use or dissemination.***  

| 'The purpose of a non-disclosure agreement is to protect important interests in encouraging negotiated settlements of disputes, ensuring that parties in such negotiations are frank and open with each other, and ensuring that parties communicate without apprehension that confidential and potentially damaging information could be used against them.'

| 5 | 5 | 3 | 4.6 | 1.0 |
