<center>
    <p style="text-align:center">
    <img alt="arize logo" src="https://storage.googleapis.com/arize-assets/arize-logo-white.jpg" width="300"/>
        <br>
        <a href="https://docs.arize.com/arize/">Docs</a>
        |
        <a href="https://github.com/Arize-ai/client_python">GitHub</a>
        |
        <a href="https://arize-ai.slack.com/join/shared_invite/zt-11t1vbu4x-xkBIHmOREQnYnYDH1GDfCg">Community</a>
    </p>
</center>

<center><h1>Using Arize with Couchbase</h1></center>

This guide shows you how to create a retrieval augmented generation chatbot and evaluate performance with Arize and Couchbase. RAG is typically to respond to queries using a specified set of documents instead of using the LLM's own training data, reducing hallucination and incorrect generations.

We'll go through the following steps:

* Create a RAG chatbot with Langchain and Couchbase

* Trace the retrieval and llm calls using Arize

* Create a dataset to benchmark performance

* Evaluate performance using LLM as a judge

Much of the code in this tutorial is adapted from the [Langchain Couchbase Tutorial](https://python.langchain.com/docs/integrations/vectorstores/couchbase/).

# Create a RAG chatbot using Langchain and Couchbase

Let's start with all of our boilerplate setup:

1. Install packages for tracing and retrieval
2. Setup our API keys
3. Setup Arize for tracing
4. Setup Couchbase
5. Create our Langchain RAG query engine
6. See your results in Arize

### Install packages for tracing and retrieval

In [None]:
!pip install -qq openai langchain langchain_community langchain-openai langchain-couchbase

!pip install -q arize-phoenix-evals arize-otel openinference-instrumentation-langchain

### Setup our API Keys

In [None]:
import os
from getpass import getpass

SPACE_ID = globals().get("SPACE_ID") or getpass(
    "🔑 Enter your Arize Space ID: "
)
API_KEY = globals().get("API_KEY") or getpass("🔑 Enter your Arize API Key: ")
DEVELOPER_KEY = globals().get("DEVELOPER_KEY") or getpass(
    "🔑 Enter your Arize Developer Key: "
)
OPENAI_API_KEY = globals().get("OPENAI_API_KEY") or getpass(
    "🔑 Enter your OpenAI API key: "
)
os.environ["OPENAI_API_KEY"] = OPENAI_API_KEY

### Setup Arize for Tracing

To follow with this tutorial, you'll need to sign up for Arize and get your API key. You can see the [guide here](https://docs.arize.com/arize/llm-tracing/quickstart-llm).

In [None]:
# Import open-telemetry dependencies
from arize.otel import register

# Setup OTEL via our convenience function
tracer_provider = register(
    space_id=SPACE_ID,
    api_key=API_KEY,
    project_name="couchbase-rag",
    log_to_console=False,
    batch=False,
)

# Import the automatic instrumentor from OpenInference
from openinference.instrumentation.langchain import LangChainInstrumentor

LangChainInstrumentor().uninstrument()
LangChainInstrumentor().instrument(tracer_provider=tracer_provider)

# Setup Couchbase

You'll need to setup your Couchbase cluster by doing the following:
1. Create an account at [Couchbase Cloud](https://cloud.couchbase.com/)
2. Create a free cluster
3. Create cluster access credentials
4. Allow access to the cluster from your local machine
5. Create a bucket to store your documents

Screenshots below:

<img src="https://storage.googleapis.com/arize-assets/tutorials/images/couchbase-free-cluster.png" width="800"/>


<img src="https://storage.googleapis.com/arize-assets/tutorials/images/couchbase-cluster-access.png" width="800"/>


<img src="https://storage.googleapis.com/arize-assets/tutorials/images/couchbase-allowed-ips.png" width="800"/>

<img src="https://storage.googleapis.com/arize-assets/tutorials/images/couchbase-create-bucket.png" width="800"/>

### Create our Langchain RAG query engine

Once you've setup your cluster, you can connect to it using langchain's couchbase package.

In [None]:
COUCHBASE_CONNECTION_STRING = getpass(
    "Enter the connection string for the Couchbase cluster: "
)
DB_USERNAME = getpass("Enter the username for the Couchbase cluster: ")
DB_PASSWORD = getpass("Enter the password for the Couchbase cluster: ")

BUCKET_NAME = "langchain_bucket"
SCOPE_NAME = "_default"
COLLECTION_NAME = "_default"
SEARCH_INDEX_NAME = "langchain-index"

In [None]:
from datetime import timedelta

from couchbase.auth import PasswordAuthenticator
from couchbase.cluster import Cluster
from couchbase.options import ClusterOptions

auth = PasswordAuthenticator(DB_USERNAME, DB_PASSWORD)
options = ClusterOptions(auth)
options.apply_profile("wan_development")
cluster = Cluster(COUCHBASE_CONNECTION_STRING, options)

# Wait until the cluster is ready for use.
cluster.wait_until_ready(timedelta(seconds=5))

Before this step, you must also create a search index. You can do this by going to the Couchbase UI and clicking on the "Search" tab. Make sure the names match up with the ones we've defined above.

Link below:
https://docs.couchbase.com/cloud/vector-search/create-vector-search-index-ui.html

In [None]:
from langchain_openai import OpenAIEmbeddings
from langchain_couchbase.vectorstores import CouchbaseVectorStore

embeddings = OpenAIEmbeddings(model="text-embedding-3-large")

vector_store = CouchbaseVectorStore(
    cluster=cluster,
    bucket_name=BUCKET_NAME,
    scope_name=SCOPE_NAME,
    collection_name=COLLECTION_NAME,
    embedding=embeddings,
    index_name=SEARCH_INDEX_NAME,
)

In [None]:
!mkdir data
!wget "https://raw.githubusercontent.com/run-llama/llama_index/main/docs/docs/examples/data/paul_graham/paul_graham_essay.txt" -O data/paul_graham_essay.txt

In [None]:
from langchain_community.document_loaders import TextLoader
from langchain_text_splitters import CharacterTextSplitter


def reset_vector_store(vector_store, chunk_size=1024, chunk_overlap=20):
    results = vector_store.similarity_search(
        k=1000,
        query="",  # Use an empty query or a specific one if needed
        search_options={
            "query": {"field": "metadata.source", "match": "paul_graham_essay"}
        },
    )
    if results:
        deleted_ids = []
        for result in results:
            deleted_ids.append(result.id)
        vector_store.delete(ids=deleted_ids)
    loader = TextLoader("./data/paul_graham_essay.txt")
    documents = loader.load()
    text_splitter = CharacterTextSplitter(
        chunk_size=chunk_size, chunk_overlap=chunk_overlap
    )
    docs = text_splitter.split_documents(documents)

    # Adding metadata to documents
    for i, doc in enumerate(docs):
        doc.metadata["source"] = "paul_graham_essay"

    vector_store.add_documents(docs)
    return vector_store


reset_vector_store(vector_store)

We can test the vector search directly with the following code:

In [None]:
query = "What did Paul Graham say about the future of AI?"
vector_store.similarity_search(query, k=2)

We can load different documents into the vector store to test with like below, with the metadata.source field used to filter the documents separately from vector queries.

In [None]:
from uuid import uuid4

from langchain_core.documents import Document

document_1 = Document(
    page_content="I had chocalate chip pancakes and scrambled eggs for breakfast this morning.",
    metadata={"source": "tweet"},
)

document_2 = Document(
    page_content="The weather forecast for tomorrow is cloudy and overcast, with a high of 62 degrees.",
    metadata={"source": "news"},
)

document_3 = Document(
    page_content="Building an exciting new project with LangChain - come check it out!",
    metadata={"source": "tweet"},
)

document_4 = Document(
    page_content="Robbers broke into the city bank and stole $1 million in cash.",
    metadata={"source": "news"},
)

document_5 = Document(
    page_content="Wow! That was an amazing movie. I can't wait to see it again.",
    metadata={"source": "tweet"},
)

document_6 = Document(
    page_content="Is the new iPhone worth the price? Read this review to find out.",
    metadata={"source": "website"},
)

document_7 = Document(
    page_content="The top 10 soccer players in the world right now.",
    metadata={"source": "website"},
)

document_8 = Document(
    page_content="LangGraph is the best framework for building stateful, agentic applications!",
    metadata={"source": "tweet"},
)

document_9 = Document(
    page_content="The stock market is down 500 points today due to fears of a recession.",
    metadata={"source": "news"},
)

document_10 = Document(
    page_content="I have a bad feeling I am going to get deleted :(",
    metadata={"source": "tweet"},
)

documents = [
    document_1,
    document_2,
    document_3,
    document_4,
    document_5,
    document_6,
    document_7,
    document_8,
    document_9,
    document_10,
]
uuids = [str(uuid4()) for _ in range(len(documents))]

vector_store.add_documents(documents=documents, ids=uuids)

You may need to tag the embedding field as a vector field in the search index settings. See image below:

<img src="https://storage.googleapis.com/arize-assets/tutorials/images/couchbase-search-index-settings.png" width="800"/>


Let's try the vector search using the Langchain retriever interface across our new documents.

In [None]:
retriever = vector_store.as_retriever(search_kwargs={"k": 1})
docs = retriever.invoke("Is the stock market down?", filter={"source": "news"})
print(docs)

Let's run an entire RAG query with the Langchain RAG query engine.

In [None]:
from langchain import hub
from langchain_openai import ChatOpenAI

llm = ChatOpenAI(model="gpt-4o-mini")

prompt = hub.pull("rlm/rag-prompt")
question = "What did Paul Graham say about AI?"
context = ""

retriever = vector_store.as_retriever(search_kwargs={"k": 1})
docs = retriever.invoke(question, filter={"source": "paul_graham_essay"})
for doc in docs:
    context += doc.page_content

messages = prompt.invoke(
    {"context": context, "question": question}
).to_messages()

response = llm.invoke(messages)
print("Context: ", context)
print("Response: ", response.content)

### See your results in the Arize UI
Once you've run a single query, you can see the trace in the Arize UI with each step taken by the retriever, the embedding, and the llm query.

Click through the queries to better understand how the query engine is performing. Arize can be used to understand and troubleshoot your RAG app by surfacing:
 - Application latency
 - Token usage
 - Runtime exceptions
 - Retrieved documents
 - Embeddings
 - LLM parameters
 - Prompt templates
 - Tool descriptions
 - LLM function calls
 - And more!

# Create synthetic dataset of questions

Using the template below, we're going to generate a dataframe of 25 questions we can use to test our customer support agent.

In [None]:
GEN_TEMPLATE = """
You are an assistant that generates Q&A questions about Paul Graham's essay below.

The questions should involve the essay contents, specific facts and figures,
names, and elements of the story. Do not ask any questions where the answer is
not in the essay contents.

Respond with one question per line. Do not include any numbering at the beginning of each line. Do not include any category headings.
Generate 10 questions. Be sure there are no duplicate questions.

[START ESSAY]
{essay}
[END ESSAY]
"""

with open("data/paul_graham_essay.txt", "r") as file:
    file_content = file.read()

GEN_TEMPLATE = GEN_TEMPLATE.format(essay=file_content)

In [None]:
import nest_asyncio
import pandas as pd

nest_asyncio.apply()
from phoenix.evals import OpenAIModel

pd.set_option("display.max_colwidth", 500)

model = OpenAIModel(model="gpt-4o", max_tokens=1300)

In [None]:
resp = model(GEN_TEMPLATE)

In [None]:
split_response = resp.strip().split("\n\n")

questions_df = pd.DataFrame(split_response, columns=["input"])
print(questions_df.head(3))

Now let's run it and manually inspect the traces! 

In [None]:
def run_rag(vector_store, questions_df, k_value=1):
    retriever = vector_store.as_retriever(search_kwargs={"k": k_value})
    response_df = questions_df.copy(deep=True)
    for index, row in response_df.iterrows():
        docs = retriever.invoke(row["input"])
        context = ""
        for doc in docs:
            context += doc.page_content
        messages = prompt.invoke(
            {"context": context, "question": row["input"]}
        ).to_messages()
        response = llm.invoke(messages)
        response_df.loc[index, "output"] = response.content
        response_df.loc[index, "reference"] = context
    text_columns = ["input", "output", "reference"]
    response_df[text_columns] = response_df[text_columns].apply(
        lambda x: x.astype(str)
    )
    return response_df


response_df = run_rag(vector_store, questions_df, k_value=1)
response_df.head(3)

In [None]:
response_df

# Evaluating your RAG app

Now that we have a set of test cases, we can create evaluators to measure performance. This way, we don't have to manually inspect every single trace to see if the LLM is doing the right thing.

In [None]:
RELEVANCE_EVAL_TEMPLATE = """You are comparing a reference text to a question and trying to determine if the reference text
contains information relevant to answering the question. Here is the data:
    [BEGIN DATA]
    ************
    [Question]: {input}
    ************
    [Reference text]: {reference}
    [END DATA]

Compare the Question above to the Reference text. You must determine whether the Reference text
contains information that can answer the Question. Please focus on whether the very specific
question can be answered by the information in the Reference text.
Your response must be single word, either "relevant" or "unrelated",
and should not contain any text or characters aside from that word.
"unrelated" means that the reference text does not contain an answer to the Question.
"relevant" means the reference text contains an answer to the Question.
"""

CORRECTNESS_EVAL_TEMPLATE = """You are given a question, an answer and reference text. You must determine whether the
given answer correctly answers the question based on the reference text. Here is the data:
    [BEGIN DATA]
    ************
    [Question]: {input}
    ************
    [Reference]: {reference}
    ************
    [Answer]: {output}
    [END DATA]
Your response must be a single word, either "correct" or "incorrect",
and should not contain any text or characters aside from that word.
"correct" means that the question is correctly and fully answered by the answer.
"incorrect" means that the question is not correctly or only partially answered by the answer.
"""

We will be creating an LLM as a judge using the prompt templates above by taking the spans recorded by Phoenix, and then giving them labels using the `llm_classify` function. This function uses LLMs to evaluate your LLM calls and gives them labels and explanations. You can read more detail [here](https://docs.arize.com/phoenix/api/evals#phoenix.evals.llm_classify).

In [None]:
from phoenix.evals import OpenAIModel, llm_classify

RELEVANCE_RAILS = ["relevant", "unrelated"]
CORRECTNESS_RAILS = ["incorrect", "correct"]

relevance_eval_df = llm_classify(
    dataframe=response_df,
    template=RELEVANCE_EVAL_TEMPLATE,
    model=OpenAIModel(model="gpt-4o"),
    rails=RELEVANCE_RAILS,
    provide_explanation=True,
    include_prompt=True,
    concurrency=4,
)

correctness_eval_df = llm_classify(
    dataframe=response_df,
    template=CORRECTNESS_EVAL_TEMPLATE,
    model=OpenAIModel(model="gpt-4o"),
    rails=CORRECTNESS_RAILS,
    provide_explanation=True,
    include_prompt=True,
    concurrency=4,
)

Let's look at and inspect the results of our evaluatiion!

In [None]:
relevance_eval_df

In [None]:
correctness_eval_df

## Experiment with different k-values, chunk sizes, and chunk overlaps

Let's change the number of documents retrieved from the vector store, the size of the chunks loaded into the vector store, and the chunk overlaps.

In [None]:
reset_vector_store(vector_store, chunk_size=100, chunk_overlap=10)

In [None]:
rag_df = run_rag(vector_store, questions_df, k_value=2)
print(rag_df)

In [None]:
print(rag_df.head(3))

Let's setup our evaluators to see how the performance changes.

In [None]:
def run_evaluators(rag_df):
    relevance_eval_df = llm_classify(
        dataframe=rag_df,
        template=RELEVANCE_EVAL_TEMPLATE,
        model=OpenAIModel(model="gpt-4o"),
        rails=RELEVANCE_RAILS,
        provide_explanation=True,
        concurrency=4,
    )
    rag_df["relevance"] = relevance_eval_df["label"]
    rag_df["relevance_explanation"] = relevance_eval_df["explanation"]

    correctness_eval_df = llm_classify(
        dataframe=rag_df,
        template=CORRECTNESS_EVAL_TEMPLATE,
        model=OpenAIModel(model="gpt-4o"),
        rails=CORRECTNESS_RAILS,
        provide_explanation=True,
        concurrency=4,
    )
    rag_df["correctness"] = correctness_eval_df["label"]
    rag_df["correctness_explanation"] = correctness_eval_df["explanation"]
    return rag_df

Let's log these results to Arize and see how they compare.

First we'll create a dataset to store our questions.

In [None]:
from arize.experimental.datasets import ArizeDatasetsClient
from uuid import uuid1
from arize.experimental.datasets.experiments.types import (
    ExperimentTaskResultColumnNames,
    EvaluationResultColumnNames,
)
from arize.experimental.datasets.utils.constants import GENERATIVE
import pandas as pd

# Set up the arize client
arize_client = ArizeDatasetsClient(developer_key=DEVELOPER_KEY, api_key=API_KEY)
dataset = None
dataset_name = "rag-experiments-" + str(uuid1())[:3]

dataset_id = arize_client.create_dataset(
    space_id=SPACE_ID,
    dataset_name=dataset_name,
    dataset_type=GENERATIVE,
    data=questions_df,
)
dataset = arize_client.get_dataset(space_id=SPACE_ID, dataset_id=dataset_id)
print(dataset)

Next we'll define which columns of our dataframe will be mapped to outputs and which will be mapped to evaluation labels and explanations..

In [None]:
# Define column mappings for task
task_cols = ExperimentTaskResultColumnNames(
    example_id="example_id", result="output"
)
# Define column mappings for evaluator
relevance_evaluator_cols = EvaluationResultColumnNames(
    label="relevance",
    explanation="relevance_explanation",
)
correctness_evaluator_cols = EvaluationResultColumnNames(
    label="correctness",
    explanation="correctness_explanation",
)


def log_experiment_to_arize(experiment_df, experiment_name):
    experiment_df["example_id"] = dataset["id"]
    return arize_client.log_experiment(
        space_id=SPACE_ID,
        experiment_name=experiment_name + "-" + str(uuid1())[:2],
        experiment_df=experiment_df,
        task_columns=task_cols,
        evaluator_columns={
            "correctness": correctness_evaluator_cols,
            "relevance": relevance_evaluator_cols,
        },
        dataset_name=dataset_name,
    )

Now let's run it for each of our experiments.

In [None]:
# Run Experiments for k-size
reset_vector_store(vector_store, chunk_size=1000, chunk_overlap=20)
k_2_chunk_1000_overlap_20 = run_rag(vector_store, questions_df, k_value=2)
k_4_chunk_1000_overlap_20 = run_rag(vector_store, questions_df, k_value=4)
k_10_chunk_1000_overlap_20 = run_rag(vector_store, questions_df, k_value=10)
k_2_chunk_1000_overlap_20 = run_evaluators(k_2_chunk_1000_overlap_20)
k_4_chunk_1000_overlap_20 = run_evaluators(k_4_chunk_1000_overlap_20)
k_10_chunk_1000_overlap_20 = run_evaluators(k_10_chunk_1000_overlap_20)

log_experiment_to_arize(k_2_chunk_1000_overlap_20, "k_2_chunk_1000_overlap_20")
log_experiment_to_arize(k_4_chunk_1000_overlap_20, "k_4_chunk_1000_overlap_20")
log_experiment_to_arize(
    k_10_chunk_1000_overlap_20, "k_10_chunk_1000_overlap_20"
)

In [None]:
# Run experiments for chunk size
reset_vector_store(vector_store, chunk_size=200, chunk_overlap=10)
k_2_chunk_200_overlap_10 = run_rag(vector_store, questions_df, k_value=2)
reset_vector_store(vector_store, chunk_size=500, chunk_overlap=20)
k_2_chunk_500_overlap_20 = run_rag(vector_store, questions_df, k_value=2)
reset_vector_store(vector_store, chunk_size=1000, chunk_overlap=50)
k_2_chunk_1000_overlap_50 = run_rag(vector_store, questions_df, k_value=2)

k_2_chunk_200_overlap_10 = run_evaluators(k_2_chunk_200_overlap_10)
k_2_chunk_500_overlap_20 = run_evaluators(k_2_chunk_500_overlap_20)
k_2_chunk_1000_overlap_50 = run_evaluators(k_2_chunk_1000_overlap_50)

log_experiment_to_arize(k_2_chunk_200_overlap_10, "k_2_chunk_200_overlap_10")
log_experiment_to_arize(k_2_chunk_500_overlap_20, "k_2_chunk_500_overlap_20")
log_experiment_to_arize(k_2_chunk_1000_overlap_50, "k_2_chunk_1000_overlap_50")

You can see the experiment results in the Arize UI and see how each RAG method performs.

<img src="https://storage.googleapis.com/arize-assets/tutorials/images/couchbase-rag-experiment.png" width="800"/>