# Evaluation and observability for LLM applications


## Creating an account on Comet.com

[Comet](https://www.comet.com/site?from=llm&utm_source=opik&utm_medium=colab&utm_content=llamaindex&utm_campaign=opik) provides a hosted version of the Opik platform, [simply create an account](https://www.comet.com/signup?from=llm&=opik&utm_medium=colab&utm_content=llamaindex&utm_campaign=opik) and grab you API Key.

> You can also run the Opik platform locally, see the [installation guide](https://www.comet.com/docs/opik/self-host/overview/?from=llm&utm_source=opik&utm_medium=colab&utm_content=llamaindex&utm_campaign=opik) for more information.

In [1]:
# !pip install opik llama-index llama-index-agent-openai llama-index-llms-openai --upgrade --quiet

In [1]:
import opik

opik.configure(use_local=False)

OPIK: Opik is already configured. You can check the settings by viewing the config file at /Users/bharathkarthick/.opik.config


## Preparing our environment

#### Create a .env file with the following content:
> OPENAI_API_KEY=your-api-key-here

> COMET_API_KEY=your-comet-api-key-here


In [2]:
from dotenv import load_dotenv

load_dotenv()

True

## Download some sample documents

In [3]:
import os
import requests

# Create directory if it doesn't exist
os.makedirs("./data/paul_graham/", exist_ok=True)

# Download the file using requests
url = "https://raw.githubusercontent.com/run-llama/llama_index/main/docs/docs/examples/data/paul_graham/paul_graham_essay.txt"
response = requests.get(url)
with open("./data/paul_graham/paul_graham_essay.txt", "wb") as f:
    f.write(response.content)

## Simple demo of logging with Opik

In [4]:
from opik import track

@track
def my_function(x: int) -> int:
    return x + 1

my_function(1)

OPIK: Started logging traces to the "Default Project" project at https://www.comet.com/opik/api/v1/session/redirect/projects/?trace_id=0197aeb4-7723-78b7-b4cb-7b8911ef3a08&path=aHR0cHM6Ly93d3cuY29tZXQuY29tL29waWsvYXBpLw==.


2

## Tracking LLM calls with Opik

In [5]:
from opik.integrations.openai import track_openai
from openai import OpenAI

openai_client = OpenAI()
openai_client = track_openai(openai_client)

prompt="Hello, world!"

response = openai_client.chat.completions.create(
    model="gpt-4o",
    messages=[
      {"role":"user", "content":prompt}
    ]
)

print(response.choices[0].message.content)

Hello! How can I assist you today?


In [35]:
models = openai_client.models.list()
for model in models.data:
    print(model.id)

gpt-4-turbo-preview
gpt-3.5-turbo-0125
gpt-4-turbo
gpt-4o
gpt-4.1
gpt-4.1-nano
text-embedding-ada-002


## Using LlamaIndex

### Configuring the Opik <> LlamaIndex integration

You can use the Opik callback directly by calling:

In [36]:
from llama_index.core import Settings
from llama_index.core.callbacks import CallbackManager
from opik.integrations.llama_index import LlamaIndexCallbackHandler
from openai import OpenAI


# Set up a callback handler that will automatically log all LlamaIndex operations to Opik
opik_callback_handler = LlamaIndexCallbackHandler()

# Integrating this handler into LlamaIndex's settings
Settings.callback_manager = CallbackManager([opik_callback_handler])

Now that the callback handler is configured, all traces will automatically be logged to Opik.

## Setup a simple LLamaIndex RAG pipeline

In [42]:
from llama_index.core import VectorStoreIndex, SimpleDirectoryReader, Settings
from llama_index.llms.openai import OpenAI

# Create LLM with the model you have access to
llm = OpenAI(model="gpt-4o")

# Set the default LLM in Settings to ensure it's used throughout
Settings.llm = llm

documents = SimpleDirectoryReader("./data/paul_graham").load_data()
index = VectorStoreIndex.from_documents(documents)
query_engine = index.as_query_engine(llm=llm)

We can now query the index using the `query_engine` object:

In [41]:
response = query_engine.query("What did the author do growing up?") 
print(response)

Growing up, the author worked on writing and programming outside of school. They wrote short stories, which they described as awful, and attempted programming on an IBM 1401 using an early version of Fortran. Later, with the advent of microcomputers, the author began programming more seriously, creating simple games, a program for predicting model rocket flight heights, and a word processor.


You can now go to the Opik app to see the trace:

![LlamaIndex trace in Opik](https://raw.githubusercontent.com/comet-ml/opik/main/apps/opik-documentation/documentation/static/img/cookbook/llamaIndex_cookbook.png)

In [43]:
str(response)

'Growing up, the author worked on writing and programming outside of school. They wrote short stories, which they described as awful, and attempted programming on an IBM 1401 using an early version of Fortran. Later, with the advent of microcomputers, the author began programming more seriously, creating simple games, a program for predicting model rocket flight heights, and a word processor.'

## Prepare data for evaluation

#### Load dataset and insert into Opik

In [47]:
import pandas as pd
pd.set_option("display.max_colwidth", None)

df = pd.read_csv("data/test.csv")
df.head()

Unnamed: 0,Question,Answer,Context
0,What was the very first programming language Paul Graham used when he began learning to program on the IBM 1401?,He used an early version of Fortran on the IBM 1401.,"The language we used was an early version of Fortran. You had to type programs on punch cards, then stack them in the card reader and press a button to load the program into memory and run it."
1,Which microcomputer did Paul Graham's father finally agree to buy for him around 1980?,A TRS-80.,"Computers were expensive in those days and it took me years of nagging before I convinced my father to buy one, a TRS-80, in about 1980. The gold standard then was the Apple II, but a TRS-80 was good enough."
2,What was the name of the startup Paul Graham co-founded that built software to create online stores?,Viaweb.,"We started a new company we called Viaweb, after the fact that our software worked via the web, and we got $10,000 in seed funding from Idelle's husband Julian."
3,Which friend of Paul Graham was the person responsible for the 1988 Internet Worm?,"Robert Tappan Morris (often referred to as ""Robert Morris"" or ""Rtm"" in the text).","I remember when my friend Robert Morris got kicked out of Cornell for writing the internet worm of 1988, I was envious that he'd found such a spectacular way to get out of grad school."
4,What was the title of the second Lisp book that Paul Graham wrote after finishing *On Lisp*?,*ANSI Common Lisp.*,"So with my unerring nose for financial opportunity, I decided to write another book on Lisp. This would be a popular book, the sort of book that could be used as a textbook. I imagined myself living frugally off the royalties and spending all my time painting. (The painting on the cover of this book, ANSI Common Lisp, is one that I painted around this time.)"


Create a dataset client


In [48]:
from opik import Opik

client = Opik()
dataset = client.get_or_create_dataset(name="Test dataset")

OPIK: Created a "Test dataset" dataset at https://www.comet.com/opik/api/v1/session/redirect/datasets/?dataset_id=0197ae7a-be36-7c9a-b522-5626f1a54b4f&path=aHR0cHM6Ly93d3cuY29tZXQuY29tL29waWsvYXBpLw==.


Insert

In [49]:
qa_pairs = [
    {"input": row["Question"], "expected_output": row["Answer"], "context": row["Context"]} 
    for _, row in df.iterrows()
]
qa_pairs[0]

{'input': 'What was the very first programming language Paul Graham used when he began learning to program on the IBM 1401?',
 'expected_output': 'He used an early version of Fortran on the IBM 1401.',
 'context': 'The language we used was an early version of Fortran. You had to type programs on punch cards, then stack them in the card reader and press a button to load the program into memory and run it.'}

In [50]:

dataset.insert(qa_pairs)

## Evaluation

LLM application

In [51]:
from llama_index.core import VectorStoreIndex, SimpleDirectoryReader, Settings
from llama_index.llms.openai import OpenAI

# Ensure we're using the correct model
llm = OpenAI(model="gpt-4o")
Settings.llm = llm

documents = SimpleDirectoryReader("./data/paul_graham").load_data()
index = VectorStoreIndex.from_documents(documents)

query_engine = index.as_query_engine(llm=llm)

Track it with Opik

In [52]:
from opik import track

@track
def my_llm_application(input: str) -> str:
    response = query_engine.query(input)
    return str(response)

Track the LLM calls

In [53]:
import openai
from opik.integrations.openai import track_openai

# Define the task to evaluate
openai_client = track_openai(openai.OpenAI())

MODEL = "gpt-4o"

Define the evaluation task

In [54]:
def evaluation_task(x):
    return {
        "output": my_llm_application(x['input'])
    }

Create a dataset client


In [55]:
from opik import Opik

client = Opik()
dataset = client.get_or_create_dataset(name="Test dataset")

Define evaluation metrics

In [56]:
from opik.evaluation.metrics import (
    Hallucination,
    AnswerRelevance,
    ContextPrecision,
    ContextRecall
)


# Define the metrics
hallucination_metric = Hallucination()
answer_relevance_metric = AnswerRelevance()
context_precision_metric = ContextPrecision()
context_recall_metric = ContextRecall() 

Run evaluation

In [57]:
from opik.evaluation import evaluate

evaluation = evaluate(
    dataset=dataset,
    task=evaluation_task,
    scoring_metrics=[hallucination_metric, answer_relevance_metric, context_precision_metric, context_recall_metric],
    experiment_config={
        "model": MODEL
    }
)

Evaluation:   0%|          | 0/5 [00:00<?, ?it/s]