# LLM Evaluation with Arize AI, OpenAI, and Phoenix

This notebook runs a batch of prompts through OpenAI's GPT-3.5 model, evaluates them using GPT-4 via Phoenix's `QAEvaluator`, and traces activity with Arize AI.

📦 Installation Cell

In [None]:
%pip install openai arize openinference-instrumentation-openai arize-phoenix-evals pandas arize-otel

🧩 Imports and Configuration


In [1]:
from arize.otel import register
from openinference.instrumentation.openai import OpenAIInstrumentor
from config import openai_api_key, arizeai_api_key, arizeai_proj_name, arizeai_spid
import os
import openai
import pandas as pd
from phoenix.evals import QAEvaluator, OpenAIModel, run_evals
import time
import logging

🔧 Logging and OpenAI Setup (Code Cell)

In [7]:
# Setup logging
logging.basicConfig(level=logging.INFO)

# Set OpenAI API Key
openai.api_key = openai_api_key


🔁 Arize Tracer Registration (Code Cell)

In [8]:
# Register the tracer provider with Arize
tracer_provider = register(
    space_id=arizeai_spid,
    api_key=arizeai_api_key,
    project_name=arizeai_proj_name
)

# Instrument the OpenAI client
OpenAIInstrumentor().instrument(tracer_provider=tracer_provider)
client = openai.OpenAI(api_key=openai_api_key)



🔭 OpenTelemetry Tracing Details 🔭
|  Arize Project: My LLM Project
|  Span Processor: BatchSpanProcessor
|  Collector Endpoint: otlp.arize.com
|  Transport: gRPC
|  Transport Headers: {'authorization': '****', 'api_key': '****', 'arize-space-id': '****', 'space_id': '****', 'arize-interface': '****', 'user-agent': '****'}
|  
|  Using a default SpanProcessor. `add_span_processor` will overwrite this default.
|  
|  `register` has set this TracerProvider as the global OpenTelemetry default.
|  To disable this behavior, call `register` with `set_global_tracer_provider=False`.



📝 Prompt Data Setup

In [9]:
prompt_data = [
    {"input": "What is the capital of France?", "reference": "Paris"},
    {"input": "Who wrote the novel '1984'?", "reference": "George Orwell"},
    {"input": "What is NYC's most famous landmark?", "reference": "Statue of Liberty"},
    {"input": "What language is primarily spoken in Brazil?", "reference": "Portuguese"},
    {"input": "What is the tallest mountain in the world?", "reference": "Mount Everest"}
]

🤖 Batch Prompt Execution

In [10]:
results = []

# Run all prompts and collect responses
for item in prompt_data:
    prompt = item["input"]
    reference = item["reference"]
    try:
        logging.info(f"Sending prompt: {prompt}")
        response = client.chat.completions.create(
            model="gpt-3.5-turbo",
            messages=[{"role": "user", "content": prompt}],
            max_tokens=50
        )
        response_text = response.choices[0].message.content.strip()
        logging.info(f"Response: {response_text}")
        results.append({
            "input": prompt,
            "output": response_text,
            "reference": reference,
            "timestamp": time.time()
        })
    except Exception as e:
        logging.error(f"Error during OpenAI request for prompt '{prompt}': {e}")
        results.append({
            "input": prompt,
            "output": "Error",
            "reference": reference,
            "timestamp": time.time()
        })


INFO:root:Sending prompt: What is the capital of France?
INFO:httpx:HTTP Request: POST https://api.openai.com/v1/chat/completions "HTTP/1.1 200 OK"
INFO:root:Response: The capital of France is Paris.
INFO:root:Sending prompt: Who wrote the novel '1984'?
INFO:httpx:HTTP Request: POST https://api.openai.com/v1/chat/completions "HTTP/1.1 200 OK"
INFO:root:Response: George Orwell wrote the novel '1984'.
INFO:root:Sending prompt: What is NYC's most famous landmark?
INFO:httpx:HTTP Request: POST https://api.openai.com/v1/chat/completions "HTTP/1.1 200 OK"
INFO:root:Response: The most famous landmark in New York City is the Statue of Liberty.
INFO:root:Sending prompt: What language is primarily spoken in Brazil?
INFO:httpx:HTTP Request: POST https://api.openai.com/v1/chat/completions "HTTP/1.1 200 OK"
INFO:root:Response: The primary language spoken in Brazil is Portuguese.
INFO:root:Sending prompt: What is the tallest mountain in the world?
INFO:httpx:HTTP Request: POST https://api.openai.com

📊 Create DataFrame (Code Cell)

In [11]:
# Convert results to DataFrame
df = pd.DataFrame(results)
df

Unnamed: 0,input,output,reference,timestamp
0,What is the capital of France?,The capital of France is Paris.,Paris,1748353000.0
1,Who wrote the novel '1984'?,George Orwell wrote the novel '1984'.,George Orwell,1748353000.0
2,What is NYC's most famous landmark?,The most famous landmark in New York City is t...,Statue of Liberty,1748353000.0
3,What language is primarily spoken in Brazil?,The primary language spoken in Brazil is Portu...,Portuguese,1748353000.0
4,What is the tallest mountain in the world?,"Mount Everest, standing at 29,032 feet (8,848 ...",Mount Everest,1748353000.0


📈 Run QA Evaluation with GPT-4 

In [14]:
# Run QA Evaluation with GPT-4
eval_model = OpenAIModel(model="gpt-4", api_key=openai_api_key)
qa_evaluator = QAEvaluator(eval_model)

qa_eval_dfs = run_evals(
    dataframe=df,
    evaluators=[qa_evaluator],
    provide_explanation=True
)

# Extract and show results
qa_eval_df = qa_eval_dfs[0]
print(qa_eval_df.columns)
print(qa_eval_df.head())




run_evals |          | 0/5 (0.0%) | ⏳ 00:00<? | ?it/s

INFO:httpx:HTTP Request: POST https://api.openai.com/v1/chat/completions "HTTP/1.1 200 OK"
INFO:httpx:HTTP Request: POST https://api.openai.com/v1/chat/completions "HTTP/1.1 200 OK"
INFO:httpx:HTTP Request: POST https://api.openai.com/v1/chat/completions "HTTP/1.1 200 OK"
INFO:httpx:HTTP Request: POST https://api.openai.com/v1/chat/completions "HTTP/1.1 200 OK"
INFO:httpx:HTTP Request: POST https://api.openai.com/v1/chat/completions "HTTP/1.1 200 OK"


Index(['label', 'score', 'explanation'], dtype='object')
     label  score                                        explanation
0  correct      1  The question asks for the capital of France. T...
1  correct      1  The question asks who wrote the novel '1984'. ...
2  correct      1  The answer correctly identifies the Statue of ...
3  correct      1  The question asks about the primary language s...
4  correct      1  The question asks for the tallest mountain in ...


In [17]:
# Combine original columns with evaluation results
combined_df = pd.concat(
    [df[["input", "output", "reference"]].reset_index(drop=True), qa_eval_df.reset_index(drop=True)],
    axis=1
)

for index, row in combined_df.iterrows():
    print(f"\n--- Prompt {index + 1} ---")
    print(f"Input      : {row['input']}")
    print(f"Output     : {row['output']}")
    print(f"Reference  : {row['reference']}")
    print(f"Label      : {row['label']}")
    print(f"Score      : {row['score']}")
    print(f"Explanation: {row['explanation']}")



--- Prompt 1 ---
Input      : What is the capital of France?
Output     : The capital of France is Paris.
Reference  : Paris
Label      : correct
Score      : 1
Explanation: The question asks for the capital of France. The reference text provides the answer as 'Paris'. The given answer states 'The capital of France is Paris.' which matches the information provided in the reference text. Therefore, the answer is correct.

--- Prompt 2 ---
Input      : Who wrote the novel '1984'?
Output     : George Orwell wrote the novel '1984'.
Reference  : George Orwell
Label      : correct
Score      : 1
Explanation: The question asks who wrote the novel '1984'. The reference text provides the name 'George Orwell'. The answer states that 'George Orwell wrote the novel '1984', which is in line with the information provided in the reference text. Therefore, the answer is correct.

--- Prompt 3 ---
Input      : What is NYC's most famous landmark?
Output     : The most famous landmark in New York City i