<center>
    <p style="text-align:center">
        <img alt="phoenix logo" src="https://raw.githubusercontent.com/Arize-ai/phoenix-assets/9e6101d95936f4bd4d390efc9ce646dc6937fb2d/images/socal/github-large-banner-phoenix.jpg" width="1000"/>
        <br>
        <br>
        <a href="https://docs.arize.com/phoenix/">Docs</a>
        |
        <a href="https://github.com/Arize-ai/phoenix">GitHub</a>
        |
        <a href="https://join.slack.com/t/arize-ai/shared_invite/zt-1px8dcmlf-fmThhDFD_V_48oU7ALan4Q">Community</a>
    </p>
</center>

# Automatically find the bad LLM responses in your LLM Evals with Cleanlab

This guide will walk you through the process of evaluating LLM responses captured in Phoenix with Cleanlab's Trustworthy Language Models (TLM).

TLM boosts the reliability of any LLM application by indicating when the model’s response is untrustworthy.

This guide requires a Cleanlab TLM API key. If you don't have one, you can sign up for a free trial [here](https://tlm.cleanlab.ai/).

## Install dependencies & Set environment variables

In [1]:
%%bash
pip install -q "arize-phoenix>=4.29.0"
pip install -q 'httpx<0.28'
pip install -q openai cleanlab_tlm openinference-instrumentation-openai

In [2]:
import json
import os
from getpass import getpass

import dotenv

dotenv.load_dotenv()

False

In [3]:
if not (openai_api_key := os.getenv("OPENAI_API_KEY")):
    openai_api_key = getpass("🔑 Enter your OpenAI API key: ")

os.environ["OPENAI_API_KEY"] = openai_api_key

In [4]:
# Sign up for a free trial of Cleanlab TLM and get an API key [here](https://tlm.cleanlab.ai/).
if not (cleanlab_tlm_api_key := os.getenv("CLEANLAB_TLM_API_KEY")):
    cleanlab_tlm_api_key = getpass("🔑 Enter your Cleanlab TLM API key: ")

os.environ["CLEANLAB_TLM_API_KEY"] = cleanlab_tlm_api_key

## Connect to Phoenix

In this example, we'll use Phoenix as our destination. You could instead add any other exporters you'd like in this approach.

If you need to set up an API key for Phoenix, you can do so [here](https://app.phoenix.arize.com/).

The code below will connect you to a Phoenix Cloud instance. You can also connect to [a self-hosted Phoenix instance](https://docs.arize.com/phoenix/deployment) if you'd prefer.

In [5]:
# Add Phoenix API Key for tracing
if not (PHOENIX_API_KEY := os.getenv("PHOENIX_CLIENT_HEADERS")):
    PHOENIX_API_KEY = getpass("🔑 Enter your Phoenix API Key: ")
os.environ["PHOENIX_CLIENT_HEADERS"] = f"api_key={PHOENIX_API_KEY}"
os.environ["PHOENIX_COLLECTOR_ENDPOINT"] = "https://app.phoenix.arize.com"


Now that we have Phoenix configured, we can register that instance with OpenTelemetry, which will allow us to collect traces from our application here.

In [6]:
from phoenix.otel import register

tracer_provider = register(project_name="evaluating_traces_TLM")

  from .autonotebook import tqdm as notebook_tqdm


🔭 OpenTelemetry Tracing Details 🔭
|  Phoenix Project: evaluating_traces_TLM
|  Span Processor: SimpleSpanProcessor
|  Collector Endpoint: https://app.phoenix.arize.com/v1/traces
|  Transport: HTTP
|  Transport Headers: {'api_key': '****'}
|  
|  Using a default SpanProcessor. `add_span_processor` will overwrite this default.
|  
|  `register` has set this TracerProvider as the global OpenTelemetry default.
|  To disable this behavior, call `register` with `set_global_tracer_provider=False`.



## Prepare trace dataset

For the sake of making this guide fully runnable, we'll briefly generate some traces and track them in Phoenix. Typically, you would have already captured traces in Phoenix and would skip to "Download trace dataset from Phoenix"

In [7]:
from openinference.instrumentation.openai import OpenAIInstrumentor

OpenAIInstrumentor().instrument(tracer_provider=tracer_provider)

In [8]:
from openai import OpenAI

# Initialize OpenAI client
client = OpenAI()


# Function to generate an answer
def generate_answers(trivia_question):
    response = client.chat.completions.create(
        model="gpt-3.5-turbo",
        messages=[
            {"role": "system", "content": "You are a trivia master."},
            {"role": "user", "content": trivia_question},
        ],
    )
    answer = response.choices[0].message.content
    return answer


trivia_questions = [
    "What is the 3rd month of the year in alphabetical order?",
    "What is the capital of France?",
    "How many seconds are in 100 years?",
    "Alice, Bob, and Charlie went to a café. Alice paid twice as much as Bob, and Bob paid three times as much as Charlie. If the total bill was $72, how much did each person pay?",
    "When was the Declaration of Independence signed?",
]

# Generate answers
answers = []
for i in range(len(trivia_questions)):
    answer = generate_answers(trivia_questions[i])
    answers.append(answer)
    print(f"Question {i+1}: {trivia_questions[i]}")
    print(f"Answer {i+1}:\n{answer}\n")

print(f"Generated {len(answers)} answers and tracked them in Phoenix.")

Question 1: What is the 3rd month of the year in alphabetical order?
Answer 1:
The 3rd month of the year in alphabetical order is March.

Question 2: What is the capital of France?
Answer 2:
The capital of France is Paris.

Question 3: How many seconds are in 100 years?
Answer 3:
There are 3,153,600,000 seconds in 100 years.

Question 4: Alice, Bob, and Charlie went to a café. Alice paid twice as much as Bob, and Bob paid three times as much as Charlie. If the total bill was $72, how much did each person pay?
Answer 4:
Let's represent the amounts paid by Alice, Bob, and Charlie as A, B, and C, respectively.

From the given information:
1. A = 2B
2. B = 3C
3. A + B + C = 72

Substitute the values of A and B from equations 1 and 2 into equation 3:
2B + B + B/3 = 72
6B + 3B + B = 216
10B = 216
B = 21.6

Now, find the values of A and C:
A = 2B
A = 2 * 21.6
A = 43.2

C = B/3
C = 21.6/3
C = 7.2

Therefore, Alice paid $43.20, Bob paid $21.60, and Charlie paid $7.20.

Question 5: When was the 

## Download trace dataset from Phoenix

In [13]:
import phoenix as px

spans_df = px.Client().get_spans_dataframe(project_name="evaluating_traces_TLM")
spans_df.head()



Unnamed: 0_level_0,name,span_kind,parent_id,start_time,end_time,status_code,status_message,events,context.span_id,context.trace_id,...,attributes.llm.provider,attributes.openinference.span.kind,attributes.llm.system,attributes.llm.input_messages,attributes.llm.invocation_parameters,attributes.output.value,attributes.llm.model_name,attributes.output.mime_type,attributes.llm.output_messages,attributes.input.mime_type
context.span_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
f8327c0feaa9104b,ChatCompletion,LLM,,2025-03-19 20:12:42.856189+00:00,2025-03-19 20:12:43.540837+00:00,OK,,[],f8327c0feaa9104b,0a8e43385a7a347dfc1ab0fe452d81f1,...,openai,LLM,openai,[{'message.content': 'You are a trivia master....,"{""model"": ""gpt-3.5-turbo""}","{""id"":""chatcmpl-BCu4J12RYILvdhbjKbN92vWwLtY1B""...",gpt-3.5-turbo-0125,application/json,"[{'message.content': 'March', 'message.role': ...",application/json
1e500f55a0965355,ChatCompletion,LLM,,2025-03-19 20:12:43.815089+00:00,2025-03-19 20:12:44.412861+00:00,OK,,[],1e500f55a0965355,0238e2633f4109b1663eb5da545e8d96,...,openai,LLM,openai,[{'message.content': 'You are a trivia master....,"{""model"": ""gpt-3.5-turbo""}","{""id"":""chatcmpl-BCu4KTZj9cEsfa2B9m5xJwXoI2Zh3""...",gpt-3.5-turbo-0125,application/json,[{'message.content': 'The capital of France is...,application/json
f7fd714b0743e678,ChatCompletion,LLM,,2025-03-19 20:12:44.487955+00:00,2025-03-19 20:12:45.375315+00:00,OK,,[],f7fd714b0743e678,6352391d98cfc603aed2366207b7c8d6,...,openai,LLM,openai,[{'message.content': 'You are a trivia master....,"{""model"": ""gpt-3.5-turbo""}","{""id"":""chatcmpl-BCu4KY4rV3fYoqGmKBzyxZMx1NlVy""...",gpt-3.5-turbo-0125,application/json,"[{'message.content': 'There are 31,536,000 sec...",application/json
11abc1a7eb12cb32,ChatCompletion,LLM,,2025-03-19 20:12:45.511824+00:00,2025-03-19 20:12:48.043642+00:00,OK,,[],11abc1a7eb12cb32,ba5065ff79d70a980b4f53cda31d3f63,...,openai,LLM,openai,[{'message.content': 'You are a trivia master....,"{""model"": ""gpt-3.5-turbo""}","{""id"":""chatcmpl-BCu4LhMA2nOcsWzjHsWdkg3CPX2F1""...",gpt-3.5-turbo-0125,application/json,[{'message.content': 'Let's denote the amount ...,application/json
bec3a1555fa392ff,ChatCompletion,LLM,,2025-03-19 20:12:48.157944+00:00,2025-03-19 20:12:48.956120+00:00,OK,,[],bec3a1555fa392ff,dd9ccd53426bf28fa5c13bb918918388,...,openai,LLM,openai,[{'message.content': 'You are a trivia master....,"{""model"": ""gpt-3.5-turbo""}","{""id"":""chatcmpl-BCu4OnVYN1uY5ZAH1Te2AG0T38I1K""...",gpt-3.5-turbo-0125,application/json,[{'message.content': 'The Declaration of Indep...,application/json


## Generate evaluations with TLM

Now that we have our trace dataset, we can generate evaluations for each trace using TLM. Ultimately, we want to end up with a trustworthiness score and explaination for each prompt, response pair in the traces.

In [14]:
from cleanlab_tlm import TLM

tlm = TLM(options={"log": ["explanation"]})

We first need to extract the prompts and responses from the individual traces. `TLM.get_trustworthiness_score()` will take a list of prompts and responses and return trustworthiness scores and explanations.

**IMPORTANT:** It is essential to always include any system prompts, context, or other information that was originally provided to the LLM to generate the response. You should construct the prompt input to `get_trustworthiness_score()` in a way that is as similar as possible to the original prompt.

In [15]:
# Create a new DataFrame with input and output columns
eval_df = spans_df[["context.span_id", "attributes.input.value", "attributes.output.value"]].copy()
eval_df.set_index("context.span_id", inplace=True)


# Combine system and user prompts from the traces
def get_prompt(input_value):
    if isinstance(input_value, str):
        input_value = json.loads(input_value)
    system_prompt = input_value["messages"][0]["content"]
    user_prompt = input_value["messages"][1]["content"]
    return system_prompt + "\n" + user_prompt


# Get the responses from the traces
def get_response(output_value):
    if isinstance(output_value, str):
        output_value = json.loads(output_value)
    return output_value["choices"][0]["message"]["content"]


# Create a list of prompts and associated responses
prompts = [get_prompt(input_value) for input_value in eval_df["attributes.input.value"]]
responses = [get_response(output_value) for output_value in eval_df["attributes.output.value"]]

eval_df["prompt"] = prompts
eval_df["response"] = responses

Now that we have all of the prompts and responses, we can evaluate each pair using TLM.

In [16]:
# Evaluate each of the prompt, response pairs using TLM
evaluations = tlm.get_trustworthiness_score(prompts, responses)

# Extract the trustworthiness scores and explanations from the evaluations
trust_scores = [entry["trustworthiness_score"] for entry in evaluations]
explanations = [entry["log"]["explanation"] for entry in evaluations]

# Add the trust scores and explanations to the DataFrame
eval_df["score"] = trust_scores
eval_df["explanation"] = explanations

# Display the new DataFrame
eval_df.head()

Querying TLM... 100%|██████████|


Unnamed: 0_level_0,attributes.input.value,attributes.output.value,prompt,response,score,explanation
context.span_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
f8327c0feaa9104b,"{""messages"": [{""role"": ""system"", ""content"": ""Y...","{""id"":""chatcmpl-BCu4J12RYILvdhbjKbN92vWwLtY1B""...",You are a trivia master.\nWhat is the 3rd mont...,March,0.037318,The user is asking for the third month of the ...
1e500f55a0965355,"{""messages"": [{""role"": ""system"", ""content"": ""Y...","{""id"":""chatcmpl-BCu4KTZj9cEsfa2B9m5xJwXoI2Zh3""...",You are a trivia master.\nWhat is the capital ...,The capital of France is Paris.,0.98743,Did not find a reason to doubt trustworthiness.
f7fd714b0743e678,"{""messages"": [{""role"": ""system"", ""content"": ""Y...","{""id"":""chatcmpl-BCu4KY4rV3fYoqGmKBzyxZMx1NlVy""...",You are a trivia master.\nHow many seconds are...,"There are 31,536,000 seconds in a year (60 sec...",0.260966,The proposed response calculates the number of...
11abc1a7eb12cb32,"{""messages"": [{""role"": ""system"", ""content"": ""Y...","{""id"":""chatcmpl-BCu4LhMA2nOcsWzjHsWdkg3CPX2F1""...","You are a trivia master.\nAlice, Bob, and Char...",Let's denote the amount Charlie paid as C. \n\...,0.380158,This response is untrustworthy due to lack of ...
bec3a1555fa392ff,"{""messages"": [{""role"": ""system"", ""content"": ""Y...","{""id"":""chatcmpl-BCu4OnVYN1uY5ZAH1Te2AG0T38I1K""...",You are a trivia master.\nWhen was the Declara...,The Declaration of Independence was approved b...,0.945734,Did not find a reason to doubt trustworthiness.


We now have a DataFrame with added colums:
- `prompt`: the combined system and user prompt from the trace
- `response`: the LLM response from the trace
- `score`: the trustworthiness score from TLM
- `explanation`: the explanation from TLM

Let's sort our traces by the `score` column to quickly find untrustworthy LLM responses.

In [17]:
sorted_df = eval_df.sort_values(by="score", ascending=True).head()
sorted_df

Unnamed: 0_level_0,attributes.input.value,attributes.output.value,prompt,response,score,explanation
context.span_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
a133162d3623131d,"{""messages"": [{""role"": ""system"", ""content"": ""Y...","{""id"":""chatcmpl-BCu68UEJaw63CvUtbwRWLbrkxDQ2V""...",You are a trivia master.\nWhat is the 3rd mont...,The 3rd month of the year in alphabetical orde...,0.034957,The proposed response states that the 3rd mont...
f8327c0feaa9104b,"{""messages"": [{""role"": ""system"", ""content"": ""Y...","{""id"":""chatcmpl-BCu4J12RYILvdhbjKbN92vWwLtY1B""...",You are a trivia master.\nWhat is the 3rd mont...,March,0.037318,The user is asking for the third month of the ...
f7fd714b0743e678,"{""messages"": [{""role"": ""system"", ""content"": ""Y...","{""id"":""chatcmpl-BCu4KY4rV3fYoqGmKBzyxZMx1NlVy""...",You are a trivia master.\nHow many seconds are...,"There are 31,536,000 seconds in a year (60 sec...",0.260966,The proposed response calculates the number of...
be3d0fe1e5d52a2b,"{""messages"": [{""role"": ""system"", ""content"": ""Y...","{""id"":""chatcmpl-BCu699koonzGxFxqmSPKhQRQi31hf""...",You are a trivia master.\nHow many seconds are...,"There are 3,153,600,000 seconds in 100 years.",0.359335,To calculate the number of seconds in 100 year...
11abc1a7eb12cb32,"{""messages"": [{""role"": ""system"", ""content"": ""Y...","{""id"":""chatcmpl-BCu4LhMA2nOcsWzjHsWdkg3CPX2F1""...","You are a trivia master.\nAlice, Bob, and Char...",Let's denote the amount Charlie paid as C. \n\...,0.380158,This response is untrustworthy due to lack of ...


In [18]:
# Let's look at the least trustworthy trace.
print("Prompt: ", sorted_df.iloc[0]["prompt"], "\n")
print("OpenAI Response: ", sorted_df.iloc[0]["response"], "\n")
print("TLM Trust Score: ", sorted_df.iloc[0]["score"], "\n")
print("TLM Explanation: ", sorted_df.iloc[0]["explanation"])

Prompt:  You are a trivia master.
What is the 3rd month of the year in alphabetical order? 

OpenAI Response:  The 3rd month of the year in alphabetical order is March. 

TLM Trust Score:  0.03495703165124102 

TLM Explanation:  The proposed response states that the 3rd month of the year in alphabetical order is March. To determine if this is correct, we first need to list the months of the year in alphabetical order: 

1. April
2. August
3. December
4. February
5. January
6. July
7. June
8. March
9. May
10. November
11. October
12. September

When we look at this list, we can see that March is actually the 8th month in alphabetical order, not the 3rd. The 3rd month in alphabetical order is December. Therefore, the proposed response is incorrect. 
This response is untrustworthy due to lack of consistency in possible responses from the model. Here's one inconsistent alternate response that the model considered (which may not be accurate either): 
December.


#### Awesome! TLM was able to identify multiple traces that contained incorrect answers from OpenAI.

Let's upload the `score` and `explanation` columns to Phoenix.

## Upload evaluations to Phoenix

Our evals_df has a column for the span_id and a column for the evaluation result. The span_id is what allows us to connect the evaluation to the correct trace in Phoenix. Phoenix will also automatically look for columns named "score" and "evaluation" to display in the UI. 

In [19]:
eval_df["score"] = eval_df["score"].astype(float)
eval_df["explanation"] = eval_df["explanation"].astype(str)

In [20]:
from phoenix.trace import SpanEvaluations

px.Client().log_evaluations(SpanEvaluations(eval_name="Trustworthiness", dataframe=eval_df))



You should now see evaluations in the Phoenix UI!

From here you can continue collecting and evaluating traces!