<center>
    <p style="text-align:center">
        <img alt="phoenix logo" src="https://storage.googleapis.com/arize-phoenix-assets/assets/phoenix-logo-light.svg" width="200"/>
        <br>
        <a href="https://docs.arize.com/phoenix/">Docs</a>
        |
        <a href="https://github.com/Arize-ai/phoenix">GitHub</a>
        |
        <a href="https://join.slack.com/t/arize-ai/shared_invite/zt-1px8dcmlf-fmThhDFD_V_48oU7ALan4Q">Community</a>
    </p>
</center>
<h1 align="center">Model Comparison for a Text Extraction Service</h1>

Imagine you're deploying a service that condenses emails into concise summaries. One challenge of using LLMs for summarization is that even the best models can miscategorize key details, or miss those details entirely.

In this tutorial, you will construct a dataset and run experiments to engineer a prompt template that produces accurately summarizes your emails. You will:

- Upload a **dataset** of **examples** containing emails to Phoenix
- Define an **experiment task** that extracts and formats the key details from those emails
- Devise an **evaluator** measuring Jaro-Winkler Similarity
- Run **experiments** to iterate on your prompt template and to compare the summaries produced by different LLMs

⚠️ This tutorial requires and OpenAI API key.

Let's get started!


#

In [1]:
!pip install arize-phoenix langchain langchain-core langchain-community langchain-benchmarks langchain-openai nest_asyncio jarowinkler

Collecting arize-phoenix
  Downloading arize_phoenix-4.6.3-py3-none-any.whl (1.4 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.4/1.4 MB[0m [31m12.6 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting langchain
  Downloading langchain-0.2.6-py3-none-any.whl (975 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m975.5/975.5 kB[0m [31m21.0 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting langchain-core
  Downloading langchain_core-0.2.11-py3-none-any.whl (337 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m337.4/337.4 kB[0m [31m22.1 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting langchain-community
  Downloading langchain_community-0.2.6-py3-none-any.whl (2.2 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m2.2/2.2 MB[0m [31m34.9 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting langchain-benchmarks
  Downloading langchain_benchmarks-0.0.12-py3-none-any.whl (67 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

# Set Up OpenAI API Key

In [2]:
import os
from getpass import getpass

if not os.getenv("OPENAI_API_KEY"):
    os.environ["OPENAI_API_KEY"] = getpass("🔑 Enter your OpenAI API key: ")

🔑 Enter your OpenAI API key: ··········


# Import Modules

In [3]:
import json
import tempfile
from datetime import datetime, timezone

import jarowinkler
import nest_asyncio
import pandas as pd
import phoenix as px
from langchain.output_parsers.openai_functions import JsonOutputFunctionsParser
from langchain_benchmarks import download_public_dataset, registry
from langchain_openai.chat_models import ChatOpenAI
from openinference.instrumentation.langchain import LangChainInstrumentor
from openinference.instrumentation.openai import OpenAIInstrumentor
from opentelemetry.exporter.otlp.proto.grpc.trace_exporter import OTLPSpanExporter
from opentelemetry.sdk.trace import TracerProvider
from opentelemetry.sdk.trace.export import SimpleSpanProcessor
from phoenix.experiments import evaluate_experiment, run_experiment
from phoenix.experiments.types import Example

nest_asyncio.apply()

# Launch Phoenix

First we have to set up our instance of Phoenix and our instrumentors to capture traces from our agent. We'll use both our Langchain and OpenAI auto instrumentors because while our task uses Langchain, our evaluation function will call OpenAI directly.

In [4]:
px.launch_app()

🌍 To view the Phoenix app in your browser, visit https://r6599g08ap1-496ff2e9c6d22116-6006-colab.googleusercontent.com/
📖 For more information on how to use Phoenix, check out https://docs.arize.com/phoenix


<phoenix.session.session.ThreadSession at 0x7fcb3af2fa90>

# Instrument LangChain and OpenAI

In [5]:
endpoint = "http://127.0.0.1:4317"
(tracer_provider := TracerProvider()).add_span_processor(
    SimpleSpanProcessor(OTLPSpanExporter(endpoint))
)

LangChainInstrumentor().instrument(tracer_provider=tracer_provider)
OpenAIInstrumentor().instrument(tracer_provider=tracer_provider)

# Experiments in Phoenix

Experiments in Phoenix are made up of 3 elements: a dataset, a task, and an evaluator. The dataset is a collection of the inputs and expected outputs that we'll use to evaluate. The task is an operation that should be performed on each input. Finally, the evaluator compares the result against an expected output.

For this example, here's what each looks like:
*   Dataset - a dataframe of emails to analyze, and the expected output for our agent
*   Task - a langchain agent that extracts key info from our input emails. The result of this task will then be compared against the expected output
*   Eval - Jaro-Winkler distance calculation on the task's output and expected output



# Download JSON Data

We've prepared some example emails and actual responses that we can use to evaluate our two models. Let's download those and save them to a temporary file.

In [6]:
dataset_name = "Email Extraction"

with tempfile.NamedTemporaryFile(suffix=".json") as f:
    download_public_dataset(registry[dataset_name].dataset_id, path=f.name)
    df = pd.read_json(f.name)[["inputs", "outputs"]]
df = df.sample(10, random_state=42)
df

Fetching examples...


  0%|          | 0/42 [00:00<?, ?it/s]

Done fetching examples.


Unnamed: 0,inputs,outputs
25,{'input': '**iCloud** �� # Failed to atte...,"{'output': {'tone': 'negative', 'topic': 'iClo..."
13,{'input': '--- | We Passed the Stop Dang...,"{'output': {'tone': 'positive', 'topic': 'Stop..."
8,{'input': '#### Where sustainability meets st...,"{'output': {'tone': 'positive', 'topic': 'Prom..."
26,"{'input': '| | | | | | Hello Jacob, 👋  ...","{'output': {'tone': 'positive', 'topic': 'Busi..."
4,{'input': 'Some travelers plan ahead; others p...,"{'output': {'tone': 'positive', 'topic': 'Trav..."
39,{'input': '--- | Costco --- ANSWE...,"{'output': {'tone': 'positive', 'topic': 'Invi..."
19,"{'input': 'Dear Jacob, Your opinion matte...","{'output': {'tone': 'positive', 'topic': 'Invi..."
29,{'input': '_`I Am looking for a possible partn...,"{'output': {'tone': 'positive', 'topic': 'Inve..."
30,{'input': 'It's always been a hassle to get mo...,"{'output': {'tone': 'positive', 'topic': 'Busi..."
6,{'input': 'Your exclusive retreat at The Venet...,"{'output': {'tone': 'positive', 'topic': 'Excl..."


# Upload Dataset to Phoenix

Next, we'll upload our dataset to Phoenix. Once this is present in Phoenix, we can run multiple experiments with different models on this one dataset, and compare their performance.

In [7]:
dataset = px.Client().upload_dataset(
    dataset_name=f"{dataset_name}{datetime.now(timezone.utc)}",
    inputs=df.inputs,
    outputs=df.outputs.map(lambda obj: obj["output"]),
)

📤 Uploading dataset...
💾 Examples uploaded: https://r6599g08ap2-496ff2e9c6d22116-6006-colab.googleusercontent.com/datasets/RGF0YXNldDox/examples
🗄️ Dataset version ID: RGF0YXNldFZlcnNpb246MQ==


# Set Up LangChain

Now we'll set up our Langchain agent. This is a straightforward agent that makes a call to our specified model and formats the response as JSON.

In [8]:
model = "gpt-4o"

llm = ChatOpenAI(model=model).bind_functions(
    functions=[registry[dataset_name].schema],
    function_call=registry[dataset_name].schema.schema()["title"],
)
output_parser = JsonOutputFunctionsParser()
extraction_chain = registry[dataset_name].instructions | llm | output_parser

# Define Task Function

Next, we need to define a Task for our experiment to use.

In [9]:
def task(ex: Example) -> str:
    return extraction_chain.invoke(ex.input)

# Check that the task is working by running it on at least one Example

In [10]:
first_key = next(iter(dataset.examples))
first_example = dataset.examples[first_key]

task(first_example)

{'sender': 'The iCloud Team',
 'sender_address': '6101 Long Prairie Rd, Ste 744 #511, Flower Mound, TX, 75028',
 'action_items': ['Update your payment information'],
 'topic': 'Failed payment attempt for iCloud storage subscription renewal',
 'tone': 'negative'}

# Run Experiment

Now we're ready to run our experiment. We'll specify our dataset and task, and generate responses for us to evaluate in the next step.

In [11]:
experiment = run_experiment(dataset, task)

🧪 Experiment started.
📺 View dataset experiments: https://r6599g08ap3-496ff2e9c6d22116-6006-colab.googleusercontent.com/datasets/RGF0YXNldDox/experiments
🔗 View this experiment: https://r6599g08ap3-496ff2e9c6d22116-6006-colab.googleusercontent.com/datasets/RGF0YXNldDox/compare?experimentId=RXhwZXJpbWVudDox


running tasks |          | 0/10 (0.0%) | ⏳ 00:00<? | ?it/s

[91mTraceback (most recent call last):
  File "/usr/local/lib/python3.10/dist-packages/phoenix/experiments/functions.py", line 312, in async_run_experiment
    _output = task(*bound_task_args.args, **bound_task_args.kwargs)
  File "<ipython-input-9-3aba8caf609a>", line 2, in task
    return extraction_chain.invoke(ex.input)
AttributeError: 'dict' object has no attribute 'input'

The above exception was the direct cause of the following exception:

RuntimeError: task failed for example id 'RGF0YXNldEV4YW1wbGU6MQ==', repetition 1
[0m
[91mTraceback (most recent call last):
  File "/usr/local/lib/python3.10/dist-packages/phoenix/experiments/functions.py", line 312, in async_run_experiment
    _output = task(*bound_task_args.args, **bound_task_args.kwargs)
  File "<ipython-input-9-3aba8caf609a>", line 2, in task
    return extraction_chain.invoke(ex.input)
AttributeError: 'dict' object has no attribute 'input'

The above exception was the direct cause of the following exception:

Runtime

# Define Evaluator

Finally, we need to define our evaluation function. Here we'll use a Jaro-Winkler similarity function that generates a score for how similar the output and expected text are. [Jaro-Winkler similarity](https://en.wikipedia.org/wiki/Jaro%E2%80%93Winkler_distance) is technique for measuring edit distance between two strings.

In [None]:
def jarowinkler_similarity(output, expected) -> float:
    return jarowinkler.jarowinkler_similarity(
        json.dumps(output, sort_keys=True),
        json.dumps(expected, sort_keys=True),
    )

# Evaluate Experiment

In [None]:
evaluate_experiment(experiment, jarowinkler_similarity)

Now we have scores on how well GPT-4o does at extracting email facts. This is helpful, but doesn't mean much on its own. Let's compare it against another model.

# Re-run with GPT 3.5 Turbo and Compare Results

To compare results with another model, we simply need to redefine our task. Our dataset and evaluator can stay the same.

In [None]:
model = "gpt-3.5-turbo"

llm = ChatOpenAI(model=model).bind_functions(
    functions=[registry[dataset_name].schema],
    function_call=registry[dataset_name].schema.schema()["title"],
)
extraction_chain = registry[dataset_name].instructions | llm | output_parser

In [None]:
def task(ex: Example) -> str:
    return extraction_chain.invoke(ex.input)

In [None]:
experiment = run_experiment(dataset, task)

In [None]:
evaluate_experiment(experiment, jarowinkler_similarity)

# View results

Now if you check your Phoenix experiment, you can compare Jaro-Winkler scores on a per query basis, and view aggregate model performance results. The screenshot belows shows results from GPT-4o on the left and GPT-3.5-turbo on the far right. The higher the jarowinkler_similarity score, the closer the outputted value is to the actual value.

You should see that GPT-4o outperforms its older cousin.

![picture](https://storage.cloud.google.com/arize-assets/phoenix/assets/images/email-extraction-example.png)

From here you could try out different models or iterate on your prompt, then run the same experiment with a modified Task to compare results.