<center>
    <p style="text-align:center">
    <img alt="arize logo" src="https://storage.googleapis.com/arize-assets/arize-logo-white.jpg" width="300"/>
        <br>
        <a href="https://docs.arize.com/arize/">Docs</a>
        |
        <a href="https://github.com/Arize-ai/client_python">GitHub</a>
        |
        <a href="https://arize-ai.slack.com/join/shared_invite/zt-11t1vbu4x-xkBIHmOREQnYnYDH1GDfCg">Slack Community</a>
    </p>
</center>
<center>
    <p style="text-align:center">
    <img alt="arize logo" src="https://camo.githubusercontent.com/19d13d2afb3500141b9a802689d89745886f5b3f56d758e4725b70df14c8697a/68747470733a2f2f616e616c7974696373696e6469616d61672e636f6d2f77702d636f6e74656e742f75706c6f6164732f323032332f30382f4c6f676f2d6f6e2d77686974652d6261636b67726f756e642e706e67" width="300"/>
    </p>
</center>

<center><h1> Portkey Universal API + Arize Tracing/Evals </h1></center>

This notebook will walk through how you can use Portkey to seamlessly use different LLMs within the same application. It also adds Arize tracing/evals so you can observe and evaluate the different LLM calls.

In [None]:
!pip install Portkey openinference-instrumentation-portkey portkey-ai arize-otel arize-phoenix "arize[Tracing]>=7.1.0"

In [None]:
#env variables
import os, getpass
os.environ["PORTKEY_API_KEY"] = getpass.getpass("Enter your PORTKEY_API_KEY: ")
os.environ["ARIZE_API_KEY"] = getpass.getpass("Enter your ARIZE_API_KEY: ")
os.environ["ARIZE_SPACE_ID"] = getpass.getpass("Enter your ARIZE_SPACE_ID: ")

In [None]:
os.environ["OPENAI_VIRTUAL_KEY"] = getpass.getpass("Enter your OPENAI_VIRTUAL_KEY: ")
os.environ["CLAUDE_VIRTUAL_KEY"] = getpass.getpass("Enter your CLAUDE_VIRTUAL_KEY: ")
os.environ["GEMINI_VIRTUAL_KEY"] = getpass.getpass("Enter your GEMINI_VIRTUAL_KEY: ")

# Setup Arize Tracing with PortkeyInstrumentor

In [None]:
# Import open-telemetry dependencies
from arize.otel import register
# Import openinference instrumentor to map Portkey traces to a standard format
from openinference.instrumentation.portkey import PortkeyInstrumentor

# Setup OTel via our convenience function
tracer_provider = register(
    space_id = os.getenv("ARIZE_SPACE_ID"), # in app space settings page
    api_key = os.getenv("ARIZE_API_KEY"), # in app space settings page
    project_name = "portkey-debate", # name this to whatever you would like
)

# Turn on the instrumentor
PortkeyInstrumentor().instrument(tracer_provider=tracer_provider)

tracer = tracer_provider.get_tracer(__name__)

# Setup application using Portkey Universal API for different LLM calls

This application runs a structured debate between two LLMs—one arguing “pro” and the other “con” on a given topic—while a third LLM acts as moderator to score each side and suggest prompt refinements. Over multiple iterations, the debate prompt is progressively improved to produce more balanced and persuasive arguments.

In [None]:
from portkey_ai import Portkey
import os

# ─── Initialize each LLM client ─────────────────────────────────────────────────
PORTKEY_API_KEY = os.getenv("PORTKEY_API_KEY")

# GPT-4o for “against” arguments
openai = Portkey(
    api_key = PORTKEY_API_KEY,
    virtual_key = os.getenv("OPENAI_VIRTUAL_KEY")
)

# Claude for “pro” arguments
claude = Portkey(
    api_key      = PORTKEY_API_KEY,
    virtual_key = os.getenv("CLAUDE_VIRTUAL_KEY")
)

# Gemini for moderation & prompt refinement
gemini = Portkey(
    api_key      = PORTKEY_API_KEY,
    virtual_key = os.getenv("GEMINI_VIRTUAL_KEY")
)


# ─── Single Debate Round Function ────────────────────────────────────────────────

def debate_round(topic: str, debate_prompt: str) -> dict:
    """
    1️⃣ Claude makes the PRO argument.
    2️⃣ GPT-4o makes the CON argument.
    3️⃣ Gemini scores both and suggests a refined prompt.
    Returns dict with keys: pro, con, new_prompt.
    """

    # PRO side (Claude)
    pro_resp = claude.chat.completions.create(
        messages = [
            {
                "role": "user",
                "content": (
                    f"Argue in favor of the following topic:\n\n"
                    f"**{topic}**\n\n"
                    f"Use this debate prompt as context:\n{debate_prompt}"
                )
            }
        ],
        model    = "claude-3-opus-20240229",
        max_tokens = 250

    )
    pro_text = pro_resp["choices"][0]["message"]["content"]

    # CON side (GPT-4o)
    con_resp = openai.chat.completions.create(
        model    = "gpt-4",
        messages = [
            {
                "role": "user",
                "content": (
                    f"Argue against the following topic:\n\n"
                    f"**{topic}**\n\n"
                    f"Use this debate prompt as context:\n{debate_prompt}"
                )
            }
        ]
    )
    con_text = con_resp["choices"][0]["message"]["content"]

    # Moderator (Gemini) — score & refine
    mod_resp = gemini.chat.completions.create(
        model    = "gemini-1.5-pro",
        messages = [
            {
                "role": "user",
                "content": (
                    f"You are a debate moderator. Here are the two sides on “{topic}”:\n\n"
                    f"🟢 PRO ARGUMENT:\n{pro_text}\n\n"
                    f"🔴 CON ARGUMENT:\n{con_text}\n\n"
                    "1. Assign each argument a persuasiveness score (0–10).\n"
                    "2. Give 1–2 sentences on their main strengths/weaknesses.\n"
                    "3. Suggest an improved debate prompt that will yield more balanced arguments next round.\n"
                    "Return **only** the improved debate prompt."
                )
            }
        ]
    )
    new_prompt = mod_resp["choices"][0]["message"]["content"].strip()

    return {"pro": pro_text, "con": con_text, "new_prompt": new_prompt}


# ─── Run Multiple Rounds ─────────────────────────────────────────────────────────

if __name__ == "__main__":
    topic          = "Implementing a nationwide four-day workweek"
    initial_prompt = "Debate the pros and cons of a four-day workweek."
    rounds         = 3

    prompt = initial_prompt
    for i in range(1, rounds + 1):
        result = debate_round(topic, prompt)
        print(f"\n── Round {i} ──")
        print("🔵 PRO:\n", result["pro"])
        print("\n🔴 CON:\n", result["con"])
        print("\n🛠️  Suggested New Prompt:\n", result["new_prompt"])
        prompt = result["new_prompt"]


#Evals

Let's add some Arize Evals. Specifically we will add a toxicity eval to make sure the outputs from the debators aren't racist, sexist, chauvinistic, overly biased, or otherwise toxic.

## Export Traces from Arize into Dataset

In [None]:
from datetime import datetime

from arize.exporter import ArizeExportClient
from arize.utils.types import Environments

client = ArizeExportClient()

print('#### Exporting your primary dataset into a dataframe.')

## TO RETRIEVE THE FIELDS FOR THIS AUTOMATICALLY, GO TO ARIZE -> PROJECTS -> YOUR PROJECT -> DOWNLOAD

primary_df = client.export_model_to_df(
    space_id=os.getenv("ARIZE_SPACE_ID"),
    model_id='portkey-debate',
    environment=Environments.TRACING,
    start_time=datetime.fromisoformat(''),
    end_time=datetime.fromisoformat(''),
    # Optionally specify columns to improve query performance
    # columns=['context.span_id', 'attributes.llm.input']
)
primary_df["input"] = primary_df["attributes.input.value"]
primary_df["output"] = primary_df["attributes.output.value"]

## Create Evals

In [None]:
from phoenix.evals import (
    TOXICITY_PROMPT_RAILS_MAP,
    TOXICITY_PROMPT_TEMPLATE,
    OpenAIModel,
    download_benchmark_dataset,
    llm_classify,
)

os.environ["OPENAI_API_KEY"] = getpass.getpass("Enter your OPENAI_API_KEY: ")

model = OpenAIModel(
    model_name="gpt-4",
    temperature=0.0,
)

#The rails is used to hold the output to specific values based on the template
#It will remove text such as ",,," or "..."
#Will ensure the binary value expected from the template is returned
rails = list(TOXICITY_PROMPT_RAILS_MAP.values())
toxic_classifications = llm_classify(
    dataframe=primary_df,
    template=TOXICITY_PROMPT_TEMPLATE,
    model=model,
    rails=rails,
    provide_explanation=True, #optional to generate explanations for the value produced by the eval LLM
)

In [None]:
toxic_classifications["eval.tone_eval.label"] = toxic_classifications["label"]
toxic_classifications["eval.tone_eval.explanation"] = toxic_classifications["explanation"]
toxic_classifications = toxic_classifications.set_index(primary_df["context.span_id"])
toxic_classifications["context.span_id"] = toxic_classifications.index
toxic_classifications.head()

## Export Evals dataset to Arize

In [None]:
from arize.pandas.logger import Client

arize_client = Client(
    space_id=os.getenv("ARIZE_SPACE_ID"),
    api_key=os.getenv("ARIZE_API_KEY")
)

arize_client.log_evaluations_sync(toxic_classifications, 'portkey-debate')