<center>
    <p style="text-align:center">
        <img alt="arize logo" src="https://storage.googleapis.com/arize-assets/arize-logo-white.jpg" width="200"/>
        <br>
        <a href="https://arize.com/docs/ax">Docs</a>
        |
        <a href="https://github.com/Arize-ai/client_python">GitHub</a>
        |
        <a href="https://arize-ai.slack.com/join/shared_invite/zt-2w57bhem8-hq24MB6u7yE_ZF_ilOYSBw#/shared-invite/email">Community</a>
    </p>
</center>
<h1 align="center">Arize Python SDK Quickstart Guide</h1>

This notebook contains runnable code examples that accompany the three-part Arize AX Python SDK tutorial. It walks through the complete workflow:

First, you will instrument a multi-agent application to send traces to Arize AX & evaluate those traces to measure performance. Then, you will create datasets and run experiments to systematically improve your application.

Use this notebook alongside the [documentation](https://arize.com/docs/api-clients/python/version-8/tutorial/get-started-tracing-with-arize-sdk) - run each cell as you follow along with the guides to see the SDK in action and build a working financial research agent from scratch.

The notebook is designed to be executed sequentially, with each section building on the previous one, so you'll have trace data to evaluate and then use those evaluations to guide your experiments.

# Send Traces to Arize AX

In [None]:
!pip install --pre arize[otel] crewai crewai-tools openinference-instrumentation-crewai openai openinference-instrumentation-openai arize-phoenix-evals


In [None]:
import os

os.environ["ARIZE_API_KEY"] = "arize-api-key"
os.environ["ARIZE_SPACE_ID"] = "arize-space-id"
os.environ["SERPER_API_KEY"] = "serper-api-key"
os.environ["OPENAI_API_KEY"] = "openai-api-key"


In [None]:
from arize.otel import register

tracer_provider = register(
    space_id=os.getenv("ARIZE_SPACE_ID"),
    api_key=os.getenv("ARIZE_API_KEY"),
    project_name="arize-sdk-quickstart",
)

In [None]:
from openinference.instrumentation.openai import OpenAIInstrumentor
from openinference.instrumentation.crewai import CrewAIInstrumentor

# Finish automatic instrumentation
OpenAIInstrumentor().instrument(tracer_provider=tracer_provider)
CrewAIInstrumentor().instrument(tracer_provider=tracer_provider)

# Send Traces to Arize AX

In [None]:
from crewai import Agent, Crew, Process, Task
from crewai_tools import SerperDevTool

search_tool = SerperDevTool()

researcher = Agent(
    role="Financial Research Analyst",
    goal="Gather up-to-date financial data, trends, and news for the target companies or markets",
    backstory="""
        You are a Senior Financial Research Analyst.
    """,
    verbose=True,
    allow_delegation=False,
    max_iter=2,
    tools=[search_tool],
)

writer = Agent(
    role="Financial Report Writer",
    goal="Compile and summarize financial research into clear, actionable insights",
    backstory="""
        You are an experienced financial content writer.
    """,
    verbose=True,
    allow_delegation=True,
    max_iter=1,
)

task1 = Task(
    description="""
        Research: {tickers}
        Focus on: {focus}
    """,
    expected_output="Detailed financial research summary with web search findings",
    agent=researcher,
)

task2 = Task(
    description="Write a report based on the research above.",
    expected_output="A polished financial analysis report",
    agent=writer,
)

crew = Crew(
    agents=[researcher, writer],
    tasks=[task1, task2],
    verbose=1,
    process=Process.sequential,
)


In [None]:
user_inputs = {"tickers": "TSLA", "focus": "financial analysis and market outlook"}

result = crew.kickoff(inputs=user_inputs)


In [None]:
test_queries = [
    {"tickers": "AAPL", "focus": "financial analysis and market outlook"},
    {"tickers": "NVDA", "focus": "valuation metrics and growth prospects"},
    {"tickers": "AMZN", "focus": "profitability and market share"},
    {"tickers": "AAPL, MSFT", "focus": "comparative financial analysis"},
    {"tickers": "META, SNAP, PINS", "focus": "social media sector trends"},
    {"tickers": "RIVN", "focus": "financial health and viability"},
    {"tickers": "SNOW", "focus": "revenue growth trajectory"},
    {"tickers": "KO", "focus": "dividend yield and stability"},
    {"tickers": "META", "focus": "latest developments and stock performance"},
    {"tickers": "AAPL, MSFT, GOOGL, AMZN, META", "focus": "big tech comparison and market outlook"},
    {"tickers": "AMC", "focus": "financial analysis and market sentiment"},
]


In [None]:
for query in test_queries:
    crew.kickoff(inputs=query)


# Run and Log Evals to Measure Performance

In [None]:
financial_completeness_template = """
You are evaluating whether a financial research report correctly completes ALL parts of the user's task with COMPREHENSIVE coverage.

User input: {attributes.input.value}

Generated report:
{attributes.output.value}

To be marked as "complete", the report MUST meet ALL of these strict requirements:

1. TICKER COVERAGE (MANDATORY):
   - Cover ALL companies/tickers mentioned in the input
   - If multiple tickers are listed, EACH must have dedicated analysis (not just mentioned in passing)
   - For multiple tickers, the report must provide COMPARATIVE analysis when relevant

2. FOCUS AREA COVERAGE (MANDATORY):
   - Address ALL focus areas mentioned in the input
   - If the focus mentions multiple topics (e.g., "earnings and outlook"), BOTH must be thoroughly addressed
   - Each focus area must have substantial content, not just a brief mention

3. FINANCIAL DATA REQUIREMENTS (MANDATORY):
   - For EACH ticker, the report must include:
     * Current/recent stock price or performance data
     * At least 2 key financial ratios (P/E, P/B, debt-to-equity, ROE, etc.)
     * Revenue or earnings information
     * Recent news or developments (within last 6 months)
   - If focus mentions specific metrics (e.g., "P/E ratio"), those MUST be explicitly provided

4. DEPTH REQUIREMENT (MANDATORY):
   - Each ticker must have at least 3-4 sentences of dedicated analysis
   - Generic statements without specific data do NOT count
   - The report must demonstrate thorough research, not superficial coverage

5. COMPARISON REQUIREMENT (if multiple tickers):
   - If 2+ tickers are requested, the report MUST include direct comparisons
   - Comparisons should cover multiple key metrics side-by-side
   - Generic statements like "both companies are good" do NOT satisfy this requirement
   - Must explicitly state which company performs better/worse on specific metrics

The report is "incomplete" if it fails ANY of the above requirements, including:
- Missing any ticker or only mentioning it briefly
- Failing to address any focus area or only addressing it superficially
- Missing required financial data for any ticker
- Providing generic analysis without specific metrics or data
- Failing to provide comparisons when multiple tickers are requested
- Not meeting the depth requirement for any ticker

Respond with ONLY one word: "complete" or "incomplete"
Then provide a detailed explanation of which specific requirements were met or failed.
"""


In [None]:
from arize import ArizeClient
from datetime import datetime, timedelta

client = ArizeClient(api_key=os.getenv("ARIZE_API_KEY"))

# Export spans from the last hour (adjust time range as needed)
end_time = datetime.now()
start_time = end_time - timedelta(hours=1)

df = client.spans.export_to_df(
    space_id=os.getenv("ARIZE_SPACE_ID"),
    project_name="arize-sdk-quickstart",
    start_time=start_time,
    end_time=end_time,
)

parent_spans = df[df["attributes.openinference.span.kind"] == "CHAIN"]


In [None]:
from phoenix.evals import LLM

llm = LLM(model="gpt-4o", provider="openai")


In [None]:
from phoenix.evals import create_classifier

completness_evaluator = create_classifier(
    name="completeness",
    prompt_template=financial_completeness_template,
    llm=llm,
    choices={"complete": 1.0, "incomplete": 0.0},
)


In [None]:
from phoenix.evals import evaluate_dataframe
from openinference.instrumentation import suppress_tracing

with suppress_tracing():
    results_df = evaluate_dataframe(dataframe=parent_spans, evaluators=[completness_evaluator])


In [None]:
import pandas as pd
from phoenix.evals.utils import to_annotation_dataframe

# Convert Phoenix eval results to annotation format
annotation_df = to_annotation_dataframe(dataframe=results_df)

# Get span_ids from parent_spans
span_ids = parent_spans["context.span_id"].values

# Build evaluation DataFrame
eval_results = []
eval_name = "completeness"

for idx, span_id in enumerate(span_ids):
    if idx < len(annotation_df):
        eval_result = {
            "context.span_id": span_id,
            f"eval.{eval_name}.label": annotation_df["label"].iloc[idx],
            f"eval.{eval_name}.score": float(annotation_df["score"].iloc[idx]),
            f"eval.{eval_name}.explanation": annotation_df["explanation"].iloc[idx],
        }
        eval_results.append(eval_result)

evals_df = pd.DataFrame(eval_results)
print(evals_df)

# Upload evaluations to Arize
response = client.spans.update_evaluations(
    space_id=os.getenv("ARIZE_SPACE_ID"),
    project_name="arize-sdk-quickstart",
    dataframe=evals_df,
)

# Iterate with Experiments


In [None]:
import pandas as pd
from datetime import datetime

# Create a dataset from the test_queries for the experiment
examples_df = pd.DataFrame([
    {
        "input": f"Research: {query['tickers']}\nFocus on: {query['focus']}",
        "tickers": str(query["tickers"]),  # Ensure string type
        "focus": str(query["focus"]),  # Ensure string type
    }
    for query in test_queries
])

# Create the dataset
def append_timestamp(text: str) -> str:
    timestamp = datetime.now().strftime("%Y%m%d%H%M%S")
    return f"{text}_{timestamp}"

dataset = client.datasets.create(
    space_id=os.getenv("ARIZE_SPACE_ID"),
    name=append_timestamp("arize-sdk-quickstart-dataset"),
    examples=examples_df,
)

print(f"Created dataset: {dataset.id} - {dataset.name}")

In [None]:
# Define evaluators for experiments
from phoenix.evals import ClassificationEvaluator
from arize.experiments.evaluators.base import EvaluationResult

completeness_evaluator = ClassificationEvaluator(
    name="completeness",
    prompt_template=financial_completeness_template,
    llm=llm,
    choices={"complete": 1.0, "incomplete": 0.0},
)


def completeness(output, dataset_row):
    # Extract input from dataset_row - construct it from tickers and focus
    if isinstance(dataset_row, dict):
        tickers = dataset_row.get("tickers", "")
        focus = dataset_row.get("focus", "")
    else:
        tickers = getattr(dataset_row, "tickers", "")
        focus = getattr(dataset_row, "focus", "")

    # Format input as expected by the evaluator template
    input_value = f"Research: {tickers}\nFocus on: {focus}"

    results = completeness_evaluator.evaluate(
        {"attributes.input.value": input_value, "attributes.output.value": output}
    )
    # Convert Phoenix evaluation result to EvaluationResult
    result = results[0]
    return EvaluationResult(
        score=result.score,
        label=result.label,
        explanation=result.explanation if hasattr(result, "explanation") else None,
    )


evaluators = [completeness]


In [None]:
# Run baseline experiment with original agent
def my_task(dataset_row):
    # Extract tickers and focus from dataset_row
    if isinstance(dataset_row, dict):
        tickers = dataset_row.get("tickers", "")
        focus = dataset_row.get("focus", "")
    else:
        tickers = getattr(dataset_row, "tickers", "")
        focus = getattr(dataset_row, "focus", "")

    # Run the original crew with the extracted inputs
    inputs = {"tickers": tickers, "focus": focus}
    result = crew.kickoff(inputs=inputs)

    # Return the output string
    if hasattr(result, "raw"):
        return result.raw
    elif isinstance(result, str):
        return result
    else:
        return str(result)

# Run baseline experiment
baseline_experiment, baseline_df = client.experiments.run(
    name="baseline-experiment",
    dataset_id=dataset.id,
    task=my_task,
    evaluators=evaluators,
)

print(f"Baseline experiment: {baseline_experiment}")

In [None]:
baseline_df

In [None]:
# Create an improved version of the crew with better prompts
# Based on evaluation explanations from baseline, we're strengthening the agent goals
researcher = Agent(
    role="Financial Research Analyst",
    goal="""Gather up-to-date financial data, trends, and news for the target companies/markets.
        Make sure to include more than 1 financial ratios (such as P/E or P/B), news from the last 6 months, and current stock price or performance data.""",
    backstory="""
            You are a Senior Financial Research Analyst.
        """,
    verbose=True,
    allow_delegation=False,
    max_iter=2,
    tools=[search_tool],
)
writer = Agent(
    role="Financial Report Writer",
    goal="Compile and summarize financial research into clear, actionable insights. If there are multiple tickers, make sure to include a dedicated comparison section.",
    backstory="""
        You are an experienced financial content writer.
    """,
    verbose=True,
    allow_delegation=True,
    max_iter=1,
)
updated_crew = Crew(
    agents=[researcher, writer],
    tasks=[task1, task2],
    verbose=1,
    process=Process.sequential,
)


In [None]:
def improved_task(dataset_row):
    # Extract tickers and focus from dataset_row
    if isinstance(dataset_row, dict):
        tickers = dataset_row.get("tickers", "")
        focus = dataset_row.get("focus", "")
    else:
        tickers = getattr(dataset_row, "tickers", "")
        focus = getattr(dataset_row, "focus", "")

    # Run the original crew with the extracted inputs
    inputs = {"tickers": tickers, "focus": focus}
    result = updated_crew.kickoff(inputs=inputs)

    # Return the output string
    if hasattr(result, "raw"):
        return result.raw
    elif isinstance(result, str):
        return result
    else:
        return str(result)

In [None]:
# Run the experiment with improved agent using the same task function and evaluator

experiment, experiment_df = client.experiments.run(
    name="improved-agent-instructions",
    dataset_id=dataset.id,
    task=improved_task,
    evaluators=evaluators,
)

print(f"Experiment: {experiment_df}")