<center>
    <p style="text-align:center">
        <img alt="phoenix logo" src="https://storage.googleapis.com/arize-phoenix-assets/assets/phoenix-logo-light.svg" width="200"/>
        <br>
        <a href="https://docs.arize.com/phoenix/">Docs</a>
        |
        <a href="https://github.com/Arize-ai/phoenix">GitHub</a>
        |
        <a href="https://join.slack.com/t/arize-ai/shared_invite/zt-1px8dcmlf-fmThhDFD_V_48oU7ALan4Q">Community</a>
    </p>
</center>

# Langgraph - Parallel Evaluation

Parallel Evaluation in LangGraph
In this tutorial, we’ll build a parallel execution workflow using LangGraph — ideal for scenarios where multiple evaluations or subtasks can run independently before being aggregated into a final decision.

Our application generates a compelling product description and then runs three checks in parallel:

- Safety Check: Is the content safe and non-violent?

- Policy Compliance: Does it follow company policy?

- Clarity Check: Is it understandable to a general audience?

This pattern demonstrates how to fan out execution after a shared generation step, and aggregate results before producing a final output.

We use Phoenix tracing to gain full visibility into each node execution, making it easy to debug or audit how decisions were made across the parallel branches.

In [1]:
!pip install langgraph langchain langchain_community "arize-phoenix" arize-phoenix-otel openinference-instrumentation-langchain



In [2]:
from langgraph.graph import StateGraph, START, END
import os, getpass

In [3]:
os.environ["OPENAI_API_KEY"] = getpass.getpass("OpenAI API Key:")

OpenAI API Key:··········


# Configure Phoenix Tracing

Make sure you go to https://app.phoenix.arize.com/ and generate an API key. This will allow you to trace your Langgraph application with Phoenix.

In [4]:
PHOENIX_API_KEY = getpass.getpass("Phoenix API Key:")
os.environ["PHOENIX_CLIENT_HEADERS"] = f"api_key={PHOENIX_API_KEY}"
os.environ["PHOENIX_COLLECTOR_ENDPOINT"] = "https://app.phoenix.arize.com"

Phoenix API Key:··········


In [6]:
from phoenix.otel import register

tracer_provider = register(
  project_name="Parallel",
  auto_instrument=True
)

🔭 OpenTelemetry Tracing Details 🔭
|  Phoenix Project: Parallel
|  Span Processor: SimpleSpanProcessor
|  Collector Endpoint: https://app.phoenix.arize.com/v1/traces
|  Transport: HTTP + protobuf
|  Transport Headers: {'api_key': '****'}
|  
|  Using a default SpanProcessor. `add_span_processor` will overwrite this default.
|  
|  
|  `register` has set this TracerProvider as the global OpenTelemetry default.
|  To disable this behavior, call `register` with `set_global_tracer_provider=False`.



In [7]:
from typing_extensions import TypedDict, Literal
from IPython.display import Image, display

from langchain.chat_models import ChatOpenAI

# LLM of choice

In [8]:
llm = ChatOpenAI(model="gpt-4o-mini", temperature=0.3)

  llm = ChatOpenAI(model="gpt-4o-mini", temperature=0.3)


# Graph State Definition
We define a State object to keep track of all data flowing through our LangGraph. This includes the input product name, the generated description, results of three independent evaluation checks, and the final aggregated output.

In [9]:
class State(TypedDict):
    product: str
    description: str
    safety_check: str
    policy_check: str
    clarity_check: str
    final_output: str

# Node 1: Generate Product Description
This node uses the LLM to write a compelling marketing-style description of the product. The output is stored in the description field of the graph state.

In [10]:
def generate_description(state: State):
    msg = llm.invoke(f"Write a compelling product description for: {state['product']}")
    return {"description": msg.content}



# Node 2–4: Parallel Evaluation Checks
After the product description is created, we fan out to three evaluators, each performing an independent check in parallel:

**Safety Check**: Is the language safe and non-violent?

**Policy Compliance**: Does it align with company guidelines?

**Clarity Check**: Is it understandable by a general audience?

Each function receives the same description as input and returns a binary decision ("yes" or "no").

In [11]:
company_policy = """Company Product Description Policy
1. Tone and Language
Product descriptions must:

Use clear, concise, and professional language.

Maintain a friendly, helpful, and inclusive tone.

Avoid slang, profanity, sarcasm, or overly casual phrasing.

Be free from any offensive, discriminatory, or culturally insensitive terms.

2. Truthfulness and Accuracy
All product features, specifications, and benefits must be factually accurate.

Claims (e.g. “fastest,” “best in class”) must be verifiable or supported by evidence (e.g., awards, benchmarks).

Avoid:

Misleading exaggerations.

Unsubstantiated health or performance claims.

Use of “guarantee” or “risk-free” unless legally backed.

3. Compliance with Legal and Regulatory Guidelines
Descriptions must not:

Promise results (e.g., “will cure,” “guarantees success”) unless FDA/FTC-compliant.

Use restricted terms in regulated industries (e.g., “organic,” “non-GMO,” “medical-grade”) without certification.

Include comparative language against other brands unless objective and non-disparaging.

4. Brand Voice and Alignment
Product language must align with brand values:

Empowerment

Sustainability

Innovation

Trust

Always mention unique selling points (USPs) when applicable, including differentiators that reflect the company’s mission.

5. Inclusivity and Accessibility
Avoid gendered, ageist, or culturally exclusive phrases.

Use plain language that is accessible to readers at or below a 10th-grade level.

Avoid niche references, idioms, or regionally specific expressions unless necessary.

6. Formatting and Style
Use proper grammar, punctuation, and spelling.

Headings and bullet points should:

Be consistently formatted.

Begin with verbs when listing features (e.g., “Enhances clarity,” “Syncs automatically”).

7. Restricted Content
Product descriptions must not include:

Violence, weapons, or militaristic comparisons.

Sexually explicit or suggestive material.

Mentions of illegal activities or substances.

Political or religious endorsements.

8. SEO and Discoverability (Optional)
If optimizing for search:

Use approved keywords (maintain keyword density <2.5%).

Do not “stuff” keywords unnaturally or compromise readability.

Include relevant tags in a way that fits the context.

9. Call to Action (CTA) Guidelines
CTAs should:

Be clear and action-oriented (e.g., “Shop Now,” “Experience the difference”).

Avoid pushy language (e.g., “Act before it’s gone!”).

Align with the tone: confident but not aggressive.

"""

In [12]:
def check_safety(state: State):
    msg = llm.invoke(f"Is this product description safe and non-violent? Answer yes or no.\n\n{state['description']}")
    return {"safety_check": msg.content}


def check_policy(state: State):
    msg = llm.invoke(f"Does this product description comply with our company policy? Here is the company policy: {company_policy}. Answer yes or no.\n\n{state['description']}")
    return {"policy_check": msg.content}


def check_clarity(state: State):
    msg = llm.invoke(f"Is this description clear and understandable to a 10th-grade reader? Answer yes or no.\n\n{state['description']}")
    return {"clarity_check": msg.content}


# Node 5: Aggregate the Results
Once the checks complete, this node gathers their responses. If all checks return "yes", the product description is approved. Otherwise, it’s flagged as rejected, along with reasons.

In [13]:
def aggregate_results(state: State):
    if (
        "yes" in state["safety_check"].strip().lower()
        and "yes" in state["policy_check"].strip().lower()
        and "yes" in state["clarity_check"].strip().lower()
    ):
        return {"final_output": state["description"]}
    return {
        "final_output": "REJECTED: One or more checks failed.\n"
        f"Safety: {state['safety_check']}, Policy: {state['policy_check']}, Clarity: {state['clarity_check']}"
    }


# Building the Parallel Evaluation Graph
With all our nodes defined, we now assemble them into a LangGraph using StateGraph.

**Start → Description**: We begin by generating the product description.

**Fan Out Checks**: The output fans out into three parallel paths — safety, policy, and clarity checks — enabling efficient, simultaneous validation.

**Converge → Aggregate**: Once all checks complete, the results converge into a final aggregation node that determines whether to approve or reject the description.

**End**: The final result is produced.

This setup showcases LangGraph’s ability to manage parallelism and convergence, streamlining complex workflows while remaining transparent and modular.

In [14]:
builder = StateGraph(State)

builder.add_node("generate_description", generate_description)
builder.add_node("check_safety", check_safety)
builder.add_node("check_policy", check_policy)
builder.add_node("check_clarity", check_clarity)
builder.add_node("aggregate_results", aggregate_results)

# Description generation first
builder.add_edge(START, "generate_description")

# Then fan out for parallel checks
builder.add_edge("generate_description", "check_safety")
builder.add_edge("generate_description", "check_policy")
builder.add_edge("generate_description", "check_clarity")

# All checks go to the aggregator
builder.add_edge("check_safety", "aggregate_results")
builder.add_edge("check_policy", "aggregate_results")
builder.add_edge("check_clarity", "aggregate_results")

# Final result
builder.add_edge("aggregate_results", END)

workflow = builder.compile()


# Example Usage

In [None]:
state = workflow.invoke({"product": "Smart glasses that project your calendar"})
print(state["final_output"])

**Introducing VisionSync Smart Glasses: Your Calendar, Right Before Your Eyes!**

Step into the future of productivity with VisionSync Smart Glasses, the revolutionary eyewear that seamlessly integrates your digital life with the real world. Imagine a pair of stylish glasses that not only enhance your vision but also keep you organized and on track—right in your line of sight!

**Key Features:**

- **Calendar Projection:** Effortlessly view your daily schedule, appointments, and reminders projected directly onto the lenses. No more fumbling with your phone or missing important meetings; everything you need is just a glance away.

- **Sleek Design:** Crafted with a modern aesthetic, VisionSync Smart Glasses are lightweight and comfortable, making them perfect for all-day wear. Choose from a variety of colors and styles to match your personal taste.

- **Voice Activation:** Stay hands-free and focused! With intuitive voice commands, you can easily navigate your calendar, set new appointm

In [15]:
state = workflow.invoke({"product": "Headphones with noise cancellation, transparency, and other advanced features."})
print(state["final_output"])
state = workflow.invoke({"product": "Smart fridge with advanced features."})
print(state["final_output"])

REJECTED: One or more checks failed.
Safety: Yes, Policy: Yes., Clarity: No.
**Introducing the Future of Food Storage: The Smart Fridge with Advanced Features**

Elevate your kitchen experience with our state-of-the-art Smart Fridge, designed for the modern home. This innovative appliance combines cutting-edge technology with sleek aesthetics, ensuring that your food stays fresher for longer while seamlessly integrating into your lifestyle.

**Key Features:**

- **Intelligent Inventory Management:** Never lose track of your groceries again! Our Smart Fridge uses advanced sensors and AI technology to monitor your food inventory in real-time. Receive notifications when items are running low or nearing their expiration date, so you can plan your meals and shopping trips with ease.

- **Touchscreen Interface:** The large, intuitive touchscreen display allows you to access recipes, create shopping lists, and even stream your favorite shows while you cook. With just a swipe, you can customiz

# Make sure to view your traces in Phoenix!

# Let's add some Evaluations (Evals)

In this section we will evaluate the accuracy of our safety, policy, and clarity checkers with another LLM call.

In [17]:
import phoenix as px
df = px.Client().get_spans_dataframe("name == 'LangGraph'", project_name='Parallel')



In [21]:
df.to_csv('parallel_evals.csv')

# Custom Eval Template

Here we define a custom eval template, designed to evaluate the policy, clarity, and safety checkers' decisions.

In [35]:
TEMPLATE = """ You must decide whether the clarity, policy, and safety checkers made the right decision based on the generated descriptions.
Check if the policy checker's decision correctly reflects whether the generatead description complies with the policy: {company_policy}.
Check if the clarity checker's decision correctly reflects whether the generated description is understandable to a 10th-grade reader.
Check if the safety checker's decision correctly reflects whether the generated description is safe and non-violent.

generated description: {description}
decisions: {decisions}

Output 1/3 if one decision is correct, 2/3 if two decisions are correct, 3/3 if all decisions are correct, and 0/3 if all decisions are incorrect.
Explan your reasoning.
"""

In [22]:
df

Unnamed: 0_level_0,name,span_kind,parent_id,start_time,end_time,status_code,status_message,events,context.span_id,context.trace_id,attributes.input.value,attributes.output.value,attributes.openinference.span.kind,attributes.output.mime_type
context.span_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1
6945c7763318435a,LangGraph,CHAIN,,2025-05-04 19:20:00.308233+00:00,2025-05-04 19:20:07.675905+00:00,OK,,[],6945c7763318435a,9049ac39db8812785dd2615fc57fbfb2,Smart glasses that project your calendar,"{""product"": ""Smart glasses that project your c...",CHAIN,application/json
93a4b944d115a6bf,LangGraph,CHAIN,,2025-05-15 01:22:38.270916+00:00,2025-05-15 01:22:45.692736+00:00,OK,,[],93a4b944d115a6bf,9b88493ebfeb8e3d640cae5ccd77aada,"Headphones with noise cancellation, transparen...","{""product"": ""Headphones with noise cancellatio...",CHAIN,application/json
f99d9fbf041b248e,LangGraph,CHAIN,,2025-05-15 01:22:45.854969+00:00,2025-05-15 01:22:59.120195+00:00,OK,,[],f99d9fbf041b248e,c8c3ec6da927a81a1d16a4d56b36075b,Smart fridge with advanced features.,"{""product"": ""Smart fridge with advanced featur...",CHAIN,application/json


# Generate Evals

In [39]:
import json
import pandas as pd
from phoenix.evals import llm_classify, OpenAIModel

def unpack(row):
    blob = json.loads(row["attributes.output.value"])
    # pull the free-text description
    description = blob.get("description", "")
    # collapse the three yes/no flags into one readable string
    decisions_dict = {
        "policy_checker":  blob.get("policy_check",  "No."),
        "clarity_checker": blob.get("clarity_check", "No."),
        "safety_checker":  blob.get("safety_check",  "No.")
    }
    # join into the form   "policy_checker: Yes., clarity_checker: No.,  ..."
    decisions = ", ".join(f"{k}: {v}" for k, v in decisions_dict.items())
    return pd.Series({"description": description, "decisions": decisions})

df[["description", "decisions"]] = df.apply(unpack, axis=1)

# 3. make sure every {placeholder} in your TEMPLATE exists as a column --------
df["company_policy"] = company_policy      # now {company_policy} will resolve

# 4. run the eval --------------------------------------------------------------

# We treat 0/3 … 3/3 as four categorical classes
rails = ["0/3", "1/3", "2/3", "3/3"]

eval_results = llm_classify(
    dataframe=df,
    template=TEMPLATE,
    model=OpenAIModel(model="gpt-4o"),  # or any supported model
    rails=rails,
    include_prompt=True,
    include_response=True,
    verbose=True,
    provide_explanation=True,
)

  eval_results = llm_classify(


Using prompt:

[PromptPartTemplate(content_type=<PromptPartContentType.TEXT: 'text'>, template=" You must decide whether the clarity, policy, and safety checkers made the right decision based on the generated descriptions. \nCheck if the policy checker's decision correctly reflects whether the generatead description complies with the policy: {company_policy}.\nCheck if the clarity checker's decision correctly reflects whether the generated description is understandable to a 10th-grade reader.\nCheck if the safety checker's decision correctly reflects whether the generated description is safe and non-violent.\n\ngenerated description: {description}\ndecisions: {decisions}\n\nReturn **only one line** in the exact format  \n<grade> — <brief explanation>  \n\nwhere <grade> is one of 0/3 | 1/3 | 2/3 | 3/3 and the hyphen never appears again.\n")]
OpenAI invocation parameters: {'model': 'gpt-4o', 'frequency_penalty': 0, 'presence_penalty': 0, 'top_p': 1, 'n': 1, 'timeout': None, 'max_completi

llm_classify |          | 0/3 (0.0%) | ⏳ 00:00<? | ?it/s

- Snapped '3/3' to rail: 3/3
- Snapped '2/3' to rail: 2/3
- Snapped '3/3' to rail: 3/3


In [40]:
eval_results.drop(columns=["prompt", "exceptions", "execution_status", "execution_seconds", "response"], inplace=True)
eval_results

Unnamed: 0_level_0,label,explanation,prompt,response,exceptions,execution_status,execution_seconds
context.span_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
6945c7763318435a,3/3,The policy checker's decision is correct as th...,"You must decide whether the clarity, policy, ...","{""explanation"":""The policy checker's decision ...",[],COMPLETED,2.418828
93a4b944d115a6bf,2/3,The policy checker correctly identified that t...,"You must decide whether the clarity, policy, ...","{""explanation"":""The policy checker correctly i...",[],COMPLETED,1.858363
f99d9fbf041b248e,3/3,"The description is clear, professional, and fr...","You must decide whether the clarity, policy, ...","{""explanation"":""The description is clear, prof...",[],COMPLETED,3.12848


# Export Evals to Phoenix!

In [42]:
from phoenix.trace import SpanEvaluations

px.Client().log_evaluations(
    SpanEvaluations(eval_name="Checker Accuracy", dataframe=eval_results)
)

