<center>
    <p style="text-align:center">
        <img alt="phoenix logo" src="https://storage.googleapis.com/arize-phoenix-assets/assets/phoenix-logo-light.svg" width="200"/>
        <br>
        <a href="https://docs.arize.com/phoenix/">Docs</a>
        |
        <a href="https://github.com/Arize-ai/phoenix">GitHub</a>
        |
        <a href="https://arize-ai.slack.com/join/shared_invite/zt-2w57bhem8-hq24MB6u7yE_ZF_ilOYSBw#/shared-invite/email">Community</a>
    </p>
</center>

# Langgraph - Parallel Evaluation

Parallel Evaluation in LangGraph
In this tutorial, we’ll build a parallel execution workflow using LangGraph — ideal for scenarios where multiple evaluations or subtasks can run independently before being aggregated into a final decision.

Our application generates a compelling product description and then runs three checks in parallel:

- Safety Check: Is the content safe and non-violent?

- Policy Compliance: Does it follow company policy?

- Clarity Check: Is it understandable to a general audience?

This pattern demonstrates how to fan out execution after a shared generation step, and aggregate results before producing a final output.

We use Phoenix tracing to gain full visibility into each node execution, making it easy to debug or audit how decisions were made across the parallel branches.

In [None]:
!pip install langgraph langchain langchain_community "arize-phoenix" arize-phoenix-otel openinference-instrumentation-langchain

In [None]:
import getpass
import os

from langgraph.graph import END, START, StateGraph

In [None]:
os.environ["OPENAI_API_KEY"] = getpass.getpass("OpenAI API Key:")

# Configure Phoenix Tracing

Make sure you go to https://app.phoenix.arize.com/ and generate an API key. This will allow you to trace your Langgraph application with Phoenix.

In [None]:
PHOENIX_API_KEY = getpass.getpass("Phoenix API Key:")
os.environ["PHOENIX_CLIENT_HEADERS"] = f"api_key={PHOENIX_API_KEY}"
os.environ["PHOENIX_COLLECTOR_ENDPOINT"] = "https://app.phoenix.arize.com"

In [None]:
from phoenix.otel import register

tracer_provider = register(project_name="Parallel", auto_instrument=True)

In [None]:
from langchain.chat_models import ChatOpenAI
from typing_extensions import TypedDict

# LLM of choice

In [None]:
llm = ChatOpenAI(model="gpt-4o-mini", temperature=0.3)

# Graph State Definition
We define a State object to keep track of all data flowing through our LangGraph. This includes the input product name, the generated description, results of three independent evaluation checks, and the final aggregated output.

In [None]:
class State(TypedDict):
    product: str
    description: str
    safety_check: str
    policy_check: str
    clarity_check: str
    final_output: str

# Node 1: Generate Product Description
This node uses the LLM to write a compelling marketing-style description of the product. The output is stored in the description field of the graph state.

In [None]:
def generate_description(state: State):
    msg = llm.invoke(f"Write a compelling product description for: {state['product']}")
    return {"description": msg.content}

# Node 2–4: Parallel Evaluation Checks
After the product description is created, we fan out to three evaluators, each performing an independent check in parallel:

**Safety Check**: Is the language safe and non-violent?

**Policy Compliance**: Does it align with company guidelines?

**Clarity Check**: Is it understandable by a general audience?

Each function receives the same description as input and returns a binary decision ("yes" or "no").

In [None]:
company_policy = """Company Product Description Policy
1. Tone and Language
Product descriptions must:

Use clear, concise, and professional language.

Maintain a friendly, helpful, and inclusive tone.

Avoid slang, profanity, sarcasm, or overly casual phrasing.

Be free from any offensive, discriminatory, or culturally insensitive terms.

2. Truthfulness and Accuracy
All product features, specifications, and benefits must be factually accurate.

Claims (e.g. “fastest,” “best in class”) must be verifiable or supported by evidence (e.g., awards, benchmarks).

Avoid:

Misleading exaggerations.

Unsubstantiated health or performance claims.

Use of “guarantee” or “risk-free” unless legally backed.

3. Compliance with Legal and Regulatory Guidelines
Descriptions must not:

Promise results (e.g., “will cure,” “guarantees success”) unless FDA/FTC-compliant.

Use restricted terms in regulated industries (e.g., “organic,” “non-GMO,” “medical-grade”) without certification.

Include comparative language against other brands unless objective and non-disparaging.

4. Brand Voice and Alignment
Product language must align with brand values:

Empowerment

Sustainability

Innovation

Trust

Always mention unique selling points (USPs) when applicable, including differentiators that reflect the company’s mission.

5. Inclusivity and Accessibility
Avoid gendered, ageist, or culturally exclusive phrases.

Use plain language that is accessible to readers at or below a 10th-grade level.

Avoid niche references, idioms, or regionally specific expressions unless necessary.

6. Formatting and Style
Use proper grammar, punctuation, and spelling.

Headings and bullet points should:

Be consistently formatted.

Begin with verbs when listing features (e.g., “Enhances clarity,” “Syncs automatically”).

7. Restricted Content
Product descriptions must not include:

Violence, weapons, or militaristic comparisons.

Sexually explicit or suggestive material.

Mentions of illegal activities or substances.

Political or religious endorsements.

8. SEO and Discoverability (Optional)
If optimizing for search:

Use approved keywords (maintain keyword density <2.5%).

Do not “stuff” keywords unnaturally or compromise readability.

Include relevant tags in a way that fits the context.

9. Call to Action (CTA) Guidelines
CTAs should:

Be clear and action-oriented (e.g., “Shop Now,” “Experience the difference”).

Avoid pushy language (e.g., “Act before it’s gone!”).

Align with the tone: confident but not aggressive.

"""

In [None]:
def check_safety(state: State):
    msg = llm.invoke(
        f"Is this product description safe and non-violent? Answer yes or no.\n\n{state['description']}"
    )
    return {"safety_check": msg.content}


def check_policy(state: State):
    msg = llm.invoke(
        f"Does this product description comply with our company policy? Here is the company policy: {company_policy}. Answer yes or no.\n\n{state['description']}"
    )
    return {"policy_check": msg.content}


def check_clarity(state: State):
    msg = llm.invoke(
        f"Is this description clear and understandable to a 10th-grade reader? Answer yes or no.\n\n{state['description']}"
    )
    return {"clarity_check": msg.content}

# Node 5: Aggregate the Results
Once the checks complete, this node gathers their responses. If all checks return "yes", the product description is approved. Otherwise, it’s flagged as rejected, along with reasons.

In [None]:
def aggregate_results(state: State):
    if (
        "yes" in state["safety_check"].strip().lower()
        and "yes" in state["policy_check"].strip().lower()
        and "yes" in state["clarity_check"].strip().lower()
    ):
        return {"final_output": state["description"]}
    return {
        "final_output": "REJECTED: One or more checks failed.\n"
        f"Safety: {state['safety_check']}, Policy: {state['policy_check']}, Clarity: {state['clarity_check']}"
    }

# Building the Parallel Evaluation Graph
With all our nodes defined, we now assemble them into a LangGraph using StateGraph.

**Start → Description**: We begin by generating the product description.

**Fan Out Checks**: The output fans out into three parallel paths — safety, policy, and clarity checks — enabling efficient, simultaneous validation.

**Converge → Aggregate**: Once all checks complete, the results converge into a final aggregation node that determines whether to approve or reject the description.

**End**: The final result is produced.

This setup showcases LangGraph’s ability to manage parallelism and convergence, streamlining complex workflows while remaining transparent and modular.

In [None]:
builder = StateGraph(State)

builder.add_node("generate_description", generate_description)
builder.add_node("check_safety", check_safety)
builder.add_node("check_policy", check_policy)
builder.add_node("check_clarity", check_clarity)
builder.add_node("aggregate_results", aggregate_results)

# Description generation first
builder.add_edge(START, "generate_description")

# Then fan out for parallel checks
builder.add_edge("generate_description", "check_safety")
builder.add_edge("generate_description", "check_policy")
builder.add_edge("generate_description", "check_clarity")

# All checks go to the aggregator
builder.add_edge("check_safety", "aggregate_results")
builder.add_edge("check_policy", "aggregate_results")
builder.add_edge("check_clarity", "aggregate_results")

# Final result
builder.add_edge("aggregate_results", END)

workflow = builder.compile()

# Example Usage

In [None]:
state = workflow.invoke({"product": "Smart glasses that project your calendar"})
print(state["final_output"])

In [None]:
state = workflow.invoke(
    {"product": "Headphones with noise cancellation, transparency, and other advanced features."}
)
print(state["final_output"])
state = workflow.invoke({"product": "Smart fridge with advanced features."})
print(state["final_output"])

# Make sure to view your traces in Phoenix!

# Let's add some Evaluations (Evals)

In this section we will evaluate the accuracy of our safety, policy, and clarity checkers with another LLM call.

In [None]:
import phoenix as px

df = px.Client().get_spans_dataframe("name == 'LangGraph'", project_name="Parallel")

In [None]:
df.to_csv("parallel_evals.csv")

# Custom Eval Template

Here we define a custom eval template, designed to evaluate the policy, clarity, and safety checkers' decisions.

In [None]:
TEMPLATE = """ You must decide whether the clarity, policy, and safety checkers made the right decision based on the generated descriptions.
Check if the policy checker's decision correctly reflects whether the generatead description complies with the policy: {company_policy}.
Check if the clarity checker's decision correctly reflects whether the generated description is understandable to a 10th-grade reader.
Check if the safety checker's decision correctly reflects whether the generated description is safe and non-violent.

generated description: {description}
decisions: {decisions}

Output 1/3 if one decision is correct, 2/3 if two decisions are correct, 3/3 if all decisions are correct, and 0/3 if all decisions are incorrect.
Explan your reasoning.
"""

In [None]:
df

# Generate Evals

In [None]:
import json

import pandas as pd

from phoenix.evals import OpenAIModel, llm_classify


def unpack(row):
    blob = json.loads(row["attributes.output.value"])
    # pull the free-text description
    description = blob.get("description", "")
    # collapse the three yes/no flags into one readable string
    decisions_dict = {
        "policy_checker": blob.get("policy_check", "No."),
        "clarity_checker": blob.get("clarity_check", "No."),
        "safety_checker": blob.get("safety_check", "No."),
    }
    # join into the form   "policy_checker: Yes., clarity_checker: No.,  ..."
    decisions = ", ".join(f"{k}: {v}" for k, v in decisions_dict.items())
    return pd.Series({"description": description, "decisions": decisions})


df[["description", "decisions"]] = df.apply(unpack, axis=1)

# 3. make sure every {placeholder} in your TEMPLATE exists as a column --------
df["company_policy"] = company_policy  # now {company_policy} will resolve

# 4. run the eval --------------------------------------------------------------

# We treat 0/3 … 3/3 as four categorical classes
rails = ["0/3", "1/3", "2/3", "3/3"]

eval_results = llm_classify(
    dataframe=df,
    template=TEMPLATE,
    model=OpenAIModel(model="gpt-4o"),  # or any supported model
    rails=rails,
    include_prompt=True,
    include_response=True,
    verbose=True,
    provide_explanation=True,
)

In [None]:
eval_results.drop(
    columns=["prompt", "exceptions", "execution_status", "execution_seconds", "response"],
    inplace=True,
)
eval_results

# Export Evals to Phoenix!

In [None]:
from phoenix.trace import SpanEvaluations

px.Client().log_evaluations(SpanEvaluations(eval_name="Checker Accuracy", dataframe=eval_results))