Skip to content

Unified trace bridge: link game-side TraceStore with framework observability (LangSmith, etc.) #83

@justinmadison

Description

@justinmadison

Problem

Agent Arena has two observability layers that don't talk to each other:

  1. Game-side (TraceStore): Observations, tool results, scores — keyed by (agent_id, tick)
  2. LLM-side (LangSmith/Anthropic console): Prompts, model responses, token usage, latency

When debugging a bad decision at tick 42, a user has to manually correlate between these systems. There's no way to click on a tick and see the full chain: what the agent saw → what prompt was built → what the LLM returned → what tool was called → what happened in the game.

Proposed Solution

Add a trace bridge that links game ticks to framework trace IDs, creating a unified view.

How it works

  1. SDK passes tick context to the decide callback:
    The decide(observation) function already receives the tick via observation.tick. No change needed.

  2. Framework starters attach Arena metadata to LLM calls:

    # In starters/langchain/agent.py
    result = graph.invoke(
        {"observation": obs},
        config={"metadata": {"arena_tick": obs.tick, "arena_agent": obs.agent_id}}
    )

    LangSmith automatically indexes this metadata, making it searchable.

  3. TraceStore captures the framework trace URL back:

    # After the LLM call, store the link
    trace.add_step("framework_trace", {
        "langsmith_run_id": run_id,
        "langsmith_url": f"https://smith.langchain.com/runs/{run_id}"
    })
  4. Result: unified per-tick trace

    Tick 42:
      observation: {pos: [1,2,3], resources: [{name: "berry", dist: 3.2}]}
      framework_trace: https://smith.langchain.com/runs/abc123  ← click to see LLM details
      decision: {tool: "collect", params: {target: "berry"}}
      tool_result: {success: true, items_collected: 1}
      score: {resources_collected: 5}
    

Framework-agnostic design

The bridge should work with any framework:

  • LangGraph: LangSmith run metadata + callbacks
  • Claude SDK: Anthropic console trace IDs
  • OpenAI SDK: OpenAI dashboard request IDs
  • Custom: Any string URL/ID the user wants to attach

The SDK provides a simple hook:

def decide(observation: Observation) -> Decision:
    # User's framework code here...
    observation.trace_metadata["framework_url"] = langsmith_url
    return decision

Acceptance Criteria

  • TraceStore supports storing external trace links per (agent_id, tick)
  • LangGraph starter attaches arena_tick + arena_agent as LangSmith run metadata
  • TraceStore captures LangSmith run URL back into the game-side trace
  • A user can go from tick → full prompt/response in LangSmith with one click
  • Design is framework-agnostic (works for Claude SDK, OpenAI, etc.)
  • Documentation shows the debugging workflow end-to-end

Dependencies

Estimated Effort

1 day (after #74 is complete)

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    Status

    Backlog

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions