Skip to content

feat(llm-gateway): emit cache & reasoning token breakdown on $ai_generation events#60403

Merged
VojtechBartos merged 2 commits into
masterfrom
vojtab/gateway-cache-tokens
May 28, 2026
Merged

feat(llm-gateway): emit cache & reasoning token breakdown on $ai_generation events#60403
VojtechBartos merged 2 commits into
masterfrom
vojtab/gateway-cache-tokens

Conversation

@VojtechBartos
Copy link
Copy Markdown
Member

@VojtechBartos VojtechBartos commented May 28, 2026

Problem

LiteLLM hands the gateway per-call usage details in standard_logging_object.metadata.usage_object, including cache_read_input_tokens, cache_creation_input_tokens (Anthropic prompt caching) and completion_tokens_details.reasoning_tokens (OpenAI o-series and similar). The Prometheus callback already consumes all three (callbacks/prometheus.py:42-46), and posthoganalytics' langchain CallbackHandler emits them on every $ai_generation it captures (ai/langchain/callbacks.py:610-614) — but the PostHog callback in the gateway was dropping them.

Result: gateway-routed traces in LLM Analytics showed $ai_input_tokens / $ai_output_tokens but no cache or reasoning breakdown, and posthog/tasks/llm_analytics_usage_report.py:375-377 (which sums these fields) was silently aggregating zero for any gateway traffic. The dollar amount ($ai_total_cost_usd) is unaffected — LiteLLM bakes pricing in — but the token accounting underneath it was not reconstructable from the events.

Changes

In services/llm-gateway/src/llm_gateway/callbacks/posthog.py, pull usage_object from the same path Prometheus uses and conditionally emit on $ai_generation:

  • $ai_cache_read_input_tokens
  • $ai_cache_creation_input_tokens
  • $ai_reasoning_tokens

Conditional emission matches the Nullable(Int64) schema in posthog/models/ai_events/sql.py:262-263 and skips providers that don't surface these keys at all. _on_failure is unchanged (no usage data on failed calls).

Parity after this PR:

Field PostHog AI Gateway (before) Gateway (after)
$ai_input_tokens / $ai_output_tokens yes yes yes
$ai_total_cost_usd yes yes yes
$ai_cache_read_input_tokens / $ai_cache_creation_input_tokens yes (cross-provider normalized) no yes (Anthropic; OpenAI's prompt_tokens_details.cached_tokens still not handled)
$ai_reasoning_tokens yes no yes

How did you test this code?

I'm an agent. Verification:

  • Unit: uv run pytest tests/callbacks/test_posthog.py — 56 passed, four new tests covering present/absent behavior for each field. Full gateway suite: 930 passed.
  • End-to-end on local: ran the gateway from this worktree, fired two claude-sonnet-4-5 calls 12k chars apart with cache_control: ephemeral on the system block. Captured events showed $ai_cache_creation_input_tokens: 8741 on call 1 and $ai_cache_read_input_tokens: 8741 on call 2, with cost dropping ~12× between the two (matching Anthropic's cache discount). $ai_reasoning_tokens shipped on every event (0 for Claude, as expected). No reasoning-model end-to-end run yet — the code path is identical to cache, just a different usage_object key.

Publish to changelog?

no

🤖 Agent context

Authored by Claude Code (Opus 4.7, 1M context) at Vojta's direction. Notes for the reviewer:

  • Cost-breakdown components ($ai_input_cost_usd, $ai_output_cost_usd, $ai_saved_cache_cost_usd) deliberately left out — posthoganalytics doesn't emit them either, so adding them to the gateway would be a one-sided source, not a parity fix.
  • OpenAI prompt_tokens_details.cached_tokens path also left out — that's a separate normalization the gateway's Prometheus callback doesn't handle either. Worth a follow-up if OpenAI traffic from Slack becomes large enough to care about its cache visibility.
  • $ai_span_name / $ai_parent_id are langchain-specific span concepts that don't map onto the gateway's flat call model. Callers can pass either via x-posthog-property-* headers if they need them, since the existing posthog_properties merge handles arbitrary fields.

LiteLLM hands the callback per-call usage details in
standard_logging_object.metadata.usage_object, including
cache_read_input_tokens and cache_creation_input_tokens for providers
that support prompt caching (Anthropic). The Prometheus callback
already consumes both for its counters; the PostHog callback was
dropping them on the floor, so LLM Analytics and the AI usage report
saw input/output tokens but no cache breakdown.

Pull the same usage_object and emit the two fields as
$ai_cache_read_input_tokens / $ai_cache_creation_input_tokens when
LiteLLM provides them, matching the Nullable(Int64) schema in
posthog/models/ai_events/sql.py. Conditional emission keeps
non-caching providers from polluting events with zeros.
Extend the cache-token parity fix to also emit
$ai_reasoning_tokens when LiteLLM reports it in
standard_logging_object.metadata.usage_object.completion_tokens_details.reasoning_tokens.
This matches the field posthoganalytics' langchain CallbackHandler
already emits (ai/langchain/callbacks.py:614), so PostHog AI traces
and gateway-routed traces are now indistinguishable on the
token-accounting fields LLM Analytics queries.
@VojtechBartos VojtechBartos changed the title feat(llm-gateway): emit cache token breakdown on $ai_generation events feat(llm-gateway): emit cache & reasoning token breakdown on $ai_generation events May 28, 2026
@VojtechBartos VojtechBartos self-assigned this May 28, 2026
@VojtechBartos VojtechBartos requested a review from a team May 28, 2026 10:35
@VojtechBartos VojtechBartos marked this pull request as ready for review May 28, 2026 10:36
@assign-reviewers-posthog assign-reviewers-posthog Bot requested a review from a team May 28, 2026 10:36
@greptile-apps
Copy link
Copy Markdown
Contributor

greptile-apps Bot commented May 28, 2026

Prompt To Fix All With AI
Fix the following 1 code review issue. Work through them one at a time, proposing concise fixes.

---

### Issue 1 of 1
services/llm-gateway/tests/callbacks/test_posthog.py:305-360
**Non-parameterized duplicate "absent" tests**

The two omission tests (`test_on_success_omits_cache_tokens_when_absent` and `test_on_success_omits_reasoning_tokens_when_absent`) share identical fixtures, patches, and structure — only the asserted absent keys differ. The project prefers parameterized tests: these could be a single `@pytest.mark.parametrize("absent_key", ["$ai_cache_read_input_tokens", "$ai_cache_creation_input_tokens", "$ai_reasoning_tokens"])` test, or even one non-parameterized test that asserts all three absent keys at once since they all share the same precondition (no `metadata` on the logging object).

Reviews (1): Last reviewed commit: "feat(llm-gateway): emit $ai_reasoning_to..." | Re-trigger Greptile

Comment on lines +305 to 360
props = mock_client.capture.call_args.kwargs["properties"]
assert "$ai_cache_read_input_tokens" not in props
assert "$ai_cache_creation_input_tokens" not in props

@pytest.mark.asyncio
async def test_on_success_emits_reasoning_tokens_when_present(
self, callback: PostHogCallback, auth_user: AuthenticatedUser, mock_posthog_client: tuple
) -> None:
_, mock_client = mock_posthog_client
kwargs = {
"standard_logging_object": {
"model": "gpt-5.2",
"custom_llm_provider": "openai",
"prompt_tokens": 50,
"completion_tokens": 200,
"metadata": {
"usage_object": {
"completion_tokens_details": {"reasoning_tokens": 120},
},
},
},
"litellm_params": {},
}

with (
patch("llm_gateway.callbacks.posthog.get_auth_user", return_value=auth_user),
patch("llm_gateway.callbacks.posthog.get_product", return_value="slack_app_routing"),
):
await callback._on_success(kwargs, None, 0.0, 1.0, end_user_id=None)

props = mock_client.capture.call_args.kwargs["properties"]
assert props["$ai_reasoning_tokens"] == 120

@pytest.mark.asyncio
async def test_on_success_omits_reasoning_tokens_when_absent(
self,
callback: PostHogCallback,
auth_user: AuthenticatedUser,
standard_logging_object: dict,
mock_posthog_client: tuple,
) -> None:
_, mock_client = mock_posthog_client
kwargs = {"standard_logging_object": standard_logging_object, "litellm_params": {}}

with (
patch("llm_gateway.callbacks.posthog.get_auth_user", return_value=auth_user),
patch("llm_gateway.callbacks.posthog.get_product", return_value="slack_app"),
):
await callback._on_success(kwargs, None, 0.0, 1.0, end_user_id=None)

props = mock_client.capture.call_args.kwargs["properties"]
assert "$ai_reasoning_tokens" not in props

@pytest.mark.asyncio
@pytest.mark.parametrize("product", ["wizard", "posthog_code", "llm_gateway"])
async def test_on_success_includes_ai_product(
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Non-parameterized duplicate "absent" tests

The two omission tests (test_on_success_omits_cache_tokens_when_absent and test_on_success_omits_reasoning_tokens_when_absent) share identical fixtures, patches, and structure — only the asserted absent keys differ. The project prefers parameterized tests: these could be a single @pytest.mark.parametrize("absent_key", ["$ai_cache_read_input_tokens", "$ai_cache_creation_input_tokens", "$ai_reasoning_tokens"]) test, or even one non-parameterized test that asserts all three absent keys at once since they all share the same precondition (no metadata on the logging object).

Prompt To Fix With AI
This is a comment left during a code review.
Path: services/llm-gateway/tests/callbacks/test_posthog.py
Line: 305-360

Comment:
**Non-parameterized duplicate "absent" tests**

The two omission tests (`test_on_success_omits_cache_tokens_when_absent` and `test_on_success_omits_reasoning_tokens_when_absent`) share identical fixtures, patches, and structure — only the asserted absent keys differ. The project prefers parameterized tests: these could be a single `@pytest.mark.parametrize("absent_key", ["$ai_cache_read_input_tokens", "$ai_cache_creation_input_tokens", "$ai_reasoning_tokens"])` test, or even one non-parameterized test that asserts all three absent keys at once since they all share the same precondition (no `metadata` on the logging object).

How can I resolve this? If you propose a fix, please make it concise.

Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!

@pytest.mark.asyncio
async def test_on_success_emits_cache_tokens_when_present(
self, callback: PostHogCallback, auth_user: AuthenticatedUser, mock_posthog_client: tuple
) -> None:
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we need some kind of integration test here really, with a real llm call and getting the usage object back, this is useful as a smoke test that the function works but we're mostly testing a mock

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

as discussed on the call, it's post release item

# present so providers that don't report them don't pollute events with
# zeros, matching the schema in posthog/models/ai_events/sql.py and the
# parity established by posthoganalytics' langchain CallbackHandler.
cache_read_input_tokens = usage_object.get("cache_read_input_tokens")
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we're very sensitive to this being wrong / incorrectly calculated, how do we protect against that, should we alert on this field missing or provide some kind of indication of it not working?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

as discussed on the call, it's post release item

@VojtechBartos VojtechBartos merged commit 5a2f0dd into master May 28, 2026
173 checks passed
@VojtechBartos VojtechBartos deleted the vojtab/gateway-cache-tokens branch May 28, 2026 15:57
@deployment-status-posthog
Copy link
Copy Markdown

deployment-status-posthog Bot commented May 28, 2026

Deploy status

Environment Status Deployed At Workflow
dev ✅ Deployed 2026-05-28 16:34 UTC Run
prod-us ✅ Deployed 2026-05-28 17:00 UTC Run
prod-eu ✅ Deployed 2026-05-28 17:02 UTC Run

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants