Skip to content

Detect AI coding agent in client and expose it as client_agent#106619

Open
alexey-milovidov wants to merge 6 commits into
masterfrom
client-agent-detection
Open

Detect AI coding agent in client and expose it as client_agent#106619
alexey-milovidov wants to merge 6 commits into
masterfrom
client-agent-detection

Conversation

@alexey-milovidov
Copy link
Copy Markdown
Member

clickhouse-client and clickhouse-local now detect whether they were invoked under a known AI coding agent (Claude Code, Cursor, OpenAI Codex, Gemini CLI, Goose, Cline, OpenCode, GitHub Copilot, etc.) by inspecting environment variables only — no filesystem probing. The detected canonical agent id (e.g. claude-code, cursor, gemini-cli) is stored in a new ClientInfo::client_agent field, sent over the TCP protocol (gated on a new protocol revision for backward compatibility), and exposed as a client_agent LowCardinality(String) column in system.query_log, system.query_thread_log and system.processes.

The motivation is observability: server-side, operators can attribute query traffic to the AI coding agent that produced it, similar to how clickhousectl already tags its outbound User-Agent. A generic AGENT environment variable is also honored verbatim, and the field is empty when no agent is detected.

Changelog category (leave one):

  • New Feature

Changelog entry (a user-readable short description of the changes that goes into CHANGELOG.md):

Added detection of the AI coding agent (Claude Code, Cursor, Codex, Gemini CLI, Goose, etc.) that invoked clickhouse-client or clickhouse-local, based on environment variables. The detected agent is reported in the new client_agent column of system.query_log, system.query_thread_log and system.processes.

Documentation entry for user-facing changes

  • Documentation update is included (system table docs for query_log, query_thread_log and processes).

alexey-milovidov and others added 2 commits June 6, 2026 01:03
clickhouse-client and clickhouse-local now detect whether they were
invoked under a known AI coding agent (Claude Code, Cursor, OpenAI Codex,
Gemini CLI, Goose, etc.) by inspecting environment variables only (no
filesystem checks), and report the detected canonical agent id in a new
`ClientInfo::client_agent` field.

The field is serialized over the TCP protocol (gated on the new
`DBMS_MIN_REVISION_WITH_CLIENT_AGENT_IN_CLIENT_INFO` revision) and exposed
as a `client_agent` `LowCardinality(String)` column in `system.query_log`,
`system.query_thread_log` and `system.processes`.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
The test runner itself may be invoked under an AI coding agent, whose
environment variables would leak into the spawned clickhouse-client and
make the test non-deterministic. Clear all known agent markers before each
case so only the variable under test is set. Also cover the `cursor` case.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@clickhouse-gh
Copy link
Copy Markdown
Contributor

clickhouse-gh Bot commented Jun 6, 2026

Workflow [PR], commit [08862e7]

Summary:


AI Review

Summary

This PR adds environment-based AI agent detection to clickhouse-client / clickhouse-local, carries the detected client_agent through ClientInfo, and exposes it in query/process observability tables. The main TCP/query-log path is coherent, and the earlier propagation and async batching issues are fixed in the current code, but the new ClientInfo field is also written into the persisted async Distributed insert header in a way that breaks old-reader compatibility for queued files.

Findings

❌ Blockers

  • [src/Interpreters/ClientInfo.cpp:216] ClientInfo::write is used inside the async Distributed insert file header before the existing rows / bytes / block-header fields. Older binaries that do not know DBMS_MIN_REVISION_WITH_CLIENT_AGENT_IN_CLIENT_INFO stop reading ClientInfo before client_agent, then DistributedAsyncInsertHeader::read interprets the leftover client_agent string length as rows. This breaks the append-compatible header contract and can corrupt pending async Distributed queue files after downgrade or when an older binary has to drain files written by a newer one.
    Suggested fix: keep the embedded ClientInfo serialization old-reader-compatible for distributed queue files, and store client_agent as a trailing async-header field after the existing shard/table fields, or add an equivalent context-specific serialization path that old readers can safely ignore.
Final Verdict

Status: ❌ Block

Minimum required action: fix the async Distributed header compatibility issue before merge.

@clickhouse-gh clickhouse-gh Bot added the pr-feature Pull request with new product feature label Jun 6, 2026
Comment thread src/Interpreters/ClientInfo.cpp Outdated
Copy link
Copy Markdown
Member Author

@alexey-milovidov alexey-milovidov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The code looks good in general, but see the above about the usage of remote connections from clickhouse-local.

Also I'm curious - do we invoke the ClientInfo constructor once in clickhouse-client/clickhouse-local?

Serialize `client_agent` for all interfaces (at the end of the stream,
independent of the initial interface) instead of only inside the `TCP`
branch. Without this, a `clickhouse-local` query (`LOCAL` interface) that
is forwarded to a remote shard via `RemoteQueryExecutor` would drop the
field, so remote `system.query_log` / `system.query_thread_log` could not
attribute traffic originating from `clickhouse-local`.

Also use `SYSTEM FLUSH LOGS query_log` instead of the global form (style
check) and extend the test to cover the `clickhouse-local` -> remote
secondary-query propagation path.
@alexey-milovidov
Copy link
Copy Markdown
Member Author

Thanks for the review. Addressed in aa625ad:

Remote connections from clickhouse-local. Good catch — the field was serialized only inside the TCP branch, so a clickhouse-local query (LOCAL interface) forwarded to a remote shard dropped client_agent. I moved the serialization out of the interface-specific block to the end of the stream, so it is written/read for all interfaces and survives forwarding. Extended 04305_client_agent_detection with a clickhouse-localremote(...) case that asserts the secondary query on the remote logs client_agent = claude-code (matched via initial_query_id).

Do we invoke the ClientInfo constructor / detection once? The env-var detection runs once per process, not per query. It happens in fillOSUserHostNameAndVersionInfo via this chain, called once at startup:

Client::processOptions / LocalServer::main
  -> ClientBase::initClientContext            (once)
    -> Context::setQueryKindInitial
      -> ClientInfo::setInitialQuery
        -> fillOSUserHostNameAndVersionInfo
          -> detectClientAgent                (reads env vars)

client_context is a single long-lived object reused for every query in the session, so the detected value is computed once and then just carried along (the ClientInfo object itself is copied around, e.g. via Context::createCopy, but detection is not re-run). The ClientInfo default constructor only initializes the socket-address members; it does not touch the environment, which is important because ClientInfo is also constructed server-side in many places where agent detection would be meaningless.

Also fixed the style check (SYSTEM FLUSH LOGS query_log instead of the global form).

…rg::alter

PR #106102 (merged 2026-06-05 17:43 UTC) refactored Iceberg::alter to
declare last_version and compression_method outside an if/else that
assigns them in both branches. Clang-tidy cppcoreguidelines-init-variables
flags the bare declarations and Build (arm_tidy) is built with
-warnings-as-errors, so the build fails for every PR whose CI ran on
master after this commit.

CIDB shows 30+ unrelated PRs hitting this in the last 2 days
(e.g. #89360, #103540, #106120, #106386, #106404, #106522, #105102,
#106590).

Initialize both variables to safe defaults at declaration. They are
unconditionally overwritten in both if and else branches before use,
so behavior is unchanged. The new defaults are only relevant if a
future code path skips both assignments, in which case last_version=0
and compression_method=CompressionMethod::None are sane no-op values
(the same defaults the old structured-binding form would produce
through default-construction of the destructured aggregate).

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
(cherry picked from commit d72cac5)
Comment thread src/Interpreters/ClientInfo.cpp
@alexey-milovidov
Copy link
Copy Markdown
Member Author

@groeneai, fix the failed test for Kafka and send a PR with the fix.

@groeneai
Copy link
Copy Markdown
Contributor

groeneai commented Jun 6, 2026

@alexey-milovidov The failed Kafka test test_storage_kafka/test_avro_schema_registry.py::test_kafka_produce_consume_avro_confluent is an infrastructure flake unrelated to this PR's client_agent change. Details:

The full traceback ends in Poco::TimeoutException while the server tries to POST to the in-cluster Confluent schema-registry HTTP endpoint at AvroConfluentSchemaRegistry.cpp:209 (HTTPSession::receive -> receiveResponse):

Code: 1000. DB::Exception: Timeout: while registering schema under subject 'avro_confluent_roundtrip_NSPJJU-value':
While executing WaitForAsyncInsert.
...
12. AvroConfluentSchemaRegistry.cpp:209: ConfluentSchemaRegistry::registerSchema()
13. AvroConfluentSchemaRegistry.cpp:271: ConfluentSchemaRegistry::registerSchema()
14. AvroRowOutputFormat.cpp:633: AvroConfluentRowOutputFormat::write()

CIDB cross-check (30d):

The PR diff (ClientInfo, client_agent env-var detection, protocol bump, doc updates) does not touch AvroConfluentSchemaRegistry, HTTPSession, the schema-registry test fixtures, or Kafka. It is a one-off network-timeout flake against the in-cluster schema-registry HTTP endpoint, not a PR-caused regression. I do not see any actionable fix to file in a separate PR — please rerun the integration job, it should pass.

If the same timeout starts hitting many unrelated PRs we will create a chronic-flaky tracking task for the registry test fixture; right now it is not flaky enough to warrant one (5 of 6 hits all on the same azat PR).

Session: cron:clickhouse-ci-task-worker:20260606-201300

@alexey-milovidov
Copy link
Copy Markdown
Member Author

alexey-milovidov commented Jun 7, 2026

It is a one-off network-timeout flake against the in-cluster schema-registry HTTP endpoint, not a PR-caused regression. I do not see any actionable fix to file in a separate PR — please rerun the integration job, it should pass.

@groeneai, We always have to fix CI failures, even when they happen once in a blue moon. Find a way to work around network failures - e.g., add retries around that in the test or in the code that works with the schema registry.

@groeneai
Copy link
Copy Markdown
Contributor

groeneai commented Jun 7, 2026

@alexey-milovidov fix PR opened: #106661

It adds a bounded retry loop with exponential backoff around ConfluentSchemaRegistry::fetchSchema and registerSchema, gated by two new format settings (format_avro_schema_registry_max_retries, format_avro_schema_registry_retry_initial_backoff_ms). The retry only fires on transport-class errors (timeouts, connection refused, DNS, HTTP 5xx/408/429); schema-validation failures stay fatal. The single Poco timeout that hit test_storage_kafka/test_avro_schema_registry.py::test_kafka_produce_consume_avro_confluent will be absorbed by the default 5 retries instead of aborting the INSERT.

alexey-milovidov and others added 2 commits June 7, 2026 13:58
`ClientInfo::write` now persists `client_agent` into async `Distributed`
insert file headers. The batching path in
`DistributedAsyncInsertDirectoryQueue` keyed `BatchHeader` only by
`settings`, `query`, and `client_info.query_kind`, so two pending inserts
with the same query/header but different `client_agent` values were merged
and sent by a single `RemoteInserter` initialized from the first file's
`client_info`. The remote shard then recorded all of the batched traffic
under that first agent, breaking the attribution contract for async
`Distributed` inserts.

Add `client_info.client_agent` to both the `BatchHeader` equality
comparison and its hash so files with different agents are not batched
together.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@alexey-milovidov
Copy link
Copy Markdown
Member Author

Picked this up via /continue-pr:

  • Merged master (branch was 271 commits behind and red). Clean merge, no conflicts; protocol revision 54485 remains unique after the merge.
  • Addressed the remaining unresolved review thread (async Distributed insert misattribution): client_info.client_agent is now part of BatchHeader::operator== and BatchHeader::Hash in DistributedAsyncInsertDirectoryQueue, so queued insert files with different detected agents are no longer batched under one ClientInfo (commit 08862e7a584). Thread replied + resolved.
  • Built the full clickhouse binary clean and verified 04305_client_agent_detection end-to-end against a local server (all 5 cases match the reference, including the clickhouse-localremote(...) forwarding case): claude-code / cursor / my-custom-agent / empty / local->remote: claude-code.
  • Sole CI failure is test_storage_kafka/test_avro_schema_registry.py::test_kafka_produce_consume_avro_confluent — a one-off Poco::TimeoutException against the in-cluster Confluent schema-registry HTTP endpoint, unrelated to client_agent. This is already being fixed by @groeneai in Retry transient Confluent Schema Registry failures #106661 (retry loop around ConfluentSchemaRegistry).

/// Sent for all interfaces (not only TCP): the detected client agent must also be preserved
/// when a clickhouse-local query (LOCAL interface) is forwarded to remote shards.
if (server_protocol_revision >= DBMS_MIN_REVISION_WITH_CLIENT_AGENT_IN_CLIENT_INFO)
writeBinary(client_agent, out);
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ClientInfo::write is also used inside the persisted async Distributed insert header, before the existing rows / bytes / block-header fields are written. That header format is designed to stay append-compatible: old readers call client_info.read(header_buf, distributed_header.revision) and then immediately parse rows if header_buf.hasPendingData. A binary that does not know DBMS_MIN_REVISION_WITH_CLIENT_AGENT_IN_CLIENT_INFO will therefore leave this new client_agent string unread and interpret its length byte as rows, corrupting pending async Distributed files written by a newer node after downgrade or by any older binary that has to drain those files.

Please keep the embedded ClientInfo layout old-reader-compatible for distributed queue files, and persist client_agent as a trailing async-header field after the existing shard/table fields (or add an equivalent context-specific serialization path) so older readers can safely ignore it.

@clickhouse-gh
Copy link
Copy Markdown
Contributor

clickhouse-gh Bot commented Jun 7, 2026

LLVM Coverage Report

Metric Baseline Current Δ
Lines 84.50% 84.50% +0.00%
Functions 92.30% 92.30% +0.00%
Branches 77.20% 77.10% -0.10%

Changed lines: Changed C/C++ lines covered by tests: 83/85 (97.65%) | Lost baseline coverage: none · Uncovered code

Full report · Diff report

pull Bot pushed a commit to Lobinson1/ClickHouse that referenced this pull request Jun 8, 2026
A single network glitch when talking to the Confluent Schema Registry
currently aborts the surrounding INSERT. The HTTP client in
`AvroConfluentSchemaRegistry` makes one POST/GET attempt and propagates
the first `Poco::TimeoutException` or `Poco::Net::NetException` to the
caller, so a transient registry outage breaks `Kafka` ingestion that
relies on `AvroConfluent`.

Add a bounded retry loop around `fetchSchema` and `registerSchema` with
exponential backoff (capped at 10 seconds). Schema-validation errors
(HTTP 409 `INCOMPATIBLE_SCHEMA`, `INCORRECT_DATA`, `BAD_ARGUMENTS`) stay
fatal so misconfigured users do not wait through every retry.

Two new format settings control the policy:

- `format_avro_schema_registry_max_retries` (default 5, capped at 20,
  0 disables retries)
- `format_avro_schema_registry_retry_initial_backoff_ms` (default 100,
  capped at 60000)

Reported by @alexey-milovidov on
ClickHouse#106619.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

pr-feature Pull request with new product feature

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants