Skip to content

ClickHouse connection pool adjustments/telemetry additions#3357

Merged
Ziinc merged 2 commits intomainfrom
adammokan/o11y-1647-deeper-dig-into-clickhouse-connection-pool-handling
Apr 10, 2026
Merged

ClickHouse connection pool adjustments/telemetry additions#3357
Ziinc merged 2 commits intomainfrom
adammokan/o11y-1647-deeper-dig-into-clickhouse-connection-pool-handling

Conversation

@amokan
Copy link
Copy Markdown
Contributor

@amokan amokan commented Apr 9, 2026

2-3 times per hour we see ClickHouse connection timeouts in production. Deeper investigation of these showed a nearly exact 5s timing. Discussing with @ruslandoga, he pointed me to this code in Finch that has a default timeout set to 5s.

These changes increase that timeout to 10s for the ClickHouse pool. I also lower the conn_max_idle_time from 9.5s to 5s - to provide a safer margin below ClickHouse's 10s keep alive timeout as I suspect reuse of server-closed connections could be part of the situation. While 5s may seem a little aggressive - I don't foresee many connections going unused for that long given our volume. Open to suggestion on this though.

Lastly, I have added 4 new metrics to the telemetry we emit around Finch related to this effort.

  • finch.request.stop.duration — end-to-end request duration, tagged by pool name
  • finch.connect.stop.duration — TCP connection establishment time, tagged by pool name (will show us if >5s connects occur)
  • finch.queue.stop.duration — pool checkout wait time, tagged by pool name
  • finch.conn_max_idle_time_exceeded — counter of connections discarded for exceeding idle time, tagged by host/port (will show the impact of lowering idle time from 9.5s to 5s)

The tag options for finch.conn_max_idle_time_exceeded do not include pool name, so host/port seemed to be the most logical choice on that as we can make assumptions that ports 8123 or 8443 would be ClickHouse, aside from host name

@amokan amokan requested review from Ziinc and chasers April 9, 2026 19:33
@Ziinc Ziinc merged commit 14d761e into main Apr 10, 2026
13 checks passed
@Ziinc Ziinc deleted the adammokan/o11y-1647-deeper-dig-into-clickhouse-connection-pool-handling branch April 10, 2026 07:20
Ziinc pushed a commit that referenced this pull request Apr 10, 2026
* change clickhouse finch pool to a 5s max idle time and override finch default timeout duration

* add finch metrics to telemetry
Ziinc pushed a commit that referenced this pull request Apr 10, 2026
* change clickhouse finch pool to a 5s max idle time and override finch default timeout duration

* add finch metrics to telemetry
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants