Conversation
… default timeout duration
Ziinc
approved these changes
Apr 10, 2026
Ziinc
pushed a commit
that referenced
this pull request
Apr 10, 2026
* change clickhouse finch pool to a 5s max idle time and override finch default timeout duration * add finch metrics to telemetry
Ziinc
pushed a commit
that referenced
this pull request
Apr 10, 2026
* change clickhouse finch pool to a 5s max idle time and override finch default timeout duration * add finch metrics to telemetry
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
2-3 times per hour we see ClickHouse connection timeouts in production. Deeper investigation of these showed a nearly exact 5s timing. Discussing with @ruslandoga, he pointed me to this code in Finch that has a default timeout set to 5s.
These changes increase that timeout to 10s for the ClickHouse pool. I also lower the
conn_max_idle_timefrom 9.5s to 5s - to provide a safer margin below ClickHouse's 10s keep alive timeout as I suspect reuse of server-closed connections could be part of the situation. While 5s may seem a little aggressive - I don't foresee many connections going unused for that long given our volume. Open to suggestion on this though.Lastly, I have added 4 new metrics to the telemetry we emit around Finch related to this effort.
finch.request.stop.duration— end-to-end request duration, tagged by pool namefinch.connect.stop.duration— TCP connection establishment time, tagged by pool name (will show us if >5s connects occur)finch.queue.stop.duration— pool checkout wait time, tagged by pool namefinch.conn_max_idle_time_exceeded— counter of connections discarded for exceeding idle time, tagged by host/port (will show the impact of lowering idle time from 9.5s to 5s)The tag options for
finch.conn_max_idle_time_exceededdo not include pool name, so host/port seemed to be the most logical choice on that as we can make assumptions that ports 8123 or 8443 would be ClickHouse, aside from host name