Improve GLOBAL IN/JOIN performance and respect max_execution_time#95966
Improve GLOBAL IN/JOIN performance and respect max_execution_time#95966c-end wants to merge 2 commits intoClickHouse:masterfrom
Conversation
|
@alexey-milovidov could you or someone else who's familiar with this part of the codebase take a look? I ran some tests in one of our production clusters and the changes from this PR indeed seem to fix the "stuck" (extremely slow) GLOBAL IN queries. |
|
Workflow [PR], commit [13bb0b1] Summary: ❌
|
|
@nickitat could you take a look and trigger another CI run? Locally, the test I added passes even if I run it multiple times. But the timeouts might be too strict for busy CI runners. |
| while (receivePacketsExpectData(state)) | ||
| { | ||
| if (max_execution_time.totalSeconds() > 0 && watch.elapsedSeconds() > max_execution_time.totalSeconds()) | ||
| throw Exception(ErrorCodes::TIMEOUT_EXCEEDED, "Timeout exceeded while reading external table data. Spent {} seconds, timeout is {} seconds.", watch.elapsedSeconds(), max_execution_time.totalSeconds()); |
There was a problem hiding this comment.
Integer truncation of fractional max_execution_time causes wrong timeout
Low Severity
max_execution_time.totalSeconds() truncates to integer, causing incorrect behavior for fractional-second values. For max_execution_time = 1.5s, totalSeconds() returns 1, so the timeout fires at 1 second instead of 1.5. For max_execution_time = 0.5s, totalSeconds() returns 0, so the guard totalSeconds() > 0 fails and the timeout check is completely skipped. The same truncation issue affects the saturate_timeout guard and poll_interval capping in extractConnectionSettingsFromContext. Using totalMicroseconds() (or comparing Poco::Timespan values directly) would avoid the precision loss.
Additional Locations (1)
LLVM Coverage Report
PR changed lines: PR changed-lines coverage: 97.12% (101/104) |
|
@nickitat could you take a look at the code changes? I will look into making the integration test less flaky. Locally it works reliably even if I run it a hundred times, but maybe the query is just too big for CI. |


This PR tries to address some issues with queries using GLOBAL IN/JOIN.
The first issue is about performance. Processing the data from the GLOBAL subquery can take a significant amount of time on the remote replicas receiving this data. The data is sent by
RemoteQueryExecutorin blocks. The default block size is 65409.TCPHandler::receiveDatacreates a newQueryPipelineandPushingPipelineExecutorfor each block.MemorySink::onFinishcopies all existing blocks of the memory table before appending the new block. I think it does that for snapshot isolation. But this makes the external table initialization O(n^2) (with n being the number of blocks). We've seen GLOBAL IN subqueries in production that produced billions of rows, leading to thousands of blocks. For those cases, the quadratic complexity becomes an issue (queries were running for hours).I'm addressing this performance problem by using a single pipeline and executor per external table instead of per block. This allows us to flush the data from
MemorySinktoStorageMemoryonly once. This should be fine because nothing will query the memory table until all data is available.It's worth noting that with the analyzer enabled, this is less of an issue because externabl table blocks are squashed (see
min_external_table_block_size_rowsandmin_external_table_block_size_bytessettings). So there are less blocks, but the fundamental issue still exists.The second issue is that no timeouts are checked during external table initialization. If table initialization takes longer than
max_execution_time(because of the performance issue described above or any other reason), the query should be aborted. This is fixed by checking the elapsed time in the read loop ofTCPHandler::readData.In addition to that, I think that timeouts like
receive_timeoutandsend_timeoutshould be capped bymax_execution_time, if present. This avoids waiting for a dead/frozen initiator replica longer than necessary. This is done inTCPHandler::extractConnectionSettingsFromContext.Changelog category (leave one):
Changelog entry (a user-readable short description of the changes that goes into CHANGELOG.md):
Improve performance for GLOBAL IN/JOIN subqueries that return a large number of rows and respect
max_execution_timeon remote servers.Documentation entry for user-facing changes
Note
Medium Risk
Touches core TCP protocol handling for external table reception and connection timeouts; bugs here could cause stuck queries, premature timeouts, or unfinished pipeline executors under error paths.
Overview
Improves performance of
GLOBAL IN/JOINexternal table initialization by reusing a singleQueryPipeline/PushingPipelineExecutorper external table instead of recreating one per received block, and ensures these executors are finished/cancelled viaQueryStatelifecycle management.Makes remote replicas respect
max_execution_timeduring external table reception (throwsTIMEOUT_EXCEEDED) and capssend_timeout/receive_timeout/poll_intervalbymax_execution_timewhen set. Adds profile events (InitializeExternalTablesMicroseconds,SendExternalTablesMicroseconds) plus a new failpoint (sleep_on_receive_external_table_data) and an integration test covering both execution-time and socket-timeout scenarios.Written by Cursor Bugbot for commit 13bb0b1. This will update automatically on new commits. Configure here.