Gracefully fail when querying a delayed remote source by george-larionov · Pull Request #84820 · ClickHouse/ClickHouse

george-larionov · 2025-07-31T16:22:47Z

Resolves #83282. Sometimes when the remote source is delayed sendQuery() receives an empty replica_states vector, which was not checked for. The fix was to add a check for an empty vector and fail gracefully in this case.

Changelog category (leave one):

Bug Fix (user-visible misbehavior in an official stable release)

Changelog entry (a user-readable short description of the changes that goes into CHANGELOG.md):

Fixed issue where querying a delayed remote source could result in vector out of bounds.

clickhouse-gh · 2025-07-31T16:23:12Z

Workflow [PR], commit [b7d50e4]

Summary: ❌

job_name	test_name	status
Stateless tests (amd_msan, parallel, 2/2)		failure
	02901_parallel_replicas_rollup	FAIL
	03525_sql_udf_names_in_system_query_log	FAIL
	02193_async_insert_tcp_client_1	FAIL
	02761_ddl_initial_query_id	FAIL
	03231_hive_partitioning_filtering	FAIL
	02136_scalar_subquery_metrics	FAIL
	02765_queries_with_subqueries_profile_events	FAIL

…ayed_remote_source-failpoint-vector-out-of-bounds

tests/queries/0_stateless/03581_query_secure_delayed_remote_source.sql

azat · 2025-08-04T15:31:10Z

src/Client/MultiplexedConnections.cpp

+    }
+    else
+    {
+        throw Exception(ErrorCodes::NO_AVAILABLE_REPLICA, "No available replica");


We should never get here with 0 replicas, and the problem is in ReadFromRemote::addLazyPipe, in case of use_delayed_remote_source = true we ignore the exceptions during obtaining the connections here -

ClickHouse/src/Processors/QueryPlan/ReadFromRemote.cpp

Lines 521 to 537 in c5f370c

try

{

if (my_table_func_ptr)

try_results = my_shard.shard_info.pool->getManyForTableFunction(timeouts, current_settings, PoolMode::GET_ONE);

else

try_results = my_shard.shard_info.pool->getManyChecked(

timeouts, current_settings, PoolMode::GET_ONE,

my_shard.main_table ? my_shard.main_table.getQualifiedName() : my_main_table.getQualifiedName());

}

catch (const Exception & ex)

{

if (ex.code() == ErrorCodes::ALL_CONNECTION_TRIES_FAILED)

LOG_WARNING(getLogger("ClusterProxy::SelectStreamFactory"),

"Connections to remote replicas of local shard {} failed, will use stale local replica", my_shard.shard_info.shard_num);

else

throw;

}

We need to rethrow the original exception if fallback to local replica (w/o TCP communication) is not possible, i.e.

patch

$ git di diff --git a/src/Client/MultiplexedConnections.cpp b/src/Client/MultiplexedConnections.cpp index 48ac03dc595..68058a48e41 100644 --- a/src/Client/MultiplexedConnections.cpp +++ b/src/Client/MultiplexedConnections.cpp @@ -186,6 +186,7 @@ void MultiplexedConnections::sendQuery( const bool enable_offset_parallel_processing = context->canUseOffsetParallelReplicas(); size_t num_replicas = replica_states.size(); + chassert(num_replicas > 0); if (num_replicas > 1) { if (enable_offset_parallel_processing) diff --git a/src/Processors/QueryPlan/ReadFromRemote.cpp b/src/Processors/QueryPlan/ReadFromRemote.cpp index 40e91f4e907..a18341cf277 100644 --- a/src/Processors/QueryPlan/ReadFromRemote.cpp +++ b/src/Processors/QueryPlan/ReadFromRemote.cpp @@ -515,6 +515,12 @@ void ReadFromRemote::addLazyPipe( auto timeouts = ConnectionTimeouts::getTCPTimeoutsWithFailover(current_settings) .getSaturated(current_settings[Setting::max_execution_time]); + bool use_delayed_remote_source = false; + fiu_do_on(FailPoints::use_delayed_remote_source, + { + use_delayed_remote_source = true; + }); + // In case reading from parallel replicas is allowed, lazy case is not triggered, // so in this case it's required to get only one connection from the pool std::vector<ConnectionPoolWithFailover::TryResult> try_results; @@ -529,7 +535,7 @@ void ReadFromRemote::addLazyPipe( } catch (const Exception & ex) { - if (ex.code() == ErrorCodes::ALL_CONNECTION_TRIES_FAILED) + if (use_delayed_remote_source && ex.code() == ErrorCodes::ALL_CONNECTION_TRIES_FAILED) LOG_WARNING(getLogger("ClusterProxy::SelectStreamFactory"), "Connections to remote replicas of local shard {} failed, will use stale local replica", my_shard.shard_info.shard_num); else @@ -543,12 +549,6 @@ void ReadFromRemote::addLazyPipe( max_remote_delay = std::max(try_result.delay, max_remote_delay); } - bool use_delayed_remote_source = false; - fiu_do_on(FailPoints::use_delayed_remote_source, - { - use_delayed_remote_source = true; - }); - if (!use_delayed_remote_source) { const auto replicated_storage = std::dynamic_pointer_cast<StorageReplicatedMergeTree>(my_storage);

But, the question is, why it fails with remoteSecure? The reason is that in fiddle we don't have SSL configured so it fails, but on CI we do, it should not fail there.

One more question, if the server is not available on that port, why it does not fail during trying to obtain the table structure (in getStructureOfRemoteTable()), this is due to isLocal check, which returns true and clickhouse does not goes via TCP, it simply execute query internally (DESC table) -

ClickHouse/src/Storages/getStructureOfRemoteTable.cpp

Lines 50 to 54 in c5f370c

if (shard_info.isLocal())

{

TableFunctionPtr table_function_ptr = TableFunctionFactory::instance().get(table_func_ptr, context);

return table_function_ptr->getActualTableStructure(context, /*is_insert_query*/ true);

}

So, I would say that the problem is this failpoint, we need to do it only for ReplicatedMergeTree and ensure that all tests with it will be correct after this change.

Thanks for pointing me in this direction, I was looking in the wrong place. Two questions. First, did you mean to do !use_delayed_remote_source in the if statement? Second, is this failpoint the only time that fallback to a local replica is not possible? Is there any way to test directly for the presence of at least one replica before calling the sendQuery function?

First, did you mean to do !use_delayed_remote_source in the if statement?

Actually after thinking about it more, we should do something like this

Details

$ git di diff --git a/src/Processors/QueryPlan/ReadFromRemote.cpp b/src/Processors/QueryPlan/ReadFromRemote.cpp index 40e91f4e907..b5fdbe5493d 100644 --- a/src/Processors/QueryPlan/ReadFromRemote.cpp +++ b/src/Processors/QueryPlan/ReadFromRemote.cpp @@ -1,3 +1,4 @@ +#include <exception> #include <Processors/QueryPlan/ReadFromRemote.h> #include <Analyzer/QueryNode.h> @@ -518,6 +519,7 @@ void ReadFromRemote::addLazyPipe( // In case reading from parallel replicas is allowed, lazy case is not triggered, // so in this case it's required to get only one connection from the pool std::vector<ConnectionPoolWithFailover::TryResult> try_results; + std::exception_ptr exception_ptr; try { if (my_table_func_ptr) @@ -529,6 +531,7 @@ void ReadFromRemote::addLazyPipe( } catch (const Exception & ex) { + exception_ptr = std::current_exception(); if (ex.code() == ErrorCodes::ALL_CONNECTION_TRIES_FAILED) LOG_WARNING(getLogger("ClusterProxy::SelectStreamFactory"), "Connections to remote replicas of local shard {} failed, will use stale local replica", my_shard.shard_info.shard_num); @@ -577,6 +580,10 @@ void ReadFromRemote::addLazyPipe( } } + + if (exception_ptr) + std::rethrow_exception(exception_ptr); + std::vector<IConnectionPool::Entry> connections; connections.reserve(try_results.size()); for (auto & try_result : try_results)

Is there any way to test directly for the presence of at least one replica before calling the sendQuery function?

The connection should not be created with zero replicas (so MultiplexedConnection::sendQuery() should never be reached in this case), and we should check it here, the problem is this fallback to local server case, due to which we ignore errors from pool::getMany*(), and later do not check that we have any connections.

Understood, this looks better to me as well

tests/queries/0_stateless/03581_query_secure_delayed_remote_source.sql

…ayed_remote_source-failpoint-vector-out-of-bounds

src/Processors/QueryPlan/ReadFromRemote.cpp

azat

Apart from minor comment adjustment, LGTM

P.S. I wouldn't even say that it is a bug-fix, AFAICS it is not possible to trigger this problem w/o failpoints, am I right?

Co-authored-by: Azat Khuzhin <a3at.mail@gmail.com>

george-larionov · 2025-08-07T16:45:02Z

Apart from minor comment adjustment, LGTM

P.S. I wouldn't even say that it is a bug-fix, AFAICS it is not possible to trigger this problem w/o failpoints, am I right?

I guess, but isn't the failpoint supposed to model a situation that could happen in real life?

Edit: I see what you mean actually, since an actual similar error would probably be caught in the code block that the failpoint avoids. I wonder how useful is this failpoint if it avoids the actual codepath that would run in reality?

azat · 2025-08-07T21:01:28Z

I wonder how useful is this failpoint if it avoids the actual codepath that would run in reality?

This one is quiestionable to me, but I think it is OK

george-larionov added 2 commits July 30, 2025 17:41

fixing issue by adding extra check and error

1f55ce8

adding test

0d5d035

george-larionov linked an issue Jul 31, 2025 that may be closed by this pull request

use_delayed_remote_source failpoint vector out of bounds #83282

Closed

clickhouse-gh bot added the pr-bugfix Pull request with bugfix, not backported by default label Jul 31, 2025

george-larionov added 4 commits July 31, 2025 16:35

appeasing style checker

f626074

Merge remote-tracking branch 'upstream/master' into fix/83282-use_del…

298b50d

…ayed_remote_source-failpoint-vector-out-of-bounds

reverting changes to 02863 test and moving to new test

9a4c9e8

updating to use currentDatabase() func

5520d24

azat self-assigned this Aug 4, 2025

george-larionov added 2 commits August 4, 2025 15:50

adding tag to not run parallel replicas

b6ddba8

adding another tag

38a8eed

azat requested changes Aug 4, 2025

View reviewed changes

george-larionov added 5 commits August 5, 2025 17:27

adding logic to rethrow exception when no local replica exists

cc3ad62

updating test

77eedd0

removing stateless test and adding integration test

83df7a2

Merge remote-tracking branch 'upstream/master' into fix/83282-use_del…

143cb44

…ayed_remote_source-failpoint-vector-out-of-bounds

adding line to disable the failpoint after test

c3152a2

george-larionov marked this pull request as ready for review August 7, 2025 15:22

azat reviewed Aug 7, 2025

View reviewed changes

src/Processors/QueryPlan/ReadFromRemote.cpp Outdated Show resolved Hide resolved

azat approved these changes Aug 7, 2025

View reviewed changes

Update src/Processors/QueryPlan/ReadFromRemote.cpp

b7d50e4

Co-authored-by: Azat Khuzhin <a3at.mail@gmail.com>

azat enabled auto-merge August 7, 2025 21:02

azat added this pull request to the merge queue Aug 7, 2025

Merged via the queue into master with commit 60c47ef Aug 7, 2025
122 of 124 checks passed

azat deleted the fix/83282-use_delayed_remote_source-failpoint-vector-out-of-bounds branch August 7, 2025 21:17

robot-clickhouse-ci-2 added the pr-synced-to-cloud The PR is synced to the cloud repo label Aug 7, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Gracefully fail when querying a delayed remote source#84820

Gracefully fail when querying a delayed remote source#84820
azat merged 14 commits intomasterfrom
fix/83282-use_delayed_remote_source-failpoint-vector-out-of-bounds

george-larionov commented Jul 31, 2025

Uh oh!

clickhouse-gh bot commented Jul 31, 2025 •

edited

Loading

Uh oh!

Uh oh!

azat Aug 4, 2025

Uh oh!

george-larionov Aug 5, 2025

Uh oh!

azat Aug 5, 2025

Uh oh!

azat Aug 5, 2025 •

edited

Loading

Uh oh!

george-larionov Aug 5, 2025

Uh oh!

Uh oh!

Uh oh!

Uh oh!

azat left a comment •

edited

Loading

Uh oh!

george-larionov commented Aug 7, 2025 •

edited

Loading

Uh oh!

azat commented Aug 7, 2025 •

edited

Loading

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

	try
	{
	if (my_table_func_ptr)
	try_results = my_shard.shard_info.pool->getManyForTableFunction(timeouts, current_settings, PoolMode::GET_ONE);
	else
	try_results = my_shard.shard_info.pool->getManyChecked(
	timeouts, current_settings, PoolMode::GET_ONE,
	my_shard.main_table ? my_shard.main_table.getQualifiedName() : my_main_table.getQualifiedName());
	}
	catch (const Exception & ex)
	{
	if (ex.code() == ErrorCodes::ALL_CONNECTION_TRIES_FAILED)
	LOG_WARNING(getLogger("ClusterProxy::SelectStreamFactory"),
	"Connections to remote replicas of local shard {} failed, will use stale local replica", my_shard.shard_info.shard_num);
	else
	throw;
	}

	if (shard_info.isLocal())
	{
	TableFunctionPtr table_function_ptr = TableFunctionFactory::instance().get(table_func_ptr, context);
	return table_function_ptr->getActualTableStructure(context, /is_insert_query/ true);
	}

Conversation

george-larionov commented Jul 31, 2025

Changelog category (leave one):

Changelog entry (a user-readable short description of the changes that goes into CHANGELOG.md):

Uh oh!

clickhouse-gh bot commented Jul 31, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

azat Aug 4, 2025

Choose a reason for hiding this comment

Uh oh!

george-larionov Aug 5, 2025

Choose a reason for hiding this comment

Uh oh!

azat Aug 5, 2025

Choose a reason for hiding this comment

Uh oh!

azat Aug 5, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

george-larionov Aug 5, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

azat left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

george-larionov commented Aug 7, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

azat commented Aug 7, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

clickhouse-gh bot commented Jul 31, 2025 •

edited

Loading

azat Aug 5, 2025 •

edited

Loading

azat left a comment •

edited

Loading

george-larionov commented Aug 7, 2025 •

edited

Loading

azat commented Aug 7, 2025 •

edited

Loading