Gracefully fail when querying a delayed remote source#84820
Conversation
|
Workflow [PR], commit [b7d50e4] Summary: ❌
|
…ayed_remote_source-failpoint-vector-out-of-bounds
tests/queries/0_stateless/03581_query_secure_delayed_remote_source.sql
Outdated
Show resolved
Hide resolved
| } | ||
| else | ||
| { | ||
| throw Exception(ErrorCodes::NO_AVAILABLE_REPLICA, "No available replica"); |
There was a problem hiding this comment.
We should never get here with 0 replicas, and the problem is in ReadFromRemote::addLazyPipe, in case of use_delayed_remote_source = true we ignore the exceptions during obtaining the connections here -
ClickHouse/src/Processors/QueryPlan/ReadFromRemote.cpp
Lines 521 to 537 in c5f370c
We need to rethrow the original exception if fallback to local replica (w/o TCP communication) is not possible, i.e.
patch
$ git di
diff --git a/src/Client/MultiplexedConnections.cpp b/src/Client/MultiplexedConnections.cpp
index 48ac03dc595..68058a48e41 100644
--- a/src/Client/MultiplexedConnections.cpp
+++ b/src/Client/MultiplexedConnections.cpp
@@ -186,6 +186,7 @@ void MultiplexedConnections::sendQuery(
const bool enable_offset_parallel_processing = context->canUseOffsetParallelReplicas();
size_t num_replicas = replica_states.size();
+ chassert(num_replicas > 0);
if (num_replicas > 1)
{
if (enable_offset_parallel_processing)
diff --git a/src/Processors/QueryPlan/ReadFromRemote.cpp b/src/Processors/QueryPlan/ReadFromRemote.cpp
index 40e91f4e907..a18341cf277 100644
--- a/src/Processors/QueryPlan/ReadFromRemote.cpp
+++ b/src/Processors/QueryPlan/ReadFromRemote.cpp
@@ -515,6 +515,12 @@ void ReadFromRemote::addLazyPipe(
auto timeouts = ConnectionTimeouts::getTCPTimeoutsWithFailover(current_settings)
.getSaturated(current_settings[Setting::max_execution_time]);
+ bool use_delayed_remote_source = false;
+ fiu_do_on(FailPoints::use_delayed_remote_source,
+ {
+ use_delayed_remote_source = true;
+ });
+
// In case reading from parallel replicas is allowed, lazy case is not triggered,
// so in this case it's required to get only one connection from the pool
std::vector<ConnectionPoolWithFailover::TryResult> try_results;
@@ -529,7 +535,7 @@ void ReadFromRemote::addLazyPipe(
}
catch (const Exception & ex)
{
- if (ex.code() == ErrorCodes::ALL_CONNECTION_TRIES_FAILED)
+ if (use_delayed_remote_source && ex.code() == ErrorCodes::ALL_CONNECTION_TRIES_FAILED)
LOG_WARNING(getLogger("ClusterProxy::SelectStreamFactory"),
"Connections to remote replicas of local shard {} failed, will use stale local replica", my_shard.shard_info.shard_num);
else
@@ -543,12 +549,6 @@ void ReadFromRemote::addLazyPipe(
max_remote_delay = std::max(try_result.delay, max_remote_delay);
}
- bool use_delayed_remote_source = false;
- fiu_do_on(FailPoints::use_delayed_remote_source,
- {
- use_delayed_remote_source = true;
- });
-
if (!use_delayed_remote_source)
{
const auto replicated_storage = std::dynamic_pointer_cast<StorageReplicatedMergeTree>(my_storage);But, the question is, why it fails with remoteSecure? The reason is that in fiddle we don't have SSL configured so it fails, but on CI we do, it should not fail there.
One more question, if the server is not available on that port, why it does not fail during trying to obtain the table structure (in getStructureOfRemoteTable()), this is due to isLocal check, which returns true and clickhouse does not goes via TCP, it simply execute query internally (DESC table) -
ClickHouse/src/Storages/getStructureOfRemoteTable.cpp
Lines 50 to 54 in c5f370c
So, I would say that the problem is this failpoint, we need to do it only for ReplicatedMergeTree and ensure that all tests with it will be correct after this change.
There was a problem hiding this comment.
Thanks for pointing me in this direction, I was looking in the wrong place. Two questions. First, did you mean to do !use_delayed_remote_source in the if statement? Second, is this failpoint the only time that fallback to a local replica is not possible? Is there any way to test directly for the presence of at least one replica before calling the sendQuery function?
There was a problem hiding this comment.
First, did you mean to do !use_delayed_remote_source in the if statement?
Actually after thinking about it more, we should do something like this
Details
$ git di
diff --git a/src/Processors/QueryPlan/ReadFromRemote.cpp b/src/Processors/QueryPlan/ReadFromRemote.cpp
index 40e91f4e907..b5fdbe5493d 100644
--- a/src/Processors/QueryPlan/ReadFromRemote.cpp
+++ b/src/Processors/QueryPlan/ReadFromRemote.cpp
@@ -1,3 +1,4 @@
+#include <exception>
#include <Processors/QueryPlan/ReadFromRemote.h>
#include <Analyzer/QueryNode.h>
@@ -518,6 +519,7 @@ void ReadFromRemote::addLazyPipe(
// In case reading from parallel replicas is allowed, lazy case is not triggered,
// so in this case it's required to get only one connection from the pool
std::vector<ConnectionPoolWithFailover::TryResult> try_results;
+ std::exception_ptr exception_ptr;
try
{
if (my_table_func_ptr)
@@ -529,6 +531,7 @@ void ReadFromRemote::addLazyPipe(
}
catch (const Exception & ex)
{
+ exception_ptr = std::current_exception();
if (ex.code() == ErrorCodes::ALL_CONNECTION_TRIES_FAILED)
LOG_WARNING(getLogger("ClusterProxy::SelectStreamFactory"),
"Connections to remote replicas of local shard {} failed, will use stale local replica", my_shard.shard_info.shard_num);
@@ -577,6 +580,10 @@ void ReadFromRemote::addLazyPipe(
}
}
+
+ if (exception_ptr)
+ std::rethrow_exception(exception_ptr);
+
std::vector<IConnectionPool::Entry> connections;
connections.reserve(try_results.size());
for (auto & try_result : try_results)There was a problem hiding this comment.
Is there any way to test directly for the presence of at least one replica before calling the sendQuery function?
The connection should not be created with zero replicas (so MultiplexedConnection::sendQuery() should never be reached in this case), and we should check it here, the problem is this fallback to local server case, due to which we ignore errors from pool::getMany*(), and later do not check that we have any connections.
There was a problem hiding this comment.
Understood, this looks better to me as well
tests/queries/0_stateless/03581_query_secure_delayed_remote_source.sql
Outdated
Show resolved
Hide resolved
tests/queries/0_stateless/03581_query_secure_delayed_remote_source.sql
Outdated
Show resolved
Hide resolved
…ayed_remote_source-failpoint-vector-out-of-bounds
Co-authored-by: Azat Khuzhin <a3at.mail@gmail.com>
I guess, but isn't the failpoint supposed to model a situation that could happen in real life? Edit: I see what you mean actually, since an actual similar error would probably be caught in the code block that the failpoint avoids. I wonder how useful is this failpoint if it avoids the actual codepath that would run in reality? |
This one is quiestionable to me, but I think it is OK |
Resolves #83282. Sometimes when the remote source is delayed sendQuery() receives an empty replica_states vector, which was not checked for. The fix was to add a check for an empty vector and fail gracefully in this case.
Changelog category (leave one):
Changelog entry (a user-readable short description of the changes that goes into CHANGELOG.md):
Fixed issue where querying a delayed remote source could result in vector out of bounds.