Company or project name
We use Clcikhouse to store tracing data
Describe what's wrong
ClickHouse server crashes with a segmentation fault (signal 11) when a distributed query using argMin() on Map-typed columns hits the per-query memory limit during aggregate state deserialization from a remote shard. The crash occurs not at the point of the memory limit exception, but ~100ms later during cleanup of the RemoteQueryExecutorReadContext fiber, when ColumnAggregateFunction::~ColumnAggregateFunction() attempts to destroy aggregate states containing a corrupted Field.
The server itself produced a crash report directing us to file this issue (Report this error to https://github.com/ClickHouse/ClickHouse/issues).
Does it reproduce on the most recent release?
Observed on 25.3.3.42-lts (official build). We have not yet tested on the latest release.
- Build ID:
9973936C3E9C99EB047A24A5D0962B5E00E1A7E3
- Git hash:
c4bfe68b052e4a15f731077d86d83b9bc2e5b71f
- Architecture: x86_64 Linux
Does it reproduce on the most recent release?
No
How to reproduce
Conditions
The crash requires a distributed query that:
- Uses
argMin(<map_column>, <datetime_column>) where the first argument is a Map type
- Groups by a high-cardinality key (e.g., millions of unique values)
- Processes enough data that aggregate state deserialization from remote shards approaches the per-query
max_memory_usage limit (~20 GiB in our case)
- Receives aggregate state data from remote shards via the native protocol
Query shape
SELECT
group_key,
argMin(map_column, datetime_column) AS first_map, -- Map in argMin triggers the bug
min(datetime_column) AS first_event
FROM distributed_table
WHERE datetime_column >= now() - INTERVAL 30 MINUTE
AND some_filter = 'value'
GROUP BY group_key
ORDER BY first_event DESC
LIMIT 500
Non-default settings
max_memory_usage = 21474836480 (20 GiB)
max_memory_usage_for_user = 102005473280
max_bytes_before_external_group_by = 10737418240
max_bytes_before_external_sort = 10737418240
max_concurrent_queries_for_user = 500
Factors that increase likelihood
- Multiple
argMin(<map_col>, ...) expressions in the same query
- Large Map values (many keys per row)
- High cardinality
GROUP BY keys producing many aggregate groups
- Memory pressure from concurrent queries
Local reproduction
We were unable to reproduce the crash in a local Docker-based 3-node cluster (x86_64 emulated via qemu on arm64 macOS). This is consistent with the crash depending on specific memory layout, jemalloc behavior, fiber TLS state timing, and scale (~20 GiB working set across many shards) that are difficult to replicate outside production. The crash has occurred multiple times in our production environment with the exact same stack trace.
Unit test (gtest)
We have written a gtest that directly exercises the deserialization crash path without requiring a distributed cluster. It calls createAndDeserializeBatch with a MemoryTracker limit configured to trigger MEMORY_LIMIT_EXCEEDED during SerializationMap::deserializeBinary → map.reserve(), then checks that ColumnAggregateFunction destructor cleanup doesn't segfault.
Place in src/AggregateFunctions/tests/gtest_argmin_map_oom.cpp (auto-discovered by GLOB_RECURSE("gtest*.cpp") in src/CMakeLists.txt). Build with ninja unit_tests_dbms, run with ./unit_tests_dbms --gtest_filter='ArgMinMapOOM.*'.
All APIs verified against the source at commit c4bfe68b.
If the test segfaults, it directly reproduces the production crash. If it passes, the bug is specific to the fiber cleanup path (the test bypasses the fiber layer), and the fix should target RemoteQueryExecutorReadContext fiber TLS management.
gtest_argmin_map_oom.cpp (click to expand)
#include <gtest/gtest.h>
#include <AggregateFunctions/AggregateFunctionFactory.h>
#include <AggregateFunctions/IAggregateFunction.h>
#include <Columns/ColumnAggregateFunction.h>
#include <Common/Arena.h>
#include <Common/MemoryTracker.h>
#include <Common/CurrentThread.h>
#include <Common/ThreadStatus.h>
#include <Core/Field.h>
#include <DataTypes/DataTypeDateTime64.h>
#include <DataTypes/DataTypeMap.h>
#include <DataTypes/DataTypeString.h>
#include <DataTypes/Serializations/SerializationAggregateFunction.h>
#include <IO/ReadBufferFromString.h>
#include <IO/WriteBufferFromString.h>
#include <IO/WriteHelpers.h>
using namespace DB;
namespace DB { namespace ErrorCodes { extern const int MEMORY_LIMIT_EXCEEDED; } }
namespace
{
AggregateFunctionPtr createArgMinMapFunction()
{
auto & factory = AggregateFunctionFactory::instance();
DataTypes arg_types = {
std::make_shared<DataTypeMap>(
std::make_shared<DataTypeString>(),
std::make_shared<DataTypeString>()),
std::make_shared<DataTypeDateTime64>(3),
};
Array params;
AggregateFunctionProperties properties;
return factory.get("argMin", NullsAction::EMPTY, arg_types, params, properties);
}
Field makeMapField(size_t n, size_t vsize, size_t seed)
{
Map map;
map.reserve(n);
for (size_t i = 0; i < n; ++i)
{
Tuple kv(2);
kv[0] = "key_" + std::to_string(seed) + "_" + std::to_string(i);
kv[1] = std::string(vsize, 'A' + static_cast<char>((seed + i) % 26));
map.push_back(std::move(kv));
}
return Field(std::move(map));
}
/// Serialize num_states argMin states.
/// Format per state (matching SingleValueDataGeneric::write/read):
/// result: UInt8 has (1) + SerializationMap::serializeBinary(Field)
/// value: UInt8 has (1) + SerializationDateTime64::serializeBinary(Field)
std::string serializeStates(
const AggregateFunctionPtr & func,
size_t num_states, size_t entries_per_map, size_t value_size)
{
auto ser_res = func->getArgumentTypes()[0]->getDefaultSerialization();
auto ser_val = func->getArgumentTypes()[1]->getDefaultSerialization();
WriteBufferFromOwnString wb;
for (size_t i = 0; i < num_states; ++i)
{
UInt8 has = 1;
writeBinary(has, wb);
Field map_field = makeMapField(entries_per_map, value_size, i);
ser_res->serializeBinary(map_field, wb, {});
writeBinary(has, wb);
Field dt_field = DecimalField<DateTime64>(DateTime64(1000000 + static_cast<Int64>(i)), 3);
ser_val->serializeBinary(dt_field, wb, {});
}
return wb.str();
}
SerializationPtr makeAggSerialization(const AggregateFunctionPtr & func)
{
return std::make_shared<SerializationAggregateFunction>(func, func->getName(), 0);
}
} // anonymous namespace
/// Baseline: deserialization works without memory limits.
TEST(ArgMinMapOOM, BaselineDeserializeSucceeds)
{
DB::ThreadStatus thread_status;
auto func = createArgMinMapFunction();
std::string serialized = serializeStates(func, 50, 30, 50);
auto col = ColumnAggregateFunction::create(func);
ReadBufferFromString rb(serialized);
auto ser = makeAggSerialization(func);
ser->deserializeBinaryBulk(*col, rb, 50, 0);
EXPECT_EQ(col->size(), 50u);
}
/// MAIN REPRODUCTION: OOM during createAndDeserializeBatch, verify cleanup.
/// SEGFAULT = bug reproduced. PASS = exception safety is correct.
TEST(ArgMinMapOOM, DeserializeOOMDoesNotCrash)
{
DB::ThreadStatus thread_status;
auto func = createArgMinMapFunction();
const size_t num_states = 200, entries = 100, vsize = 200;
std::string serialized = serializeStates(func, num_states, entries, vsize);
auto * tracker = CurrentThread::getMemoryTracker();
ASSERT_NE(tracker, nullptr);
tracker->setHardLimit(static_cast<Int64>(50 * entries * vsize));
auto col = ColumnAggregateFunction::create(func);
ReadBufferFromString rb(serialized);
auto ser = makeAggSerialization(func);
bool oom = false;
try { ser->deserializeBinaryBulk(*col, rb, num_states, 0); }
catch (const Exception & e) {
if (e.code() == ErrorCodes::MEMORY_LIMIT_EXCEEDED) oom = true; else throw;
}
tracker->setHardLimit(0); // remove limit for safe cleanup
EXPECT_TRUE(oom);
EXPECT_GT(col->size(), 0u);
EXPECT_LT(col->size(), num_states);
// col destructor runs here — segfault = bug reproduced.
}
/// Stress: 34 iterations with varying memory limits.
TEST(ArgMinMapOOM, RepeatedOOMStressTest)
{
DB::ThreadStatus thread_status;
auto func = createArgMinMapFunction();
std::string serialized = serializeStates(func, 300, 80, 150);
auto * tracker = CurrentThread::getMemoryTracker();
ASSERT_NE(tracker, nullptr);
for (size_t f = 1; f <= 100; f += 3)
{
tracker->setHardLimit(static_cast<Int64>(f * 80 * 150));
auto col = ColumnAggregateFunction::create(func);
ReadBufferFromString rb(serialized);
auto ser = makeAggSerialization(func);
try { ser->deserializeBinaryBulk(*col, rb, 300, 0); }
catch (const Exception &) {}
tracker->setHardLimit(0);
// col destructor — must not crash at any limit.
}
}
Expected behavior
When MEMORY_LIMIT_EXCEEDED is thrown during distributed aggregate state deserialization, the query should fail cleanly with error code 241 without crashing the server. All partially-constructed aggregate states should be safely destroyed.
Error message and/or stacktrace
Error (MEMORY_LIMIT_EXCEEDED)
2026.02.18 01:23:00.912444 [ 981618 ] {64fc6438-f4e6-42ad-92b4-4852c83d98ab} <Error> executeQuery:
Code: 241. DB::Exception: Query memory limit exceeded: would use 20.00 GiB
(attempt to allocate chunk of 4.00 MiB bytes), maximum: 20.00 GiB.:
while receiving packet from <remote_shard>:30901: While executing Remote. (MEMORY_LIMIT_EXCEEDED)
(version 25.3.3.42 (official build))
Error message and/or stacktrace
MemoryTracker::allocImpl()
MemoryTracker::allocImpl()
CurrentMemoryTracker::allocImpl()
AllocatorWithMemoryTracking<DB::Field>::allocate()
std::vector<DB::Field, AllocatorWithMemoryTracking<DB::Field>>::reserve()
DB::SerializationMap::deserializeBinary(DB::Field&, DB::ReadBuffer&, DB::FormatSettings const&) const
DB::SingleValueDataGeneric::read(DB::ReadBuffer&, DB::ISerialization const&, DB::Arena*)
DB::AggregateFunctionArgMinMax<DB::AggregateFunctionArgMinMaxDataGeneric<DB::SingleValueDataFixed<DB::DateTime64>>, true>::deserialize()
DB::IAggregateFunctionHelper<...>::createAndDeserializeBatch()
DB::SerializationAggregateFunction::deserializeBinaryBulk()
DB::ISerialization::deserializeBinaryBulkWithMultipleStreams()
DB::NativeReader::read()
DB::Connection::receiveDataImpl()
DB::Connection::receivePacket()
DB::PacketReceiver::Task::run()
boost::context::detail::fiber_entry<...>()
Crash (Segfault — signal 11)
2026.02.18 01:23:01.535153 [ 992999 ] {} <Fatal> BaseDaemon: ########## Short fault info ############
2026.02.18 01:23:01.535216 [ 992999 ] {} <Fatal> BaseDaemon: (version 25.3.3.42 (official build)) Received signal 11
2026.02.18 01:23:01.535234 [ 992999 ] {} <Fatal> BaseDaemon: Signal description: Segmentation fault
2026.02.18 01:23:01.535253 [ 992999 ] {} <Fatal> BaseDaemon: Address: 0x50. Access: read. Address not mapped to object.
Crash stack trace:
0. signalHandler(int, siginfo_t*, void*)
1. ? @ 0x000000000003dc90
2. free_default -- SIGSEGV: called with ptr = 0x50
3. DB::Field::~Field()
4. DB::Field::~Field()
5. DB::ColumnAggregateFunction::~ColumnAggregateFunction()
6. DB::ColumnAggregateFunction::~ColumnAggregateFunction()
7. std::vector<DB::ColumnWithTypeAndName>::__destroy_vector::operator()()
8. DB::Packet::~Packet()
9. DB::RemoteQueryExecutorReadContext::~RemoteQueryExecutorReadContext()
10. DB::RemoteQueryExecutorReadContext::~RemoteQueryExecutorReadContext()
11. DB::RemoteQueryExecutor::~RemoteQueryExecutor()
12. DB::RemoteSource::~RemoteSource()
13. std::__shared_ptr_emplace<...>::__on_zero_shared()
14. DB::QueryPipeline::~QueryPipeline()
15. DB::QueryPipeline::reset()
16. DB::TCPHandler::runImpl()
17. DB::TCPHandler::run()
18. Poco::Net::TCPServerConnection::start()
19. Poco::Net::TCPServerDispatcher::run()
20. Poco::PooledThread::run()
21. Poco::ThreadImpl::runnableEntry(void*)
Integrity check passed: checksum: 4A692E92B3118E45A8AAFDD0CC48C1C5
Additional context
No response
Company or project name
We use Clcikhouse to store tracing data
Describe what's wrong
ClickHouse server crashes with a segmentation fault (signal 11) when a distributed query using
argMin()onMap-typed columns hits the per-query memory limit during aggregate state deserialization from a remote shard. The crash occurs not at the point of the memory limit exception, but ~100ms later during cleanup of theRemoteQueryExecutorReadContextfiber, whenColumnAggregateFunction::~ColumnAggregateFunction()attempts to destroy aggregate states containing a corruptedField.The server itself produced a crash report directing us to file this issue (
Report this error to https://github.com/ClickHouse/ClickHouse/issues).Does it reproduce on the most recent release?
Observed on 25.3.3.42-lts (official build). We have not yet tested on the latest release.
9973936C3E9C99EB047A24A5D0962B5E00E1A7E3c4bfe68b052e4a15f731077d86d83b9bc2e5b71fDoes it reproduce on the most recent release?
No
How to reproduce
Conditions
The crash requires a distributed query that:
argMin(<map_column>, <datetime_column>)where the first argument is aMaptypemax_memory_usagelimit (~20 GiB in our case)Query shape
Non-default settings
Factors that increase likelihood
argMin(<map_col>, ...)expressions in the same queryGROUP BYkeys producing many aggregate groupsLocal reproduction
We were unable to reproduce the crash in a local Docker-based 3-node cluster (x86_64 emulated via qemu on arm64 macOS). This is consistent with the crash depending on specific memory layout, jemalloc behavior, fiber TLS state timing, and scale (~20 GiB working set across many shards) that are difficult to replicate outside production. The crash has occurred multiple times in our production environment with the exact same stack trace.
Unit test (gtest)
We have written a gtest that directly exercises the deserialization crash path without requiring a distributed cluster. It calls
createAndDeserializeBatchwith aMemoryTrackerlimit configured to triggerMEMORY_LIMIT_EXCEEDEDduringSerializationMap::deserializeBinary→map.reserve(), then checks thatColumnAggregateFunctiondestructor cleanup doesn't segfault.Place in
src/AggregateFunctions/tests/gtest_argmin_map_oom.cpp(auto-discovered byGLOB_RECURSE("gtest*.cpp")insrc/CMakeLists.txt). Build withninja unit_tests_dbms, run with./unit_tests_dbms --gtest_filter='ArgMinMapOOM.*'.All APIs verified against the source at commit
c4bfe68b.If the test segfaults, it directly reproduces the production crash. If it passes, the bug is specific to the fiber cleanup path (the test bypasses the fiber layer), and the fix should target
RemoteQueryExecutorReadContextfiber TLS management.gtest_argmin_map_oom.cpp (click to expand)
Expected behavior
When
MEMORY_LIMIT_EXCEEDEDis thrown during distributed aggregate state deserialization, the query should fail cleanly with error code 241 without crashing the server. All partially-constructed aggregate states should be safely destroyed.Error message and/or stacktrace
Error (MEMORY_LIMIT_EXCEEDED)
Error message and/or stacktrace
Crash (Segfault — signal 11)
Crash stack trace:
Integrity check passed:
checksum: 4A692E92B3118E45A8AAFDD0CC48C1C5Additional context
No response