Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Segmentation fault while establishing a client connection #51780

Closed
antonio2368 opened this issue Jul 4, 2023 · 10 comments
Closed

Segmentation fault while establishing a client connection #51780

antonio2368 opened this issue Jul 4, 2023 · 10 comments
Assignees
Labels
fuzz Problem found by one of the fuzzers

Comments

@antonio2368
Copy link
Member

Related report https://s3.amazonaws.com/clickhouse-test-reports/0/4d76b538c7838c55105746bfdd99db1e01e1fd91/stateless_tests__release__s3_storage__[2_2].html

Stacktrace:

+ rg -Fa '<Fatal>' /var/log/clickhouse-server/clickhouse-server.log
2023.07.04 14:27:59.504843 [ 78113 ] {} <Fatal> BaseDaemon: ########################################
2023.07.04 14:27:59.504970 [ 78113 ] {} <Fatal> BaseDaemon: (version 23.7.1.159 (official build), build id: 8C1181B1F0C61E356B94FE9CD958359D5AD01EF8, git hash: 4d76b538c7838c55105746bfdd99db1e01e1fd91) (from thread 9796) (query_id: 613575a4-4584-409d-8d26-ee96a1a42ecb) (query: -- these are just in case
2023.07.04 14:27:59.505000 [ 78113 ] {} <Fatal> BaseDaemon: Address: 0x7fd61e7c7450. Access: read. Address not mapped to object.
2023.07.04 14:27:59.505025 [ 78113 ] {} <Fatal> BaseDaemon: Stack trace: 0x0000000014b266b6 0x0000000014b28ede 0x0000000014b41160 0x0000000014b42c3d 0x0000000012c1c883
2023.07.04 14:27:59.547828 [ 78113 ] {} <Fatal> BaseDaemon: 2.1. inlined from ./build_docker/./src/Common/CurrentThread.cpp:95: DB::CurrentThread::getGroup()
2023.07.04 14:27:59.547868 [ 78113 ] {} <Fatal> BaseDaemon: 2. ./build_docker/./src/Client/Connection.cpp:192: DB::Connection::connect(DB::ConnectionTimeouts const&) @ 0x0000000014b266b6 in /usr/lib/debug/usr/bin/clickhouse.debug
2023.07.04 14:27:59.582162 [ 78113 ] {} <Fatal> BaseDaemon: 3. ./build_docker/./src/Client/Connection.cpp:432: DB::Connection::getServerRevision(DB::ConnectionTimeouts const&) @ 0x0000000014b28ede in /usr/lib/debug/usr/bin/clickhouse.debug
2023.07.04 14:27:59.590283 [ 78113 ] {} <Fatal> BaseDaemon: 4. ./build_docker/./src/Client/ConnectionEstablisher.cpp:0: DB::ConnectionEstablisher::run(PoolWithFailoverBase<DB::IConnectionPool>::TryResult&, String&) @ 0x0000000014b41160 in /usr/lib/debug/usr/bin/clickhouse.debug
2023.07.04 14:27:59.596763 [ 78113 ] {} <Fatal> BaseDaemon: 5. ./build_docker/./src/Client/ConnectionEstablisher.cpp:129: DB::ConnectionEstablisherAsync::Task::run(std::function<void (int, Poco::Timespan, DB::AsyncEventTimeoutType, String const&, unsigned int)>, std::function<void ()>) @ 0x0000000014b42c3d in /usr/lib/debug/usr/bin/clickhouse.debug
2023.07.04 14:27:59.600837 [ 78113 ] {} <Fatal> BaseDaemon: 6.1. inlined from ./build_docker/./contrib/llvm-project/libcxx/include/__functional/function.h:818: ~__policy_func
2023.07.04 14:27:59.600864 [ 78113 ] {} <Fatal> BaseDaemon: 6.2. inlined from ./build_docker/./contrib/llvm-project/libcxx/include/__functional/function.h:1174: ~function
2023.07.04 14:27:59.600884 [ 78113 ] {} <Fatal> BaseDaemon: 6.3. inlined from ./build_docker/./src/Common/AsyncTaskExecutor.cpp:77: DB::AsyncTaskExecutor::Routine::operator()(std::function<void ()>)
2023.07.04 14:27:59.600903 [ 78113 ] {} <Fatal> BaseDaemon: 6.4. inlined from ./build_docker/./src/Common/Fiber.h:70: Fiber::RoutineImpl<DB::AsyncTaskExecutor::Routine>::operator()(boost::context::fiber&&)
2023.07.04 14:27:59.600928 [ 78113 ] {} <Fatal> BaseDaemon: 6.5. inlined from ./build_docker/./contrib/llvm-project/libcxx/include/__functional/invoke.h:394: decltype(std::declval<Fiber::RoutineImpl<DB::AsyncTaskExecutor::Routine>&>()(std::declval<boost::context::fiber>())) std::__invoke[abi:v15000]<Fiber::RoutineImpl<DB::AsyncTaskExecutor::Routine>&, boost::context::fiber>(Fiber::RoutineImpl<DB::AsyncTaskExecutor::Routine>&, boost::context::fiber&&)
2023.07.04 14:27:59.600959 [ 78113 ] {} <Fatal> BaseDaemon: 6.6. inlined from ./build_docker/./contrib/llvm-project/libcxx/include/__functional/invoke.h:531: std::invoke_result<Fiber::RoutineImpl<DB::AsyncTaskExecutor::Routine>&, boost::context::fiber>::type std::invoke<Fiber::RoutineImpl<DB::AsyncTaskExecutor::Routine>&, boost::context::fiber>(Fiber::RoutineImpl<DB::AsyncTaskExecutor::Routine>&, boost::context::fiber&&)
2023.07.04 14:27:59.600978 [ 78113 ] {} <Fatal> BaseDaemon: 6.7. inlined from ./build_docker/./contrib/boost/boost/context/fiber_fcontext.hpp:140: boost::context::detail::fiber_record<boost::context::fiber, FiberStack&, Fiber::RoutineImpl<DB::AsyncTaskExecutor::Routine>>::run(void*)
2023.07.04 14:27:59.600987 [ 78113 ] {} <Fatal> BaseDaemon: 6. ./build_docker/./contrib/boost/boost/context/fiber_fcontext.hpp:80: void boost::context::detail::fiber_entry<boost::context::detail::fiber_record<boost::context::fiber, FiberStack&, Fiber::RoutineImpl<DB::AsyncTaskExecutor::Routine>>>(boost::context::detail::transfer_t) @ 0x0000000012c1c883 in /usr/lib/debug/usr/bin/clickhouse.debug
2023.07.04 14:27:59.761859 [ 78113 ] {} <Fatal> BaseDaemon: Integrity check of the executable successfully passed (checksum: B5FE19DBDA917FF0CC3F73D8A26D445C)
2023.07.04 14:27:59.859655 [ 78113 ] {} <Fatal> BaseDaemon: Report this error to https://github.com/ClickHouse/ClickHouse/issues
2023.07.04 14:27:59.859798 [ 78113 ] {} <Fatal> BaseDaemon: Changed settings: min_compress_block_size = 2397122, max_compress_block_size = 2963649, max_block_size = 20523, max_insert_threads = 10, max_threads = 52, max_read_buffer_size = 943450, connect_timeout_with_failover_ms = 2000, connect_timeout_with_failover_secure_ms = 3000, idle_connection_timeout = 36000, s3_check_objects_after_upload = true, use_uncompressed_cache = true, stream_like_engine_allow_direct_select = true, enable_multiple_prewhere_read_steps = true, replication_wait_for_inactive_replica_timeout = 30, group_by_two_level_threshold = 603281, group_by_two_level_threshold_bytes = 1, distributed_aggregation_memory_efficient = false, enable_memory_bound_merging_of_aggregation_results = false, allow_nonconst_timezone_arguments = true, min_chunk_bytes_for_parallel_parsing = 13557159, merge_tree_coarse_index_granularity = 7, min_bytes_to_use_direct_io = 2943395733, min_bytes_to_use_mmap_io = 1328036391, log_queries = true, insert_quorum_timeout = 60000, http_response_buffer_size = 5862141, fsync_metadata = false, http_send_timeout = 60., http_receive_timeout = 60., opentelemetry_start_trace_probability = 0.10000000149011612, max_untracked_memory = 1048576, memory_profiler_step = 1048576, log_comment = '01752_distributed_query_sigsegv.sql', send_logs_level = 'fatal', optimize_aggregation_in_order = true, aggregation_in_order_max_block_bytes = 29541, read_in_order_two_level_merge_threshold = 20, database_atomic_wait_for_drop_and_detach_synchronously = true, local_filesystem_read_method = 'pread', remote_filesystem_read_method = 'read', local_filesystem_read_prefetch = true, remote_filesystem_read_prefetch = false, async_insert_busy_timeout_ms = 5000, enable_filesystem_cache = true, enable_filesystem_cache_on_write_operations = true, load_marks_asynchronously = true, allow_prefetched_read_pool_for_remote_filesystem = false, allow_prefetched_read_pool_for_local_filesystem = false, filesystem_prefetch_step_bytes = 104857600, filesystem_prefetch_min_bytes_for_single_read_task = 16777216, filesystem_prefetch_max_memory_usage = 134217728, filesystem_prefetches_limit = 10, insert_keeper_max_retries = 20, insert_keeper_fault_injection_probability = 0.009999999776482582, optimize_distinct_in_order = false
2023.07.04 14:28:01.008986 [ 728 ] {} <Fatal> Application: Child process was terminated by signal 11.

The problematic query:

2772493:2023.07.04 14:12:26.184657 [ 9796 ] {613575a4-4584-409d-8d26-ee96a1a42ecb} <Debug> executeQuery: Query span trace_id for opentelemetry log: cab14e2e-7990-ee6f-e342-fa91d0256bea
2772494:2023.07.04 14:12:26.184707 [ 9796 ] {613575a4-4584-409d-8d26-ee96a1a42ecb} <Debug> executeQuery: (from [::1]:34258) (comment: 01752_distributed_query_sigsegv.sql) -- these are just in case
2772495- SELECT throwIf(dummy = 0) FROM remote('127.{1,2}', system.one); (stage: Complete)
2772496:2023.07.04 14:12:26.184881 [ 9796 ] {613575a4-4584-409d-8d26-ee96a1a42ecb} <Trace> ContextAccess (default): Access granted: CREATE TEMPORARY TABLE, REMOTE ON *.*
2772499:2023.07.04 14:12:26.185173 [ 9796 ] {613575a4-4584-409d-8d26-ee96a1a42ecb} <Trace> ContextAccess (default): Access granted: SELECT(dummy) ON system.one
2772500:2023.07.04 14:12:26.185195 [ 9796 ] {613575a4-4584-409d-8d26-ee96a1a42ecb} <Trace> InterpreterSelectQuery: FetchColumns -> Complete
2772502:2023.07.04 14:12:26.185226 [ 9796 ] {613575a4-4584-409d-8d26-ee96a1a42ecb} <Trace> InterpreterSelectQuery: Complete -> Complete
2772505:2023.07.04 14:12:26.185654 [ 77740 ] {613575a4-4584-409d-8d26-ee96a1a42ecb} <Trace> Connection (127.2:9000): Connecting. Database: (not specified). User: default
2772892:2023.07.04 14:12:26.238591 [ 9796 ] {613575a4-4584-409d-8d26-ee96a1a42ecb} <Error> executeQuery: Code: 395. DB::Exception: Value passed to 'throwIf' function is non-zero: while executing 'FUNCTION throwIf(equals(dummy, 0) :: 2) -> throwIf(equals(dummy, 0)) UInt8 : 1'. (FUNCTION_THROW_IF_VALUE_IS_NON_ZERO) (version 23.7.1.159 (official build)) (from [::1]:34258) (comment: 01752_distributed_query_sigsegv.sql) (in query: -- these are just in case
2776200:2023.07.04 14:27:59.504970 [ 78113 ] {} <Fatal> BaseDaemon: (version 23.7.1.159 (official build), build id: 8C1181B1F0C61E356B94FE9CD958359D5AD01EF8, git hash: 4d76b538c7838c55105746bfdd99db1e01e1fd91) (from thread 9796) (query_id: 613575a4-4584-409d-8d26-ee96a1a42ecb) (query: -- these are just in case

The segfault happens inside DB::CurrentThread::getGroup() called from LOG_TRACE.

I see fibers mentioned, maybe @Avogar you know what it could be?

@azat as it's your test

@antonio2368 antonio2368 added the fuzz Problem found by one of the fuzzers label Jul 4, 2023
@Avogar
Copy link
Member

Avogar commented Jul 4, 2023

Similar issue: #51313
I am still not sure how it's possible and couldn't reproduce it, but I am still trying.

@Avogar Avogar self-assigned this Jul 4, 2023
@Avogar
Copy link
Member

Avogar commented Jul 4, 2023

@azat I will be happy if you can also tale a look if you have time, because you know this code well enough and found a lot of bugs :)

@Avogar
Copy link
Member

Avogar commented Jul 4, 2023

The problem is - somehow during fiber execution we are getting invalid thread local ptr current_thread.

@azat
Copy link
Collaborator

azat commented Jul 8, 2023

Yeah, it is hard to debug such issues, but it is much easier with core dumps, so how about adding core dumps from stateless tests as well into artifacts (like this had been done for stress and ast fuzzer)?

@Avogar
Copy link
Member

Avogar commented Jul 25, 2023

So, what is happening:

  • We handle TCP query from client
  • Run query and create RemoteQueryExecutor
  • Start creating connections asynchronously
  • Get exception while executing query locally because of throwIf
  • Stop query execution
  • Destroy RemoteQueryExecutor in TCPHandler thread
  • In RemoteQueryExecutor destructor we run fiber to finish creating connections to not leave it in invalid state
  • Inside fiber we access curent_thread and get segfault on it.

So, for unknown reason curent_thread in TCPHandler thread became invalid or fiber has different curent_thread for some reason.
Also we have another side problem that we cancel execution in RemoteQueryExecutor destructor in case of exception, but we should do it explicitly in PipelineExecutor by calling RemoteQueryExecutor::cancel and not it TCPHandler thread. Seems like I found a small bug in PipelineExecutor : #52533
So, maybe it will solve the issue because we will finish establishing connections in proper thread

@Avogar
Copy link
Member

Avogar commented Aug 9, 2023

Didn't see this failure after #52533, let's close

@azat
Copy link
Collaborator

azat commented Apr 8, 2024

I've seen this today:

2024.04.08 13:38:48.060743 [ 2181342 ] {} <Fatal> BaseDaemon: ########## Short fault info ############
2024.04.08 13:38:48.060827 [ 2181342 ] {} <Fatal> BaseDaemon: (version 24.2.2.71 (official build), build id: 57F857DCFE8BA6838F6463E4665CD700852BFF0E, git hash: 9293d361e72be9f6ccfd444d504e2137b2e837cf) (from thread 2120710) Received signal 11
2024.04.08 13:38:48.060987 [ 2181342 ] {} <Fatal> BaseDaemon: Signal description: Segmentation fault
2024.04.08 13:38:48.061013 [ 2181342 ] {} <Fatal> BaseDaemon: Address: 0x7fd0c3b0ee48. Access: read. Address not mapped to object.
2024.04.08 13:38:48.061032 [ 2181342 ] {} <Fatal> BaseDaemon: Stack trace: 0x0000000012ebdef0 0x0000000012ec065e 0x0000000012ecd6d3 0x0000000012ecf668 0x0000000010ce14e3
2024.04.08 13:38:48.061047 [ 2181342 ] {} <Fatal> BaseDaemon: ########################################
2024.04.08 13:38:48.061138 [ 2181342 ] {} <Fatal> BaseDaemon: (version 24.2.2.71 (official build), build id: 57F857DCFE8BA6838F6463E4665CD700852BFF0E, git hash: 9293d361e72be9f6ccfd444d504e2137b2e837cf) (from thread 2120710) (query_id: 6939178d-e550-47f0-850a-89d43e72b68c) (query:
2024.04.08 13:38:48.061232 [ 2181342 ] {} <Fatal> BaseDaemon: Address: 0x7fd0c3b0ee48. Access: read. Address not mapped to object.
2024.04.08 13:38:48.061273 [ 2181342 ] {} <Fatal> BaseDaemon: Stack trace: 0x0000000012ebdef0 0x0000000012ec065e 0x0000000012ecd6d3 0x0000000012ecf668 0x0000000010ce14e3
2024.04.08 13:38:48.136737 [ 2181342 ] {} <Fatal> BaseDaemon: 2.0. inlined from ./src/Common/CurrentThread.cpp:94: DB::CurrentThread::getGroup()
2024.04.08 13:38:48.136801 [ 2181342 ] {} <Fatal> BaseDaemon: 2. ./build_docker/./src/Client/Connection.cpp:199: DB::Connection::connect(DB::ConnectionTimeouts const&) @ 0x0000000012ebdef0
2024.04.08 13:38:48.177997 [ 2181342 ] {} <Fatal> BaseDaemon: 3. ./build_docker/./src/Client/Connection.cpp:495: DB::Connection::getServerRevision(DB::ConnectionTimeouts const&) @ 0x0000000012ec065e
2024.04.08 13:38:48.186829 [ 2181342 ] {} <Fatal> BaseDaemon: 4. ./build_docker/./src/Client/ConnectionEstablisher.cpp:0: DB::ConnectionEstablisher::run(PoolWithFailoverBase<DB::IConnectionPool>::TryResult&, String&) @ 0x0000000012ecd6d3
2024.04.08 13:38:48.195735 [ 2181342 ] {} <Fatal> BaseDaemon: 5. ./build_docker/./src/Client/ConnectionEstablisher.cpp:127: DB::ConnectionEstablisherAsync::Task::run(std::function<void (int, Poco::Timespan, DB::AsyncEventTimeoutType, String const&, unsigned int)>, std::function<void ()>) @ 0x0000000012ecf668
2024.04.08 13:38:48.202229 [ 2181342 ] {} <Fatal> BaseDaemon: 6.0. inlined from ./contrib/llvm-project/libcxx/include/__functional/function.h:818: ?
2024.04.08 13:38:48.202284 [ 2181342 ] {} <Fatal> BaseDaemon: 6.1. inlined from ./contrib/llvm-project/libcxx/include/__functional/function.h:1174: ?
2024.04.08 13:38:48.202316 [ 2181342 ] {} <Fatal> BaseDaemon: 6.2. inlined from ./build_docker/./src/Common/AsyncTaskExecutor.cpp:88: DB::AsyncTaskExecutor::Routine::operator()(std::function<void ()>)
2024.04.08 13:38:48.202350 [ 2181342 ] {} <Fatal> BaseDaemon: 6.3. inlined from ./src/Common/Fiber.h:76: Fiber::RoutineImpl<DB::AsyncTaskExecutor::Routine>::operator()(boost::context::fiber&&)
2024.04.08 13:38:48.202416 [ 2181342 ] {} <Fatal> BaseDaemon: 6.4. inlined from ./contrib/llvm-project/libcxx/include/__functional/invoke.h:394: ?
2024.04.08 13:38:48.202586 [ 2181342 ] {} <Fatal> BaseDaemon: 6.5. inlined from ./contrib/llvm-project/libcxx/include/__functional/invoke.h:531: ?
2024.04.08 13:38:48.202617 [ 2181342 ] {} <Fatal> BaseDaemon: 6.6. inlined from ./contrib/boost/boost/context/fiber_fcontext.hpp:157: boost::context::detail::fiber_record<boost::context::fiber, FiberStack&, Fiber::RoutineImpl<DB::AsyncTaskExecutor::Routine>>::run(void*)
2024.04.08 13:38:48.202651 [ 2181342 ] {} <Fatal> BaseDaemon: 6. ./contrib/boost/boost/context/fiber_fcontext.hpp:97: void boost::context::detail::fiber_entry<boost::context::detail::fiber_record<boost::context::fiber, FiberStack&, Fiber::RoutineImpl<DB::AsyncTaskExecutor::Routine>>>(boost::context::detail::transfer_t) @ 0x0000000010ce14e3
2024.04.08 13:38:48.410544 [ 2181342 ] {} <Fatal> BaseDaemon: Integrity check of the executable successfully passed (checksum: 3485110FABDB0C94202BD684999A9814)
2024.04.08 13:38:48.410618 [ 2181342 ] {} <Fatal> BaseDaemon: Report this error to https://github.com/ClickHouse/ClickHouse/issues
2024.04.08 13:38:48.410757 [ 2181342 ] {} <Fatal> BaseDaemon: Changed settings: max_threads = 1, max_query_size = 2097152, use_uncompressed_cache = false, distributed_background_insert_sleep_time_ms = 60000, distributed_background_insert_max_sleep_time_ms = 180000, distributed_background_insert_batch = true, load_balancing = 'random', compile_expressions = false, compile_aggregate_expressions = false, compile_sort_description = false, max_execution_time = 90., max_memory_usage = 107374182400, external_table_strict_query = true, max_partitions_per_insert_block = 1000, allow_experimental_database_materialized_mysql = true, allow_experimental_map_type = true, background_message_broker_schedule_pool_size = 32, backup_threads = 1

Sadly, but I don't have query logs for that time already (sigh)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
fuzz Problem found by one of the fuzzers
Projects
None yet
Development

No branches or pull requests

3 participants