clickhouse don't graceful shutdown on systemctl stop clickhouse-server but no any errors in logs #31478

deepdivenow · 2021-11-17T16:30:17Z

You have to provide the following information whenever possible.
ClickHouse 21.10.2.15
systemctl restart clickhouse-server

How to reproduce

systemctl restart clickhouse-server

A clear and concise description of what you expected to happen.
main.log:
2021.11.17 16:38:36.000118 [ 16805 ] {} AsynchronousMetrics: MemoryTracking: was 3.66 GiB, peak 3.71 GiB, will set to 3.65 GiB (RSS), difference: -3.99 MiB
2021.11.17 16:38:36.714041 [ 16651 ] {} Application: Closed connections. But 1 remain. Tip: To increase wait time add to config: <shutdown_wait_unfinished>60</shutdown_wait_unfinished>
2021.11.17 16:38:36.714103 [ 16651 ] {} Application: Will shutdown forcefully.
2021.11.17 16:38:36.867171 [ 16640 ] {} Application: Child process exited normally with code 0.
2021.11.17 16:39:25.043803 [ 18271 ] {} Application: Will watch for the process with pid 18283
2021.11.17 16:39:25.044084 [ 18283 ] {} Application: Forked a child process to watch
2021.11.17 16:39:25.044656 [ 18283 ] {} SentryWriter: Sending crash reports is disabled
2021.11.17 16:39:25.044818 [ 18283 ] {} Pipe: Pipe capacity is 1.00 MiB
2021.11.17 16:39:25.095628 [ 18283 ] {} : Starting ClickHouse 21.10.2.15 with revision 54455, build id: 6699B86599A2121E78E0D42DD67791ABD9AE5265, PID 18283
2021.11.17 16:39:25.095741 [ 18283 ] {} Application: starting up
2021.11.17 16:39:25.095766 [ 18283 ] {} Application: OS name: Linux, version: 4.15.0-162-generic, architecture: x86_64
2021.11.17 16:39:25.215198 [ 18283 ] {} Application: Calculated checksum of the binary: 902F2F159F07FEEC02834AEF44CBFB70, integrity check passed.
2021.11.17 16:39:25.215301 [ 18283 ] {} StatusFile: Status file /srv/clickhouse/server/status already exists - unclean restart. Contents:
PID: 16651
Started at: 2021-11-17 16:31:13
Revision: 54455

Add any other context about the problem here.
Where i can found description about clickhouse-server shutdown process.
Why it unclean restart but no any error messages about this.

deepdivenow · 2021-11-17T16:38:25Z

SYSTEM SHUTDOWN Query create the same:

2021.11.17 19:34:25.995307 [ 9338 ] {45e11e5e-487f-46b1-8505-cc9be54c8bcd} executeQuery: (from 127.0.0.1:55602) SYSTEM SHUTDOWN
2021.11.17 19:34:25.995446 [ 9338 ] {45e11e5e-487f-46b1-8505-cc9be54c8bcd} ContextAccess (default): Access granted: SYSTEM SHUTDOWN ON .
2021.11.17 19:34:25.995652 [ 9338 ] {45e11e5e-487f-46b1-8505-cc9be54c8bcd} MemoryTracker: Peak memory usage (for query): 0.00 B.
2021.11.17 19:34:25.995664 [ 9331 ] {} BaseDaemon: Received signal 15
2021.11.17 19:34:25.995721 [ 9331 ] {} Application: Received termination signal (Terminated)
2021.11.17 19:34:25.995750 [ 9338 ] {} TCPHandler: Processed in 0.000740994 sec.
2021.11.17 19:34:25.995753 [ 9331 ] {} BaseDaemon: Received signal 15
2021.11.17 19:34:25.995758 [ 9327 ] {} Application: Received termination signal.
2021.11.17 19:34:25.995806 [ 9331 ] {} Application: Received termination signal (Terminated)
2021.11.17 19:34:25.995808 [ 9338 ] {} TCPHandler: Done processing connection.
2021.11.17 19:34:25.995834 [ 9327 ] {} Application: Waiting for current connections to close.
2021.11.17 19:34:26.146686 [ 9342 ] {} SystemLog (system.query_log): Flushing system log, 6 entries to flush up to offset 366
2021.11.17 19:34:26.151071 [ 9342 ] {} DiskLocal: Reserving 1.00 MiB on disk default, having unreserved 8.40 TiB.
2021.11.17 19:34:26.153919 [ 9342 ] {} system.query_log: Renaming temporary part tmp_insert_202111_46_46_0 to 202111_14280_14280_0.
2021.11.17 19:34:26.154164 [ 9361 ] {} system.query_log (MergerMutator): Selected 2 parts from 202111_12936_14275_675 to 202111_14276_14276_0
2021.11.17 19:34:26.154227 [ 9361 ] {} DiskLocal: Reserving 1.07 MiB on disk default, having unreserved 8.40 TiB.
2021.11.17 19:34:26.154299 [ 9498 ] {} system.query_log (MergerMutator): Merging 2 parts: from 202111_12936_14275_675 to 202111_14276_14276_0 into Compact
2021.11.17 19:34:26.154325 [ 9342 ] {} SystemLog (system.query_log): Flushed system log up to offset 366
2021.11.17 19:34:26.154553 [ 9498 ] {} system.query_log (MergerMutator): Selected MergeAlgorithm: Horizontal
2021.11.17 19:34:26.154754 [ 9498 ] {} MergeTreeSequentialSource: Reading 3 marks from part 202111_12936_14275_675, total 10156 rows starting from the beginning of the part
2021.11.17 19:34:26.155472 [ 9498 ] {} MergeTreeSequentialSource: Reading 2 marks from part 202111_14276_14276_0, total 8 rows starting from the beginning of the part
2021.11.17 19:34:26.189510 [ 9498 ] {} system.query_log (MergerMutator): Merge sorted 10164 rows, containing 65 columns (65 merged, 0 gathered) in 0.035218391 sec., 288599.2151089469 rows/sec., 317.17 MiB/sec.
2021.11.17 19:34:26.192586 [ 9498 ] {} system.query_log: Renaming temporary part tmp_merge_202111_12936_14276_676 to 202111_12936_14276_676.
2021.11.17 19:34:26.192693 [ 9498 ] {} system.query_log (MergerMutator): Merged 2 parts: from 202111_12936_14275_675 to 202111_14276_14276_0
2021.11.17 19:34:26.192713 [ 9498 ] {} MemoryTracker: Peak memory usage: 29.84 MiB.
2021.11.17 19:34:26.804178 [ 9327 ] {} Application: Closed all listening sockets. Waiting for 1 outstanding connections.
2021.11.17 19:34:28.005903 [ 9423 ] {} DNSResolver: Updating DNS cache
2021.11.17 19:34:28.006000 [ 9423 ] {} DNSResolver: Updated DNS cache
2021.11.17 19:34:31.709409 [ 9327 ] {} Application: Closed connections. But 1 remain. Tip: To increase wait time add to config: <shutdown_wait_unfinished>60</shutdown_wait_unfinished>
2021.11.17 19:34:31.709480 [ 9327 ] {} Application: Will shutdown forcefully.
2021.11.17 19:34:31.862414 [ 9319 ] {} Application: Child process exited normally with code 0.
2021.11.17 19:34:47.479451 [ 18019 ] {} Application: Will watch for the process with pid 18033
2021.11.17 19:34:47.479704 [ 18033 ] {} Application: Forked a child process to watch
2021.11.17 19:34:47.480229 [ 18033 ] {} SentryWriter: Sending crash reports is disabled
2021.11.17 19:34:47.480374 [ 18033 ] {} Pipe: Pipe capacity is 1.00 MiB
2021.11.17 19:34:47.544145 [ 18033 ] {} : Starting ClickHouse 21.10.2.15 with revision 54455, build id: 6699B86599A2121E78E0D42DD67791ABD9AE5265, PID 18033
2021.11.17 19:34:47.544263 [ 18033 ] {} Application: starting up
2021.11.17 19:34:47.544289 [ 18033 ] {} Application: OS name: Linux, version: 4.15.0-162-generic, architecture: x86_64
2021.11.17 19:34:47.664035 [ 18033 ] {} Application: Calculated checksum of the binary: 902F2F159F07FEEC02834AEF44CBFB70, integrity check passed.
2021.11.17 19:34:47.664136 [ 18033 ] {} StatusFile: Status file /srv/clickhouse/server/status already exists - unclean restart. Contents:
PID: 9327
Started at: 2021-11-17 18:49:10
Revision: 54455

deepdivenow · 2021-11-17T16:49:28Z

zookeeper is_active flag on mergetreereplicated tables unreleased too

nvartolomei · 2021-11-17T17:43:22Z

this behaviour is known and by design

2021.11.17 16:38:36.714041 [ 16651 ] {} Application: Closed connections. But 1 remain. Tip: To increase wait time add to config: <shutdown_wait_unfinished>60</shutdown_wait_unfinished>

This is the key line.

https://github.com/ClickHouse/ClickHouse/blob/master/programs/server/Server.cpp#L1586-L1597

alexey-milovidov · 2021-11-18T04:24:29Z

@nvartolomei It still has to gracefully close ZooKeeper sessions.

deepdivenow · 2021-11-18T05:55:59Z

@nvartolomei by design not delete status file on shutdown
and not delete is_active in zoo on replicatedmergetree tables?

Force close clients connection is not matter.

MalKeshar · 2022-04-12T12:46:48Z

Same story. Clickhouse always report error "Replica appears to be already active" on startup and table switched to read write mode only after few attempts and 30 seconds:

13:25:33.054371 [ 19778 ] {} <Debug> carbon_metrics.graphite_sharded (ReplicatedMergeTreeRestartingThread): Activating replica.
13:25:33.054563 [ 19868 ] {} <Debug> StorageDistributed (graphite_index): Removing /data/graphite/clickhouse/data/carbon_metrics/graphite_index/shard1_replica2 (used for async INSERT into Distributed)

13:25:43.060912 [ 19727 ] {} <Debug> carbon_metrics.graphite_sharded (ReplicatedMergeTreeRestartingThread): Activating replica.
13:25:43.066739 [ 19727 ] {} <Error> carbon_metrics.graphite_sharded (ReplicatedMergeTreeRestartingThread): Couldn't start replication (table will be in readonly mode): Replica /clickhouse/tables/1/carbon_metrics.graphite/replicas/shard-replica-clickhouse1 appears to be already active (host: clickhouse1.xxx.local, port: 9009, tcp_port: 9000, database: carbon_metrics, table: graphite_sharded, scheme: http, ). If you're sure it's not, try again in a minute or remove znode /clickhouse/tables/1/carbon_metrics.graphite/replicas/shard-replica-clickhouse1/is_active manually. Code: 224, e.displayText() = DB::Exception: Replica /clickhouse/tables/1/carbon_metrics.graphite/replicas/shard-replica-clickhouse1 appears to be already active (host: clickhouse1.xxx.local, port: 9009, tcp_port: 9000, database: carbon_metrics, table: graphite_sharded, scheme: http, ). If you're sure it's not, try again in a minute or remove znode /clickhouse/tables/1/carbon_metrics.graphite/replicas/shard-replica-clickhouse1/is_active manually, Stack trace (when copying this message, always include the lines below):

13:25:53.067021 [ 19789 ] {} <Debug> carbon_metrics.graphite_sharded (ReplicatedMergeTreeRestartingThread): Activating replica.
13:25:53.073927 [ 19789 ] {} <Error> carbon_metrics.graphite_sharded (ReplicatedMergeTreeRestartingThread): Couldn't start replication (table will be in readonly mode): Replica /clickhouse/tables/1/carbon_metrics.graphite/replicas/shard-replica-clickhouse1 appears to be already active (host: clickhouse1.xxx.local, port: 9009, tcp_port: 9000, database: carbon_metrics, table: graphite_sharded, scheme: http, ). If you're sure it's not, try again in a minute or remove znode /clickhouse/tables/1/carbon_metrics.graphite/replicas/shard-replica-clickhouse1/is_active manually. Code: 224, e.displayText() = DB::Exception: Replica /clickhouse/tables/1/carbon_metrics.graphite/replicas/shard-replica-clickhouse1 appears to be already active (host: clickhouse1.xxx.local, port: 9009, tcp_port: 9000, database: carbon_metrics, table: graphite_sharded, scheme: http, ). If you're sure it's not, try again in a minute or remove znode /clickhouse/tables/1/carbon_metrics.graphite/replicas/shard-replica-clickhouse1/is_active manually, Stack trace (when copying this message, always include the lines below):

13:26:03.074326 [ 19802 ] {} <Debug> carbon_metrics.graphite_sharded (ReplicatedMergeTreeRestartingThread): Activating replica.
13:26:03.080302 [ 19802 ] {} <Error> carbon_metrics.graphite_sharded (ReplicatedMergeTreeRestartingThread): Couldn't start replication (table will be in readonly mode): Replica /clickhouse/tables/1/carbon_metrics.graphite/replicas/shard-replica-clickhouse1 appears to be already active (host: clickhouse1.xxx.local, port: 9009, tcp_port: 9000, database: carbon_metrics, table: graphite_sharded, scheme: http, ). If you're sure it's not, try again in a minute or remove znode /clickhouse/tables/1/carbon_metrics.graphite/replicas/shard-replica-clickhouse1/is_active manually. Code: 224, e.displayText() = DB::Exception: Replica /clickhouse/tables/1/carbon_metrics.graphite/replicas/shard-replica-clickhouse1 appears to be already active (host: clickhouse1.xxx.local, port: 9009, tcp_port: 9000, database: carbon_metrics, table: graphite_sharded, scheme: http, ). If you're sure it's not, try again in a minute or remove znode /clickhouse/tables/1/carbon_metrics.graphite/replicas/shard-replica-clickhouse1/is_active manually, Stack trace (when copying this message, always include the lines below):

13:26:13.080689 [ 19810 ] {} <Debug> carbon_metrics.graphite_sharded (ReplicatedMergeTreeRestartingThread): Activating replica.
13:26:13.085287 [ 19810 ] {} <Debug> carbon_metrics.graphite_sharded (ReplicatedMergeTreeQueue): Loading queue from /clickhouse/tables/1/carbon_metrics.graphite/replicas/shard-replica-clickhouse1/queue

debian 9, clickhouse 21.8.14.5, zookeeper 3.4.13.

PMazarovich · 2023-08-24T15:21:14Z

In AWS ECS when mounting /var/lib/clickhouse to EFS after restarting container, I see:
<Error> Application: DB::Exception: Cannot lock file /var/lib/clickhouse/status. Another server instance in same directory is already running.

and

<Information> StatusFile: Status file /var/lib/clickhouse/status already exists - unclean restart.
PID: 1
Cannot lock file /var/lib/clickhouse/status. Another server instance in same directory is already running.

Maybe someone knows, how to deal with this?

deepdivenow added the potential bug To be reviewed by developers and confirmed/rejected. label Nov 17, 2021

alexey-milovidov added operations minor Priority: minor and removed potential bug To be reviewed by developers and confirmed/rejected. labels Nov 18, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

clickhouse don't graceful shutdown on systemctl stop clickhouse-server but no any errors in logs #31478

clickhouse don't graceful shutdown on systemctl stop clickhouse-server but no any errors in logs #31478

deepdivenow commented Nov 17, 2021

deepdivenow commented Nov 17, 2021

deepdivenow commented Nov 17, 2021

nvartolomei commented Nov 17, 2021 •

edited

alexey-milovidov commented Nov 18, 2021

deepdivenow commented Nov 18, 2021

MalKeshar commented Apr 12, 2022

PMazarovich commented Aug 24, 2023

clickhouse don't graceful shutdown on systemctl stop clickhouse-server but no any errors in logs #31478

clickhouse don't graceful shutdown on systemctl stop clickhouse-server but no any errors in logs #31478

Comments

deepdivenow commented Nov 17, 2021

deepdivenow commented Nov 17, 2021

deepdivenow commented Nov 17, 2021

nvartolomei commented Nov 17, 2021 • edited

alexey-milovidov commented Nov 18, 2021

deepdivenow commented Nov 18, 2021

MalKeshar commented Apr 12, 2022

PMazarovich commented Aug 24, 2023

nvartolomei commented Nov 17, 2021 •

edited