Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

clickhouse don't graceful shutdown on systemctl stop clickhouse-server but no any errors in logs #31478

Open
deepdivenow opened this issue Nov 17, 2021 · 7 comments
Labels
minor Priority: minor operations

Comments

@deepdivenow
Copy link

You have to provide the following information whenever possible.
ClickHouse 21.10.2.15
systemctl restart clickhouse-server

How to reproduce

systemctl restart clickhouse-server

A clear and concise description of what you expected to happen.
main.log:
2021.11.17 16:38:36.000118 [ 16805 ] {} AsynchronousMetrics: MemoryTracking: was 3.66 GiB, peak 3.71 GiB, will set to 3.65 GiB (RSS), difference: -3.99 MiB
2021.11.17 16:38:36.714041 [ 16651 ] {} Application: Closed connections. But 1 remain. Tip: To increase wait time add to config: <shutdown_wait_unfinished>60</shutdown_wait_unfinished>
2021.11.17 16:38:36.714103 [ 16651 ] {} Application: Will shutdown forcefully.
2021.11.17 16:38:36.867171 [ 16640 ] {} Application: Child process exited normally with code 0.
2021.11.17 16:39:25.043803 [ 18271 ] {} Application: Will watch for the process with pid 18283
2021.11.17 16:39:25.044084 [ 18283 ] {} Application: Forked a child process to watch
2021.11.17 16:39:25.044656 [ 18283 ] {} SentryWriter: Sending crash reports is disabled
2021.11.17 16:39:25.044818 [ 18283 ] {} Pipe: Pipe capacity is 1.00 MiB
2021.11.17 16:39:25.095628 [ 18283 ] {} : Starting ClickHouse 21.10.2.15 with revision 54455, build id: 6699B86599A2121E78E0D42DD67791ABD9AE5265, PID 18283
2021.11.17 16:39:25.095741 [ 18283 ] {} Application: starting up
2021.11.17 16:39:25.095766 [ 18283 ] {} Application: OS name: Linux, version: 4.15.0-162-generic, architecture: x86_64
2021.11.17 16:39:25.215198 [ 18283 ] {} Application: Calculated checksum of the binary: 902F2F159F07FEEC02834AEF44CBFB70, integrity check passed.
2021.11.17 16:39:25.215301 [ 18283 ] {} StatusFile: Status file /srv/clickhouse/server/status already exists - unclean restart. Contents:
PID: 16651
Started at: 2021-11-17 16:31:13
Revision: 54455

Add any other context about the problem here.
Where i can found description about clickhouse-server shutdown process.
Why it unclean restart but no any error messages about this.

@deepdivenow deepdivenow added the potential bug To be reviewed by developers and confirmed/rejected. label Nov 17, 2021
@deepdivenow
Copy link
Author

SYSTEM SHUTDOWN Query create the same:

2021.11.17 19:34:25.995307 [ 9338 ] {45e11e5e-487f-46b1-8505-cc9be54c8bcd} executeQuery: (from 127.0.0.1:55602) SYSTEM SHUTDOWN
2021.11.17 19:34:25.995446 [ 9338 ] {45e11e5e-487f-46b1-8505-cc9be54c8bcd} ContextAccess (default): Access granted: SYSTEM SHUTDOWN ON .
2021.11.17 19:34:25.995652 [ 9338 ] {45e11e5e-487f-46b1-8505-cc9be54c8bcd} MemoryTracker: Peak memory usage (for query): 0.00 B.
2021.11.17 19:34:25.995664 [ 9331 ] {} BaseDaemon: Received signal 15
2021.11.17 19:34:25.995721 [ 9331 ] {} Application: Received termination signal (Terminated)
2021.11.17 19:34:25.995750 [ 9338 ] {} TCPHandler: Processed in 0.000740994 sec.
2021.11.17 19:34:25.995753 [ 9331 ] {} BaseDaemon: Received signal 15
2021.11.17 19:34:25.995758 [ 9327 ] {} Application: Received termination signal.
2021.11.17 19:34:25.995806 [ 9331 ] {} Application: Received termination signal (Terminated)
2021.11.17 19:34:25.995808 [ 9338 ] {} TCPHandler: Done processing connection.
2021.11.17 19:34:25.995834 [ 9327 ] {} Application: Waiting for current connections to close.
2021.11.17 19:34:26.146686 [ 9342 ] {} SystemLog (system.query_log): Flushing system log, 6 entries to flush up to offset 366
2021.11.17 19:34:26.151071 [ 9342 ] {} DiskLocal: Reserving 1.00 MiB on disk default, having unreserved 8.40 TiB.
2021.11.17 19:34:26.153919 [ 9342 ] {} system.query_log: Renaming temporary part tmp_insert_202111_46_46_0 to 202111_14280_14280_0.
2021.11.17 19:34:26.154164 [ 9361 ] {} system.query_log (MergerMutator): Selected 2 parts from 202111_12936_14275_675 to 202111_14276_14276_0
2021.11.17 19:34:26.154227 [ 9361 ] {} DiskLocal: Reserving 1.07 MiB on disk default, having unreserved 8.40 TiB.
2021.11.17 19:34:26.154299 [ 9498 ] {} system.query_log (MergerMutator): Merging 2 parts: from 202111_12936_14275_675 to 202111_14276_14276_0 into Compact
2021.11.17 19:34:26.154325 [ 9342 ] {} SystemLog (system.query_log): Flushed system log up to offset 366
2021.11.17 19:34:26.154553 [ 9498 ] {} system.query_log (MergerMutator): Selected MergeAlgorithm: Horizontal
2021.11.17 19:34:26.154754 [ 9498 ] {} MergeTreeSequentialSource: Reading 3 marks from part 202111_12936_14275_675, total 10156 rows starting from the beginning of the part
2021.11.17 19:34:26.155472 [ 9498 ] {} MergeTreeSequentialSource: Reading 2 marks from part 202111_14276_14276_0, total 8 rows starting from the beginning of the part
2021.11.17 19:34:26.189510 [ 9498 ] {} system.query_log (MergerMutator): Merge sorted 10164 rows, containing 65 columns (65 merged, 0 gathered) in 0.035218391 sec., 288599.2151089469 rows/sec., 317.17 MiB/sec.
2021.11.17 19:34:26.192586 [ 9498 ] {} system.query_log: Renaming temporary part tmp_merge_202111_12936_14276_676 to 202111_12936_14276_676.
2021.11.17 19:34:26.192693 [ 9498 ] {} system.query_log (MergerMutator): Merged 2 parts: from 202111_12936_14275_675 to 202111_14276_14276_0
2021.11.17 19:34:26.192713 [ 9498 ] {} MemoryTracker: Peak memory usage: 29.84 MiB.
2021.11.17 19:34:26.804178 [ 9327 ] {} Application: Closed all listening sockets. Waiting for 1 outstanding connections.
2021.11.17 19:34:28.005903 [ 9423 ] {} DNSResolver: Updating DNS cache
2021.11.17 19:34:28.006000 [ 9423 ] {} DNSResolver: Updated DNS cache
2021.11.17 19:34:31.709409 [ 9327 ] {} Application: Closed connections. But 1 remain. Tip: To increase wait time add to config: <shutdown_wait_unfinished>60</shutdown_wait_unfinished>
2021.11.17 19:34:31.709480 [ 9327 ] {} Application: Will shutdown forcefully.
2021.11.17 19:34:31.862414 [ 9319 ] {} Application: Child process exited normally with code 0.
2021.11.17 19:34:47.479451 [ 18019 ] {} Application: Will watch for the process with pid 18033
2021.11.17 19:34:47.479704 [ 18033 ] {} Application: Forked a child process to watch
2021.11.17 19:34:47.480229 [ 18033 ] {} SentryWriter: Sending crash reports is disabled
2021.11.17 19:34:47.480374 [ 18033 ] {} Pipe: Pipe capacity is 1.00 MiB
2021.11.17 19:34:47.544145 [ 18033 ] {} : Starting ClickHouse 21.10.2.15 with revision 54455, build id: 6699B86599A2121E78E0D42DD67791ABD9AE5265, PID 18033
2021.11.17 19:34:47.544263 [ 18033 ] {} Application: starting up
2021.11.17 19:34:47.544289 [ 18033 ] {} Application: OS name: Linux, version: 4.15.0-162-generic, architecture: x86_64
2021.11.17 19:34:47.664035 [ 18033 ] {} Application: Calculated checksum of the binary: 902F2F159F07FEEC02834AEF44CBFB70, integrity check passed.
2021.11.17 19:34:47.664136 [ 18033 ] {} StatusFile: Status file /srv/clickhouse/server/status already exists - unclean restart. Contents:
PID: 9327
Started at: 2021-11-17 18:49:10
Revision: 54455

@deepdivenow
Copy link
Author

zookeeper is_active flag on mergetreereplicated tables unreleased too

@nvartolomei
Copy link
Contributor

nvartolomei commented Nov 17, 2021

this behaviour is known and by design

2021.11.17 16:38:36.714041 [ 16651 ] {} Application: Closed connections. But 1 remain. Tip: To increase wait time add to config: <shutdown_wait_unfinished>60</shutdown_wait_unfinished>

This is the key line.

https://github.com/ClickHouse/ClickHouse/blob/master/programs/server/Server.cpp#L1586-L1597

@alexey-milovidov
Copy link
Member

@nvartolomei It still has to gracefully close ZooKeeper sessions.

@alexey-milovidov alexey-milovidov added operations minor Priority: minor and removed potential bug To be reviewed by developers and confirmed/rejected. labels Nov 18, 2021
@deepdivenow
Copy link
Author

@nvartolomei by design not delete status file on shutdown
and not delete is_active in zoo on replicatedmergetree tables?

Force close clients connection is not matter.

@MalKeshar
Copy link

Same story. Clickhouse always report error "Replica appears to be already active" on startup and table switched to read write mode only after few attempts and 30 seconds:

13:25:33.054371 [ 19778 ] {} <Debug> carbon_metrics.graphite_sharded (ReplicatedMergeTreeRestartingThread): Activating replica.
13:25:33.054563 [ 19868 ] {} <Debug> StorageDistributed (graphite_index): Removing /data/graphite/clickhouse/data/carbon_metrics/graphite_index/shard1_replica2 (used for async INSERT into Distributed)

13:25:43.060912 [ 19727 ] {} <Debug> carbon_metrics.graphite_sharded (ReplicatedMergeTreeRestartingThread): Activating replica.
13:25:43.066739 [ 19727 ] {} <Error> carbon_metrics.graphite_sharded (ReplicatedMergeTreeRestartingThread): Couldn't start replication (table will be in readonly mode): Replica /clickhouse/tables/1/carbon_metrics.graphite/replicas/shard-replica-clickhouse1 appears to be already active (host: clickhouse1.xxx.local, port: 9009, tcp_port: 9000, database: carbon_metrics, table: graphite_sharded, scheme: http, ). If you're sure it's not, try again in a minute or remove znode /clickhouse/tables/1/carbon_metrics.graphite/replicas/shard-replica-clickhouse1/is_active manually. Code: 224, e.displayText() = DB::Exception: Replica /clickhouse/tables/1/carbon_metrics.graphite/replicas/shard-replica-clickhouse1 appears to be already active (host: clickhouse1.xxx.local, port: 9009, tcp_port: 9000, database: carbon_metrics, table: graphite_sharded, scheme: http, ). If you're sure it's not, try again in a minute or remove znode /clickhouse/tables/1/carbon_metrics.graphite/replicas/shard-replica-clickhouse1/is_active manually, Stack trace (when copying this message, always include the lines below):

13:25:53.067021 [ 19789 ] {} <Debug> carbon_metrics.graphite_sharded (ReplicatedMergeTreeRestartingThread): Activating replica.
13:25:53.073927 [ 19789 ] {} <Error> carbon_metrics.graphite_sharded (ReplicatedMergeTreeRestartingThread): Couldn't start replication (table will be in readonly mode): Replica /clickhouse/tables/1/carbon_metrics.graphite/replicas/shard-replica-clickhouse1 appears to be already active (host: clickhouse1.xxx.local, port: 9009, tcp_port: 9000, database: carbon_metrics, table: graphite_sharded, scheme: http, ). If you're sure it's not, try again in a minute or remove znode /clickhouse/tables/1/carbon_metrics.graphite/replicas/shard-replica-clickhouse1/is_active manually. Code: 224, e.displayText() = DB::Exception: Replica /clickhouse/tables/1/carbon_metrics.graphite/replicas/shard-replica-clickhouse1 appears to be already active (host: clickhouse1.xxx.local, port: 9009, tcp_port: 9000, database: carbon_metrics, table: graphite_sharded, scheme: http, ). If you're sure it's not, try again in a minute or remove znode /clickhouse/tables/1/carbon_metrics.graphite/replicas/shard-replica-clickhouse1/is_active manually, Stack trace (when copying this message, always include the lines below):

13:26:03.074326 [ 19802 ] {} <Debug> carbon_metrics.graphite_sharded (ReplicatedMergeTreeRestartingThread): Activating replica.
13:26:03.080302 [ 19802 ] {} <Error> carbon_metrics.graphite_sharded (ReplicatedMergeTreeRestartingThread): Couldn't start replication (table will be in readonly mode): Replica /clickhouse/tables/1/carbon_metrics.graphite/replicas/shard-replica-clickhouse1 appears to be already active (host: clickhouse1.xxx.local, port: 9009, tcp_port: 9000, database: carbon_metrics, table: graphite_sharded, scheme: http, ). If you're sure it's not, try again in a minute or remove znode /clickhouse/tables/1/carbon_metrics.graphite/replicas/shard-replica-clickhouse1/is_active manually. Code: 224, e.displayText() = DB::Exception: Replica /clickhouse/tables/1/carbon_metrics.graphite/replicas/shard-replica-clickhouse1 appears to be already active (host: clickhouse1.xxx.local, port: 9009, tcp_port: 9000, database: carbon_metrics, table: graphite_sharded, scheme: http, ). If you're sure it's not, try again in a minute or remove znode /clickhouse/tables/1/carbon_metrics.graphite/replicas/shard-replica-clickhouse1/is_active manually, Stack trace (when copying this message, always include the lines below):

13:26:13.080689 [ 19810 ] {} <Debug> carbon_metrics.graphite_sharded (ReplicatedMergeTreeRestartingThread): Activating replica.
13:26:13.085287 [ 19810 ] {} <Debug> carbon_metrics.graphite_sharded (ReplicatedMergeTreeQueue): Loading queue from /clickhouse/tables/1/carbon_metrics.graphite/replicas/shard-replica-clickhouse1/queue

debian 9, clickhouse 21.8.14.5, zookeeper 3.4.13.

@PMazarovich
Copy link

In AWS ECS when mounting /var/lib/clickhouse to EFS after restarting container, I see:
<Error> Application: DB::Exception: Cannot lock file /var/lib/clickhouse/status. Another server instance in same directory is already running.

and

<Information> StatusFile: Status file /var/lib/clickhouse/status already exists - unclean restart.
PID: 1
Cannot lock file /var/lib/clickhouse/status. Another server instance in same directory is already running.

Maybe someone knows, how to deal with this?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
minor Priority: minor operations
Projects
None yet
Development

No branches or pull requests

5 participants