Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Set force_restore_data in stress tests #28296

Closed
wants to merge 1 commit into from

Conversation

azat
Copy link
Collaborator

@azat azat commented Aug 28, 2021

CI reports]:

1

    $ pigz -cd clickhouse-server.log.gz | fgrep 'Application: DB::Exception:' -m1
2021.08.25 11:34:10.208309 [ 9275 ] {} <Error> Application: DB::Exception: The local set of parts of table test_96v1zj.dst_10 doesn't look like the set of parts in ZooKeeper: 4.00 rows of 4.00 total rows in filesystem are suspicious. There are 1 unexpected parts with 4 rows (0 of them is not just-written with 0 rows), 0 missing parts (with 0 blocks).: Cannot attach table `test_96v1zj`.`dst_10` from metadata file /var/lib/clickhouse/metadata/test_96v1zj/dst_10.sql from query ATTACH TABLE test_96v1zj.dst_10 (`p` UInt64, `k` UInt64, `v` UInt64) ENGINE = ReplicatedMergeTree('/test/01154_move_partition_long_test_96v1zj/dst', '10') PARTITION BY p % 10 ORDER BY k SETTINGS index_granularity = 8192: while loading database `test_96v1zj` from path /var/lib/clickhouse/metadata/test_96v1zj

2

2021.08.12 03:26:24.345469 [ 11280 ] {} <Error> Application: DB::Exception: The local set of parts of table test_1.ttl_table2 doesn't look like the set of parts in ZooKeeper: 15.00 rows of 15.00 total rows in filesystem are suspicious. There are 4 unexpected parts with 15 rows (1 of them is not just-written with 0 rows), 0 missing parts (with 0 blocks).: Cannot attach table `test_1`.`ttl_table2` from metadata file /var/lib/clickhouse/metadata/test_1/ttl_table2.sql from query ATTACH TABLE test_1.ttl_table2 (`key` DateTime) ENGINE = ReplicatedMergeTree('/test/01921_concurrent_ttl_and_normal_merges/01921_concurrent_ttl_and_normal_merges_zookeeper_long_test_1/ttl_table', '2') ORDER BY tuple() TTL key + toIntervalSecond(1) SETTINGS merge_with_ttl_timeout = 1, max_replicated_merges_with_ttl_in_queue = 100, max_number_of_merges_with_ttl_in_pool = 100, cleanup_delay_period = 1, cleanup_delay_period_random_add = 0, index_granularity = 8192: while loading database `test_1` from path /var/lib/clickhouse/metadata/test_1

Changelog category (leave one):

  • Not for changelog (changelog entry is not required)

CI report [1]:

<details>

```
    $ pigz -cd clickhouse-server.log.gz | fgrep 'Application: DB::Exception:' -m1
2021.08.25 11:34:10.208309 [ 9275 ] {} <Error> Application: DB::Exception: The local set of parts of table test_96v1zj.dst_10 doesn't look like the set of parts in ZooKeeper: 4.00 rows of 4.00 total rows in filesystem are suspicious. There are 1 unexpected parts with 4 rows (0 of them is not just-written with 0 rows), 0 missing parts (with 0 blocks).: Cannot attach table `test_96v1zj`.`dst_10` from metadata file /var/lib/clickhouse/metadata/test_96v1zj/dst_10.sql from query ATTACH TABLE test_96v1zj.dst_10 (`p` UInt64, `k` UInt64, `v` UInt64) ENGINE = ReplicatedMergeTree('/test/01154_move_partition_long_test_96v1zj/dst', '10') PARTITION BY p % 10 ORDER BY k SETTINGS index_granularity = 8192: while loading database `test_96v1zj` from path /var/lib/clickhouse/metadata/test_96v1zj
```

</details>

  [1]: https://clickhouse-test-reports.s3.yandex.net/27881/e8d87053c04e8e30bb35fa46298abb521818731f/stress_test_(undefined).html#fail1
@robot-clickhouse robot-clickhouse added the pr-not-for-changelog This PR should not be mentioned in the changelog label Aug 28, 2021
Copy link
Member

@alexey-milovidov alexey-milovidov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This situation should not happen in stress test.
We should fix the root cause.

@alexey-milovidov alexey-milovidov self-assigned this Aug 28, 2021
@azat azat marked this pull request as draft August 29, 2021 11:13
@azat
Copy link
Collaborator Author

azat commented Sep 9, 2021

This situation should not happen in stress test.

Indeed, but there problem here is DROP fails in the middle, parts/* znodes had been removed already but replica path had not been completely removed, so later this replica will not be considered as new replica.

Maybe replica should set some flag in dropReplica (not under replica_path), and later attach can check it, and attach table in read-only mode if it set? @alesapin what do you think?

Logs

2021.09.02 23:10:16.203598 [ 18964 ] {86789434-b69c-4367-b021-26ab33d4ef4c} <Debug> executeQuery: (from [::1]:55820) (comment: '/usr/share/clickhouse-test/queries/0_stateless/00993_system_parts_race_condition_drop_zookeeper.sh') DROP TABLE IF EXISTS alter_table_4
2021.09.02 23:10:16.308794 [ 18964 ] {86789434-b69c-4367-b021-26ab33d4ef4c} <Information> test_1.alter_table_4: Removing replica /clickhouse/tables/00993_system_parts_race_condition_drop_zookeeper_test_1/alter_table/replicas/r_4, marking it as lost
2021.09.02 23:10:31.612931 [ 18964 ] {86789434-b69c-4367-b021-26ab33d4ef4c} <Error> executeQuery: Code: 999. Coordination::Exception: Operation timeout. (KEEPER_EXCEPTION) (version 21.10.1.7982) (from [::1]:55820) (comment: '/usr/share/clickhouse-test/queries/0_stateless/00993_system_parts_race_condition_drop_zookeeper.sh') (in query: DROP TABLE IF EXISTS alter_table_4), Stack trace (when copying this message, always include the lines below):
... # so it failed during recursive removal of replica path, and it removes parts, but does not removes the root
2021.09.02 23:25:36.963361 [ 13472 ] {b81137a0-396c-4238-85eb-e02aa16b45d5} <Trace> InterpreterSystemQuery: Restarting replica on test_1.alter_table_4
2021.09.02 23:25:39.437203 [ 1828 ] {} <Debug> test_1.alter_table_4: Loading data parts
2021.09.02 23:26:08.953889 [ 13472 ] {b81137a0-396c-4238-85eb-e02aa16b45d5} <Error> executeQuery: Code: 231. DB::Exception: The local set of parts of table test_1.alter_table_4 doesn't look like the set of parts in ZooKeeper: 295.09 thousand rows of 295.09 thousand total rows in filesystem are suspicious. There are 57 unexpected parts with 295090 rows (9 of them is not just-written with 42556 rows), 0 missing parts (with 0 blocks). (TOO_MANY_UNEXPECTED_DATA_PARTS) (version 21.10.1.7982) (from [::1]:58192) (comment: 01646_system_restart_replicas_smoke.sql) (in query: SYSTEM RESTART REPLICAS;), Stack trace (when copying this message, always include the lines below):
... and here it fails, because there are no parts in keeper

Keeper

(CONNECTED [127.1:2182]) /> json_cat /clickhouse/tables/00993_system_parts_race_condition_drop_zookeeper_test_1/alter_table/replicas/r_4/is_lost
1
(CONNECTED [127.1:2182]) /> stat /clickhouse/tables/00993_system_parts_race_condition_drop_zookeeper_test_1/alter_table/replicas/r_4/is_lost
Stat(
  czxid=0xc3d5
  mzxid=0xed1b
  ctime=1630613343856
  mtime=1630613416310
  version=2
  cversion=0
  aversion=0
  ephemeralOwner=0x0
  dataLength=1
  numChildren=0
  pzxid=0xc3d5
)
(CONNECTED [127.1:2182]) /> ls /clickhouse/tables/00993_system_parts_race_condition_drop_zookeeper_test_1/alter_table/replicas/r_4
columns
flags
host
is_lost
log_pointer
max_processed_insert_time
metadata
metadata_version
min_unprocessed_insert_time
mutation_pointer
parts
queue
(CONNECTED [127.1:2182]) /> ls /clickhouse/tables/00993_system_parts_race_condition_drop_zookeeper_test_1/alter_table/replicas/r_4/parts

# no parts

@CLAassistant
Copy link

CLAassistant commented Sep 28, 2021

CLA assistant check
All committers have signed the CLA.

@azat
Copy link
Collaborator Author

azat commented Oct 2, 2021

@azat
Copy link
Collaborator Author

azat commented Oct 28, 2021

2021.09.02 23:26:08.953889 [ 13472 ] {b81137a0-396c-4238-85eb-e02aa16b45d5} <Error> executeQuery: Code: 231. DB::Exception: The local set of parts of table test_1.alter_table_4 doesn't look like the set of parts in ZooKeeper: 295.09 thousand rows of 295.09 thousand total rows in filesystem are suspicious. There are 57 unexpected parts with 295090 rows (9 of them is not just-written with 42556 rows), 0 missing parts (with 0 blocks). (TOO_MANY_UNEXPECTED_DATA_PARTS) (version 21.10.1.7982) (from [::1]:58192) (comment: 01646_system_restart_replicas_smoke.sql) (in query: SYSTEM RESTART REPLICAS;), Stack trace (when copying this message, always include the lines below):

Fixed such issues for ReplicatedMergeTree engine - #30826

azat added a commit to azat/ClickHouse that referenced this pull request Nov 1, 2021
…f parts in ZooKeeper" error

If during removing replica_path from zookeeper, some error occurred
(zookeeper goes away), then it may not remove everything from zookeeper.

And on DETACH/ATTACH (or server restart, like stress tests does in the
analysis from this comment [1]), it will trigger an error:

    The local set of parts of table test_1.alter_table_4 doesn't look like the set of parts in ZooKeeper:

  [1]: ClickHouse#28296 (comment)

Fix this, by removing "metadata" at first, and only after this
everything else, this will avoid this error, since on ATTACH such table
will be marked as read-only.

v2: forget to remove remote_replica_path itself
v3: fix test_drop_replica by adding a check for remote_replica_path existence
@azat azat deleted the stress-force_restore_data branch November 2, 2021 07:34
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
pr-not-for-changelog This PR should not be mentioned in the changelog
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants