Flaky test test_backup_restore_on_cluster/test_concurrency.py #42719

qoega · 2022-10-26T18:28:36Z

In runner I run test several times like that and get error

pytest --count 100 -x test_backup_restore_on_cluster/test_concurrency.py -k test_concurrent_backups_on_same_node


test_backup_restore_on_cluster/test_concurrency.py::test_concurrent_backups_on_same_node[35-100] PASSED                                                             [ 16%]
test_backup_restore_on_cluster/test_concurrency.py::test_concurrent_backups_on_same_node[36-100] FAILED                                                             [ 17%]

================================================================================ FAILURES =================================================================================
______________________________________________________________ test_concurrent_backups_on_same_node[36-100] _______________________________________________________________

    def test_concurrent_backups_on_same_node():
        create_and_fill_table()

        backup_names = [new_backup_name() for _ in range(num_concurrent_backups)]

        ids = []
        for backup_name in backup_names:
            id = node0.query(
                f"BACKUP TABLE tbl ON CLUSTER 'cluster' TO {backup_name} ASYNC"
            ).split("\t")[0]
            ids.append(id)

        ids_list = "[" + ", ".join([f"'{id}'" for id in ids]) + "]"

>       assert_eq_with_retry(
            node0,
            f"SELECT status FROM system.backups WHERE status == 'CREATING_BACKUP' AND id IN {ids_list}",
            "",
        )

test_backup_restore_on_cluster/test_concurrency.py:122:
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _

instance = <helpers.cluster.ClickHouseInstance object at 0x7f17c07d03a0>
query = "SELECT status FROM system.backups WHERE status == 'CREATING_BACKUP' AND id IN ['7a95bd38-4241-4134-acbe-714ef4517c4f', 'c963cfe2-a95c-4885-a1a3-eafd79539e36', '53213f9a-58eb-4bf5-bde1-50919b5e7ad5', '601ec446-cfe9-43ba-adec-b3f99a14c8cf']"
expectation = '', retry_count = 20, sleep_time = 0.5, stdin = None, timeout = None, settings = None, user = None, ignore_error = False
get_result = <function <lambda> at 0x7f17c12e09d0>

    def assert_eq_with_retry(
        instance,
        query,
        expectation,
        retry_count=20,
        sleep_time=0.5,
        stdin=None,
        timeout=None,
        settings=None,
        user=None,
        ignore_error=False,
        get_result=lambda x: x,
    ):
        expectation_tsv = TSV(expectation)
        for i in range(retry_count):
            try:
                if (
                    TSV(
                        get_result(
                            instance.query(
                                query,
                                user=user,
                                stdin=stdin,
                                timeout=timeout,
                                settings=settings,
                                ignore_error=ignore_error,
                            )
                        )
                    )
                    == expectation_tsv
                ):
                    break
                time.sleep(sleep_time)
            except Exception as ex:
                logging.exception(f"assert_eq_with_retry retry {i+1} exception {ex}")
                time.sleep(sleep_time)
        else:
            val = TSV(
                get_result(
                    instance.query(
                        query,
                        user=user,
                        stdin=stdin,
                        timeout=timeout,
                        settings=settings,
                        ignore_error=ignore_error,
                    )
                )
            )
            if expectation_tsv != val:
>               raise AssertionError(
                    "'{}' != '{}'\n{}".format(
                        expectation_tsv,
                        val,
                        "\n".join(expectation_tsv.diff(val, n1="expectation", n2="query")),
                    )
                )
E               AssertionError: '' != 'CREATING_BACKUP
E               CREATING_BACKUP
E               CREATING_BACKUP
E               CREATING_BACKUP'
E               @@ -0,0 +1,4 @@
E               +CREATING_BACKUP
E               +CREATING_BACKUP
E               +CREATING_BACKUP
E               +CREATING_BACKUP

helpers/test_tools.py:114: AssertionError

The text was updated successfully, but these errors were encountered:

tavplubix · 2022-11-07T15:25:34Z

It's broken for 2 months already: https://play.clickhouse.com/play?user=play#c2VsZWN0IAp0b1N0YXJ0T2ZEYXkoY2hlY2tfc3RhcnRfdGltZSkgYXMgZCwKY291bnQoKSwgZ3JvdXBVbmlxQXJyYXkocHVsbF9yZXF1ZXN0X251bWJlciksIGFueShyZXBvcnRfdXJsKQpmcm9tIGNoZWNrcyB3aGVyZSAnMjAyMi0wNi0wMScgPD0gY2hlY2tfc3RhcnRfdGltZSBhbmQgdGVzdF9uYW1lIGxpa2UgJyV0ZXN0X2JhY2t1cF9yZXN0b3JlX29uX2NsdXN0ZXIvdGVzdF9jb25jdXJyZW5jeS5weSUnIGFuZCB0ZXN0X3N0YXR1cyBpbiAoJ0ZBSUwnLCAnRkxBS1knKSBncm91cCBieSBkIG9yZGVyIGJ5IGQgZGVzYw==

Probably it was broken in #41066

qoega · 2023-01-05T10:22:08Z

https://s3.amazonaws.com/clickhouse-test-reports/44853/9756d46ad751aa3d3701e20de24ce55e25038850/integration_tests__asan__[5/6].html

qoega · 2023-01-05T11:17:46Z

https://s3.amazonaws.com/clickhouse-test-reports/44853/9756d46ad751aa3d3701e20de24ce55e25038850/integration_tests__release__[1/4].html

also

test_backup_restore_on_cluster/test.py::test_tables_dependency

azat · 2023-01-23T13:07:15Z

Looks like this is a duplicate #45437

qoega · 2023-07-11T09:46:13Z

https://s3.amazonaws.com/clickhouse-test-reports/51696/5f73bec3257d8dd5c0994818e6b8d828cbd8eb8c/integration_tests__asan__[5_6].html

        if status == "BACKUP_CREATED":
            node1.query("DROP TABLE tbl ON CLUSTER 'cluster' SYNC")
            node1.query(f"RESTORE TABLE tbl ON CLUSTER 'cluster' FROM {backup_name}")
>           assert node1.query("SELECT * FROM tbl ORDER BY x") == TSV([3, 5])
E           AssertionError: assert '5\n' == 3\n5
E            +  where '5\n' = <bound method ClickHouseInstance.query of <helpers.cluster.ClickHouseInstance object at 0x7fd5bdbf59c0>>('SELECT * FROM tbl ORDER BY x')
E            +    where <bound method ClickHouseInstance.query of <helpers.cluster.ClickHouseInstance object at 0x7fd5bdbf59c0>> = <helpers.cluster.ClickHouseInstance object at 0x7fd5bdbf59c0>.query
E            +  and   3\n5 = TSV([3, 5])

test_backup_restore_on_cluster/test.py:1064: AssertionError

tavplubix · 2023-11-06T13:24:20Z

https://s3.amazonaws.com/clickhouse-test-reports/56369/cf07fd79fcba3d906d181adc33752edec0a712a9/integration_tests__release__[2_4].html

tavplubix · 2023-11-29T16:00:17Z

Still flaky: https://s3.amazonaws.com/clickhouse-test-reports/57337/67f0a0e3e44649c8ce46ec31e1fda329b9c132e6/integration_tests__release__[2_4].html

vitlibar · 2023-12-01T14:47:06Z

https://s3.amazonaws.com/clickhouse-test-reports/57337/67f0a0e3e44649c8ce46ec31e1fda329b9c132e6/integration_tests__release__[2_4].html

There is a strange error, it looks like RWLocks are not working correctly because after releasing all read-locks a write-lock for the table mydb.tbl7 in that test still can't be acquired, I'll try to investigate further. It doesn't seem related to backups at all.

Algunenano · 2023-12-01T16:10:29Z

There is a strange error, it looks like RWLocks are not working correctly because after releasing all read-locks a write-lock for the table mydb.tbl7 in that test still can't be acquired, I'll try to investigate further. It doesn't seem related to backups at all.

Sounds similar to #38864

Looking at the logs it seems the previous call to drop table (instead of drop database) failed due to a timeout because there was a concurrent backup going:

2023.11.29 05:34:30.449234 [ 676 ] {d570a475-25a6-4824-84fb-f4ecc4a6241d} <Error> executeQuery: Code: 473. DB::Exception: WRITE locking attempt on "mydb.tbl7" has timed out! (10000ms) Possible deadlock avoided. Client should retry. (DEADLOCK_AVOIDED) (version 23.11.1.2068) (from 172.16.7.1:55172) (in query: DROP TABLE IF EXISTS mydb.tbl7 SYNC), Stack trace (when copying this message, always include the lines below):

0. DB::Exception::Exception(DB::Exception::MessageMasked&&, int, bool) @ 0x000000000c532dd7 in /usr/bin/clickhouse
1. DB::Exception::Exception<String const&, DB::StorageID, long long>(int, FormatStringHelperImpl<std::type_identity<String const&>::type, std::type_identity<DB::StorageID>::type, std::type_identity<long long>::type>, String const&, DB::StorageID&&, long long&&) @ 0x00000000116b768d in /usr/bin/clickhouse
2. DB::IStorage::tryLockTimed(std::shared_ptr<DB::RWLockImpl> const&, DB::RWLockImpl::Type, String const&, std::chrono::duration<long long, std::ratio<1l, 1000l>> const&) const @ 0x00000000116b7435 in /usr/bin/clickhouse
3. DB::IStorage::lockExclusively(String const&, std::chrono::duration<long long, std::ratio<1l, 1000l>> const&) @ 0x00000000116b7f6c in /usr/bin/clickhouse
4. DB::InterpreterDropQuery::executeToTableImpl(std::shared_ptr<DB::Context const>, DB::ASTDropQuery&, std::shared_ptr<DB::IDatabase>&, StrongTypedef<wide::integer<128ul, unsigned int>, DB::UUIDTag>&) @ 0x0000000010e0d6a4 in /usr/bin/clickhouse
5. DB::InterpreterDropQuery::execute() @ 0x0000000010e09030 in /usr/bin/clickhouse
6. DB::executeQueryImpl(char const*, char const*, std::shared_ptr<DB::Context>, DB::QueryFlags, DB::QueryProcessingStage::Enum, DB::ReadBuffer*) @ 0x000000001137b862 in /usr/bin/clickhouse
7. DB::executeQuery(String const&, std::shared_ptr<DB::Context>, DB::QueryFlags, DB::QueryProcessingStage::Enum) @ 0x000000001137557a in /usr/bin/clickhouse
8. DB::TCPHandler::runImpl() @ 0x0000000012262be9 in /usr/bin/clickhouse
9. DB::TCPHandler::run() @ 0x0000000012277659 in /usr/bin/clickhouse
10. Poco::Net::TCPServerConnection::start() @ 0x0000000014c9a454 in /usr/bin/clickhouse
11. Poco::Net::TCPServerDispatcher::run() @ 0x0000000014c9b5d1 in /usr/bin/clickhouse
12. Poco::PooledThread::run() @ 0x0000000014d92567 in /usr/bin/clickhouse
13. Poco::ThreadImpl::runnableEntry(void*) @ 0x0000000014d90b5c in /usr/bin/clickhouse
14. ? @ 0x00007facc33f9ac3 in ?
15. ? @ 0x00007facc348ba40 in ?

I'm not confident that the call to dropOwnerGroupAndPassOwnership(it_group, /* next= */ true); there is completely correct. It seems it resets owner to the next element in the readers_queue. When this second read owner finishes it's not the last one so it does nothing, and when the first reader finishes it's not the main owner (rdlock_owner) so it doesn't notify nobody, so the next write lock waits "forever".

^ This is pure speculation. Reasoning with multiple threads and several layers of abstraction is complex so I'm most likely wrong, but at least the code looks suspicious to me.

We do have another open ticket related to RWLock in #41371

so when the previous owner (it never went away) goes to unlock it does no notifying since it's not the real owner.

vitlibar · 2023-12-03T15:02:28Z

It turned out #38864 was not completely correct, so I've made another fix: #57454

I'm not confident that the call to dropOwnerGroupAndPassOwnership(it_group, /* next= */ true); there is completely correct.

You're right it wasn't correct. There were two problems about dropOwnerGroupAndPassOwnership() there:

RWLock didn't expect rdlock_owner can be anything else except readers_queue.begin() or readers_queue.end() and it was even a consistency check for that in RWLock::unlock(). Because this invariant was broken RWLock::unlock() didn't actually release read locks after write lock failure and RWLock::unlock() just returned without doing anything.
It could suddenly switch the current phase to read or write while there are active locks from the opposite phase around; for example write lock failure can be caused by another write being active and it's definitely wrong to switch the current phase to read in this case.

I added two tests Common.RWLockWriteLockTimeoutDuringTwoReads and Common RWLockWriteLockTimeoutDuringWriteWithWaitingRead in my PR #57454 to cover both cases.

qoega added the flaky test flaky test found by CI label Oct 26, 2022

qoega assigned vitlibar Oct 26, 2022

tavplubix mentioned this issue Nov 7, 2022

Fix getauxval for sanitizer builds #42866

Merged

alexey-milovidov added testing Special issue with list of bugs found by CI and removed flaky test flaky test found by CI labels Nov 16, 2022

qoega mentioned this issue Jan 5, 2023

Fix integration-base image #44853

Merged

vitlibar mentioned this issue Jan 5, 2023

Up the log level of tables dependencies graphs #44957

Merged

This was referenced Jul 11, 2023

CI Fest #12 #51442

Closed

fix write to finalized buffer #51696

Merged

CheSema mentioned this issue Sep 28, 2023

add cancelation point to s3 retries #54697

Merged

vitlibar mentioned this issue Dec 3, 2023

Fix RWLock inconsistency after write lock timeout #57454

Merged

vitlibar mentioned this issue Dec 10, 2023

Fix RWLock inconsistency after write lock timeout (again) #57733

Merged

Algunenano closed this as completed Jan 29, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Flaky test test_backup_restore_on_cluster/test_concurrency.py #42719

Flaky test test_backup_restore_on_cluster/test_concurrency.py #42719

qoega commented Oct 26, 2022

tavplubix commented Nov 7, 2022

qoega commented Jan 5, 2023

qoega commented Jan 5, 2023

azat commented Jan 23, 2023

qoega commented Jul 11, 2023

tavplubix commented Nov 6, 2023

tavplubix commented Nov 29, 2023

vitlibar commented Dec 1, 2023 •

edited

Algunenano commented Dec 1, 2023

vitlibar commented Dec 3, 2023 •

edited

Flaky test test_backup_restore_on_cluster/test_concurrency.py #42719

Flaky test test_backup_restore_on_cluster/test_concurrency.py #42719

Comments

qoega commented Oct 26, 2022

tavplubix commented Nov 7, 2022

qoega commented Jan 5, 2023

qoega commented Jan 5, 2023

azat commented Jan 23, 2023

qoega commented Jul 11, 2023

tavplubix commented Nov 6, 2023

tavplubix commented Nov 29, 2023

vitlibar commented Dec 1, 2023 • edited

Algunenano commented Dec 1, 2023

vitlibar commented Dec 3, 2023 • edited

vitlibar commented Dec 1, 2023 •

edited

vitlibar commented Dec 3, 2023 •

edited