Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Flaky test test_backup_restore_on_cluster/test_concurrency.py #42719

Closed
qoega opened this issue Oct 26, 2022 · 10 comments
Closed

Flaky test test_backup_restore_on_cluster/test_concurrency.py #42719

qoega opened this issue Oct 26, 2022 · 10 comments
Assignees
Labels
testing Special issue with list of bugs found by CI

Comments

@qoega
Copy link
Member

qoega commented Oct 26, 2022

In runner I run test several times like that and get error

pytest --count 100 -x test_backup_restore_on_cluster/test_concurrency.py -k test_concurrent_backups_on_same_node


test_backup_restore_on_cluster/test_concurrency.py::test_concurrent_backups_on_same_node[35-100] PASSED                                                             [ 16%]
test_backup_restore_on_cluster/test_concurrency.py::test_concurrent_backups_on_same_node[36-100] FAILED                                                             [ 17%]

================================================================================ FAILURES =================================================================================
______________________________________________________________ test_concurrent_backups_on_same_node[36-100] _______________________________________________________________

    def test_concurrent_backups_on_same_node():
        create_and_fill_table()

        backup_names = [new_backup_name() for _ in range(num_concurrent_backups)]

        ids = []
        for backup_name in backup_names:
            id = node0.query(
                f"BACKUP TABLE tbl ON CLUSTER 'cluster' TO {backup_name} ASYNC"
            ).split("\t")[0]
            ids.append(id)

        ids_list = "[" + ", ".join([f"'{id}'" for id in ids]) + "]"

>       assert_eq_with_retry(
            node0,
            f"SELECT status FROM system.backups WHERE status == 'CREATING_BACKUP' AND id IN {ids_list}",
            "",
        )

test_backup_restore_on_cluster/test_concurrency.py:122:
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _

instance = <helpers.cluster.ClickHouseInstance object at 0x7f17c07d03a0>
query = "SELECT status FROM system.backups WHERE status == 'CREATING_BACKUP' AND id IN ['7a95bd38-4241-4134-acbe-714ef4517c4f', 'c963cfe2-a95c-4885-a1a3-eafd79539e36', '53213f9a-58eb-4bf5-bde1-50919b5e7ad5', '601ec446-cfe9-43ba-adec-b3f99a14c8cf']"
expectation = '', retry_count = 20, sleep_time = 0.5, stdin = None, timeout = None, settings = None, user = None, ignore_error = False
get_result = <function <lambda> at 0x7f17c12e09d0>

    def assert_eq_with_retry(
        instance,
        query,
        expectation,
        retry_count=20,
        sleep_time=0.5,
        stdin=None,
        timeout=None,
        settings=None,
        user=None,
        ignore_error=False,
        get_result=lambda x: x,
    ):
        expectation_tsv = TSV(expectation)
        for i in range(retry_count):
            try:
                if (
                    TSV(
                        get_result(
                            instance.query(
                                query,
                                user=user,
                                stdin=stdin,
                                timeout=timeout,
                                settings=settings,
                                ignore_error=ignore_error,
                            )
                        )
                    )
                    == expectation_tsv
                ):
                    break
                time.sleep(sleep_time)
            except Exception as ex:
                logging.exception(f"assert_eq_with_retry retry {i+1} exception {ex}")
                time.sleep(sleep_time)
        else:
            val = TSV(
                get_result(
                    instance.query(
                        query,
                        user=user,
                        stdin=stdin,
                        timeout=timeout,
                        settings=settings,
                        ignore_error=ignore_error,
                    )
                )
            )
            if expectation_tsv != val:
>               raise AssertionError(
                    "'{}' != '{}'\n{}".format(
                        expectation_tsv,
                        val,
                        "\n".join(expectation_tsv.diff(val, n1="expectation", n2="query")),
                    )
                )
E               AssertionError: '' != 'CREATING_BACKUP
E               CREATING_BACKUP
E               CREATING_BACKUP
E               CREATING_BACKUP'
E               @@ -0,0 +1,4 @@
E               +CREATING_BACKUP
E               +CREATING_BACKUP
E               +CREATING_BACKUP
E               +CREATING_BACKUP

helpers/test_tools.py:114: AssertionError
@qoega qoega added the flaky test flaky test found by CI label Oct 26, 2022
@alexey-milovidov alexey-milovidov added testing Special issue with list of bugs found by CI and removed flaky test flaky test found by CI labels Nov 16, 2022
@qoega
Copy link
Member Author

qoega commented Jan 5, 2023

@azat
Copy link
Collaborator

azat commented Jan 23, 2023

Looks like this is a duplicate #45437

@qoega
Copy link
Member Author

qoega commented Jul 11, 2023

https://s3.amazonaws.com/clickhouse-test-reports/51696/5f73bec3257d8dd5c0994818e6b8d828cbd8eb8c/integration_tests__asan__[5_6].html

        if status == "BACKUP_CREATED":
            node1.query("DROP TABLE tbl ON CLUSTER 'cluster' SYNC")
            node1.query(f"RESTORE TABLE tbl ON CLUSTER 'cluster' FROM {backup_name}")
>           assert node1.query("SELECT * FROM tbl ORDER BY x") == TSV([3, 5])
E           AssertionError: assert '5\n' == 3\n5
E            +  where '5\n' = <bound method ClickHouseInstance.query of <helpers.cluster.ClickHouseInstance object at 0x7fd5bdbf59c0>>('SELECT * FROM tbl ORDER BY x')
E            +    where <bound method ClickHouseInstance.query of <helpers.cluster.ClickHouseInstance object at 0x7fd5bdbf59c0>> = <helpers.cluster.ClickHouseInstance object at 0x7fd5bdbf59c0>.query
E            +  and   3\n5 = TSV([3, 5])

test_backup_restore_on_cluster/test.py:1064: AssertionError

@tavplubix
Copy link
Member

@vitlibar
Copy link
Member

vitlibar commented Dec 1, 2023

https://s3.amazonaws.com/clickhouse-test-reports/57337/67f0a0e3e44649c8ce46ec31e1fda329b9c132e6/integration_tests__release__[2_4].html

There is a strange error, it looks like RWLocks are not working correctly because after releasing all read-locks a write-lock for the table mydb.tbl7 in that test still can't be acquired, I'll try to investigate further. It doesn't seem related to backups at all.

@Algunenano
Copy link
Member

There is a strange error, it looks like RWLocks are not working correctly because after releasing all read-locks a write-lock for the table mydb.tbl7 in that test still can't be acquired, I'll try to investigate further. It doesn't seem related to backups at all.

Sounds similar to #38864

Looking at the logs it seems the previous call to drop table (instead of drop database) failed due to a timeout because there was a concurrent backup going:

2023.11.29 05:34:30.449234 [ 676 ] {d570a475-25a6-4824-84fb-f4ecc4a6241d} <Error> executeQuery: Code: 473. DB::Exception: WRITE locking attempt on "mydb.tbl7" has timed out! (10000ms) Possible deadlock avoided. Client should retry. (DEADLOCK_AVOIDED) (version 23.11.1.2068) (from 172.16.7.1:55172) (in query: DROP TABLE IF EXISTS mydb.tbl7 SYNC), Stack trace (when copying this message, always include the lines below):

0. DB::Exception::Exception(DB::Exception::MessageMasked&&, int, bool) @ 0x000000000c532dd7 in /usr/bin/clickhouse
1. DB::Exception::Exception<String const&, DB::StorageID, long long>(int, FormatStringHelperImpl<std::type_identity<String const&>::type, std::type_identity<DB::StorageID>::type, std::type_identity<long long>::type>, String const&, DB::StorageID&&, long long&&) @ 0x00000000116b768d in /usr/bin/clickhouse
2. DB::IStorage::tryLockTimed(std::shared_ptr<DB::RWLockImpl> const&, DB::RWLockImpl::Type, String const&, std::chrono::duration<long long, std::ratio<1l, 1000l>> const&) const @ 0x00000000116b7435 in /usr/bin/clickhouse
3. DB::IStorage::lockExclusively(String const&, std::chrono::duration<long long, std::ratio<1l, 1000l>> const&) @ 0x00000000116b7f6c in /usr/bin/clickhouse
4. DB::InterpreterDropQuery::executeToTableImpl(std::shared_ptr<DB::Context const>, DB::ASTDropQuery&, std::shared_ptr<DB::IDatabase>&, StrongTypedef<wide::integer<128ul, unsigned int>, DB::UUIDTag>&) @ 0x0000000010e0d6a4 in /usr/bin/clickhouse
5. DB::InterpreterDropQuery::execute() @ 0x0000000010e09030 in /usr/bin/clickhouse
6. DB::executeQueryImpl(char const*, char const*, std::shared_ptr<DB::Context>, DB::QueryFlags, DB::QueryProcessingStage::Enum, DB::ReadBuffer*) @ 0x000000001137b862 in /usr/bin/clickhouse
7. DB::executeQuery(String const&, std::shared_ptr<DB::Context>, DB::QueryFlags, DB::QueryProcessingStage::Enum) @ 0x000000001137557a in /usr/bin/clickhouse
8. DB::TCPHandler::runImpl() @ 0x0000000012262be9 in /usr/bin/clickhouse
9. DB::TCPHandler::run() @ 0x0000000012277659 in /usr/bin/clickhouse
10. Poco::Net::TCPServerConnection::start() @ 0x0000000014c9a454 in /usr/bin/clickhouse
11. Poco::Net::TCPServerDispatcher::run() @ 0x0000000014c9b5d1 in /usr/bin/clickhouse
12. Poco::PooledThread::run() @ 0x0000000014d92567 in /usr/bin/clickhouse
13. Poco::ThreadImpl::runnableEntry(void*) @ 0x0000000014d90b5c in /usr/bin/clickhouse
14. ? @ 0x00007facc33f9ac3 in ?
15. ? @ 0x00007facc348ba40 in ?

I'm not confident that the call to dropOwnerGroupAndPassOwnership(it_group, /* next= */ true); there is completely correct. It seems it resets owner to the next element in the readers_queue. When this second read owner finishes it's not the last one so it does nothing, and when the first reader finishes it's not the main owner (rdlock_owner) so it doesn't notify nobody, so the next write lock waits "forever".

^ This is pure speculation. Reasoning with multiple threads and several layers of abstraction is complex so I'm most likely wrong, but at least the code looks suspicious to me.

We do have another open ticket related to RWLock in #41371

so when the previous owner (it never went away) goes to unlock it does no notifying since it's not the real owner.

@vitlibar
Copy link
Member

vitlibar commented Dec 3, 2023

It turned out #38864 was not completely correct, so I've made another fix: #57454

I'm not confident that the call to dropOwnerGroupAndPassOwnership(it_group, /* next= */ true); there is completely correct.

You're right it wasn't correct. There were two problems about dropOwnerGroupAndPassOwnership() there:

  • RWLock didn't expect rdlock_owner can be anything else except readers_queue.begin() or readers_queue.end() and it was even a consistency check for that in RWLock::unlock(). Because this invariant was broken RWLock::unlock() didn't actually release read locks after write lock failure and RWLock::unlock() just returned without doing anything.
  • It could suddenly switch the current phase to read or write while there are active locks from the opposite phase around; for example write lock failure can be caused by another write being active and it's definitely wrong to switch the current phase to read in this case.

I added two tests Common.RWLockWriteLockTimeoutDuringTwoReads and Common RWLockWriteLockTimeoutDuringWriteWithWaitingRead in my PR #57454 to cover both cases.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
testing Special issue with list of bugs found by CI
Projects
None yet
Development

No branches or pull requests

6 participants