Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ThreadSanitizer data race in librdkafka (test_storage_kafka) #56043

Closed
kitaisreal opened this issue Oct 26, 2023 · 14 comments
Closed

ThreadSanitizer data race in librdkafka (test_storage_kafka) #56043

kitaisreal opened this issue Oct 26, 2023 · 14 comments
Assignees
Labels
testing Special issue with list of bugs found by CI

Comments

@kitaisreal
Copy link
Collaborator

https://s3.amazonaws.com/clickhouse-test-reports/55775/78977f7d5cba4084235d4beddebba8917c7bf30b/integration_tests__tsan__[6_6]/integration_run_parallel4_0.log

WARNING: ThreadSanitizer: data race (pid=1)
  Atomic write of size 4 at 0x7b78000f4bb0 by thread T377 (mutexes: read M0, write M1):
    #0 rd_atomic32_get build_docker/./contrib/librdkafka/src/rdatomic.h:108:9 (clickhouse+0x1f8b1a93) (BuildId: 662ff0bfd164437162d27a08d006d6ba54fefcf7)
    #1 rd_kafka_stats_emit_all build_docker/./contrib/librdkafka/src/rdkafka.c:1654:3 (clickhouse+0x1f8b1a93)
    #2 rd_kafka_stats_emit_tmr_cb build_docker/./contrib/librdkafka/src/rdkafka.c:1898:2 (clickhouse+0x1f8b1a93)
    #3 rd_kafka_timers_run build_docker/./contrib/librdkafka/src/rdkafka_timer.c:288:4 (clickhouse+0x1f9a29aa) (BuildId: 662ff0bfd164437162d27a08d006d6ba54fefcf7)
    #4 rd_kafka_thread_main build_docker/./contrib/librdkafka/src/rdkafka.c:2021:3 (clickhouse+0x1f8a8b29) (BuildId: 662ff0bfd164437162d27a08d006d6ba54fefcf7)
    #5 _thrd_wrapper_function build_docker/./contrib/librdkafka/src/tinycthread.c:576:9 (clickhouse+0x1f9c611b) (BuildId: 662ff0bfd164437162d27a08d006d6ba54fefcf7)

  Previous write of size 4 at 0x7b78000f4bb0 by thread T376:
    #0 rd_atomic32_init build_docker/./contrib/librdkafka/src/rdatomic.h:49:10 (clickhouse+0x1f8a3173) (BuildId: 662ff0bfd164437162d27a08d006d6ba54fefcf7)
    #1 rd_kafka_bufq_init build_docker/./contrib/librdkafka/src/rdkafka_buf.c:225:2 (clickhouse+0x1f8a3173)
    #2 rd_kafka_bufq_concat build_docker/./contrib/librdkafka/src/rdkafka_buf.c:236:2 (clickhouse+0x1f8a3173)
    #3 rd_kafka_broker_fail build_docker/./contrib/librdkafka/src/rdkafka_broker.c:572:2 (clickhouse+0x1f87a82f) (BuildId: 662ff0bfd164437162d27a08d006d6ba54fefcf7)
    #4 rd_kafka_broker_op_serve build_docker/./contrib/librdkafka/src/rdkafka_broker.c:3317:33 (clickhouse+0x1f88bc18) (BuildId: 662ff0bfd164437162d27a08d006d6ba54fefcf7)
    #5 rd_kafka_broker_ops_serve build_docker/./contrib/librdkafka/src/rdkafka_broker.c:3351:24 (clickhouse+0x1f88bc18)
    #6 rd_kafka_broker_ops_io_serve build_docker/./contrib/librdkafka/src/rdkafka_broker.c:3401:9 (clickhouse+0x1f88e323) (BuildId: 662ff0bfd164437162d27a08d006d6ba54fefcf7)
    #7 rd_kafka_broker_consumer_serve build_docker/./contrib/librdkafka/src/rdkafka_broker.c:4975:17 (clickhouse+0x1f88a0be) (BuildId: 662ff0bfd164437162d27a08d006d6ba54fefcf7)
    #8 rd_kafka_broker_serve build_docker/./contrib/librdkafka/src/rdkafka_broker.c:5080:17 (clickhouse+0x1f88a0be)
    #9 rd_kafka_broker_thread_main build_docker/./contrib/librdkafka/src/rdkafka_broker.c:5237:25 (clickhouse+0x1f8834b9) (BuildId: 662ff0bfd164437162d27a08d006d6ba54fefcf7)
    #10 _thrd_wrapper_function build_docker/./contrib/librdkafka/src/tinycthread.c:576:9 (clickhouse+0x1f9c611b) (BuildId: 662ff0bfd164437162d27a08d006d6ba54fefcf7)

SUMMARY: ThreadSanitizer: data race build_docker/./contrib/librdkafka/src/rdatomic.h:108:9 in rd_atomic32_get
@kitaisreal kitaisreal added the testing Special issue with list of bugs found by CI label Oct 26, 2023
@tavplubix tavplubix changed the title ThreadSanitizer data race in librdkafka ThreadSanitizer data race in librdkafka (test_storage_kafka) Nov 2, 2023
@KochetovNicolai
Copy link
Member

@ilejn
Copy link
Contributor

ilejn commented Nov 29, 2023

The issue introduced in #50999 where we started to use librdkafka internal statistics.
Race condition inside librdkafka, probably cannot be fixed at ClickHouse level.
One thread gets counter in statistic callback, another thread sets this counter to zero because of rebalance.

@antaljanosbenjamin
Copy link
Member

Thanks for figuring this out @ilejn! Unfortunately I couldn't manage to really deep dive into this issue because of my other tasks. As I see Alexey has already opened an issue for this in librdkafka: confluentinc/librdkafka#4522. I will post your findings there also, it might help the librdkafka developers to fix the issue.

@ilejn
Copy link
Contributor

ilejn commented Nov 30, 2023

Thanks for figuring this out @ilejn! Unfortunately I couldn't manage to really deep dive into this issue because of my other tasks. As I see Alexey has already opened an issue for this in librdkafka: confluentinc/librdkafka#4522. I will post your findings there also, it might help the librdkafka developers to fix the issue.

Actually I feel that I am responsible for this because it was introduced by my PR.
The problem is I cannot yet reproduce the sanitizer' complain locally.

@alexey-milovidov
Copy link
Member

Let's revert the system.kafka_consumers table.

@antaljanosbenjamin
Copy link
Member

Based on this comment I think this is easy to patch.

@ilejn
Copy link
Contributor

ilejn commented Dec 7, 2023

I've performed an experiment https://pastila.nl/?0011c723/8b2d1af82b2cf6841f1760545970cddd#P5994GOYJnexrHbWQ8oaZw==
to prove that switching from rd_atomic32_init to rd_atomic32_set solves the issue.
Plan to make PR to our local fork of librdkafka by December 13.
Hope it is not too late.

@ilejn
Copy link
Contributor

ilejn commented Dec 12, 2023

Created PR in our librdkafka fork ClickHouse/librdkafka#10
that supposedly fixes the issue.

@antaljanosbenjamin
Copy link
Member

Thanks for the PR! I merged it into master. Are you planning to create a PR to ClickHouse to update the submodule or should I do it?

@ilejn
Copy link
Contributor

ilejn commented Dec 12, 2023

Thanks for the PR! I merged it into master. Are you planning to create a PR to ClickHouse to update the submodule or should I do it?

Thanks.
Will do ["a PR to ClickHouse to update the submodule"] until EOD and let you know.

@antaljanosbenjamin
Copy link
Member

This issue is fixed by #57791

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
testing Special issue with list of bugs found by CI
Projects
None yet
Development

No branches or pull requests

7 participants