Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

clickhouse-test: improve left queries after the test hardening #36649

Merged

Conversation

azat
Copy link
Collaborator

@azat azat commented Apr 26, 2022

Changelog category (leave one):

  • Improvement

Changelog entry (a user-readable short description of the changes that goes to CHANGELOG.md):

Add is_all_data_sent column into system.processes, and improve internal testing hardening check based on it.

TL;DR;

Right now it is possible to have "false-positive" for this hardening,
because there is a tiny delay (that can be quite significant on CI, when
it is under pressure) between when the server sends EndOfStream and the
server removes an entry from system.processes.

But now system.processes has is_all_data_sent column, that means that
the EndOfStream was sent, and we can use it to avoid false positive.

Here is an example of such report 1:

2022-04-25 03:47:18 00806_alter_update:                                                     [ FAIL ] 0.95 sec. - Queries left in background after the test finished:
2022-04-25 03:47:18         "elapsed": 0.100084746,
2022-04-25 03:47:18         "is_cancelled": 0,
2022-04-25 03:47:18         "query": "DROP TABLE alter_update_00806;",
2022-04-25 03:47:18         "thread_ids": [
2022-04-25 03:47:18             "8950"
2022-04-25 03:47:18         ],

2022.04.25 03:47:17.887095 [ 8950 ] {7c062004-4c22-486c-934a-f405846e2c81} <Debug> executeQuery: (from [::1]:52012) (comment: 00806_alter_update.sql) DROP TABLE alter_update_00806;
2022.04.25 03:47:17.887493 [ 8950 ] {7c062004-4c22-486c-934a-f405846e2c81} <Trace> ContextAccess (default): Access granted: DROP TABLE ON test_7ntsjn.alter_update_00806
2022.04.25 03:47:17.887765 [ 8950 ] {7c062004-4c22-486c-934a-f405846e2c81} <Trace> test_7ntsjn.alter_update_00806 (1bc92bca-10a7-444e-be5e-7f61f4650169): Found 2 old parts to remove.
2022.04.25 03:47:17.887947 [ 8950 ] {7c062004-4c22-486c-934a-f405846e2c81} <Debug> test_7ntsjn.alter_update_00806 (1bc92bca-10a7-444e-be5e-7f61f4650169): Removing part from filesystem 20180101_20180101_1_1_0
2022.04.25 03:47:17.888960 [ 8950 ] {7c062004-4c22-486c-934a-f405846e2c81} <Debug> test_7ntsjn.alter_update_00806 (1bc92bca-10a7-444e-be5e-7f61f4650169): Removing part from filesystem 20180102_20180102_2_2_0
2022.04.25 03:47:17.890620 [ 8950 ] {7c062004-4c22-486c-934a-f405846e2c81} <Debug> DatabaseCatalog: Waiting for table 1bc92bca-10a7-444e-be5e-7f61f4650169 to be finally dropped
2022.04.25 03:47:17.895046 [ 8950 ] {7c062004-4c22-486c-934a-f405846e2c81} <Debug> MemoryTracker: Peak memory usage (for query): 0.00 B.
...
2022.04.25 03:47:17.938328 [ 4422 ] {aa01985a-78f5-4c0e-b646-8d04a4a1dc77} <Debug> executeQuery: (from [::1]:59416) (comment: 00806_alter_update.sql) DROP DATABASE test_7ntsjn
2022.04.25 03:47:17.938667 [ 4422 ] {aa01985a-78f5-4c0e-b646-8d04a4a1dc77} <Trace> ContextAccess (default): Access granted: DROP DATABASE ON test_7ntsjn.*
...
2022.04.25 03:47:18.154847 [ 8950 ] {} <Debug> TCPHandler: Processed in 0.269358257 sec.
2022.04.25 03:47:18.154991 [ 8950 ] {} <Debug> TCPHandler: Done processing connection.
2022.04.25 03:47:18.155181 [ 8950 ] {} <Debug> TCP-Session: e1d8176a-ee62-4e0a-9855-fe9eb52e06dc Destroying unnamed session of user 94309d50-4f52-5250-31bd-74fecac179db

So as you can see here DROP TABLE was captured when elapsed was 0.1,
while TCPHandler processes it for 0.26 seconds.
Also from the same report you are seeing that DROP DATABASE was executed
before TCPHandler stopoped processing DROP TABLE

@azat azat mentioned this pull request Apr 26, 2022
@robot-clickhouse robot-clickhouse added the pr-improvement Pull request with some product improvements label Apr 26, 2022
@azat azat force-pushed the system.processes-is_all_data_sent branch from 82e8f2b to 8742072 Compare April 26, 2022 07:56
v2: fix SHOW PROCESSLIST (does not have process list entry)
Signed-off-by: Azat Khuzhin <a.khuzhin@semrush.com>
Right now it is possible to have "false-positive" for this hardening,
because there is a tiny delay (that can be quite significant on CI, when
it is under pressure) between when the server sends EndOfStream and the
server removes an entry from system.processes.

But now system.processes has is_all_data_sent column, that means that
the EndOfStream was sent, and we can use it to avoid false positive.

Here is an example of such report [1]:

    2022-04-25 03:47:18 00806_alter_update:                                                     [ FAIL ] 0.95 sec. - Queries left in background after the test finished:
    2022-04-25 03:47:18         "elapsed": 0.100084746,
    2022-04-25 03:47:18         "is_cancelled": 0,
    2022-04-25 03:47:18         "query": "DROP TABLE alter_update_00806;",
    2022-04-25 03:47:18         "thread_ids": [
    2022-04-25 03:47:18             "8950"
    2022-04-25 03:47:18         ],

    2022.04.25 03:47:17.887095 [ 8950 ] {7c062004-4c22-486c-934a-f405846e2c81} <Debug> executeQuery: (from [::1]:52012) (comment: 00806_alter_update.sql) DROP TABLE alter_update_00806;
    2022.04.25 03:47:17.887493 [ 8950 ] {7c062004-4c22-486c-934a-f405846e2c81} <Trace> ContextAccess (default): Access granted: DROP TABLE ON test_7ntsjn.alter_update_00806
    2022.04.25 03:47:17.887765 [ 8950 ] {7c062004-4c22-486c-934a-f405846e2c81} <Trace> test_7ntsjn.alter_update_00806 (1bc92bca-10a7-444e-be5e-7f61f4650169): Found 2 old parts to remove.
    2022.04.25 03:47:17.887947 [ 8950 ] {7c062004-4c22-486c-934a-f405846e2c81} <Debug> test_7ntsjn.alter_update_00806 (1bc92bca-10a7-444e-be5e-7f61f4650169): Removing part from filesystem 20180101_20180101_1_1_0
    2022.04.25 03:47:17.888960 [ 8950 ] {7c062004-4c22-486c-934a-f405846e2c81} <Debug> test_7ntsjn.alter_update_00806 (1bc92bca-10a7-444e-be5e-7f61f4650169): Removing part from filesystem 20180102_20180102_2_2_0
    2022.04.25 03:47:17.890620 [ 8950 ] {7c062004-4c22-486c-934a-f405846e2c81} <Debug> DatabaseCatalog: Waiting for table 1bc92bca-10a7-444e-be5e-7f61f4650169 to be finally dropped
    2022.04.25 03:47:17.895046 [ 8950 ] {7c062004-4c22-486c-934a-f405846e2c81} <Debug> MemoryTracker: Peak memory usage (for query): 0.00 B.
    ...
    2022.04.25 03:47:17.938328 [ 4422 ] {aa01985a-78f5-4c0e-b646-8d04a4a1dc77} <Debug> executeQuery: (from [::1]:59416) (comment: 00806_alter_update.sql) DROP DATABASE test_7ntsjn
    2022.04.25 03:47:17.938667 [ 4422 ] {aa01985a-78f5-4c0e-b646-8d04a4a1dc77} <Trace> ContextAccess (default): Access granted: DROP DATABASE ON test_7ntsjn.*
    ...
    2022.04.25 03:47:18.154847 [ 8950 ] {} <Debug> TCPHandler: Processed in 0.269358257 sec.
    2022.04.25 03:47:18.154991 [ 8950 ] {} <Debug> TCPHandler: Done processing connection.
    2022.04.25 03:47:18.155181 [ 8950 ] {} <Debug> TCP-Session: e1d8176a-ee62-4e0a-9855-fe9eb52e06dc Destroying unnamed session of user 94309d50-4f52-5250-31bd-74fecac179db

  [1]: https://s3.amazonaws.com/clickhouse-test-reports/36319/a646cf76b6d4699f06aea1e8d777edb1ad6fd2c5/stateless_tests__debug__actions__[1/3]/runlog.log

So as you can see here DROP TABLE was captured when elapsed was 0.1,
while TCPHandler processes it for 0.26 seconds.
Also from the same report you are seeing that DROP DATABASE was executed
before TCPHandler stopoped processing DROP TABLE.

Signed-off-by: Azat Khuzhin <a.khuzhin@semrush.com>
Signed-off-by: Azat Khuzhin <a.khuzhin@semrush.com>
Signed-off-by: Azat Khuzhin <a.khuzhin@semrush.com>
@azat azat force-pushed the system.processes-is_all_data_sent branch from 8742072 to 87421d8 Compare April 26, 2022 09:15
@azat
Copy link
Collaborator Author

azat commented Apr 26, 2022

Style Check (actions) — Check black failed.

@alexey-milovidov alexey-milovidov self-assigned this Apr 27, 2022
@alexey-milovidov
Copy link
Member

00157_cache_dictionary

This test requires attention.

@alexey-milovidov alexey-milovidov merged commit b8a9f14 into ClickHouse:master Apr 27, 2022
@alexey-milovidov alexey-milovidov added the testing Special issue with list of bugs found by CI label Apr 27, 2022
@azat azat deleted the system.processes-is_all_data_sent branch April 27, 2022 08:17
@tavplubix
Copy link
Member

Looks like it makes clickhouse-test incompatible with older versions of clickhouse-server:

./clickhouse-test 00001
Using queries from 'queries' directory
Using clickhouse-client as client program (expecting split build)
Connecting to ClickHouse server... OK

Running 1 stateless tests (MainProcess).

00001_select_1:                                                         [ FAIL ] - Internal query (CREATE/DROP DATABASE) failed:
HTTPError
Code: 404. Code: 47. DB::Exception: Missing columns: 'is_all_data_sent' while processing query: 'SELECT is_initial_query, user, query_id, address, port, initial_user, initial_query_id, initial_address, initial_port, interface, os_user, client_hostname, client_name, client_revision, client_version_major, client_version_minor, client_version_patch, http_method, http_user_agent, http_referer, forwarded_for, quota_key, distributed_depth, elapsed, is_cancelled, read_rows, read_bytes, total_rows_approx, written_rows, written_bytes, memory_usage, peak_memory_usage, query, thread_ids, ProfileEvents, Settings, current_database FROM system.processes WHERE (query NOT LIKE '%system.processes%') AND (NOT is_all_data_sent) AND ((Settings['log_comment']) = '00001_select_1.sql') AND (current_database = 'test_12hiy0')', required columns: 'http_user_agent' 'initial_user' 'ProfileEvents' 'address' 'query_id' 'is_cancelled' 'os_user' 'port' 'client_name' 'initial_address' 'interface' 'initial_query_id' 'is_initial_query' 'client_version_major' 'client_revision' 'read_rows' 'initial_port' 'client_version_patch' 'user' 'forwarded_for' 'client_version_minor' 'query' 'http_method' 'written_rows' 'http_referer' 'memory_usage' 'current_database' 'distributed_depth' 'read_bytes' 'quota_key' 'elapsed' 'total_rows_approx' 'peak_memory_usage' 'written_bytes' 'Settings' 'client_hostname' 'thread_ids' 'is_all_data_sent', maybe you meant: ['http_user_agent','initial_user','ProfileEvents','address','query_id','is_cancelled','os_user','port','client_name','initial_address','interface','initial_query_id','is_initial_query','client_version_major','client_revision','read_rows','initial_port','client_version_patch','user','forwarded_for','client_version_minor','query','http_method','written_rows','http_referer','memory_usage','current_database','distributed_depth','read_bytes','quota_key','elapsed','total_rows_approx','peak_memory_usage','written_bytes','Settings','client_hostname','thread_ids']. (UNKNOWN_IDENTIFIER) (version 22.5.1.1)

  File "/home/tavplubix/ch/ClickHouse/tests/./clickhouse-test", line 813, in run
    result = self.process_result_impl(proc, stdout, stderr, total_time)

  File "/home/tavplubix/ch/ClickHouse/tests/./clickhouse-test", line 681, in process_result_impl
    processlist = get_processlist_after_test(self.testcase_args)

  File "/home/tavplubix/ch/ClickHouse/tests/./clickhouse-test", line 233, in get_processlist_after_test
    return clickhouse_execute_json(args, f"""

  File "/home/tavplubix/ch/ClickHouse/tests/./clickhouse-test", line 121, in clickhouse_execute_json
    data = clickhouse_execute_http(base_args, query, timeout, settings, 'JSONEachRow')

  File "/home/tavplubix/ch/ClickHouse/tests/./clickhouse-test", line 113, in clickhouse_execute_http
    raise HTTPError(data.decode(), res.status)

Settings used in the test: --max_insert_threads=0 --group_by_two_level_threshold=100000 --group_by_two_level_threshold_bytes=50000000 --distributed_aggregation_memory_efficient=1 --fsync_metadata=0 --output_format_parallel_formatting=0 --input_format_parallel_parsing=0 --min_chunk_bytes_for_parallel_parsing=9420086 --max_read_buffer_size=842152 --prefer_localhost_replica=0 --max_block_size=67511 --max_threads=28

Database: test_12hiy0

Having 1 errors! 0 tests passed. 0 tests skipped. 1.43 s elapsed (MainProcess).
Won't run stateful tests because test data wasn't loaded.
All tests have finished.

@azat
Copy link
Collaborator Author

azat commented Apr 28, 2022

Looks like it makes clickhouse-test incompatible with older versions of clickhouse-server:

Indeed, thanks for the report, should be fixed by #36767

azat added a commit to azat/ClickHouse that referenced this pull request Apr 30, 2022
azat added a commit to azat/ClickHouse that referenced this pull request May 4, 2022
Before tests can fail if there was implicit reconnect, with queries
left, and without referenced PR, it requires manual debugging to know
that the reason was reconnect.

But the problem is, that the server does send EndOfStream but hanged
after, but before removing this query from the system.processes.
But after adding is_all_data_sent (ClickHouse#36816, ClickHouse#36767, ClickHouse#36649),
clickhouse-test can check queries left only for which server did not
sent EndOfStream/Exception.

In other words after adding is_all_data_sent, it should not be possible
to have queries left in such cases.

Reverts: 53be9c5
Reverts: ClickHouse#36587
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
pr-improvement Pull request with some product improvements testing Special issue with list of bugs found by CI
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants