Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix running of Q28 by not removing important backslashes from regex #252

Merged
merged 1 commit into from
Nov 15, 2024

Conversation

JelteF
Copy link
Contributor

@JelteF JelteF commented Nov 15, 2024

Q28 contains backslashes, specifically the backslash in \1 is very important for the meaning of the query. However, both backslashes in this query are accidentally removed in almost all benchmark scripts because they used read instead of read -r.

This adds the -r flag to all the scripts that didn't have it yet, to make sure that read interprets it as a raw \ character instead of an escape sequence.

With this change the REGEXP_REPLACE call doesn't always return the literal string "1", but actually (as intended by the query) the referrer domain. This change generally makes the query significantly slower to execute, because now there are many more groups. I'm not sure how this should be listed in the results UI. It might make sense to ignore the query or something until all/most benchmarks have been rerun.

This problem was found independently in #47 and #72, but fixed in a different way. I'm not sure if the fix that was made for starrocks in #72 is actually correct though, because it seems to fix the issue twice: It replaces \1 with \\1, but it also added the -r flag to read. The same is the case for the doris benchmark.

Apart from those PRs there were already a few benchmarks that were using the -r flag:

❯ rg 'while read -r'
elasticsearch/run.sh
5:cat 'queries.sql' | while read -r QUERY; do

starrocks/run.sh
8:cat queries.sql | while read -r query; do

doris/run.sh
8:while read -r query; do

crunchy-bridge-for-analytics/run.sh
22:    while read -r query; do

@CLAassistant
Copy link

CLAassistant commented Nov 15, 2024

CLA assistant check
All committers have signed the CLA.

Q28 contains backslashes, specifically the backslash in `\1` is very
important for the meaning of the query. However, both backslashes in
this query are accidentally removed in almost all benchmark scripts
because they used `read` instead of `read -r`.

This adds the `-r` flag to all the scripts that didn't have it yet, to
make sure that `read` interprets it as a raw `\` character instead of an
escape sequence.

With this change the REGEXP_REPLACE call doesn't always return the
literal string "1", but actually (as intended by the query) the referrer
domain. This change generally makes the query significantly slower to
execute, because now there are many more groups. I'm not sure how this
should be listed in the results UI. It might make sense to ignore the
query or something.

This problem was found independently in ClickHouse#47 and ClickHouse#72, but fixed in a
different way. I'm not sure if the fix that was made for starrocks in ClickHouse#72
is actually correct though, because it seems to fix the issue twice: It
replaces `\1` with `\\1`, but it also added the `-r` flag to `read`.

Apart from those issues there were already a few benchmarks that were
using the `-r` flag:
```
❯ rg 'while read -r'
elasticsearch/run.sh
5:cat 'queries.sql' | while read -r QUERY; do

starrocks/run.sh
8:cat queries.sql | while read -r query; do

doris/run.sh
8:while read -r query; do

crunchy-bridge-for-analytics/run.sh
22:    while read -r query; do
```
@rschu1ze
Copy link
Member

@JelteF That is a great find.

Regarding Doris and Starrocks: Let me ping the contributors who last touched Q28 for these two databases. It is indeed weird that they seemingly fixed the issue redundantly.

Now, all reads run with -r. There is still a mixed usage of \1 vs. \\1: All but Pandas, Doris, Starrocks, Polars use the former syntax. This may or may not be a mistake - in some cases, databases require escaping \. Hard to verify without re-running measurements for these systems.

Regarding re-running benchmarks: that is too much work (at least for me) and as measurements will be updated over time, the effect of the bug on Q28 should diminish. Also, just one out of 43 queries is affected so aggregated results should not be too much off. Of course, if someone likes to re-run benchmarks, please go ahead.

@rschu1ze rschu1ze merged commit 39b20b5 into ClickHouse:main Nov 15, 2024
@alexey-milovidov
Copy link
Member

alexey-milovidov commented Nov 15, 2024

MySQL supports C-style escaping, and the databases implementing the MySQL query dialect, such as MariaDB, Infobright, SingleStore, Doris/Starrocks, are expected to parse this escaping. However, it can be modified by sql_mode: https://dev.mysql.com/doc/refman/8.4/en/sql-mode.html#sqlmode_no_backslash_escapes

ClickHouse supports C-style escaping, but does it in an intelligent way, so when there is \c where c does not belong to the list of known escaping characters, it parses it as \c instead of just c. This makes single-backslash escapes work for LIKE (\_, \%) and regular expressions (\d, \w, \s, ...)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants