Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

when using Engine=MySQL for table or database - can i hint ClickHouse to pass WHERE condition to underlying MySQL? #63068

Open
pakud opened this issue Apr 27, 2024 · 4 comments
Labels
question Question?

Comments

@pakud
Copy link
Contributor

pakud commented Apr 27, 2024

let's say i have a MySQL database 'mounted' to ClickHouse 24.3.2 via:

CREATE DATABASE mydb ENGINE = MySQL('host', 'db', 'login', 'pass')

i've noticed that most of more complex queries lead to ClickHouse reading the whole table from MySQL.

example query:

SELECT * FROM content WHERE transaction_id IN ( SELECT id FROM transaction WHERE object_id=1234 )

while i don't expect ClickHouse to pass the whole query to MySQL - i'd love to hint ClickHouse that it should compile list of ids from the subquery and then pass it to MySQL for the outer query instead of loading whole content table from MySQL and only then doing the filtering.

thank you!

@pakud pakud added the question Question? label Apr 27, 2024
@pmusa
Copy link
Contributor

pmusa commented May 1, 2024

From the docs:

Simple WHERE clauses such as =, !=, >, >=, <, <= are executed on the MySQL server.
The rest of the conditions and the LIMIT sampling constraint are executed in ClickHouse only after the query to MySQL finishes.

So, even though it compiles the list before the main query, the IN operation is executed on ClickHouse, which means it needs to read the content table. Below is the execution difference between using IN and >= (which could be any of the above).

EXPLAIN SELECT * FROM mydb.test WHERE id >= ( SELECT id FROM mydb.test2 WHERE obj_id=1234 LIMIT 1);

┌─explain────────────────────────────────────────────────────────┐
1. │ Expression ((Project names + Projection))                      │
2. │   Filter ((WHERE + Change column names to column identifiers)) │
3. │     ReadFromStorage (MySQL)                                    │
   └────────────────────────────────────────────────────────────────┘

EXPLAIN SELECT * FROM mydb.test WHERE id IN ( SELECT id FROM mydb.test2 WHERE obj_id=1234 LIMIT 1)

┌─explain──────────────────────────────────────────────────────────────┐
1. │ CreatingSets (Create sets before main query execution)               │
2. │   Expression ((Project names + Projection))                          │
3. │     Filter ((WHERE + Change column names to column identifiers))     │
4. │       ReadFromStorage (MySQL)                                        │
5. │   CreatingSet (Create set for subquery)                              │
6. │     Expression ((Project names + Projection))                        │
7. │       Limit (preliminary LIMIT (without OFFSET))                     │
8. │         Filter ((WHERE + Change column names to column identifiers)) │
9. │           ReadFromStorage (MySQL)                                    │
   └──────────────────────────────────────────────────────────────────────┘

@pakud
Copy link
Contributor Author

pakud commented May 1, 2024

thanks @pmusa ; so, basically, for the type of query i'm trying to execute - there's no way to push down filtering to MySQL level?

thx!

@pmusa
Copy link
Contributor

pmusa commented May 1, 2024

I don't think so. Why would you want to do it anyway? Then, the computation would be all done by MySQL, so just run the query on MySQL.

@pakud
Copy link
Contributor Author

pakud commented May 2, 2024

@pmusa : here's our use case: over time ClickHouse became central hub for calculating any type of statistics for us. some of the data is replicated to it, while other data 'mounted' via MySQL database engine from remote servers. this gives us ability to use joins across otherwise disconnected systems.

in this particular case data from a ticketing system is not replicated, but 'mounted' instead. the particular query i'm trying to execute, without ability to push down IN ( list of values ) to MySQL would be very inefficient.

oh well - now i have a better idea about the inner workings, maybe we'll have to consider MaterializedMySQL or gather this particular stats at the source rather than via ClickHouse.

thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
question Question?
Projects
None yet
Development

No branches or pull requests

2 participants