Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Indentical subqueries (or CTE) execute only once. (condition pushdown) #21992

Open
UnamedRus opened this issue Mar 22, 2021 · 4 comments
Open
Labels
comp-cte common table expression (WITH ... SELECT) feature

Comments

@UnamedRus
Copy link
Contributor

After #2301 fix, clickhouse execute sub queries from single level of query only once. But it doesn't work in case we are using that sub query in WHERE condition and that conditions is being pushed to the inner query.

Use case

WITH x AS
    (
        SELECT *
        FROM numbers(10000)
        WHERE NOT sleep(1)
    )
SELECT count()
FROM
(
    SELECT *
    FROM numbers(100)
)
WHERE number IN (x)

Query id: 6b8ed07a-1513-49f1-9f5e-ef3c32375fda

┌─count()─┐
│     100 │
└─────────┘

1 rows in set. Elapsed: 2.009 sec. Processed 20.10 thousand rows, 160.80 KB (10.00 thousand rows/s., 80.04 KB/s.)


SELECT count()
FROM
(
    SELECT *
    FROM numbers(100)
)
WHERE number IN
(
    SELECT *
    FROM numbers(10000)
    WHERE NOT sleep(1)
)

Query id: 6d1f8c64-118f-4a44-a0c3-30c9bf31bca3

┌─count()─┐
│     100 │
└─────────┘

1 rows in set. Elapsed: 2.005 sec. Processed 20.10 thousand rows, 160.80 KB (10.02 thousand rows/s., 80.19 KB/s.)


WITH
    (
        SELECT groupArray(number)
        FROM numbers(10000)
        WHERE NOT sleep(1)
    ) AS x
SELECT count()
FROM
(
    SELECT *
    FROM numbers(100)
)
WHERE number IN
(
    SELECT arrayJoin(x)
)

Query id: 4bc49312-9e82-4dca-883d-316734c76ba2

┌─count()─┐
│     100 │
└─────────┘

1 rows in set. Elapsed: 1.004 sec.

Describe the solution you'd like
Clickhouse wouldn't push that kind of conditions or would execute them only once.

Describe alternatives you've considered
Disable predicate optimization by hand:

set  set enable_optimize_predicate_expression =0;
@filimonov filimonov added the comp-cte common table expression (WITH ... SELECT) label Mar 22, 2021
@amosbird amosbird self-assigned this Mar 23, 2021
@amosbird
Copy link
Collaborator

#23539 (comment) can help

@seanlaff
Copy link
Contributor

I have a similar problem where I have a performant two-statement query (using a temp table) that I can't convert to a one-statement query due to how clickhouse plans the CTE query.

CREATE TEMPORARY TABLE temp AS 
SELECT d FROM dist_table WHERE e IN (
    SELECT f from local_table
);
SELECT *
FROM dist_table 
WHERE a IN (
    SELECT b FROM local_table WHERE c IN (temp)
);

In this two-statement query, temp is ran once on the cluster. Which is what I intend.

If I try to convert this to a one-statement query, I use WITH

WITH temp AS (
SELECT d FROM dist_table WHERE e IN (
    SELECT f from local_table
)
SELECT *
FROM dist_table 
WHERE a IN (
    SELECT b FROM local_table WHERE c IN (temp)
);

Clickhouse's query planner "inlines" the query here c IN (temp). This nesting causes the "Double-distributed IN/JOIN subqueries is denied" error to be thrown. I could override this error with distributed_product_mode="allow"but then the temp query ends up being ran N times on the cluster (where N is number of nodes). I cannot recreate the performance of the two-statement query.

As a workaround I tried @UnamedRus trick of using WITH expr AS alias with groupArray()/arrayJoin()

WITH (
SELECT groupArray(d) FROM dist_table WHERE e IN (
    SELECT f from local_table
) AS temp
SELECT *
FROM dist_table 
WHERE a IN (
    SELECT b FROM local_table WHERE c IN (select arrayJoin(temp))
)
SETTINGS distributed_product_mode='allow';

The query plan begins by running the CTE query once on the cluster (which is what we want), but then inside the main query the CTE gets rebuilt again on every node... which I think might be expected given distributed_product_mode='allow'.

In summary, I can't seem to find any workaround to make a performant one-statement query that matches my two-statement query. @amosbird's suggestion of adding a AS MATERIALIZED #23539 (comment) to the CTE syntax would theoretically allow me to tell the query planner to fully resolve my CTE, and then pass it to the query without having to override distributed_product_mode:

WITH temp AS MATERIALIZED (
SELECT d FROM dist_table WHERE e IN (
    SELECT f from local_table
)
SELECT *
FROM dist_table 
WHERE a IN (
    SELECT b FROM local_table WHERE c IN (temp)
);

@xevix
Copy link

xevix commented Aug 15, 2023

Is there any plan to implement materialized CTEs? If not, is there an issue/discussion where it’s explained why not?

I’m wondering for the simple case where there is just one Clickhouse instance, so the trick of using GLOBAL JOIN to force temporary caching doesn’t work, but still want to avoid the round trip of first writing a temporary table.

@UnamedRus
Copy link
Contributor Author

Is there any plan to implement materialized CTEs?

Plans yes, but not ETA

If not, is there an issue/discussion

I've opened separate feature request for that
#53449

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
comp-cte common table expression (WITH ... SELECT) feature
Projects
None yet
Development

No branches or pull requests

5 participants