New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
CROSS JOIN using unexpectedly large amount of memory #12571
Comments
I've tryed to reproduce the issue on my laptop. First of all it returns some data immediately. But I think there's no guaranty for it.
The query should generate 10^7 * 10^7 rows. In case my laptop (in debug version) generates ~10^7 rows per second we need 10^7 seconds (~116 days) to finish. Then I started the query and leaved it working for several minutes. My laptop has 16Gb of memory. And the query used about 2Gb without growth.
I've tested the issue at newer debug version. But it should not matter in context of memory usage: there were no significant changes in CROSS JOIN logic since 20.5. |
@4ertus2 it does not. The right table is huge it does not fit into memory, join with a single row (left table) version 20.7.1 set max_memory_usage = 30000000000 select q.id, p.id from (select 1 id, '' name) q cross join xs p where startsWith(q.name, p.name); |
There's a message that right table has been loaded into memory and we've got out of memory in join phase not in right table insertion.
I cannot reproduce this issue. If we have just a huge right table we should not get |
10 mil rows * 2kb = raw data 20 GB, it will take 100GB RAM in a hash table 10 000 Peak memory usage (for query): 125.42 MiB. 100 000 1 000 000 10 000 000 |
My bad. I've made something wrong with population scripts and get empy strings
It does not need hash table for CROSS JOIN, columns only. So probably 32Gb in query log is OK: 20Gb + some overhead in memory allocations in columns. Any case it's expected that it fails at right table creation phase and it's not expected that memory is still growing in CROSS JOIN phase. |
OK, it looks I've reproduced the issue. @gkristic could you run the query with
|
Sure! I've just run that query with the same setup as before. It produced a first batch of results rather quickly. (It returned the first 10K rows.) Then it kept processing but without printing anything else other than the stats, with understandably decreasing rates for rows/s and MB/s. The output is not comfortable to see with all those useless X's, though 😃 But I let it ran for minutes and I didn't see resident memory grow past ~80GiB. Is this enough for what you wanted to know? Thanks for taking a look at this @4ertus2 and @den-crane! Let me know if there's anything else I can do. |
Let me quantify the "I let it ran for minutes"... It was ~20 minutes. |
I ran it again with |
Thanx. Summary: Summary 2: |
The query
works alright, the memory usage is minimal (4.70 GiB). Something is related to the condition on |
This query requires a lot of RAM:
And this query is working fine:
|
The reason is that full table (with both |
Postponed until #21047. |
I've been having issues trying to find prefix relationships between strings in a table, using
startsWith
on top of a cross join. Although the dataset is about 20GiB uncompressed, ClickHouse keeps trying to allocate memory well past the mark where the entire table would fit in it. I tried in a box with 512GiB of RAM (running ClickHouse exclusively) and the query still aborts due of memory limits. My original query was more complicated and involved aggregation functions, that I blamed at first. But as I simplified the case, I noticed that the problem persists even without aggregation of any kind. In this scenario I'd expect ClickHouse to loop over the data and stream the results right away, but queries still abort instead.Setup: I'm using ClickHouse in Docker. It's almost exactly the standard yandex/clickhouse-server image in Docker Hub (20.5.2.7), except that I changed the max memory for queries to take advantage of the extra memory in the instance I'm using. Specifically, here's my Dockerfile:
The maximum amount of total memory is set through the 0.9 ratio that comes preset in the standard image.
Dataset: The table I'm using has only two columns: an ID (
UInt64
) and a name (String
). In my original case the strings are all strictly different and vary in length (average ~2KiB), but I reduced the problem further to a dataset that can be easily generated on demand. In this case all strings are equal and 2KiB long. Here's the exact table structure:You can populate the data with:
Query: Here's the query I run:
The problem seems to be related to the presence of
startsWith
. If I remove it and go for the full cross join, or change it to avoid the reflexive cases (q.id != p.id
), ClickHouse starts returning results immdiately. However, the experience with this query is that it blocks until it aborts due to memory. It doesn't print even a single result. I setsend_logs_level
totrace
to have a better perspective, and here's what I got (summarized below for brevity; full output here).I tried with the standard Yandex.Metrica dataset, using the URL field in kind of the same way:
In this case it sent ~10K results almost immediately (for URL
http://public_search
) and then kept processing silently. I didn't let it finish because I had to shutdown the instance. But I let it run for at least 10 minutes and memory use seemed pretty stable; most of the time between 30 to 40GiB, peaking at ~50GiB. It still seems extremely high, though, considering that the uncompressed dataset is 6GiB. Also, in my case above the string is 2KiB for all rows, whereas in this dataset theURL
column is empty for most of them (more than 7 out of 8 million), and most of those that are non-empty are shorter than 200 bytes.For contrast:
Let me know if there's anything I can help with.
The text was updated successfully, but these errors were encountered: