-
Notifications
You must be signed in to change notification settings - Fork 458
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Filters pushdown from view to source #3196
Comments
Hi! Thanks for the nice words. We already do push down filters, so there is possibly something else going on. If you type
you should get a visual presentation of the plan, which is certainly supposed to have a filter in place before the materialization. Depending on when you most recently grabbed the binary, some work did recently unblock the flow of data from sources (which should have been noticeable only in cases of large batch sizes). Can you tell us which version you are using? Thanks! |
Also what OS are you using? On my Mac OS X laptop, it comfortably churns through 100s of gigs of input data while swapping out to disk - you should not have had an OOM error by 10g usage. |
I'm using docker image in Kubernetes, version v0.2.3-rc1.
Looks like the filter is after materializing? |
@frankmcsherry do you think the plan looks as expected? |
Hi, sorry for the delay. The plan looks totally normal, and the filtering will happen before the materialization. The It's a bit hard to know without additional information about your set-up. Can you describe the source, and the data it contains? You could also do a
which should maintain a small footprint, and let you check out how many records are passing the filter. It's definitely not expected that the filtering doesn't happen, and hasn't been an issue in other tests. I'd recommend grabbing the most recent release you can and double checking that; there has been some churn around Kafka reading and I can't say with high confidence that we haven't had an issue that could have done this and which might have been fixed. |
@bruwozniak What is the source's envelope? Filters are pushed down for |
@wangandi I don't specify the envelope when creating the source so presumably it should be @frankmcsherry the data is a Kafka (2.3.0) topic with 60 well balanced partitions and Protobuf data.
This works fine when the source topic (same protobuf format) has just 1 partition and 1GB of data, presumably because they can all fit into memory before filter. |
Thanks for the additional info! I'm going to switch this from a feature request to a bug, as this is supposed to work (and .. seems to for our tests, but clearly we aren't testing something). |
Also, in the interest of keeping folks sane on our end: when you create an issue and select "bug" it asks for various bits of information; would you be willing to check out that list and try to grab those things (e.g. version, how you installed it, but also for bits of your logs). |
What version of Materialize are you using?
How did you install Materialize?
Main part of k8s deployment manifest:
What was the issue?See the thread above. Is the issue reproducible? If so, please provide reproduction instructions.Steps to reproduce:
Please attach any applicable log files.
|
Thanks very much! |
I did a quick test of memory usage on the same ~1 GB dataset, with a materialized view that filtered it down to 56 records:
In both cases, checking Based on this, I suspect that the filter is there and doing its job, but increasing the number of partitions seem to result in a lot of extra memory allocations at the outset.
@bruwozniak In the 1 partition 1 GB of data case, what does the memory usage graph look like? We have instructions on obtaining performance metrics here: https://materialize.io/docs/monitoring/, though it would be sufficient if you knew the peak memory usage and steady state memory usage after the making the view. |
@wangandi I think the usage is similar to what you observed (peak below 4GB). And as I see you can reproduce this yourself too. But this is not the problematic scenario. Maybe you can try with 10GB data spread across 60 partitions and see what the memory profile looks like? |
@bruwozniak
Thus, I was wondering, in the 1 partition 1 GB case, did you see a significant memory spike that made you believe the entire dataset was being loaded into memory before the filter was applied? Because in my quick test on 1 partition 1 GB, I only observed a gradual, monotonic increase in memory, and the total amount of increase was only 20% of the size of the dataset. If you did see a significant memory spike in the 1 partition 1 GB case, that would indicate that there is an additional factor (such as size of average record, protobuf decoding) besides # of partitions/size of topic that we should be investigating as a cause of memory spiking. |
@wangandi I see that you opened a feature issue related to this, but I still believe this is a performance bug. What I am trying to use Materialize for, using the production topic, likely won't fit into e.g. 64GB of server RAM, either. So it's not just about being able to run heavy workloads on a laptop. |
Hi, first of all I'm quite impressed with your project, keep up the great work.
I have a use case where I would like to analyze some data in a fairly large (~400GB) Kafka topic looking for specific criteria.
For example: I ran
CREATE SOURCE [...];
and thenCREATE MATERIALIZED VIEW xyz AS SELECT * FROM source WHERE id='abcd1234';
At this point Materialize starts reading the topic and OOMs after a few moments using 10GB of heap, presumably because there isn't enough ram to fit all source data in memory.But of course technically you'd only need to keep the one row relevant to the view.
Can this behavior be optimized in the future?
The text was updated successfully, but these errors were encountered: