Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fuse Map/Filter/Project operators on top of a Get #4463

Merged
merged 1 commit into from Oct 14, 2020

Conversation

justinj
Copy link
Contributor

@justinj justinj commented Oct 12, 2020

This is part 1 of the work outlined in #4346, and builds off of #3258,
where we detect map/filter/project chains already present in plans but
do not attempt to synthesize them.

I did some quick and dirty unscientific benchmarking and noticed some
slight slowdown rather than any performance improvement, not sure if
there are any thoughts on what might need to be done to see some
improvement. I'm not sure if you had in mind some kind of tighter fusing
on top of a Get, or not?

I triggered the behaviour with this kind of query:

statement ok
CREATE TABLE abc (a int, b int, c int)

statement ok
INSERT INTO abc VALUES (1, 2, 3), (4, 5, 6);

query T multiline
EXPLAIN SELECT a, b, a+b FROM abc WHERE a = 1 UNION ALL (VALUES (7, 8, 9));
----
%0 =
| Get materialize.public.abc (u1)
| Filter (#0 = 1)
| Map (1 + #1)
| Project (#0, #1, #3)

%1 =
| Constant (7, 8, 9)

%2 =
| Union %0 %1

EOF

This change is Reviewable

) -> bool {
if let Some((mfp, input)) = self.extract_map_filter_project(relation_expr) {
self.ensure_rendered(&input, scope, worker_index);
let (ok_collection, mut err_collection) = self.collection(&input).unwrap();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is the moment where we could integrate this more tightly, using the flat_map_ref method instead of collection. It has the potential to avoid pulling the contents of the arrangement first, and then flat_map_fallible'ing them afterwards, and instead doing all the work at once.

@frankmcsherry
Copy link
Contributor

The set-up looks great. I think the thing to check out next are some of the ways to integrate a bit more tightly, if we are worried about performance before landing. For example, a very restrictive filter should demonstrate a performance difference when push in through the flat_map_ref method, as it can avoid allocating (though: to avoid all allocations we'll need to avoid unpack() I think, which may require some shenanigans).

This is part 1 of the work outlined in MaterializeInc#4346, and builds off of MaterializeInc#3258,
where we detect map/filter/project chains already present in plans but
do not attempt to synthesize them.

I did some quick and dirty unscientific benchmarking and noticed some
slight slowdown rather than any performance improvement, not sure if
there are any thoughts on what might need to be done to see some
improvement.
@justinj justinj marked this pull request as ready for review October 14, 2020 15:09
@justinj
Copy link
Contributor Author

justinj commented Oct 14, 2020

That change showed perf improvements! My (very unscientific) benchmark with the following query with 1000000 rows:

select sum(d) from (select a+b as d from v where c = 3 union all (select * from (values (1))))

took 1.5s with the new impl and 3.5s with the old one.

let mut row_packer = repr::RowPacker::new();
let temp_storage = RowArena::new();
move |row| {
mfp.evaluate(&mut row.unpack(), &temp_storage, &mut row_packer)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Very minor, and not actionable, but this is the other moment at which a potentially optional allocation happens. row.unpack() will produce a Vec<Datum> that we could try and wrangle in to re-using an existing allocation. It's a bit awkward with lifetimes, but inside flat_map_ref() we could mess around a bit there and re-use one vector for each operator invocation.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just read the code, and it would at least involve copy/paste of differential's flat_map_ref, as that method does not expose the 'storage lifetime, which is what we would want to use for the lifetime of the Vec<Datum>.

Copy link
Contributor

@frankmcsherry frankmcsherry left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks like a good foundation to me!

@justinj
Copy link
Contributor Author

justinj commented Oct 14, 2020

Cool, TFTR! Thanks for the notes.

@justinj justinj merged commit 052add0 into MaterializeInc:main Oct 14, 2020
@justinj justinj deleted the fuse branch October 14, 2020 16:33
justinj added a commit to justinj/materialize that referenced this pull request Oct 21, 2020
In MaterializeInc#4463 we introduced MapFilterProjects on top of Gets to reduce the
amount of data that we had to pack and unpack across operator
boundaries. When doing this, I made one RowArena that was used across
each invocation of MFP, which resulted in memory buildup over time. This
change changes this to occur per-row so that we can throw away
intermediate data.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants