Fuse Map/Filter/Project operators on top of a Get #4463

justinj · 2020-10-12T15:36:59Z

This is part 1 of the work outlined in #4346, and builds off of #3258,
where we detect map/filter/project chains already present in plans but
do not attempt to synthesize them.

I did some quick and dirty unscientific benchmarking and noticed some
slight slowdown rather than any performance improvement, not sure if
there are any thoughts on what might need to be done to see some
improvement. I'm not sure if you had in mind some kind of tighter fusing
on top of a Get, or not?

I triggered the behaviour with this kind of query:

statement ok
CREATE TABLE abc (a int, b int, c int)

statement ok
INSERT INTO abc VALUES (1, 2, 3), (4, 5, 6);

query T multiline
EXPLAIN SELECT a, b, a+b FROM abc WHERE a = 1 UNION ALL (VALUES (7, 8, 9));
----
%0 =
| Get materialize.public.abc (u1)
| Filter (#0 = 1)
| Map (1 + #1)
| Project (#0, #1, #3)

%1 =
| Constant (7, 8, 9)

%2 =
| Union %0 %1

EOF

This change is

frankmcsherry · 2020-10-12T16:39:50Z

src/dataflow/src/render/mod.rs

+    ) -> bool {
+        if let Some((mfp, input)) = self.extract_map_filter_project(relation_expr) {
+            self.ensure_rendered(&input, scope, worker_index);
+            let (ok_collection, mut err_collection) = self.collection(&input).unwrap();


I think this is the moment where we could integrate this more tightly, using the flat_map_ref method instead of collection. It has the potential to avoid pulling the contents of the arrangement first, and then flat_map_fallible'ing them afterwards, and instead doing all the work at once.

frankmcsherry · 2020-10-12T16:42:08Z

The set-up looks great. I think the thing to check out next are some of the ways to integrate a bit more tightly, if we are worried about performance before landing. For example, a very restrictive filter should demonstrate a performance difference when push in through the flat_map_ref method, as it can avoid allocating (though: to avoid all allocations we'll need to avoid unpack() I think, which may require some shenanigans).

This is part 1 of the work outlined in MaterializeInc#4346, and builds off of MaterializeInc#3258, where we detect map/filter/project chains already present in plans but do not attempt to synthesize them. I did some quick and dirty unscientific benchmarking and noticed some slight slowdown rather than any performance improvement, not sure if there are any thoughts on what might need to be done to see some improvement.

justinj · 2020-10-14T15:10:10Z

That change showed perf improvements! My (very unscientific) benchmark with the following query with 1000000 rows:

select sum(d) from (select a+b as d from v where c = 3 union all (select * from (values (1))))

took 1.5s with the new impl and 3.5s with the old one.

frankmcsherry · 2020-10-14T15:50:47Z

src/dataflow/src/render/mod.rs

+                    let mut row_packer = repr::RowPacker::new();
+                    let temp_storage = RowArena::new();
+                    move |row| {
+                        mfp.evaluate(&mut row.unpack(), &temp_storage, &mut row_packer)


Very minor, and not actionable, but this is the other moment at which a potentially optional allocation happens. row.unpack() will produce a Vec<Datum> that we could try and wrangle in to re-using an existing allocation. It's a bit awkward with lifetimes, but inside flat_map_ref() we could mess around a bit there and re-use one vector for each operator invocation.

Just read the code, and it would at least involve copy/paste of differential's flat_map_ref, as that method does not expose the 'storage lifetime, which is what we would want to use for the lifetime of the Vec<Datum>.

frankmcsherry

This looks like a good foundation to me!

justinj · 2020-10-14T16:33:02Z

Cool, TFTR! Thanks for the notes.

In MaterializeInc#4463 we introduced MapFilterProjects on top of Gets to reduce the amount of data that we had to pack and unpack across operator boundaries. When doing this, I made one RowArena that was used across each invocation of MFP, which resulted in memory buildup over time. This change changes this to occur per-row so that we can throw away intermediate data.

justinj requested a review from frankmcsherry October 12, 2020 15:36

frankmcsherry reviewed Oct 12, 2020

View reviewed changes

justinj force-pushed the fuse branch from 0cb8836 to c69a7c6 Compare October 14, 2020 15:06

justinj marked this pull request as ready for review October 14, 2020 15:09

frankmcsherry reviewed Oct 14, 2020

View reviewed changes

frankmcsherry approved these changes Oct 14, 2020

View reviewed changes

frankmcsherry mentioned this pull request Oct 14, 2020

Fuse Map/Filter/Project operations to Get #4346

Closed

justinj merged commit 052add0 into MaterializeInc:main Oct 14, 2020

justinj deleted the fuse branch October 14, 2020 16:33

justinj mentioned this pull request Oct 21, 2020

Don't reuse a RowArena for MapFilterProject #4578

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fuse Map/Filter/Project operators on top of a Get #4463

Fuse Map/Filter/Project operators on top of a Get #4463

justinj commented Oct 12, 2020 •

edited by benesch

frankmcsherry Oct 12, 2020

frankmcsherry commented Oct 12, 2020

justinj commented Oct 14, 2020

frankmcsherry Oct 14, 2020

frankmcsherry Oct 14, 2020

frankmcsherry left a comment

justinj commented Oct 14, 2020

Fuse Map/Filter/Project operators on top of a Get #4463

Fuse Map/Filter/Project operators on top of a Get #4463

Conversation

justinj commented Oct 12, 2020 • edited by benesch

frankmcsherry Oct 12, 2020

Choose a reason for hiding this comment

frankmcsherry commented Oct 12, 2020

justinj commented Oct 14, 2020

frankmcsherry Oct 14, 2020

Choose a reason for hiding this comment

frankmcsherry Oct 14, 2020

Choose a reason for hiding this comment

frankmcsherry left a comment

Choose a reason for hiding this comment

justinj commented Oct 14, 2020

justinj commented Oct 12, 2020 •

edited by benesch