Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Demand analysis for arrangements #2194

Open
frankmcsherry opened this issue Mar 3, 2020 · 1 comment
Open

Demand analysis for arrangements #2194

frankmcsherry opened this issue Mar 3, 2020 · 1 comment

Comments

@frankmcsherry
Copy link
Member

@frankmcsherry frankmcsherry commented Mar 3, 2020

We are seeing high arrangement record counts for chbench queries, for example:

materialize=> select * from mz_records_per_dataflow_global;
  id  |                                name                                | records  
------+--------------------------------------------------------------------+----------
  779 | Dataflow: materialize.public.q01                                   |     3687
 1569 | Dataflow: materialize.public.q09                                   | 41038774
 1320 | Dataflow: materialize.public.q06                                   |        1
 1364 | Dataflow: materialize.public.q08                                   |   371294
 1885 | Dataflow: materialize.public.q17                                   |  1879369
  914 | Dataflow: materialize.public.q02                                   |  1298394
 1690 | Dataflow: materialize.public.q12                                   |  3484277
 1181 | Dataflow: materialize.public.q05                                   | 84255727
 2015 | Dataflow: materialize.public.q19                                   |  1113873
 1784 | Dataflow: materialize.public.q14                                   |   100001

The largest ones, q05 and q09, both involve orderline in a join. Naively, it makes sense that we need to keep it around as we may need to respond to arbitrary changes to other relations.

However, our ability to compact things is compromised by the arrangement containing primary keys, even though the query does not use them. For example, in both queries we only use five fields from orderline, foreign keys into other relations and ol_amount. The various values in other fields prevent us from collapsing the arrangements down, as, for example the primary key ensures that no records are ever the same.

We could project down to the demanded columns, and arrange just those, in particular for private arrangements. This has a potential negative impact on sharing within a query (e.g. delta queries might want to be careful doing this), but a potential massive upside for collapsing large collections with relatively fewer distinct values of interest.


For bonus points, if we noticed that ol_amount is only summed at the end, we could move it in to the diff field (would require some advanced work) and this would further reduce the memory footprint as we wouldn't need to track distinct rows for each distinct value of ol_amount.

cc @wangandi as relevant to your interests.

@frankmcsherry

This comment has been minimized.

Copy link
Member Author

@frankmcsherry frankmcsherry commented Mar 3, 2020

A more probable way to win partial bonus points: if we push down the reduction (summing ol_amount) through the join, and create a reduce using the join keys that will eventually be required, we would simplify the amount of data required in the same way: removing the dependence on the distinct values of ol_amount.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Linked pull requests

Successfully merging a pull request may close this issue.

None yet
1 participant
You can’t perform that action at this time.