A `top_k` operator can be implemented more efficiently

We see a lot of `top_k` operators in computations. Things like `cc` and `bfs` use them for graph computation, and several of the TPCH queries have similar "top blah" queries (e.g. Query 15).

It is not hard to implement `top_k` with `group`, but there are much more efficient implementations you could imagine. Or, there are optimizations to `group` that make sense for `top_k` and perhaps not as much sense for more general `group` computation. For example:

1. For each key, `group` reads out all of the value histories from the input and output traces. We could do this lazily, as each evaluation of the "user logic" only needs to pull record until they observe k-ish values.

2. By watching which values are observed, we can restrict the future times that need to be warned about. If we only observe the histories of the first three values before finding enough values, we needn't warn about times not found in those first three values. Times in all the other values (e.g. often later iterations, in computations like cc,bfs), can potentially be ignored.

3. When attending to new changes, we have the potential to swap in the *output* trace for values before the smallest value in the new changes. This needs to be double checked for correctness, but it seems that there is the ability to perform this substitutions safely in several cases.

4. In the same setting as above, we can advance the substituted values from the output trace up to the smallest value in the new changes, as the specific values of these values are not as important as the fact that they are less or equal to the new changes and not changing. This has the potential to reduce the number of moments of variation, as any changes among the smallest values collapse down to just the moment where the size of the set changes.

5. Having swapped in the output trace, advanced values, and advanced times, we can evaluate an interval by starting from the times in the input trace and evaluating the `top_k` logic at each of these, warning about the times that might be interesting in subsequent values, but only those values we need to explore to fill out the output. This restricts our attention to the moments where the count of elements in output changes, and only to the changes for values that would emerge. All of the updates to values greater than these, those that never appear in the output, are ignored.

All of this makes for a pretty sweet implementation, that should make `top_k` pretty effective. Right now TPCH's Query 15 is a real pain point for e.g. DBToaster, and this sort of implementation seems like it would clear up all sorts of issues, plus make bfs lots faster.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

A `top_k` operator can be implemented more efficiently #29

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Uh oh!

A top_k operator can be implemented more efficiently #29

Description

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions

A `top_k` operator can be implemented more efficiently #29