Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Monoid for worst case optimal joins #147

Merged
merged 9 commits into from Mar 4, 2019

Conversation

Projects
None yet
3 participants
@eoxxs
Copy link
Contributor

eoxxs commented Feb 25, 2019

Most of it is straight forward type changing, but theres the part about correctly counting the occurrences of tuples whilst proposing the extensions.

If we have a Max monoid "counting" would result in prefixed proposing potentially more results than they prior "counted".

Assuming that the indexed collections are distinct (key, value) pairs we can count +1 if the diff is positive and -1 if negative. Thats why we need a is_inverse() method in the diff trait. The count trace should contain isize diffs. Or maybe I am mistaken here?

It feels like there is a more elegant way though instead of using is_inverse ...

@eoxxs

This comment has been minimized.

Copy link
Contributor Author

eoxxs commented Feb 25, 2019

Type alias for Monoid+Mul as Semiring would make sense

@frankmcsherry

This comment has been minimized.

Copy link
Member

frankmcsherry commented Feb 25, 2019

Hrm, yes I also hope there is a more interesting way than is_inverse... ;)

Let me ponder it for a bit. I see the problem you've found, that WCOJ counting is annoying. Perhaps if nothing else we can break it apart so that the Delta Query approach (which doesn't require counts) works in either case, and WCOJ are perhaps available under integer counting instead.

My intuition is that perhaps we can finesse this with a different monoid. Specifically, for most monoids what we want (I think) is a usize that goes up as we add vals but cannot go down if we cannot remove them. If we can remove them then yeah too bad I guess. Perhaps an isize when used as a Monoid, so without negation, could stand in for usize. Let me ponder this for a bit, sorry!

@frankmcsherry

This comment has been minimized.

Copy link
Member

frankmcsherry commented Feb 25, 2019

I think this might be a slightly sticky point like @comnik ran in to as well, where we need to be a bit more precise about what we are determining here. The prior code,

let counts = collection.map(|(k,_v)| k).arrange_by_self().trace;

just drops the value and re-accumulates, which is only correct if the (k,v) collection is distinct. I think we perhaps want to do

let counts = 
collection
    .distinct()
    .map(|(k,_v)| k)
    .arrange_by_self()
    .trace;

which discards the monoid after determining the number of which are non-zero. It is a fair bit more compute (and state maintained) but I think it is closer to correct.

What do you think, @eoxxs? Is that likely to typecheck?

@frankmcsherry

This comment has been minimized.

Copy link
Member

frankmcsherry commented Feb 25, 2019

Also, if we do the above, the

let counts = 
collection
    .distinct()

fragment is also likely to be the same state as for validate(), which needs to semijoin against the (k,v) arrangement-by-self.

@frankmcsherry

This comment has been minimized.

Copy link
Member

frankmcsherry commented Feb 25, 2019

Type alias for Monoid+Mul as Semiring would make sense

I tried this out before, and .. I'm not sure where the constraint would be used. The join method for example currently allows something more general, in that the monoid types of the two argument collections can be different (and often are, when there is one relation with many aggregates and another with just an isize).

What we actually end up with are three types, R1, R2, R3, where R1 and R1 are monoids with addition and have a * action with range R3 across which their addition distributes. I'm not sure what the group/category thing is for this type of mutual relationship. =/

@comnik

This comment has been minimized.

Copy link
Contributor

comnik commented Feb 26, 2019

So in Declarative we currently distinct for most attributes, before indexing them (so all three index traces are on a distinct collection). For some others, we know that they will be distinct anyways and don't distinct explicitly.

@frankmcsherry I don't understand your comment, that the delta query approach doesn't require counts? Do you mean that a specific source relation itself doesn't need counts, as long as the extenders have them?

@frankmcsherry

This comment has been minimized.

Copy link
Member

frankmcsherry commented Feb 26, 2019

@frankmcsherry I don't understand your comment, that the delta query approach doesn't require counts? Do you mean that a specific source relation itself doesn't need counts, as long as the extenders have them?

I think what I meant was that between i. delta queries, and ii. WCO joins, only ii. requires counts. Just doing delta queries works with propose and validate. If the math didn't work out (I think it will be ok, with the proposal above) we could have restricted ourselves to delta queries for single-semiring queries.

@eoxxs

This comment has been minimized.

Copy link
Contributor Author

eoxxs commented Mar 1, 2019

Using distinct() typechecks happily.

It seems this is the way to go.
Counting is done correctly wether we have an inverse or not, tradeoff is the added computation to distinct the count trace first.

@frankmcsherry

This comment has been minimized.

Copy link
Member

frankmcsherry commented Mar 1, 2019

Now, as an added bonus: I think the input to distinct() is the same input as to validate(). One should be able to arrange_by_self() the input and use it for both computations, saving a bit there. @comnik should be able to supply details, or I can too (if you want the surprise ruined ;)).

@eoxxs eoxxs force-pushed the eoxxs:master branch from 49acd21 to faeb9a8 Mar 1, 2019

@eoxxs

This comment has been minimized.

Copy link
Contributor Author

eoxxs commented Mar 1, 2019

Upps I think I did something wrong pushing to my fork. Let me fix that :)

Wanted to be smart and discard the is_inverse commit all together...

Weird I'll probably create a new PR later...

@eoxxs

This comment has been minimized.

Copy link
Contributor Author

eoxxs commented Mar 1, 2019

Like so?

let arranged = collection.arrange_by_self();
let counts = arranged
    .distinct()
    .map(|(k, _v)| k)
    .arrange_by_self()
    .trace;
let propose = collection.arrange_by_key().trace;
let validate = arranged.trace;

validate and counts share the same base arrangement.

@frankmcsherry

This comment has been minimized.

Copy link
Member

frankmcsherry commented Mar 1, 2019

I think that looks right (tbh: I screwed up the "where is the monoid now" reasoning a few times).

@frankmcsherry

This comment has been minimized.

Copy link
Member

frankmcsherry commented Mar 1, 2019

Now, just to check, it looks like this is based on some weird fork that has already landed. Is it possible to point it at differential master, to just pick up the relevant changes (vs various Monoid and Lattice changes from the past)? If not, we can try to merge and assume that it will all land just fine (no clue how git de-duplicates nodes in the history).

eoxxs added some commits Feb 25, 2019

Count correctly
To ensure correct worst case optimal joins we need to count the number of
tuples that each extension might propose correctly. If we
have different diff types "counting" might mean max or min. So we need
to retort back to isize for the count_trace.

@eoxxs eoxxs force-pushed the eoxxs:master branch from faeb9a8 to c0de3e7 Mar 1, 2019

@eoxxs

This comment has been minimized.

Copy link
Contributor Author

eoxxs commented Mar 1, 2019

I removed these weird commits that came from me rebasing my fork to the current upstream and then force pushing.
Looks like its alright now...

cursor.map_times(&storage, |t, d| if t.less_equal(time) { count += d; });
// assert!(count >= 0);
if count > 0 {
if count > R::zero() {

This comment has been minimized.

@frankmcsherry

frankmcsherry Mar 1, 2019

Member

Can you talk me through this logic? I would have though that Monoid did not imply Ord, and that our goal here is probably to multiply count and diff. Is that wrong?

This comment has been minimized.

@frankmcsherry

frankmcsherry Mar 1, 2019

Member

Same thing over in validate(); I think our goal (correct me if wrong) is to end up multiplying all of the constraints together. Back in relational-land this would implement "intersect", and in general monoid land it implements something else, I suppose...

This comment has been minimized.

@eoxxs

eoxxs Mar 2, 2019

Author Contributor

Ah I must have missed pushing the multiplication part.
As far as I understand it, the

cursor.map_times(&storage, |t, d| if t.less_equal(time) { count += d; });

both in validate and propose starts with the zero element of the monoid and then applies its addition for every time in the cursor, sort of consolidating one specific (k, v)-pair in the propose_trace/validate_trace.
If its non-zero we create an output like so:

if count > R::zero() {
    session.give(((prefix.clone(), value.clone()), time.clone(), diff.clone() * count));
}

using the Mul that the semi-ring supplies.
Same for validate.

This comment has been minimized.

@frankmcsherry

frankmcsherry Mar 2, 2019

Member

I suspect rather than > R::zero() you'll probably want !count.is_zero(), and you may even want to perform the multiplication with diff first (to see if they annihilate).

I think the > you have access to is not through Monoid, but rather just the Ord implementation that differential requires to sort things. I don't think it is semantically meaningful here (e.g., the numbers could accumulate negatively, and you would want to multiply anyhow).

This comment has been minimized.

@frankmcsherry

frankmcsherry Mar 2, 2019

Member

And, if you have the free cycles, there are several other spots in the code where you have != R::zero() that we should probably make into !blah.is_zero(), in potential anticipation of going from monoids to semigroups (where a zero element may not inhabit the type). Not as urgent, but it was something that came to mind as I read the edits.

This comment has been minimized.

@eoxxs

eoxxs Mar 2, 2019

Author Contributor

Mh yes good point. Didn't think much about the >.
Let me think a bit about semi-rings, but multiplication first wouldn't cost us anything, so seems like a sensible choice.
I am not sure if we can have two non-zero elements of our semiring using Mul would turn into the identiy, aka zero, element of Add

This comment has been minimized.

@frankmcsherry

frankmcsherry Mar 2, 2019

Member

Sorry, replying to the github email apparently put the response one level up. I'll avoid using that mechanism in the future.

This comment has been minimized.

@eoxxs

eoxxs Mar 2, 2019

Author Contributor

Oh yeah didn't think about more complex diffs then a single one^^
True that, true that

@frankmcsherry

This comment has been minimized.

Copy link
Member

frankmcsherry commented Mar 2, 2019

@frankmcsherry

This comment has been minimized.

Copy link
Member

frankmcsherry commented Mar 2, 2019

Looking good!

@frankmcsherry

This comment has been minimized.

Copy link
Member

frankmcsherry commented Mar 2, 2019

Makes me thing we may eventually want MulAssign<&R2>. :D

@frankmcsherry

This comment has been minimized.

Copy link
Member

frankmcsherry commented Mar 4, 2019

Looks good to me, and happy to merge!

@frankmcsherry frankmcsherry merged commit 971915c into TimelyDataflow:master Mar 4, 2019

1 check passed

continuous-integration/travis-ci/pr The Travis CI build passed
Details
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.