Faster union/intersection implementation for sets? #83

belyaev-mikhail · 2020-08-18T17:13:15Z

Current union (addAll) implementation for sets is O(logN * M) as far as I understand even when both sides have the same implementation.
It is certainly possible to improve that to log or even sub-log implementation for HAMTs by merging one-bit arrays only.
It is also possible to improve intersection, although atm you don't even have a method for intersection in the API.

Not sure if it helps maps in any way, but I guess it does.
While intersection is questionable, union is a very important and widely used operation (especially so when + is overloaded as union) for persistent sets and implementing this doesn't even need changing the api.

The text was updated successfully, but these errors were encountered:

belyaev-mikhail · 2020-08-23T16:28:50Z

I'll work on a PR if you find this idea useful

qurbonzoda · 2020-08-23T17:19:00Z

It would be great.

belyaev-mikhail · 2020-08-27T13:41:13Z

So, #86 is the first version of the pull request.
I've done some benchmarking (not the stuff I do often, so critique is welcome) and these are preliminary results:

https://docs.google.com/spreadsheets/d/1NaMgasBhqyThazU-_TVWAK2fCcnztAQ9WCRxr9lNfUw/edit?usp=sharing
https://docs.google.com/spreadsheets/d/1lTjcPPfOZTKd4eHHrzrLOoq9SbaNf9-eEx4MCsyzfMU/edit?usp=sharing

These look promising, but there currently are a few pitfalls:

Current implementation doesn't try its extreme to preserve persistense (namely, it does not return the same node if all the children are reference-equal to the previous node), because it's much trickier to do for addAll/putAll efficiently;
I still struggle to design a solution for ordered maps/sets, because addAll cannot be implemented using HashMap putAll due to different conflict behaviour (putAll replaces older values, while addAll should retain older values, but edit the new ones);
The whole DeltaCounter business may seem sloppy and less Kotlin'y than your ModificationResult approach, but I found massive boxing for bulk operations too heavy and it's not trivial to implement bulk operations this way (sometimes you modify against the other node, not this);
No mutableAddAll/mutablePutAll for now because I'm not sure if they provide any increased benefit (and, tbh, I still don't fully get how to get around your MutabilityOwnership correctly);
No intersection yet, because that would involve changing the current interfaces.

Feel free to check out and destroy the PR =)

belyaev-mikhail · 2020-08-27T13:45:32Z

Some comments on the benchmarks: I tested two scenarios, typical and worst-case where typical is uniting two non-intersecting sets/maps and worst case is two completely identical, but different, sets.
The choice is due to this approach should not perform well in the second variant (theoretically, it should be the same O(N * logM) as the old mutable approach), which is somewhat true in practice.
No enormous speedup, but it seems decent to me.

I didn't try to construct a benchmark for best-case scenario, (say, two sets with all hashes being different in their first bits) because it has constant complexity, but seems very unrealistic to me.

ilya-g · 2020-08-31T20:38:17Z

Current implementation doesn't try its extreme to preserve persistense (namely, it does not return the same node if all the children are reference-equal to the previous node), because it's much trickier to do for addAll/putAll efficiently;

This would be unfortunate, because there are use cases when a reference equality check is used after an operation to find out whether it has actually changed content (instead of boolean flag that is returned from mutable collection operations).

belyaev-mikhail · 2020-08-31T21:41:40Z

Well, it can be done either by checking array element reference-equality manually after the actual logic, or by doing it on-the-fly using two additional masks. Either way we throw away the array we allocated, but there's no way avoiding it, I guess.
I'll look into it.

belyaev-mikhail · 2020-09-03T13:17:38Z

Ok, so I implemented the array element checking, did some benchmarks and it does not seem to introduce much overhead (or any at all, for that matter).

qurbonzoda · 2020-09-17T21:10:21Z

The asked optimisation was implemented by @belyaev-mikhail and merged to master (690343a).

belyaev-mikhail · 2020-09-17T21:12:53Z

Well, not really) I am still investigating the intersection and other bulk ops and will make it a separate PR, so this is not the end)

belyaev-mikhail · 2020-09-17T21:14:00Z

But maybe opening a separate issue for that will make more sense, too

belyaev-mikhail mentioned this issue Aug 27, 2020

Faster union operations #86

Merged

qurbonzoda closed this as completed Sep 17, 2020

belyaev-mikhail mentioned this issue Oct 19, 2020

Faster bulk operations for non-ordered sets #91

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Faster union/intersection implementation for sets? #83

Faster union/intersection implementation for sets? #83

belyaev-mikhail commented Aug 18, 2020 •

edited

belyaev-mikhail commented Aug 23, 2020

qurbonzoda commented Aug 23, 2020

belyaev-mikhail commented Aug 27, 2020

belyaev-mikhail commented Aug 27, 2020

ilya-g commented Aug 31, 2020

belyaev-mikhail commented Aug 31, 2020

belyaev-mikhail commented Sep 3, 2020

qurbonzoda commented Sep 17, 2020

belyaev-mikhail commented Sep 17, 2020

belyaev-mikhail commented Sep 17, 2020

Faster union/intersection implementation for sets? #83

Faster union/intersection implementation for sets? #83

Comments

belyaev-mikhail commented Aug 18, 2020 • edited

belyaev-mikhail commented Aug 23, 2020

qurbonzoda commented Aug 23, 2020

belyaev-mikhail commented Aug 27, 2020

belyaev-mikhail commented Aug 27, 2020

ilya-g commented Aug 31, 2020

belyaev-mikhail commented Aug 31, 2020

belyaev-mikhail commented Sep 3, 2020

qurbonzoda commented Sep 17, 2020

belyaev-mikhail commented Sep 17, 2020

belyaev-mikhail commented Sep 17, 2020

belyaev-mikhail commented Aug 18, 2020 •

edited