Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Faster union/intersection implementation for sets? #83

Closed
belyaev-mikhail opened this issue Aug 18, 2020 · 10 comments
Closed

Faster union/intersection implementation for sets? #83

belyaev-mikhail opened this issue Aug 18, 2020 · 10 comments

Comments

@belyaev-mikhail
Copy link
Contributor

belyaev-mikhail commented Aug 18, 2020

Current union (addAll) implementation for sets is O(logN * M) as far as I understand even when both sides have the same implementation.
It is certainly possible to improve that to log or even sub-log implementation for HAMTs by merging one-bit arrays only.
It is also possible to improve intersection, although atm you don't even have a method for intersection in the API.

Not sure if it helps maps in any way, but I guess it does.
While intersection is questionable, union is a very important and widely used operation (especially so when + is overloaded as union) for persistent sets and implementing this doesn't even need changing the api.

@belyaev-mikhail
Copy link
Contributor Author

I'll work on a PR if you find this idea useful

@qurbonzoda
Copy link
Contributor

It would be great.

@belyaev-mikhail
Copy link
Contributor Author

So, #86 is the first version of the pull request.
I've done some benchmarking (not the stuff I do often, so critique is welcome) and these are preliminary results:

https://docs.google.com/spreadsheets/d/1NaMgasBhqyThazU-_TVWAK2fCcnztAQ9WCRxr9lNfUw/edit?usp=sharing
https://docs.google.com/spreadsheets/d/1lTjcPPfOZTKd4eHHrzrLOoq9SbaNf9-eEx4MCsyzfMU/edit?usp=sharing

These look promising, but there currently are a few pitfalls:

  • Current implementation doesn't try its extreme to preserve persistense (namely, it does not return the same node if all the children are reference-equal to the previous node), because it's much trickier to do for addAll/putAll efficiently;
  • I still struggle to design a solution for ordered maps/sets, because addAll cannot be implemented using HashMap putAll due to different conflict behaviour (putAll replaces older values, while addAll should retain older values, but edit the new ones);
  • The whole DeltaCounter business may seem sloppy and less Kotlin'y than your ModificationResult approach, but I found massive boxing for bulk operations too heavy and it's not trivial to implement bulk operations this way (sometimes you modify against the other node, not this);
  • No mutableAddAll/mutablePutAll for now because I'm not sure if they provide any increased benefit (and, tbh, I still don't fully get how to get around your MutabilityOwnership correctly);
  • No intersection yet, because that would involve changing the current interfaces.

Feel free to check out and destroy the PR =)

@belyaev-mikhail
Copy link
Contributor Author

Some comments on the benchmarks: I tested two scenarios, typical and worst-case where typical is uniting two non-intersecting sets/maps and worst case is two completely identical, but different, sets.
The choice is due to this approach should not perform well in the second variant (theoretically, it should be the same O(N * logM) as the old mutable approach), which is somewhat true in practice.
No enormous speedup, but it seems decent to me.

I didn't try to construct a benchmark for best-case scenario, (say, two sets with all hashes being different in their first bits) because it has constant complexity, but seems very unrealistic to me.

@ilya-g
Copy link
Member

ilya-g commented Aug 31, 2020

Current implementation doesn't try its extreme to preserve persistense (namely, it does not return the same node if all the children are reference-equal to the previous node), because it's much trickier to do for addAll/putAll efficiently;

This would be unfortunate, because there are use cases when a reference equality check is used after an operation to find out whether it has actually changed content (instead of boolean flag that is returned from mutable collection operations).

@belyaev-mikhail
Copy link
Contributor Author

Well, it can be done either by checking array element reference-equality manually after the actual logic, or by doing it on-the-fly using two additional masks. Either way we throw away the array we allocated, but there's no way avoiding it, I guess.
I'll look into it.

@belyaev-mikhail
Copy link
Contributor Author

Ok, so I implemented the array element checking, did some benchmarks and it does not seem to introduce much overhead (or any at all, for that matter).

@qurbonzoda
Copy link
Contributor

The asked optimisation was implemented by @belyaev-mikhail and merged to master (690343a).

@belyaev-mikhail
Copy link
Contributor Author

Well, not really) I am still investigating the intersection and other bulk ops and will make it a separate PR, so this is not the end)

@belyaev-mikhail
Copy link
Contributor Author

But maybe opening a separate issue for that will make more sense, too

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants