Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Replaceordereddict #386

Conversation

StephenVavasis
Copy link
Contributor

Addresses problems with making OrderedDict compatible with 0.7 via a complete rewrite.

 On branch newiterationsortedcontainers
 Changes to be committed:
	modified:   ../docs/src/sorted_containers.md
	modified:   DataStructures.jl
	modified:   balanced_tree.jl
	modified:   container_loops.jl
	modified:   ../test/test_sorted_containers.jl

This commit updates container_loops.jl to use the new iteration protocol
(introduced in 0.7.0-DEV). It should be backwards compatible with 0.6.2.

In addition, it fixes a bug in container_loops.jl in which the length()
function when applied to subranges (i.e.,  inclusive(a,b,c) or
exclusive(a,b,c)) returned the length of the whole container instead
of the length of the subrange.  (There should be no value returned
for the length of the subrange since the data structure does not support
an O(1) algorithm or even O(log n) algorithm to compute the length.)

Some other smaller changes in this commit are as follows.
 - IntSet was changed to BitSet (name change in 0.7.0-DEV)
 - Small updates to documentation
 - Some assert statements in balanced_tree.jl that were present
   during development are deleted.
This PR completely rewrites OrderedDict so that it is implemented in
terms of Base containers Dict and Vector instead of the previous
implementation, which was a modified version of dict.jl.  This is to
improve long-term maintainability of the code.  However, there are
performance regressions with this rewrite for which I will open issues
on GitHub.

This PR also reimplements OrderedSet for the new version of OrderedDict.

This PR removes the getindex and setindex! methods for OrderedSet.
These methods were not documented, and it is not clear how they are
supposed to work in the presence of deletions.  It also removes the
findfirst (etc) methods for OrderedSet, since those rely on getindex.

The getindex, setindex! and find tests have been removed from test_ordered_set.

Typos; add using statements to runtests

more typos

Typos

added push!

typo

generalize push

bug

remove push tuple

add "in" for keys; wrong assertion

bug needs odx instead of od

change a line

fix tuple -> pair

another get method

typo in sizehint

Fix typos in 0.6.2 iteration

fixed dict_sorting

fixed dict_sorting again

Remove getindex and setindex from ordered set

fixed next

fixed pop!(c) (single arg)

remove leftover  line

bug in pop!

removed findfirst tests

delete! no longer errors on missing key

no change
@kmsquire
Copy link
Member

First, some history: I wrote the first version of OrderedDict, which was reasonably performant. @JeffBezanson rewrote this version to be more performant, especially for iteration. He originally targeted Julia itself (JuliaLang/julia#10116). I brought it here when it became clear that it wasn't going to be merged there. (I thought I rewrote the commit message for him to be the owner, but it seems my git-foo wasn't up to snuff yet.)

Regarding this version:

  • Ideally, the two commits would be in separate merge requests, especially since they have nothing to do with one another.
  • I appreciate the goal of making the code easier to use and more maintainable. That said, I find quite doubtful that this type of rewrite can ever be as performant as the previous version.
  • That said, I have no time to work on this myself right now, so in order to get things working, I'm okay with having something like this merged. If has the time and inkling to improve it and maintain it in the future, great.
  • If anyone else has the time to look to see if the current implementation might be fixed with a more surgical approach, that would be nice...

Going to review this now.

@StephenVavasis
Copy link
Contributor Author

Thanks for the history! I suspected something like that must have occurred. I will try to rebase these changes on the latest master and split into two PR's. In the next comment I'll post my thoughts on partially recovering performance.

@StephenVavasis
Copy link
Contributor Author

I see three ways to get a more performant OrderedDict beyond what this PR offers.

  1. Go back to the old approach in which the maintainers of Dict pinky-swear that they will also maintain OrderedDict. I see this as unlikely; it appears that the core developers are pushing as much out of Base as possible, presumably in order to be able to more easily delegate responsibility for code maintenance.

  2. Ask for a complete refactoring of dict.jl to split off a high-level interface from a low-level substrate. In principle, this is the best solution, but everyone seems very busy right now.

  3. Try to improve the performance of the code in this PR. There are two separate sources of performance losses, which I'll expand on now.

First, as observed by Martin Holters, if one repeatedly inserts and deletes keys and then carries out many loops over the whole OrderedDict, then the performance of this PR can be arbitrarily worse (i.e., a ratio greater than any constant factor) compared to the old implementation. This can be fixed as follows: Every time a key is added, one checks whether a1 has become more than twice as long as d1, and if so, one runs a compression on a1. This will return the performance ratio to a constant factor in all cases, I believe.

Second, the new code often makes multiple probes (as many as 3) into the underlying Dict for a single operation. To fix this, a token interface to Dict is needed. There is already an open issue for this here: JuliaLang/julia#24454, so maybe we can press for this issue to be addressed.

If these two fixes are applied, then the new code for OrderedDict will still be slower than the old; the ratio will depend on how long hashing takes, but I expect the ratio to range from 1.25 to 2.

@StephenVavasis
Copy link
Contributor Author

This PR is no longer needed because of #448.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants