WIP: try ordered Dict representation #10116

JeffBezanson · 2015-02-07T22:13:58Z

This implements the representation mentioned in #10092, storing keys and values in dense ordered arrays, with a sparse(r) index vector. There are a couple TODOs in the code. This is expected to be a bit slower for deletion-heavy workloads, but I haven't benchmarked that yet.

My benchmark code is at https://gist.github.com/JeffBezanson/ea224aa45e2242ca3ee9 .
I took some benchmarks from past Dict performance issues, and added a couple more. The performance characteristics are interesting. Highlights:

Simply iterating over a Dict is 10x faster, as I decided to make start squeeze out deleted items first.
rehash! doesn't need to reallocate the keys and values arrays at all in theory, but we might need to in order to fix the problem of finalizers that remove keys during rehashing.
Seems to be faster for String keys, but unfortunately much slower for Int keys.

Profiling indicates that a huge amount of insertion time is spent on the line

        elseif si > 0 && isequal(key, keys[si])

My best guess is that this is because keys[si] is random access, where before we iterated through the slots and keys arrays in order. But if pointer indirection is involved anyway (e.g. String keys) we get a significant net speedup.

StefanKarpinski · 2015-02-07T22:47:24Z

Sorry, didn't see the PR before writing that comment. This is pretty fascinating. If it's a win whenever there's pointer indirection, it's clearly a good tradeoff for Python, where that's ~~almost~~ always the case.

JeffBezanson · 2015-02-07T22:53:11Z

Really big gains are possible if you need the ordering feature. For example with this change, setdiff can be rewritten as follows:

function setdiff(a, b)
    args_type = promote_type(eltype(a), eltype(b))
    bset = Set(b)
    seen = Set{args_type}()
    for a_elem in a
        !in(a_elem, bset) && push!(seen, a_elem)
    end
    collect(seen)
end

And I get:

setdiff:
elapsed time: 0.074992577 seconds (42 MB allocated, 2.35% gc time in 2 pauses with 0 full sweep)

which is 6x faster than master, and almost 10x faster than 0.3.

StefanKarpinski · 2015-02-07T22:55:44Z

This seems like a fairly big win overall – if we can figure out a clever way to make the Int case (immediate keys in general) faster, then I would say that switching to this would be well worth it. Another benefit is that it's one less data structure to have – no need for a separate ordered dict implementation.

StefanKarpinski · 2015-02-07T23:31:18Z

So, am I correct that the thing that's slow here for dicts of Ints is calling copy on them? That makes it seem like we really should be copying Dict objects more efficiently – why not just copy their internal arrays directly? Is the idea behind the current implementation of dict copying that rebuilding them is an equivalent amount of work to copying the internals and then rehashing?

That aside, I guess the real point of the benchmark is that inserting into an Int dict is 30% slower and iterating over an Int dict is also slower, making copy, which does both, a total of 70% slower.

JeffBezanson · 2015-02-07T23:36:05Z

Iterating over any dict, including Int dicts, is 10x faster since it becomes equivalent to iterating over two arrays. Mostly assigning is slower.

Yes, we could definitely copy more efficiently.

timholy · 2015-02-08T00:36:45Z

I rarely iterate over Dicts, so I'm curious to know how other workloads fare. For example, a way I often use Dicts is as "flexible sparse matrices," i.e.,

cachekey(i::Int, j::Int) = i <= j ? Pair(i, j) : Pair(j, i)
cache[key] = dot(vectors[i], vectors[j])

and sometime later if vectors[j] gets eaten, delete!(cache, key) for all keys associated with j.

I guess I should just check this branch out and benchmark it, but I won't have time for that for the next several days.

JeffBezanson · 2015-02-08T00:37:44Z

If you send me a test case I will happily benchmark it!

tkelman · 2015-02-08T03:28:20Z

Some of the CI jobs appear to be running out of memory here

quinnj · 2015-02-08T05:36:45Z

julia> r = Dict(1=>"hey",2=>"ho")
Dict{Int64,ASCIIString} with 2 entries:
  1 => "hey"
  2 => "ho"

julia> r[1]
"hey"

julia> r[2]
"ho"

julia> delete!(r,1)
Dict{Int64,ASCIIString} with 1 entry:Error showing value of type Dict{Int64,ASCIIString}:
ERROR: UndefRefError: access to undefined reference
 in showdict at dict.jl:81
 in writemime at replutil.jl:34
 in display at REPL.jl:105
 in display at REPL.jl:108
 in display at multimedia.jl:149
 in print_response at REPL.jl:127
 in print_response at REPL.jl:112
 in anonymous at REPL.jl:588

julia> r
Dict{Int64,ASCIIString} with 1 entry:Error showing value of type Dict{Int64,ASCIIString}:
ERROR: UndefRefError: access to undefined reference
 in showdict at dict.jl:81
 in writemime at replutil.jl:34
 in display at REPL.jl:105
 in display at REPL.jl:108
 in display at multimedia.jl:149
 in print_response at REPL.jl:127
 in print_response at REPL.jl:112
 in anonymous at REPL.jl:588

quinnj · 2015-02-08T05:36:57Z

Excited about this!

bfredl · 2015-02-08T11:05:35Z

Intresting work! However, in a use case, where dict elements are deleted when going "down" in a stack and then later added back in reverse order, when going back up the stack (a simple back-tracking algoritm), this causes a (worst-case) 3x slow-down, but I suppose this is an unavoidable consequence from the strict ordering semantics. (if an element are removed and then immediately readded, it must be pushed on front right?)
code+data here (sorry for messy code, it was an assignment where only speed mattered, not code readability. Once it was faster than the c++/STL implementation I stopped working on it :) )

JeffBezanson · 2015-02-08T16:01:47Z

Thanks for the test case, that's very helpful!

JeffBezanson · 2015-02-08T20:52:21Z

Ok, I was able to get back most of the performance by removing the sizehint, which was causing excess memory allocation and initialization overhead. Seems not to be worth it.

bfredl · 2015-02-09T20:20:44Z

Nice! Still somewhat slower, but I guess this is an extreme case of (structural) mutation relative to access....

timholy · 2015-02-11T02:56:49Z

Sorry for being slow, but perhaps this will be useful:

function initialize(N, nnbrs)
    nbrs_list = [Set{Int}() for i = 1:N]
    dp = Dict{Pair{Int,Int},Float64}()
    for i = 1:N
        nnbr = rand(nnbrs)
        nbrs = nbrs_list[i]
        for j = 1:nnbr
            nbr = rand(1:N)
            push!(nbrs, nbr)
            dp[cachekey(i,j)] = rand()
        end
    end
    nbrs_list, dp
end

function eliminate!(nbrs_list, dp)
    N = length(nbrs_list)
    order = randperm(N)
    for i in order
        nbrs = nbrs_list[i]
        for j in nbrs
            delete!(dp, cachekey(i,j))
        end
        empty!(nbrs)
    end
end

cachekey(i, j) = i <= j ? Pair(i, j) : Pair(j, i)

nnbrs = 1:20
nbrs_list, dp = initialize(5, nnbrs)
eliminate!(nbrs_list, dp)
@time 1

N = 10^5
@time nbrs_list, dp = initialize(N, nnbrs)
@time eliminate!(nbrs_list, dp)

hayd · 2015-05-22T21:15:31Z

Is it possible this could make 0.4? As this is awesome.

Does the int-case have to be special-cased for performance (e.g. using the old code)? Note: intset is special-cased (see #10065).

kmsquire · 2015-05-25T06:26:21Z

If this doesn't make 0.4, it might be worth replacing OrderedDict in
DataStructures.jl with this version.

On Friday, May 22, 2015, Andy Hayden notifications@github.com wrote:

Is it possible this could make 0.4? As this is awesome.

Does the int-case have to be special-cased for performance (e.g. using the
old code)? Note: intset is special-cased (see #10065
#10065).

—
Reply to this email directly or view it on GitHub
#10116 (comment).

JeffBezanson · 2015-05-26T16:25:01Z

I'm starting to think we should go with this. Int keys are arguably not the most important case, and they're easy to special-case if necessary. Being faster for strings, faster iteration, and not needing a separate OrderedDict seem to be worth it.

timholy · 2015-05-26T16:47:21Z

If you try that benchmark I posted, I'd be curious to know how it fares.

JeffBezanson · 2015-05-30T20:37:38Z

The timings are a bit variable, but I got this on master:

   2.582 microseconds (155 allocations: 10845 bytes)
 522.997 milliseconds (1099 k allocations: 184 MB, 19.66% gc time)
 182.668 milliseconds (6 allocations: 781 KB)

and on this branch:

   2.451 microseconds (155 allocations: 10845 bytes)
 502.958 milliseconds (1170 k allocations: 152 MB, 22.28% gc time)
 188.579 milliseconds (6 allocations: 781 KB)

JeffBezanson · 2015-05-30T21:28:34Z

Best time I've seen on master:

   2.916 microseconds (155 allocations: 10845 bytes)
 457.631 milliseconds (1099 k allocations: 184 MB, 22.73% gc time)
 172.746 milliseconds (6 allocations: 781 KB)

best time here:

   2.634 microseconds (155 allocations: 10845 bytes)
 466.899 milliseconds (1170 k allocations: 152 MB, 23.64% gc time)
 182.558 milliseconds (6 allocations: 781 KB)

I think it's at least safe to conclude there's no major regression.

timholy · 2015-05-30T21:58:13Z

LGTM!

ScottPJones · 2015-05-30T22:08:32Z

LGTM too... (and I do plan on using ordered Dicts heavily, so getting this in will be very nice)

StefanKarpinski · 2015-05-30T22:08:58Z

I'm all for making this change. Ordered Dicts are very useful. One issues this raises is that dicts are now distinguished not just by their contents but by their ordering. Should two Dicts be considered equal if they have different orders? While making Dicts ordered is non-breaking, changing their equality is not.

ggggggggg · 2016-05-13T16:26:08Z

Maybe worth merging soon? Seems like the discussion has died down and most people are pro.

kmsquire · 2016-05-13T16:38:05Z

FWIW, this implementation is the one used for OrderedDicts in DataStructures.jl.

samoconnor · 2017-08-06T06:56:17Z

Bump.
Can we have this, please?

StefanKarpinski · 2017-08-06T19:53:45Z

I agree – we should probably do this and make let various Dict APIs take advantage of it.

samoconnor · 2017-11-15T20:34:14Z

FWIW I agree with Stefan's agreement to the last "bump".

* JuliaLang/julia#10116 * This is largely copy and paste + rename, fixups * Offers good speed improvements in OrderedDict iteration, small improvements elsewhere

Moelf · 2021-10-20T23:39:29Z

looks like this is bit rotten, what are people's thought after 4 years? knowing we have it in DataStructures

oscardssmith · 2021-10-20T23:59:28Z

IMO, this is a good change but we probably need to remake the branch.

kmsquire · 2021-10-21T19:11:32Z

Note that this implementation was ported to DataStructures.jl (and currently resides in OrderedCollections.jl, which DataStructures.jl re-exports).

StefanKarpinski · 2021-11-05T15:41:28Z

I think this is probably a good idea.

Moelf · 2022-01-18T16:35:09Z

is this closed because it's implemented somewhere? AFAIC this is still not done and could/should be done

oscardssmith · 2022-01-18T16:36:33Z

This is closed due to bit-rot. It still should be done, but it needs a new PR.

ViralBShah · 2022-01-18T18:16:14Z

Can't we just use it from DataStructures?

oscardssmith · 2022-01-18T18:18:13Z

We can copy that implementation, but we still need a new PR to copy it to Base so that all users of Dict get the speedup.

timholy · 2022-01-19T08:03:16Z

Tread cautiously about assuming the one in DataStructures is up to snuff, as it was split from Base years ago and I doubt it has seen a lot of development since. JuliaCollections/DataStructures.jl#234 in particular requires investigation.

JeffBezanson force-pushed the jb/ordereddict branch from bd8c9ad to 27da02c Compare February 8, 2015 00:23

kmsquire mentioned this pull request Mar 4, 2015

RFC: Update OrderedDict, OrderedSet constructors to take iterables (fixes #50, #64, #67) JuliaCollections/DataStructures.jl#77

Merged

StefanKarpinski mentioned this pull request May 21, 2015

Do more work to prevent iterator invalidation for Dict's? #11316

Open

hayd mentioned this pull request May 27, 2015

WIP/RFC: Refactored dict.jl #7348

Closed

8 tasks

try the ordered Dict representation mentioned in #10092

ecf5033

JeffBezanson force-pushed the jb/ordereddict branch from 192f8a9 to ecf5033 Compare May 29, 2015 17:55

quinnj mentioned this pull request Jun 3, 2016

show(dict) now sorted by key #16743

Closed

phaverty mentioned this pull request Jun 27, 2016

OrderedDict for names? davidavdav/NamedArrays.jl#19

Closed

StefanKarpinski mentioned this pull request Feb 19, 2017

make keys(dict) and values(dict) more useful #20678

Closed

samoconnor mentioned this pull request Feb 14, 2018

Return ordered JSON output JuliaCloud/AWSCore.jl#25

Merged

kmsquire mentioned this pull request May 25, 2018

Replaceordereddict JuliaCollections/DataStructures.jl#386

Closed

jebej mentioned this pull request Jul 21, 2018

Guarantee consistent order of pairs, keys and values in Dicts #28196

Open

vtjnash mentioned this pull request Aug 7, 2018

Make broadcasting over sets an error #28491

Closed

JeffBezanson mentioned this pull request Jan 6, 2020

make Dict ordered? #34265

Open

KristofferC mentioned this pull request Sep 25, 2020

Add OrderedDict to Base (for Julia 1.6?) #37761

Closed

tpapp mentioned this pull request Oct 23, 2020

Change Dict to be ordered by default #38145

Closed

ViralBShah closed this Jan 18, 2022

ViralBShah deleted the jb/ordereddict branch January 18, 2022 10:03

oscardssmith mentioned this pull request Jan 19, 2022

Set operation (setdiff, union, intersect, symdiff) performance issues #40501

Open

Tortar mentioned this pull request Mar 4, 2023

Use OrderedDict for agents container in StandardABM for better reproducibility JuliaDynamics/Agents.jl#752

Closed

WIP: try ordered Dict representation #10116

WIP: try ordered Dict representation #10116

Conversation

JeffBezanson commented Feb 7, 2015

StefanKarpinski commented Feb 7, 2015

JeffBezanson commented Feb 7, 2015

StefanKarpinski commented Feb 7, 2015

StefanKarpinski commented Feb 7, 2015

JeffBezanson commented Feb 7, 2015

timholy commented Feb 8, 2015

JeffBezanson commented Feb 8, 2015

tkelman commented Feb 8, 2015

quinnj commented Feb 8, 2015

quinnj commented Feb 8, 2015

bfredl commented Feb 8, 2015

JeffBezanson commented Feb 8, 2015

JeffBezanson commented Feb 8, 2015

bfredl commented Feb 9, 2015

timholy commented Feb 11, 2015

hayd commented May 22, 2015

kmsquire commented May 25, 2015

JeffBezanson commented May 26, 2015

timholy commented May 26, 2015

JeffBezanson commented May 30, 2015

JeffBezanson commented May 30, 2015

timholy commented May 30, 2015

ScottPJones commented May 30, 2015

StefanKarpinski commented May 30, 2015

ggggggggg commented May 13, 2016

kmsquire commented May 13, 2016

samoconnor commented Aug 6, 2017

StefanKarpinski commented Aug 6, 2017

samoconnor commented Nov 15, 2017

Moelf commented Oct 20, 2021 • edited Loading

oscardssmith commented Oct 20, 2021

kmsquire commented Oct 21, 2021

StefanKarpinski commented Nov 5, 2021

Moelf commented Jan 18, 2022 • edited Loading

oscardssmith commented Jan 18, 2022

ViralBShah commented Jan 18, 2022

oscardssmith commented Jan 18, 2022

timholy commented Jan 19, 2022

Moelf commented Oct 20, 2021 •

edited

Loading

Moelf commented Jan 18, 2022 •

edited

Loading