julep/rfc: proposal for unified Dict get/insert/has interface #12157

vtjnash · 2015-07-15T04:21:17Z

problem statement:

a common idiom when working with Associative objects is to do a set of has / insert / lookup operations:

d = Dict()
if !(key in keys(d))
  d[key] = initial_value
end
value = d[key]

this syntax has the advantage of only using basic operations, for clarity, but it's also 3x slower than necessary since the dict lookup gets repeated three times. that concern has lead to the introduction of a get! method that accepts either an object or function, to insert a value if the lookup fails, so that the above code can be rewritten as:

d = Dict()
value = get!(d, key, initial_value)

or

d = Dict()
value = get!(d, key, () -> initial_value)

however, this implementation is still not necessarily faster, since it involves precomputing initial_value (fine it it's just a constant, but bad if it expensive to compute or creates unnecessary garbage) or invoking a lambda.

the code to support this one code pattern isn't exactly short, requires a fair amount of code duplication, and even a macro definition is provided (https://github.com/JuliaLang/julia/blob/master/base/dict.jl#L632-L690)

proposal

define the Ref operation on a Dict to return an intelligent indexer object. the above code pattern could then be written as:

d = Dict()
bp = Ref(d, key)
if isempty(bp)
  bp[] = initial_value
end
value = bp[]

implementation sketch

type RefDict{K, V, D <: Dict} <: Ref{V}
  d::D
  k::K
  last_htindex::Int
  last_state::Int
end
function RefDict{K,V}(d::Dict{K,V}, k)
  key = convert(K, k)
  isequal(key, k) || throw(ArgumentError("k is not a usable key"))
  return RefDict{K,V,typeof(d)}(d, k, 0, 0)
end
function isempty(r::RefDict)
  if (d.last_state != d.changecounter)
    r.last_htindex = ht_keyindex2(d, key)
    r.last_state = d.changecounter
  end
  return r.last_htindex > 0
end

ref #12035 (comment)

The text was updated successfully, but these errors were encountered:

yuyichao · 2015-07-15T04:25:00Z

Is this the one you briefly mentioned on saturday night during the JuliaCon?

carnaval · 2015-07-15T04:26:22Z

Yep. And yes please (again) !

(btw s/in/haskey/ in your first example)

yuyichao · 2015-07-15T04:27:49Z

Also would this still has benefit over the get! version if we have fast anonymous function?

vtjnash · 2015-07-15T04:33:54Z

Is this the one you briefly mentioned on saturday night during the JuliaCon?

yes. i figured i should finally try to see what it would look like

carnaval · 2015-07-15T04:39:39Z

Also would this still has benefit over the get! version if we have fast anonymous function?

It is more general since you can use the hash table before the insertion. Like

if isempty(bp)
    println("Inserting into ", d)
    bp[] = x
end

We could also still have the lambda version for compactness I guess.

carnaval · 2015-07-15T04:41:39Z

(well in this example I guess you could capture d in the lambda, but my point is you can carry the ref around for some time before committing to the insertion)

JeffBezanson · 2015-07-15T04:43:14Z

Good idea!

I think my main concern would be allocating an extra object for each dictionary operation. It could get quite expensive. Also unfortunately the code for the common idiom is not shorter.

yuyichao · 2015-07-15T04:43:28Z

(well in this example I guess you could capture d in the lambda,

Was just going to say that.

but my point is you can carry the ref around for some time before committing to the insertion)

I see. (not sure if it is good to bring up but I guess it should be as useful as c++ reference.....)

JeffBezanson · 2015-07-15T04:45:09Z

Related: #2108

kmsquire · 2015-07-15T05:04:02Z

I'll point out that @StephenVavasis implemented a Token interface for SortedDicts in DataStructures.jl, which plays a similar roll. This one sounds better, however.

(He even started to use Ref in a similar way, before I discouraged him because it wasn't used yet in Base (sorry, Stephen).)

kmsquire · 2015-07-15T05:06:39Z

my point is you can carry the ref around for some time before committing to the insertion

We might need to be careful about asynchronous updates to the hash table causing it to be resized, especially if we ever get threading.

yuyichao · 2015-07-15T21:03:59Z

We might need to be careful about asynchronous updates to the hash table causing it to be resized, especially if we ever get threading.

I think in the example above ref[] = value basically fall back to ref.dict[ref.key] = value if it detects that the dictionary have changed so we should just be as careful as doing regular dict assignment?

There might be one additional race possibility but IMHO it shouldn't need more care than the dict itself.

carnaval · 2015-07-15T21:08:16Z

Yes, Jameson's proposal handles the single thread rehash case by graceful performance degradation. Multi threaded modification of the same data structure will likely be defined as a race condition => undefined behavior. We'll have to implement a family of concurrent data structures. (the java std lib has state of the art implementation of those things, probably good inspiration).

StephenVavasis · 2015-07-18T21:22:38Z

Since Kevin pinged me on this topic, I guess I can add my two cents about tokens and Ref.

(1) The primary purpose of tokens in my code is to provide a mechanism for stepping through a SortedDict (or SortedSet or SortedMultiSet) in the sort-order of the keys. Obviously, this is not an issue for Dict.

(2) A second purpose of a token is to provide a handle to an entry that can be dereferenced in O(1) time. Again, this is not so much an issue for Dict because the hash-lookup is already supposed to be O(1).

(3) Using a Ref object to point to a spot in the structure where an item would go if it would be inserted later seems like a reasonable idea; as mentioned earlier, there is a performance hit because the Ref object is heap-allocated. However, I am thinking of changing my balanced-tree implementation so that the most recent look-up is cached because often the next lookup is nearby. Couldn't Dict do the same with caching?

(4) Going off on a tangent, I have an issue with the equivalence between haskey(d,k) and in(k, keys(d)) ... see my recent posting here:

JuliaCollections/DataStructures.jl#103

(5) Finally, with regard to the problem of the awkward if-block versus the need to evaluate 'initial_value' in a get! function, wouldn't it be OK to use a macro and to ask anyone who writes an Associative type to support a macro version of get! that solves this problem?

StefanKarpinski · 2015-07-23T14:47:27Z

I don't really like the syntax much. I've proposed adding a syntax for get! like d[k] ?= v. However, that begs the question of what we should lower that to. If we lower it to the form that uses a closure, then we haven't solved the performance issue (although I'm not entirely convinced that this matters since it will be fast eventually, right?), so maybe it could lower to something like what you're proposing. There certainly are cases within the implementation of Dict where having a more usable "intelligent index" type would be helpful in implementing higher level operations.

dbeach24 · 2016-08-29T22:08:38Z

I have an alternate proposed solution to this which I (unwittingly) opened in a separate ticket #18282.

TLDR: What about building a function that handles the "update this value in a dictonary" case, as follows:

update!(f::Function, a::Associative, key, default) = (a[key] = f(get(a, key, default)))

(except that it would be specialized to avoid recomputation of the hash/slot.)

This would allow code like:

update!(x -> x+1, my_counters, key, 0)  # where 0 is the default value for a counter, or
update!(x -> x+value, my_totals, key, 0.0) # where value is accumulated

KristofferC · 2016-08-29T22:23:11Z

Sounds like #15630

KristofferC · 2016-08-29T22:23:41Z

See also https://github.com/KristofferC/UpdateIndex.jl

dbeach24 · 2016-08-29T22:36:05Z

@KristofferC Pardon my ignorance. Do LinearSlow AbstractArrays and Associatives share a common interface / behavior pattern?

EDIT: I think I understand a bit better now from seeing UpdateIndex.jl

KristofferC · 2016-08-30T06:15:56Z

It is similar since you want to "hold on" to the result of the lookup thas has to be made when indexing.

dbeach24 · 2016-08-30T13:23:38Z

@KristofferC Thanks for sharing these links. FWIW, I agree that Julia would benefit from a way to directly overload the various ?= operators. (Python has this capability, and numpy leverages it heavily. Why would Julia want to skip over this functionality?) However, I suppose that discussion belongs on #15630.

I think the dictionary case differs slightly because, unlike arrays, dictionaries are not guaranteed to have values over any particular range of keys, and there is no default value that can be assumed for missing keys, at least not in the general case.

In your proposal:

updateindex!(A, op, s, i...)

op is a binary operator, and
s is the RHS operand.

Could these be combined into a single function, f(x) = op(x, s)? By placing this first, we can permit "do" syntax, and this makes our proposals look very similar, indeed.

updateindex!(f, A, i...)
update!(f, dict, key, default)

Not sure how to deal with the fact that arrays need multiple indices and dicts have just one. Can the array indices be grouped into a tuple, or will this create overhead?

eschnett · 2016-08-30T13:46:04Z

In general, creating the new entry (i.e. what is called default here) might be expensive, or might be a transaction that should only occur when the new value is actually needed. get has a form that accepts a function instead of a default value, and that function is only called when needed. This function is also the first argument. So we can do this:

update!(op, mkdefault, coll, ids...)

where both op and mkdefault are functions, coll is a collection, and ids are indices or keys.

dbeach24 · 2016-08-30T14:55:14Z

@eschnett That's a good point.

Is it confusing to have update! accept two functions?
Which one of them should be available for "do" syntax.
When the default value is not supplied by a function, it typically comes last. Should the two functions really be the first two arguments?

Current proposal would look like this:

update!(init_counter, counters, key) x do
    x + 1
end

What about moving the default function to the end?

update!(counters, key, init_counter) x do
   x + 1
end

Edit: I like the way update!(counters...) reads in this last example. It clearly states which collection is being updated. (But of course that only holds if you use the "do" syntax.)

Edit 2: Could we support both cases where the default is an object or a function? i.e.

update!(f::Function, coll::Associative, key, default)
update!(f::Function, coll::Associative, key, default::Function)

...or is that an abuse of overloading?

martinholters · 2016-08-30T15:13:17Z

Could we support both cases where the default is an object or a function

For a dict holding functions as values, that might give surprising/undesired behavior.

dbeach24 · 2016-08-30T15:24:12Z

@martinholters Thank you for speaking up, I was looking for someone who might disagree.

Does that use case compel us to move the default function to another argument position? (arguably less clear and less consistent?)

Of course, if someone was keeping a dict of functions, they could still write:

update!(mydict, key, () -> my_default_value_which_is_a_function) do x
    decorate_function(x)
end

algorithmx · 2021-02-24T01:49:36Z

See also https://github.com/KristofferC/UpdateIndex.jl

Hi Kristofer, are there any follow-up improvements on this ? I am hitting a performance barrier now. The bottleneck is to count the number of occurrence of various patterns in a "string". I used a Dict{pattern_type,Int} to do the job, and I wish to make it faster. What is the state-of-art technique for that ? Otherwise I'll just patch my code with this one.

yuyichao mentioned this issue Sep 10, 2015

Get dict value as a nullable #13055

Open

carnaval mentioned this issue Sep 11, 2015

?= "get or set" operator #2108

Open

yuyichao mentioned this issue Dec 19, 2015

UndefRefError with Dicts and finalizers #14445

Closed

KristofferC mentioned this issue Feb 1, 2016

Current method of caching arrays seems to be inefficient in some scenarios. JuliaDiff/ForwardDiff.jl#92

Closed

mbauman mentioned this issue Mar 26, 2016

RFC: Facilitate not paying double lookup cost when mutating a value in a LinearSlow AbstractArray #15630

Open

yuyichao mentioned this issue Aug 29, 2016

Update dictionary value without recomputing hash #18282

Closed

ViralBShah added the kind:julep Julia Enhancement Proposal label Aug 30, 2016

kmsquire mentioned this issue Nov 2, 2017

Define a token API for Associative types #24454

Open

vtjnash mentioned this issue Mar 1, 2018

keys, eachindex and linearindices consistency #25901

Open

andyferris mentioned this issue Oct 6, 2020

Add modify! function for lookup/update/insert/delete in one go #33758

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

julep/rfc: proposal for unified Dict get/insert/has interface #12157

julep/rfc: proposal for unified Dict get/insert/has interface #12157

vtjnash commented Jul 15, 2015

yuyichao commented Jul 15, 2015

carnaval commented Jul 15, 2015

yuyichao commented Jul 15, 2015

vtjnash commented Jul 15, 2015

carnaval commented Jul 15, 2015

carnaval commented Jul 15, 2015

JeffBezanson commented Jul 15, 2015

yuyichao commented Jul 15, 2015

JeffBezanson commented Jul 15, 2015

kmsquire commented Jul 15, 2015

kmsquire commented Jul 15, 2015

yuyichao commented Jul 15, 2015

carnaval commented Jul 15, 2015

StephenVavasis commented Jul 18, 2015

StefanKarpinski commented Jul 23, 2015

dbeach24 commented Aug 29, 2016 •

edited

Loading

KristofferC commented Aug 29, 2016

KristofferC commented Aug 29, 2016

dbeach24 commented Aug 29, 2016 •

edited

Loading

KristofferC commented Aug 30, 2016

dbeach24 commented Aug 30, 2016 •

edited

Loading

eschnett commented Aug 30, 2016

dbeach24 commented Aug 30, 2016 •

edited

Loading

martinholters commented Aug 30, 2016

dbeach24 commented Aug 30, 2016

algorithmx commented Feb 24, 2021

julep/rfc: proposal for unified Dict get/insert/has interface #12157

julep/rfc: proposal for unified Dict get/insert/has interface #12157

Comments

vtjnash commented Jul 15, 2015

problem statement:

proposal

implementation sketch

yuyichao commented Jul 15, 2015

carnaval commented Jul 15, 2015

yuyichao commented Jul 15, 2015

vtjnash commented Jul 15, 2015

carnaval commented Jul 15, 2015

carnaval commented Jul 15, 2015

JeffBezanson commented Jul 15, 2015

yuyichao commented Jul 15, 2015

JeffBezanson commented Jul 15, 2015

kmsquire commented Jul 15, 2015

kmsquire commented Jul 15, 2015

yuyichao commented Jul 15, 2015

carnaval commented Jul 15, 2015

StephenVavasis commented Jul 18, 2015

StefanKarpinski commented Jul 23, 2015

dbeach24 commented Aug 29, 2016 • edited Loading

KristofferC commented Aug 29, 2016

KristofferC commented Aug 29, 2016

dbeach24 commented Aug 29, 2016 • edited Loading

KristofferC commented Aug 30, 2016

dbeach24 commented Aug 30, 2016 • edited Loading

eschnett commented Aug 30, 2016

dbeach24 commented Aug 30, 2016 • edited Loading

martinholters commented Aug 30, 2016

dbeach24 commented Aug 30, 2016

algorithmx commented Feb 24, 2021

dbeach24 commented Aug 29, 2016 •

edited

Loading

dbeach24 commented Aug 29, 2016 •

edited

Loading

dbeach24 commented Aug 30, 2016 •

edited

Loading

dbeach24 commented Aug 30, 2016 •

edited

Loading