Not sure how best to handle `missing` #22

ssfrr · 2020-02-06T21:18:39Z

I'm not actually sure where the right place on the stack is to fix this, because it seems to cut across several layers.

Here's an example - Say I want to get the mean of each group, ignoring missings. Notice that for the foo=2 group, both elements are missing.

using SplitApplyCombine

table = [(foo=1, bar=rand()),
         (foo=2, bar=missing),
         (foo=3, bar=rand()),
         (foo=1, bar=missing),
         (foo=2, bar=missing),
         (foo=3, bar=rand())]
map(mean ∘ skipmissing, group(r->r.foo, r->r.bar, table))

This throws the error MethodError: no method matching zero(::Type{Any})

This is the result of a cascade of things, most of which seem pretty reasonable in isolation, which is why it's not clear (to me anyways) what the right fix is

mean doesn't know how to handle an empty array Any[]. I don't think there's anything more reasonable for mean to do here
table doesn't have usful type information (see type promotion of missing inside tuples JuliaLang/julia#31077)
group seems to set the type of the dictionary elements based on the eltype of table.

I'm not sure if there's a good resolution to this. Even if group built up the groups iteratively rather than pre-allocating, for a group with only missings it would end up with an Array{Missing}, which still doesn't help mean figure out what a reasonable answer is.

My current workaround is to re-inject the type information, but it took some digging to figure out what the actual problem was, and is not pretty:

map(c->mean(Vector{Float64}(collect(skipmissing(c)))), group(r->r.foo, r->r.bar, table))

Another workaround is setting the type of table explicitly:

table = NamedTuple{(:foo, :bar), Tuple{Int64, Union{Missing,Float64}}}[
    (foo=1, bar=rand()),
    ...

But that gets pretty verbose.

Any thoughts as the the best way to handle this?

The text was updated successfully, but these errors were encountered:

andyferris · 2020-02-06T23:02:11Z

Yes - you are completely right. This is difficult. And the difficulty arises from the interaction of lots of little different aspects of the system, from the types, inference, standard library/Base.missing, not paying special attention to missing in SplitApplyCombine, lack of composiblity Statistics.mean when you want to skip missing values, etc.

If you are looking for a quick solution, you can try:

julia> (((count, sum),) -> sum / count).(groupreduce(r->r.foo, r->coalesce(r.bar, 0.0), ((count, sum), bar) -> (count + 1, sum + bar), table, init = (0, 0.0)))
3-element Dictionaries.HashDictionary{Any,Any}
 2 │ 0.0
 3 │ 0.567196074236143
 1 │ 0.04591996082297756

It's not particularly elegant though! And it's still slow due to type instability of table. You might be able to fix that with:

julia> table = NamedTuple{(:foo, :bar), Tuple{Int, Union{Missing, Float64}}}[(foo=1, bar=rand()),
                (foo=2, bar=missing),
                (foo=3, bar=rand()),
                (foo=1, bar=missing),
                (foo=2, bar=missing),
                (foo=3, bar=rand())]
6-element Array{NamedTuple{(:foo, :bar),Tuple{Int64,Union{Missing, Float64}}},1}:
 NamedTuple{(:foo, :bar),Tuple{Int64,Union{Missing, Float64}}}((1, 0.07490415420165197))
 NamedTuple{(:foo, :bar),Tuple{Int64,Union{Missing, Float64}}}((2, missing))            
 NamedTuple{(:foo, :bar),Tuple{Int64,Union{Missing, Float64}}}((3, 0.1651018454906743)) 
 NamedTuple{(:foo, :bar),Tuple{Int64,Union{Missing, Float64}}}((1, missing))            
 NamedTuple{(:foo, :bar),Tuple{Int64,Union{Missing, Float64}}}((2, missing))            
 NamedTuple{(:foo, :bar),Tuple{Int64,Union{Missing, Float64}}}((3, 0.23729296064354855))

but that's not reasonable and it's very much not recommended for large named tuples of with more than one or two Union{T, missing} fields.

Any thoughts as the the best way to handle this?

I'm not certain the best way we can improve this on the library side.

Well, I suppose that SplitApplyCombine functions could implement type widening when growing collections. group and so-on could have an in-built skipmissing flag. I dunno - there's still multiple challenges outside of SplitApplyCombine that could be resolved, too, so that this can remain a simple, composable library.

tkf · 2020-05-22T21:15:32Z

Well, I suppose that SplitApplyCombine functions could implement type widening when growing collections.

I think it should be easy once you have mutate-or-widen interface for Dictionaries.jl. Here is an example with plain Dict:

using BangBang
using BangBang.Experimental: modify!!
using BangBang.NoBang: SingletonVector
using InitialValues
using InitialValues: InitialValue

function groupreduce_bb(by, f, op, itr; init = Init(op))
    acc = foldl(itr; init = Dict{Union{},Union{}}()) do acc, x
        acc, _ = modify!!(acc, by(x)) do iacc
            Some(op(something(iacc, init), f(x)))
        end
        return acc
    end
    if init isa InitialValue && InitialValue <: valtype(acc)
        return Dict(k => v for (k, v) in acc if v !== init)
    else
        return acc
    end
end

group_bb(by, f, itr) =
    groupreduce_bb(by, x -> SingletonVector(f(x)), append!!, table; init = Init(append!!))

julia> group_bb(r->r.foo, r->r.bar, table)
Dict{Int64,AbstractArray{T,1} where T} with 3 entries:
  2 => [missing, missing]
  3 => [0.0756208, 0.745847]
  1 => Union{Missing, Float64}[0.37734, missing]

(OK, "easy" in the sense it's possible now that I fixed a bug JuliaFolds/BangBang.jl#145)

Transducers.jl uses a similar strategy but in a more thread-friendly way.

group and so-on could have an in-built skipmissing flag.

I don't think adding a flag is the best strategy in terms of composability. I think this is where you really need transducers. SplitApplyCombine.groupreduce kind of already works with Transducers.jl but it's a bit tricky to do this for now:

julia> using SplitApplyCombine
       using Transducers
       using OnlineStats: Mean

julia> rf0 = reducingfunction(Mean())
       rf = reducingfunction(NotA(Missing), rf0)
       groupreduce(r->r.foo, r->r.bar, rf, table; init = Init(rf0))
3-element Dictionaries.HashDictionary{Any,Union{InitialValues.InitialValueOf{Transducers.OnlineStatReducingFunction{Mean{Float64,OnlineStatsBase.EqualWeight}}}, Mean{Float64,OnlineStatsBase.EqualWeight}}}
 2 │ Init(::Transducers.OnlineStatReducingFunction{Mean{Float64,OnlineStatsBase.EqualWeight}})
 3 │ Mean: n=2 | value=0.410734
 1 │ Mean: n=1 | value=0.37734

I think it should be possible to make it work more easily, without adding Transducers.jl-aware code in SplitApplyCombine.jl, if groupreduce is implemented with the widening strategy. The integration can be much smoother if you use InitialValues.jl, though.

ssfrr mentioned this issue Feb 7, 2020

Customize the Dict type for group #18

Open

nalimilan mentioned this issue Apr 17, 2020

dataframerow related docstrings JuliaData/DataFrames.jl#2196

Merged

tkf mentioned this issue May 23, 2020

Support GenericInitialValue (InitialValues.INIT) JuliaFolds/Transducers.jl#288

Merged

1 task

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Not sure how best to handle `missing` #22

Not sure how best to handle `missing` #22

ssfrr commented Feb 6, 2020

andyferris commented Feb 6, 2020

tkf commented May 22, 2020

Not sure how best to handle missing #22

Not sure how best to handle missing #22

Comments

ssfrr commented Feb 6, 2020

andyferris commented Feb 6, 2020

tkf commented May 22, 2020

Not sure how best to handle `missing` #22

Not sure how best to handle `missing` #22