Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improving inference for NamedTuple with Union{T, Missing} fields #25925

Open
nalimilan opened this issue Feb 7, 2018 · 3 comments
Open

Improving inference for NamedTuple with Union{T, Missing} fields #25925

nalimilan opened this issue Feb 7, 2018 · 3 comments
Labels
compiler:inference Type inference domain:missing data Base.missing and related functionality

Comments

@nalimilan
Copy link
Member

nalimilan commented Feb 7, 2018

(This issue as been discussed at length at JuliaData/Missings.jl#6, but since then changes have been applied to Base regarding the handling of Union{T, Missing}. I think it's the last essential change needed to handle missing values efficiently, and to be able to port Query.jl to the Missing type. See also this Discourse thread. Cc: @davidanthoff)

Since #25828, map and broadcast are relatively fast with Union{T, Missing} element types. However they still widen the element type progressively as new types are encountered, which implies copying the already processed values to an array with a wider element type. For collections of Union{T, Missing} elements, this means at most one copy will be made. But for collections of NamedTuple with Union{T, Missing} fields, a copy will have to be made every time a missing value is encountered for a given field, implying up to 2^N-1 (EDIT: actually, N-1) copies (with N the number of fields). Clearly that's not going to fly.

A solution to this problem would be to use inference to determine the element type in advance, and fall back to the current approach when the inferred type isn't useful (Any, or other abstract type which is considered as not precise enough). While this would work already for collections with Union{T, Missing} elements, it needs more changes to work with more complex cases, notably NamedTuple with Union{T, Missing} fields (but also any parametric type).

Let's take a concrete case which is relatively simple: a generator over a vector of NamedTuple elements which returns NamedTuple elements (possibly with some transformations in the middle, but better start without). We would like return_type for first to give a narrower upper bound for the element type than just Any: e.g. NamedTuple{(:a, :b), Tuple{<:Union{Int,Missing}, <:Union{Float64,Missing}}}. The code below shows that we're not very far from this, since the types of the fields are correctly inferred. But when combining the tuple type for the fields with the names into a NamedTuple type, the information is discarded and we get Core.SSAValue(48)::Type{NamedTuple{(:a, :b),_1}} where _1.

I'd appreciate any pointers regarding changes that are necessary to turn the where _1 part into where _1<:Tuple{Union{Missing, Int64},Union{Missing, Float64}}.

julia> x = NamedTuple{(:a, :b),Tuple{Union{Missing, Int64},Union{Missing, Float64}}}[(a=1, b=2.0), (a=missing, b=missing), (a=missing, b=3.0), (a=2, b=missing)];

julia> Core.Compiler.return_type(first, Tuple{typeof(x)})
NamedTuple{(:a, :b),Tuple{Union{Missing, Int64},Union{Missing, Float64}}}

julia> itr1 = ((i.a, i.b) for i in x);

# OK: precise, thanks to tuples covariance
julia> Core.Compiler.return_type(first, Tuple{typeof(itr1)})
Tuple{Union{Missing, Int64},Union{Missing, Float64}}

julia> itr2 = ((a=i.a, b=i.b) for i in x);

# No OK
julia> Core.Compiler.return_type(first, Tuple{typeof(itr2)})
Any

julia> @code_warntype first(itr2)
[...]
      Core.SSAValue(45) = (Base.getfield)(Core.SSAValue(29), :a)::Union{Missing, Int64}
      # meta: pop location
      # meta: location sysimg.jl getproperty 8
      Core.SSAValue(46) = (Base.getfield)(Core.SSAValue(29), :b)::Union{Missing, Float64}
      # meta: pop location
      Core.SSAValue(43) = (Core.tuple)(Core.SSAValue(45), Core.SSAValue(46))::Tuple{Union{Missing, Int64},Union{Missing, Float64}}
      # meta: location boot.jl Type 498
      Core.SSAValue(47) = (Core.typeof)(Core.SSAValue(43))::Type{#s54} where #s54<:Tuple{Union{Missing, Int64},Union{Missing, Float64}}
      Core.SSAValue(48) = (Core.apply_type)(Core.NamedTuple, (:a, :b), Core.SSAValue(47))::Type{NamedTuple{(:a, :b),_1}} where _1
      Core.SSAValue(49) = (Core.SSAValue(48))(Core.SSAValue(43))::Any
      # meta: pop locations (3)
      return Core.SSAValue(49)
  end::Any
@vtjnash
Copy link
Sponsor Member

vtjnash commented Feb 20, 2018

every time a missing value is encountered for a given field, implying up to 2^N-1 copies (with N the number of fields).

The copy is only made once for each column, so it should be approximately O(k) copies, where k is the number of columns. The value O(2^k) (only the highest order term is significant, so we drop the -1) represents instead the amount of work that needs to be done to compile this ahead-of-time and address the JIT pause.

@vtjnash
Copy link
Sponsor Member

vtjnash commented Feb 20, 2018

changes that are necessary to turn

In base/compiler/tfuncs.jl, for the apply_type_tfunc, it's marked with:

       # These blocks improve type info but make compilation a bit slower.

@nalimilan
Copy link
Member Author

Thanks for the pointers, but what is this supposed to do in practice? I've tried uncommenting these lines, but this doesn't appear to make any difference to the NamedTuple example I posted above.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
compiler:inference Type inference domain:missing data Base.missing and related functionality
Projects
None yet
Development

No branches or pull requests

2 participants