Skip to content
This repository has been archived by the owner on May 4, 2019. It is now read-only.

Revisiting representation of missing values #133

Open
simonster opened this issue Dec 21, 2014 · 19 comments
Open

Revisiting representation of missing values #133

simonster opened this issue Dec 21, 2014 · 19 comments

Comments

@simonster
Copy link
Member

Moving discussion about how to represent missing values in DataArrays from JuliaLang/julia#9363 and JuliaData/DataFrames.jl#745 here. I've done some benchmarks here that show that, with the best code I can generate at the moment, NaN sentinels > Vector{Bool} > BitVector > Vector{Float64}.

@johnmyleswhite
Copy link
Member

This is takeaway that I had from https://groups.google.com/d/msg/julia-dev/yIv6ZNbwHxk/WkUl8B1GQWoJ

In my ideal world, we'd define a set of appropriate abstractions such that we're not locked into any particular implementation.

@simonster
Copy link
Member Author

Unfortunately, as my benchmarks show, it's possible to extract reasonable performance out of the BitArray in this benchmark, but abstracting that would basically require using a macro for looping so that we can unroll it 64x. Maybe we could get LLVM to optimize this on its own by specifying loop unrolling metadata, but we don't currently have a facility for that in Julia. The pain involved in getting good performance out of BitArrays is definitely a point against them, though, and maybe we should switch to NaNs for FloatingPoint and Array{Bool} for other types if we want it to be easy to get good performance in user code.

@johnmyleswhite
Copy link
Member

I'm not really fond of switching to NaN for FloatingPoint since it's totally reasonable for a database to store both NaN and NULL. How much performance do we really gain from using NaN? If NaN's aren't more than twice as fast, it's hard for me to see why we should special-case floating point numbers.

@simonster
Copy link
Member Author

We can use a different bit pattern for NULL NaN, as @tshort commented in DataArrays.jl#5.

I think 25% slower is about the cutoff where there'd be reason to have DataArrays that encode missingness as NaNs. It's no big deal if your code runs in 200 ms instead of 100 ms, but if it takes 12 hours instead of 6 then you might begin to care :). OTOH we don't necessarily need to use them by default, if we can make our code abstract enough, which doesn't seem too hard if NaNs and Array{Bool} are the only things we're optimizing for.

In any case, I think your idea of benchmarking a representative list of operations is a good one. I propose:

  • Vector sum (done)
  • Element-wise addition
  • Mean of matrix rows

@johnmyleswhite
Copy link
Member

I'm pretty uncomfortable assuming that data sources will never contain the NaN bit pattern that we choose.

It's also not clear to me that people should ever be doing "complex" calculations on DataArray{Float64}. It seems much safer to make it possible to get access to a modified copy of the memory of DataArray with NaN inserted where appropriate. The way GLM produces design matrices is my typical example.

@simonster
Copy link
Member Author

There are 22 bits of payload for Float32 NaNs and 51 bits of payload for Float64 NaNs, and I don't think the CPU ever sets them to anything but zero (except for maybe the signalling bit?). I think the probability that someone uses a specific NaN pattern the hardware doesn't generate, doesn't want that NaN to represent a missing value, and happens to pick the same NaN that we do is pretty low.

I do some computations with multiple-gigabyte arrays with some missing values, although generally the first thing I do is pick a subset and drop chunks that have missing values. I also occasionally compute the mean or variance across a dimension of ~100 MB arrays with missing values. It would be nice to use DataArrays for these cases without a substantial performance cost over NaNs.

@johnmyleswhite
Copy link
Member

I don't want to be the only one arguing against NaN, so I'm willing to budge, but I think the 2x gain is pretty small compared to all of the other places we could focus performance work on. But I'm also not in your boat: essentially no data I interact with in daily work is a floating point number, so the gains are especially unimportant for me and I may therefore undervalue them.

@vchuravy
Copy link

To chime in on the topic of NaN. When I first encountered DataArrays i particularly liked that the design didn't include NaN as sentinel values for missingness. If I want to perfrom numerical operations on my data I have to decide how to deal with missing values and I like that they are not conflated with numerical error signals.

If I understand the current design correctly there is a BitArray that signals missingness in a data array and the current discussion is about improving the performance of this approach? I can see the need to special case the operations over floats , but would it be maybe possible to just store the sentinel value when setting a value to be missing and otherwise use the more general approach? Then numerical expensive calculations could work on the underlying data and everybody else (for me especially DB interfaces), would work via the general interface and does not have to know about the special casing of floats and yet another sentinel value.

How would the Nullable approach factor into this debate?

@nalimilan
Copy link
Member

I was going to suggest something like @vchuravy. That would simplify the design since fallback functions would be guaranteed to work, thus specialized functions for floats would only be optimizations to be added when needed. Also zero-copy interoperability with R would be ensured that way.

That would come at the cost of setting missing values to a sentinel on initialization and when setting a value to null, probably not a big deal.

@johnmyleswhite
Copy link
Member

Honestly, the fact that any other person in the world but me dislikes using NaN makes me want to not start special-casing Float32/Float64. I think it'll take away intellectual resources from making all versions of NullableArray fast because the lessons we learn about the FloatingPoint case won't generalize.

@vchuravy, here's an overview of how I see things moving:

  • Base now provides a scalar conception of missing values, which is Nullable{T}. From this, Base can derive Array{Nullable{T}}, but using Array{Nullable{T}} makes it hard to pass the raw data of type T around to BLAS, LAPACK, etc.
  • To get around that, we introduce NullableArray{T}, which is a special case of the array-of-structs to struct-of-arrays transform that we don't know how to add in full generality to Julia just yet. Instead of storing an array of composites, which each contain a Bool field plus a field of type T, we switch to storing two arrays: a BitArray or Array{Bool} and a Array{T}. In cases where there are no missing values, this gives zero-copy access to raw data. It's largely just a hack, but it may prove worth it. (Whether it's worth it depends largely on how many operations you want to perform on NullableArray objects.)

Does that clear things up? Basically, you just replace Array{Nullable{T}} because it's problematic with something that is abstractly equivalent, but laid out in memory in a better way.

@vchuravy
Copy link

@johnmyleswhite Thanks for the explanation. I imagine that that Array{T} part in a NullableArray{T} has to be allocated fully? Otherwise no zero-copy of the raw data would hardly be possible.

Instead of allocating the memory region corresponding to a missing value to zero or a random value, we could instead use a sentinel value. All operations then can decided to deal with missingness in any way they want.

This could made general to user-defined types by defining a method

sentinel(::T) = empty(T)
sentinel(::Float64) = SpecialNaN

where empty(T) would correspond to a uninitialised value of T.

and setindex could be defined as:

function setindex(a :: NullableArray{T}, x, idx...)
   if isnull(x) 
      a.data[idx...] = sentinel(T)
      a.bitmap[idx...] = false
   else
      a.data[idx...] = x
      a.bitmap[idx...] = true
  end
end

Set values to null in a NullableArray would be slightly more expensive and creation of a NullalbleArray would be certainly more expansive then before.

I am quite sure that this simple approach has problems on its own, but maybe it might be worthwhile to explore.

@johnmyleswhite
Copy link
Member

That could work.

Once upon a time, we did something very much like that by defining zero(T) for every type that got stored in a DataArray. I found it really unpleasant to work with, since you'd try to do something reasonable like put a Date in a DataArray and then you'd suddenly get stuck writing boilerplate to implement zero(T).

That said, I'm not totally opposed to the idea, but it's not clear to me that it's worth doing for any types other than Float32 and Float64.

@johnmyleswhite
Copy link
Member

And, yes, we do allocate the Array{T} part every time we create a DataArray:

type DataArray{T, N} <: AbstractDataArray{T, N}

@tshort
Copy link

tshort commented Dec 23, 2014

Slight detour... What about [1, NA, 3, 4]? We can get that to promote to an Array{Nullable}, but I'm not sure if it can be promoted to a NullableArray.

@vchuravy
Copy link

@johnmyleswhite Yes providing sentinel for each possible value would be getting annoying fast. Probably best would be to use a trait interface turning my example into.

*I also noticed that x should be a Nullable for this to work properly. *

hassentinel(x) = false
hassentinel(::Float64) = true

# ...

function setindex(a :: NullableArray{T}, x :: Nullable{T}, idx...)
   if isnull(x) 
      hassentinel(T) && a.data[idx...] = sentinel(T)
      a.bitmap[idx...] = false
   else
      a.data[idx...] = get(x)
      a.bitmap[idx...] = true
  end
end

@tshort Any Array{Nullable{T}} should be convertible into a NullableArray{T} by using the a setindex function similar to the one given above.

@simonster
Copy link
Member Author

@tshort Without any tweaks I think that would be a Vector{Any} holding 3 Ints and a Nullable. If we want it to be a Vector{Nullable{Int}}, then we'd need to define promote_rule{T,S}(::Type{Nullable{T}}, ::Type{S}) = Nullable{promote_type(T,S)}, which may not be a good idea. If we do go that route, getting concatenation to produce a NullableVector when one of the elements is a Nullable seems like a similar issue to #130; I don't think we could do it without overwriting vcat in general.

@nalimilan
Copy link
Member

That said, I'm not totally opposed to the idea, but it's not clear to me that it's worth doing for any types other than Float32 and Float64.

For compatibility with R, a sentinel could be used for integers, PDAs and strings. Though Julia wouldn't rely on the sentinel to check for missingness, since the sentinels would be valid integer values (not sure how NA strings are handled). That said, this could also be done when passing data to R by the package responsible for that, to avoid imposing an overhead on all Julia users.

@johnmyleswhite
Copy link
Member

Eh. I'm not excited about making everything on all primitive numeric types a bunch slower for a form of hypothetical compatibility with R.

@tshort
Copy link

tshort commented Dec 24, 2014

I don't think sentinals should be slower. I timed an Int sentinal using @simonster's code, and it was about the same speed as using Vector{Bool}. getindex can still return a Nullable. Strings are likely to be much slower. PooledDataArrays already have a sentinal.

I tend to prefer Vector{Bool} as the default and nan sentinals for floats. As John said, the API is the most important part.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants