RFC: Bitstypes for handling missing values #45

Closed
tshort opened this Issue Aug 5, 2012 · 19 comments

5 participants

@tshort

I've taken a first shot at implementing bitstypes with missing value checking for integers, booleans, and floating point numbers. Implementing missing values as bitstypes has a number of advantages:

  • Missing values can be included in any Julia composite type, including scalars, vectors, matrices, dicts, sparse arrays, distributed arrays, and mmapped arrays.

  • Using a bit pattern to represent NA's generally offers speed and memory advantages. For floating point operations, NA's are represented as NaN's, so there is often no slowdown for NA checking.

  • Because common Vectors can now include NA's, support for DataFrame columns is easier. Any functions that work with Arrays should work with DataFrame columns since they are arrays.

This is an expansion on issue #22 (NA support for Vector{Float}).

Bitstype NA arrays and DataVecs and PooledDataVecs seem to inter-operate well. Having both a masking approach (DataVecs) and a bit-pattern approach will give users more options to best handle missing data based on their problem. Here is a specification document showing how I think everything fits together:

https://github.com/tshort/JuliaData/blob/bitstypeNA/spec/MissingValues.md

Here is the code:

https://github.com/tshort/JuliaData/blob/bitstypeNA/src/bitstypeNA.jl
https://github.com/tshort/JuliaData/blob/bitstypeNA/src/boolNA.jl
https://github.com/tshort/JuliaData/blob/bitstypeNA/src/intNA.jl
https://github.com/tshort/JuliaData/blob/bitstypeNA/src/floatNA.jl

Some tests are also available:

https://github.com/tshort/JuliaData/blob/bitstypeNA/test/bitstypeNA.jl
https://github.com/tshort/JuliaData/blob/bitstypeNA/test/bitstypeNA2.jl

Using bitstypes is a good showcase for Julia's capability. It was surprisingly easy to create these new types. I started with integers and Jeff's suggestion [1], and from there, it was mostly a matter of adapting int.jl. I think I had something working in 1.5 hours one evening earlier this week. Bools and floats each took about the same amount of time.

[1] https://groups.google.com/d/msg/julia-dev/n3ntT4M0gwo/xpPuTgwSpb0J

I didn't submit this as a pull request. It's in a branch under my fork of JuliaData. We should probably have discussions on what and how this should be incorporated. Some options are:

  • Include it with the JuliaData package.

  • Include it as a module with the JuliaData package. I'm not sure how different modules and packages will be.

  • Make it a separate package that can work with JuliaData structures.

  • Do one of the above on a temporary basis until modules/packages are better sorted out.

  • Toss it out, and do something better.

Currently, things seem to work pretty well. Vectors work well in DataFrames. Indexing with IntNA's and BoolNA's with and without NA's seems to work. Conversion to DataVecs and inter-operability with DataVecs seems good. It's still not well tested, so I'm sure there are many bugs and missing features, especially in conversion. I have not run into any big gotchas that would require language changes or other action by Julia core developers.

@HarlanH

Hi Tom,

Wow, this is a nice body of work. As before, I completely agree with your characterization of the issues, but remain unconvinced that a bitstype NA will be easier to use in practice. A few more specific thoughts on your excellent spec:

  • Not a huge fan of type names like "BoolNA". It's not that this type is a Boolean NA, it's that it's a Boolean that can support NA. What about "BoolData", "IntData64", etc? Where "Data" means "might be missing"...
  • For the problem with b = x .> 3, it seems like the issue is that x is a Float64 instead of a FloatNA64. You didn't change the promotion rules so that x = [pi, NA, 1:5] returns a FloatNA64? I think that would solve the issue...
  • You ask what to do if indexing an Array{T} with a BoolNA, if T doesn't support NAs. What about converting the Array{T} to DataVec{T}, which will always work?
  • The issue you demonstrate with BoolNA failing with ?: is part of the reason I'm not a huge fan of this approach. We're never going to have the level of integration that R has, so I'd prefer keeping Data types as mostly separate as possible, so that people aren't tempted to do operations like this. Another thought on the specific issue -- what happens if that BoolNA vector has the Replace flag set to False?
  • With standard floating-point types, any operation involving a missing value returns false. Don't you mean NaN? Or are you referring to comparison (vs. arithmatic) operations?

Definitely a great demonstration of the possibilities (and limitations) of this approach! Why don't we push towards finishing up the second demo, then get the community's input on this RFC?

@johnmyleswhite
Julia Statistics member

Really impressive body of work, Tom!

I agree with Harlan that we should focus on getting the second demo working. My plan is to work on the demos/docs/specs for half of today. I'm going to start by focusing on documentation and goals, then review the codebase.

I think I'm less skeptical than Harlan about the use of a BitsType, but remain skeptical about treating NA's as NaN's. My instinct is that NA's need to be a separate type to make it easier to reason about the outcomes of operations involving them. Also, my instinct is that every mathematical operation needs to handle these things differently: sums should replace NA's with 0.0 while products should replace NA's with 1.0. And missing data imputation methods seem to be harder to reason about if the storage information doesn't tell what's missing vs. what's corrupt.

@tshort

Thanks for reviewing. As to specifics raised by Harlan:

  • I didn't really like the NA in the type name either, but I couldn't think of anything better. I'll try out your "IntData64" idea. Any other options? Intdata64, Intwna64, IntOrNA64 (probably not), ...

  • On the b = x .> 3 issue, you're right. I've since changed [NA, 1.0] to return a FloatNA64. That should solve that.

  • You're right, converting Array{T} to DataVec{T} should be the default for comparisons that don't have a sibling type that supports NA's.

  • The Bool issue can occur with either a bitstype or a masking approach. If I have a DataVec of Bool's, I can't do dv[i] ? "a" : "b". dv[i] could be NA, and then this will fail. You also asked "what happens if that BoolNA vector has the Replace flag set to False?" A normal array doesn't have a replace flag, so that's the normal default. If you do naReplace(somearray, true), you now have a different data type with a pointer to the original array. That may or may not support indexing (it doesn't now but could).

  • On your last point, I was referring to comparison.

As to John's skepticism about treating NA's as NaN's, with this bitstype approach, NA's and NaN's are pretty separate, at least as separate as R [1]. For cases where that's not good enough, that's why it's nice to also have the masking approach. In the numpy/python world, it sounds like they've decided on doing a bitstype approach along with a masking approach. Lately, most of the arguments have centered around the masking approach and whether it would cover the right use cases: should NA cover undetermined values, values that should be kept but ignored for now, or something else.

[1] In R, is.na really means is.na.or.nan. I think it's defined that way because sum(c(1.0,NaN,NA)) could return NA or NaN depending on the ordering of elements and implementation of sum.

Lastly, I agree on getting things spruced up and getting the word out. I'll look at how I might be able to help later tonight or tomorrow night.

@StefanKarpinski

That's impressive work, Tom! A lot of code and very nice coverage of various issues in your discussion.

My biggest objection to this approach is that you have to define a new FooNA type every time you want to work with Foo objects in data frames. With the masking approach, on the other hand, you get a nice synergy where one person can write a Foo type, without having to worry about or even be aware of the missing data issue, yet you can immediately use Foo objects in a data frame and have NA support. This is not an academic issue. If at some point I go and add Float80 and Float128, now you have to write FloatNA80 and FloatNA128. Once we have a DateTime type, you have to create DateTimeNA and so on.

I'm also unconvinced about the performance and memory advantages. If NA masks are represented using BitArrays, then the overhead is only one bit per value, which for Int64 or Float64 data is a just 1.6% overhead. If NA masks are represented using a sparse matrix, then the overhead is proportional to the number of NAs, which is often very small. Of course, using NA types, you can make the overhead zero (although technically, you're giving up a little bit of range).

For performance, lets do a simplistic benchmark to see what happens with IntNAs. First, let's mock up some plain old integer vectors and NA masks and time adding them:

julia> v = randi(typemax(Int64),10000000);

julia> w = randi(typemax(Int64),10000000);

julia> v_mask = falses(size(v));

julia> w_mask = falses(size(w));

julia> @elapsed begin
         u = v+w
         u_mask = v_mask | w_mask
       end
0.17679715156555176

I did the timing a couple of times and took a fast one. Now lets try it with IntNAs:

julia> v = intNA(v);

julia> w = intNA(w);

julia> @elapsed u = v+w
1.3901519775390625

Likewise, I took the best of a bunch of runs. Upshot: using IntNA arithmetic instead of regular Int arithmetic and vectorized masking operations is about 8x slower. Note that this is using a one-byte-per-Bool representation for the NA mask. We could do this with BitArrays and get space savings and even faster computation:

julia> v = int(v);

julia> w = int(w);

julia> v_mask = bitfalses(size(v));

julia> w_mask = bitfalses(size(w));

julia> @elapsed begin
         u = v+w
         u_mask = v_mask | w_mask
       end
0.14422297477722168

Using BitArrays shaves another 20% off and is almost 10x faster than using IntNAs. If you used a sparse NA mask, the time for doing the NA mask computation would be proportional to the number of NAs, which again, could be very small. The NA mask approach gives the programmer the choice of what kind of overhead they want to pay for. If they're in the extremely lucky position of not needing NAs at all, we could just use a dummy mask type that doesn't take up any storage and adds no overhead to operations.

@StefanKarpinski

There is one place where there's a clear performance advantage to the NA type approach: floating-point operations where the IEEE behavior of NaNs matches the semantics of NAs. For example:

julia> v = rand(10000000);

julia> w = rand(10000000);

julia> @elapsed begin
         u = v+w
         u_mask = v_mask | w_mask
       end
0.1533949375152588

julia> v = floatNA(v);

julia> w = floatNA(w);

julia> @elapsed u = v+w
0.07993006706237793

FloatNA vector addition is 2x faster than adding float vectors and NA masks. It might be possible to do a hybrid approach where only floating-point data columns use special NA types and everything else uses NA masks. I'm not sure that's worth the additional complication though. Worse still, any operation where IEEE behavior doesn't match NA semantics is a performance trap:

julia> v = rand(10000000);

julia> w = rand(10000000);

julia> @elapsed begin
         u = v .< w
         u_mask = v_mask | w_mask
       end
0.08398103713989258

julia> v = floatNA(v);

julia> w = floatNA(w);

julia> @elapsed u = v .< w
1.8569231033325195

Oops. 22x slowdown. Probably not worth a 2x gain on some other operations.

@StefanKarpinski

One more performance consideration: the vectorized operations that the NA mask approach uses are going to get faster as Julia and LLVM get better at taking advantage of vectorized CPU instructions like the SSE and AVX families. Since IntNA and many FloatNA instructions aren't just machine ops, they won't benefit from this. So the performance gap above is going to get bigger, not smaller as the compiler infrastructure improves.

@tshort

Thanks for taking a look, Stefan. My biggest issue with DataVecs is that currently they're rather broken relative to Arrays. Even fundamental functions like mean don't work. You basically have to wrap everything in nafilter even if there are no NA's like:

groupby(df, ["a", "b"]) | quote
    x_mean = mean(nafilter(x))
    y_median = median(nafilter(y))
end

It was a lot easier to add NA support from the bottom up than it would be to make DataVecs as well supported as Arrays. Making AbstractDataVecs inherit from AbstractVectors would help this situation a lot (issue #23), but even with that, there's a good chance that I won't easily be able to take an FFT of a DataVec without wrapping in nafilter.

By restricting NA's just to DataFrames, I think this creates a divide between users coming from an R background and those coming from a Matlab background. Increased awareness of NA's (and thus NaN's) will also help core Julia functions. For example, median and quantile currently don't work right for arrays with NaN's.

Lastly, what can it hurt to allow different NA representations? Users can select what works best for them. Developers will end up contributing to the parts they use (and as with R, users become developers).

This doesn't have to be a part of JuliaData, but I do want to keep DataFrames as flexible as possible as to what can be a column (especially standard Vectors).

@HarlanH
@StefanKarpinski

I think what's called for here is to make DataVecs more first-class as vectors. Closing #23 will help a lot and various other things will fall into place with a bit of time. One option for mean and such would be to error out when applied bare to a DataVec only if the DataVec has any NA data since otherwise it doesn't matter whether you use nafilter or nareplace. So it will work until you have NAs, at which point you'll have to fix it, but it can give a clear error message.

Increased awareness of NA's (and thus NaN's) will also help core Julia functions. For example, median and quantile currently don't work right for arrays with NaN's.

That's just a bug. Can you open an issue for that (with an example)? In general I think core Julia is pretty good about NaNs. I certainly spent a lot of time making sure all the various basic arithmetic operations work correctly (and quickly) with NaNs. Conflating NA and NaN seems like a dangerous thing to me. For example, if NaN is used to represent NA, then how do you represent a non-NA NaN? After all, NaN is a perfectly valid floating-point value.

@johnmyleswhite
Julia Statistics member

Like Harlan, I think we should have a DataMatrix to handle the case of homogeneous columns with missingness. For my primary current project, I would love to have DataMatrix available, since I have a purely numeric data set with missing entries and just need Julia to support missingness before I can translate my R code to Julia.

If we can keep NaN separate from NA with small performance costs, I think that's worth doing. But it's clearly a complex issue as one can tell from reviewing the debates with NumPy that Tom has linked to.

@StefanKarpinski

Anyone have those links to NumPy debates available? I'd be interested in seeing the arguments, although for Julia especially, because of parametric polymorphism, I'm really strongly pro-mask.

@HarlanH
@tshort

First for John and the DataMatrix type, if you use bitstypes that support NA, the job's done. It's floatNA(m) where m is your array. If you want column headers, then it'll be a bit more work to add those, but one nice feature of the Index class is that you can re-use that for something like this. For a zoo-like object (one of my interests), it shouldn't be much work to put together a column index, a data array, and an indexed time column (from experimental_indexing.jl). That said, it's also not that much more effort to put together a DataMatrix using the same approach as Harlan did for DataVec's. I would make it inherit from AbstractMatrix, though.

As to Stefan's suggestion to close #23, I'm fully in favor of that, but I thought Harlan may have had some reluctance as to whether it was a good idea or not. I tried it once, and it seemed to work. It increase the warnings a lot, and those can be a pain to squash. On your point of NaN's used to represent NA's, in my implementation, they are defined as separate bit patterns. You can tell them apart as well as R tells them apart, and I don't remember complaints from R users that it has caused problems.

@tshort
@tshort

Stefan's timings are interesting. For both the IntNA examples and the
FloatNA comparison examples, the bitstype methods just were not
optimal, both for much the same reason. The compiler didn't do a very
good job with the default code. Let's start with the Float case. By
changing from this:

< (x::FloatNA32, y::FloatNA32) = isna(x) || isna(y) ? NA_Bool : lt_float(unbox(FloatNA32,x),unbox(FloatNA32,y))

function (.<){T<:FloatNA}(A::AbstractArray{T}, B::AbstractArray{T})
    F = Array(BoolNA, promote_shape(size(A),size(B)))
    for i = 1:numel(B)
        F[i] = (<)(A[i], B[i])
    end
    return F
end

to the following, I get very similar timings compared to the mask approach.

function (.<){T<:FloatNA}(A::AbstractArray{T}, B::AbstractArray{T})
    F = Array(BoolNA, promote_shape(size(A),size(B)))
    for i = 1:numel(B)
        F[i] = (<)(float64(A[i]), float64(B[i]))
        if isna(A[i]) || isna(B[i])
            F[i] = NA_Bool
        end
    end
    return F
end

The compiler is not good enough to make the first case fast, but when
the code is broken up a bit, the compiler can make it fast. I don't know
why. Maybe the compiler will inline the version with the smaller function
called.

Here are timings with the faster versions of FloatNA comparisons (best
of three):

N = 10000000
w = rand(N);
v = rand(N);
w_mask = falses(N);
v_mask = falses(N);
vn = floatNA(v);
wn = floatNA(w);
vd = DataVec(v);
wd = DataVec(w);

julia> @time res = v .< w;
elapsed time: 0.1324479579925537  seconds

julia> @time begin res = v .< w; res2 = w_mask | v_mask; end
elapsed time: 0.23880600929260254 seconds

julia> @time res = vn .< wn;
elapsed time: 0.24005508422851562 seconds

I also tried comparing to a DataVec, but that method isn't
implemented. Comparison to a single value is implemented. Here are the
timings (best of three):

julia> @time res = v .< .1;
elapsed time: 0.16703200340270996 seconds

julia> @time res = vn .< .1;    # FloatNA
elapsed time: 0.13710999488830566 seconds

julia> @time res = vd .< .1;    # DataVec
elapsed time: 12.214823961257935 seconds

My reimplementation of comparison ended up being a little faster than
that for regular Float64's. DataVec's were 90 times slower.

IntNA's were faster once Array operations were implemented as above.
Here are some timings (best of three):

v = randi(typemax(Int64),10000000);
w = randi(typemax(Int64),10000000);
v_mask = falses(size(v));
w_mask = falses(size(w));
vn = intNA(v);
wn = intNA(w);
vd = DataVec(v);
wd = DataVec(w);

julia> @time u = v+w
elapsed time: 0.22473406791687012 seconds

julia> @time begin u = v+w; u_mask = v_mask | w_mask end
elapsed time: 0.3735179901123047 seconds

julia> @time u = vn+wn    # IntNA
elapsed time: 0.34845685958862305 seconds

julia> @time u = vd+wd    # DataVec
elapsed time: 5.17979097366333 seconds

For performance, I think bitstypes have a slight advantage (especially
for Floats), but both approaches can give decent performance. DataVecs
are more flexible in being able to hold different types. Bitstypes are
more flexible in that they can be used in different containers.

My promptings are a friendly challenge to anti-bitstype folks: if you
think DataVecs are superior, show me. Make DataVecs more complete, and
make them faster. No one's taken the bait so far. If someone does,
that will make me happy since I'm pro bitstype and pro DataVec:) I
certainly think it's possible, and maybe it'll be less work than I
think.

This was referenced Aug 26, 2012
@tshort

Immutable types now offer another interesting approach to handling missing data. Something like the following (untested code) has some interesting features.

immutable Data{T}
    data::T
    isna::Bool
end

The advantages are:

  • Like the masking approach that DataArray uses, this can be used with any fundamental type.

  • Like the bitstypes discussed in this issue, immutable data points can be used in Julia arrays and other data structures.

Speedwise, this approach would probably perform well. For indexing operations, it should be faster than the BitArray masking approach. It is the least space efficient approach. It would also get around some of the clunkiness of DataArrays (like [1, 2, NA, 4]).

@johnmyleswhite
Julia Statistics member

It would be nice to see performance tests of basic operations to confirm our intuitions about the relative merits of different approaches.

I'm not sure how this will solve the [1, 2, NA, 4] problem.

@tshort

Agreed on performance tests, although things in Base are not settled on that. I expect array indexing and vectorized performance will improve. Also, while it is easy to come up with tests that use DataArrays, tests that use something unimplemented like this Data{T} type would be more work.

As far as [1, 2, NA, 4], that could be automatically promoted to an Array{Data{Int}, 1} (or the equivalent for a masking bitstype).

@garborg
Julia Statistics member

Closing in favor of recent discussions around NullableArray design.

@garborg garborg closed this Jan 12, 2015
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment