-
Notifications
You must be signed in to change notification settings - Fork 360
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
RFC: Bitstypes for handling missing values #45
Comments
Hi Tom, Wow, this is a nice body of work. As before, I completely agree with your characterization of the issues, but remain unconvinced that a bitstype NA will be easier to use in practice. A few more specific thoughts on your excellent spec:
Definitely a great demonstration of the possibilities (and limitations) of this approach! Why don't we push towards finishing up the second demo, then get the community's input on this RFC? |
Really impressive body of work, Tom! I agree with Harlan that we should focus on getting the second demo working. My plan is to work on the demos/docs/specs for half of today. I'm going to start by focusing on documentation and goals, then review the codebase. I think I'm less skeptical than Harlan about the use of a BitsType, but remain skeptical about treating NA's as NaN's. My instinct is that NA's need to be a separate type to make it easier to reason about the outcomes of operations involving them. Also, my instinct is that every mathematical operation needs to handle these things differently: sums should replace NA's with 0.0 while products should replace NA's with 1.0. And missing data imputation methods seem to be harder to reason about if the storage information doesn't tell what's missing vs. what's corrupt. |
Thanks for reviewing. As to specifics raised by Harlan:
As to John's skepticism about treating NA's as NaN's, with this bitstype approach, NA's and NaN's are pretty separate, at least as separate as R [1]. For cases where that's not good enough, that's why it's nice to also have the masking approach. In the numpy/python world, it sounds like they've decided on doing a bitstype approach along with a masking approach. Lately, most of the arguments have centered around the masking approach and whether it would cover the right use cases: should NA cover undetermined values, values that should be kept but ignored for now, or something else. [1] In R, Lastly, I agree on getting things spruced up and getting the word out. I'll look at how I might be able to help later tonight or tomorrow night. |
That's impressive work, Tom! A lot of code and very nice coverage of various issues in your discussion. My biggest objection to this approach is that you have to define a new I'm also unconvinced about the performance and memory advantages. If NA masks are represented using BitArrays, then the overhead is only one bit per value, which for For performance, lets do a simplistic benchmark to see what happens with IntNAs. First, let's mock up some plain old integer vectors and NA masks and time adding them: julia> v = randi(typemax(Int64),10000000);
julia> w = randi(typemax(Int64),10000000);
julia> v_mask = falses(size(v));
julia> w_mask = falses(size(w));
julia> @elapsed begin
u = v+w
u_mask = v_mask | w_mask
end
0.17679715156555176 I did the timing a couple of times and took a fast one. Now lets try it with IntNAs: julia> v = intNA(v);
julia> w = intNA(w);
julia> @elapsed u = v+w
1.3901519775390625 Likewise, I took the best of a bunch of runs. Upshot: using IntNA arithmetic instead of regular Int arithmetic and vectorized masking operations is about 8x slower. Note that this is using a one-byte-per-Bool representation for the NA mask. We could do this with BitArrays and get space savings and even faster computation: julia> v = int(v);
julia> w = int(w);
julia> v_mask = bitfalses(size(v));
julia> w_mask = bitfalses(size(w));
julia> @elapsed begin
u = v+w
u_mask = v_mask | w_mask
end
0.14422297477722168 Using BitArrays shaves another 20% off and is almost 10x faster than using IntNAs. If you used a sparse NA mask, the time for doing the NA mask computation would be proportional to the number of NAs, which again, could be very small. The NA mask approach gives the programmer the choice of what kind of overhead they want to pay for. If they're in the extremely lucky position of not needing NAs at all, we could just use a dummy mask type that doesn't take up any storage and adds no overhead to operations. |
There is one place where there's a clear performance advantage to the NA type approach: floating-point operations where the IEEE behavior of NaNs matches the semantics of NAs. For example: julia> v = rand(10000000);
julia> w = rand(10000000);
julia> @elapsed begin
u = v+w
u_mask = v_mask | w_mask
end
0.1533949375152588
julia> v = floatNA(v);
julia> w = floatNA(w);
julia> @elapsed u = v+w
0.07993006706237793 FloatNA vector addition is 2x faster than adding float vectors and NA masks. It might be possible to do a hybrid approach where only floating-point data columns use special NA types and everything else uses NA masks. I'm not sure that's worth the additional complication though. Worse still, any operation where IEEE behavior doesn't match NA semantics is a performance trap: julia> v = rand(10000000);
julia> w = rand(10000000);
julia> @elapsed begin
u = v .< w
u_mask = v_mask | w_mask
end
0.08398103713989258
julia> v = floatNA(v);
julia> w = floatNA(w);
julia> @elapsed u = v .< w
1.8569231033325195 Oops. 22x slowdown. Probably not worth a 2x gain on some other operations. |
One more performance consideration: the vectorized operations that the NA mask approach uses are going to get faster as Julia and LLVM get better at taking advantage of vectorized CPU instructions like the SSE and AVX families. Since IntNA and many FloatNA instructions aren't just machine ops, they won't benefit from this. So the performance gap above is going to get bigger, not smaller as the compiler infrastructure improves. |
Thanks for taking a look, Stefan. My biggest issue with DataVecs is that currently they're rather broken relative to Arrays. Even fundamental functions like
It was a lot easier to add NA support from the bottom up than it would be to make DataVecs as well supported as Arrays. Making AbstractDataVecs inherit from AbstractVectors would help this situation a lot (issue #23), but even with that, there's a good chance that I won't easily be able to take an FFT of a DataVec without wrapping in nafilter. By restricting NA's just to DataFrames, I think this creates a divide between users coming from an R background and those coming from a Matlab background. Increased awareness of NA's (and thus NaN's) will also help core Julia functions. For example, Lastly, what can it hurt to allow different NA representations? Users can select what works best for them. Developers will end up contributing to the parts they use (and as with R, users become developers). This doesn't have to be a part of JuliaData, but I do want to keep DataFrames as flexible as possible as to what can be a column (especially standard Vectors). |
Thanks for this benchmarking, Stefan! Very interesting! Tom, to me, coming from R and doing mostly statistical operations, a I've suggested before that perhaps there should be another type in the On Wed, Aug 8, 2012 at 1:46 PM, Tom Short notifications@github.com wrote:
|
I think what's called for here is to make DataVecs more first-class as vectors. Closing #23 will help a lot and various other things will fall into place with a bit of time. One option for
That's just a bug. Can you open an issue for that (with an example)? In general I think core Julia is pretty good about NaNs. I certainly spent a lot of time making sure all the various basic arithmetic operations work correctly (and quickly) with NaNs. Conflating NA and NaN seems like a dangerous thing to me. For example, if NaN is used to represent NA, then how do you represent a non-NA NaN? After all, NaN is a perfectly valid floating-point value. |
Like Harlan, I think we should have a DataMatrix to handle the case of homogeneous columns with missingness. For my primary current project, I would love to have DataMatrix available, since I have a purely numeric data set with missing entries and just need Julia to support missingness before I can translate my R code to Julia. If we can keep NaN separate from NA with small performance costs, I think that's worth doing. But it's clearly a complex issue as one can tell from reviewing the debates with NumPy that Tom has linked to. |
Anyone have those links to NumPy debates available? I'd be interested in seeing the arguments, although for Julia especially, because of parametric polymorphism, I'm really strongly pro-mask. |
Sure: https://github.com/numpy/numpy/blob/master/doc/neps/missing-data.rst -Harlan On Wed, Aug 8, 2012 at 4:42 PM, Stefan Karpinski
|
First for John and the DataMatrix type, if you use bitstypes that support NA, the job's done. It's As to Stefan's suggestion to close #23, I'm fully in favor of that, but I thought Harlan may have had some reluctance as to whether it was a good idea or not. I tried it once, and it seemed to work. It increase the warnings a lot, and those can be a pain to squash. On your point of NaN's used to represent NA's, in my implementation, they are defined as separate bit patterns. You can tell them apart as well as R tells them apart, and I don't remember complaints from R users that it has caused problems. |
There have also been long email discussions: http://article.gmane.org/gmane.comp.python.numeric.general/44820 http://article.gmane.org/gmane.comp.python.numeric.general/46704 http://article.gmane.org/gmane.comp.python.numeric.general/48734 http://article.gmane.org/gmane.comp.python.numeric.general/49810 Here’s an email summarizing some of the email discussions in 2011: http://article.gmane.org/gmane.comp.python.numeric.general/46528 It’s a good plug for Julia—we’ve got working versions of a masking approach and a bit pattern approach in a lot less time than numpy has spent just talking about it. |
Stefan's timings are interesting. For both the IntNA examples and the < (x::FloatNA32, y::FloatNA32) = isna(x) || isna(y) ? NA_Bool : lt_float(unbox(FloatNA32,x),unbox(FloatNA32,y))
function (.<){T<:FloatNA}(A::AbstractArray{T}, B::AbstractArray{T})
F = Array(BoolNA, promote_shape(size(A),size(B)))
for i = 1:numel(B)
F[i] = (<)(A[i], B[i])
end
return F
end to the following, I get very similar timings compared to the mask approach. function (.<){T<:FloatNA}(A::AbstractArray{T}, B::AbstractArray{T})
F = Array(BoolNA, promote_shape(size(A),size(B)))
for i = 1:numel(B)
F[i] = (<)(float64(A[i]), float64(B[i]))
if isna(A[i]) || isna(B[i])
F[i] = NA_Bool
end
end
return F
end The compiler is not good enough to make the first case fast, but when Here are timings with the faster versions of FloatNA comparisons (best N = 10000000
w = rand(N);
v = rand(N);
w_mask = falses(N);
v_mask = falses(N);
vn = floatNA(v);
wn = floatNA(w);
vd = DataVec(v);
wd = DataVec(w);
julia> @time res = v .< w;
elapsed time: 0.1324479579925537 seconds
julia> @time begin res = v .< w; res2 = w_mask | v_mask; end
elapsed time: 0.23880600929260254 seconds
julia> @time res = vn .< wn;
elapsed time: 0.24005508422851562 seconds I also tried comparing to a DataVec, but that method isn't julia> @time res = v .< .1;
elapsed time: 0.16703200340270996 seconds
julia> @time res = vn .< .1; # FloatNA
elapsed time: 0.13710999488830566 seconds
julia> @time res = vd .< .1; # DataVec
elapsed time: 12.214823961257935 seconds My reimplementation of comparison ended up being a little faster than IntNA's were faster once Array operations were implemented as above. v = randi(typemax(Int64),10000000);
w = randi(typemax(Int64),10000000);
v_mask = falses(size(v));
w_mask = falses(size(w));
vn = intNA(v);
wn = intNA(w);
vd = DataVec(v);
wd = DataVec(w);
julia> @time u = v+w
elapsed time: 0.22473406791687012 seconds
julia> @time begin u = v+w; u_mask = v_mask | w_mask end
elapsed time: 0.3735179901123047 seconds
julia> @time u = vn+wn # IntNA
elapsed time: 0.34845685958862305 seconds
julia> @time u = vd+wd # DataVec
elapsed time: 5.17979097366333 seconds For performance, I think bitstypes have a slight advantage (especially My promptings are a friendly challenge to anti-bitstype folks: if you |
Immutable types now offer another interesting approach to handling missing data. Something like the following (untested code) has some interesting features. immutable Data{T}
data::T
isna::Bool
end The advantages are:
Speedwise, this approach would probably perform well. For indexing operations, it should be faster than the BitArray masking approach. It is the least space efficient approach. It would also get around some of the clunkiness of DataArrays (like |
It would be nice to see performance tests of basic operations to confirm our intuitions about the relative merits of different approaches. I'm not sure how this will solve the |
Agreed on performance tests, although things in Base are not settled on that. I expect array indexing and vectorized performance will improve. Also, while it is easy to come up with tests that use DataArrays, tests that use something unimplemented like this Data{T} type would be more work. As far as |
Closing in favor of recent discussions around NullableArray design. |
I've taken a first shot at implementing bitstypes with missing value checking for integers, booleans, and floating point numbers. Implementing missing values as bitstypes has a number of advantages:
This is an expansion on issue #22 (NA support for Vector{Float}).
Bitstype NA arrays and DataVecs and PooledDataVecs seem to inter-operate well. Having both a masking approach (DataVecs) and a bit-pattern approach will give users more options to best handle missing data based on their problem. Here is a specification document showing how I think everything fits together:
https://github.com/tshort/JuliaData/blob/bitstypeNA/spec/MissingValues.md
Here is the code:
https://github.com/tshort/JuliaData/blob/bitstypeNA/src/bitstypeNA.jl
https://github.com/tshort/JuliaData/blob/bitstypeNA/src/boolNA.jl
https://github.com/tshort/JuliaData/blob/bitstypeNA/src/intNA.jl
https://github.com/tshort/JuliaData/blob/bitstypeNA/src/floatNA.jl
Some tests are also available:
https://github.com/tshort/JuliaData/blob/bitstypeNA/test/bitstypeNA.jl
https://github.com/tshort/JuliaData/blob/bitstypeNA/test/bitstypeNA2.jl
Using bitstypes is a good showcase for Julia's capability. It was surprisingly easy to create these new types. I started with integers and Jeff's suggestion [1], and from there, it was mostly a matter of adapting int.jl. I think I had something working in 1.5 hours one evening earlier this week. Bools and floats each took about the same amount of time.
[1] https://groups.google.com/d/msg/julia-dev/n3ntT4M0gwo/xpPuTgwSpb0J
I didn't submit this as a pull request. It's in a branch under my fork of JuliaData. We should probably have discussions on what and how this should be incorporated. Some options are:
Currently, things seem to work pretty well. Vectors work well in DataFrames. Indexing with IntNA's and BoolNA's with and without NA's seems to work. Conversion to DataVecs and inter-operability with DataVecs seems good. It's still not well tested, so I'm sure there are many bugs and missing features, especially in conversion. I have not run into any big gotchas that would require language changes or other action by Julia core developers.
The text was updated successfully, but these errors were encountered: