Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

NA/missing values #470

Closed
HarlanH opened this issue Feb 26, 2012 · 11 comments
Closed

NA/missing values #470

HarlanH opened this issue Feb 26, 2012 · 11 comments
Labels
kind:speculative Whether the change will be implemented is speculative

Comments

@HarlanH
Copy link
Contributor

HarlanH commented Feb 26, 2012

As discussed in this thread, Julia needs to support data with missing values. Current thinking seems to be to create a parallel system of union types (e.g., IntData), promotion, and methods, rather than implementing anything at the bit level (which could be done, at least for floating point numbers). Note that Matlab suggests overloading NaN for missing data, which is not a good idea, and R uses NaN payload for floating NAs.

References:
http://www.pauldickman.com/teaching/sas/missing.php
http://cran.r-project.org/doc/manuals/R-lang.pdf (section 3.3.4)

@StefanKarpinski
Copy link
Sponsor Member

For the record, my thinking is more along these lines:

  • Use NaNs with special payload to indicate NA for floats.
  • Define a special IntNA type that uses the minimum Int value to indicate NA.
  • Use a separate metadata column to indicate different reasons for an NA when that's needed.

@HarlanH
Copy link
Contributor Author

HarlanH commented Feb 27, 2012

Ah, ok. I'll play around more with these options soon. We'll also need boolean and string NAs. (And factor NAs, when we have factors or similar...)

@StefanKarpinski
Copy link
Sponsor Member

Yikes. With all of these different type of NAs, maybe a parametric type is better. Something like this:

type NA{T}
  value::T
  na::Bool
end

That's going to be far less efficient than what I was proposing above, but it would allow us to express the behavior of NA types once instead of five separate times. You could write generic operations like this:

+(x::NA, y::NA) = NA(x.value + y.value, x.na | y.na)

@jacobhinkle
Copy link

It sounds like what you really want is a parametrized Maybe type, like in Haskell. Would issue #414 make something like that less painful? Annotation would allow the compiler to easily infer whether a function has the capability to return a None or if it is always going to give Justs. Then you can convert to the different payload at the last minute in some cases, instead of dragging an extra byte around all over the place.

@HarlanH
Copy link
Contributor Author

HarlanH commented Feb 27, 2012

Jacob, I don't think function annotation for purity is related to this question. The + operator needs to be able to deal with NAs, but it's impossible to know whether external data has NAs or not at compile time. But I might be misunderstanding...

Scala also has an Option type: http://www.codecommit.com/blog/scala/the-option-pattern But you have to wrap everything in Some() all of the time, which feels clunky to me.

Stefan, hm, I dunno, I think performance is important here. If we had immutable arrays, you could do a one-type check at initialization time for any NAs in the object, then do a simple check at access time to determine what method to use. But with mutable arrays, that might require some bookkeeping...

@StefanKarpinski
Copy link
Sponsor Member

I suspect that the biggest performance hit here would actually be from the indirect storage that would currently be forced by having arrays of NA objects. If we implement inline storage for arrays of immutable objects, that would go away. The extra boolean operations to track NA values seems kind of unavoidable to me and probably wouldn't be any worse than any of the other approaches. The main issue is that machines don't do things like integer arithmetic with NA semantics for you. NaN behaves basically the way you'd want it to, so for floats only, you could potentially get normal arithmetic speed while supporting NA by making NA a special NaN value.

@HarlanH
Copy link
Contributor Author

HarlanH commented Mar 1, 2012

More info on how other lanugages/packages deal with NA: http://pandas.pydata.org/pandas-docs/stable/missing_data.html Pandas for Python doesn't really support NA, as NumPy doesn't yet implement it.

@WarrenWeckesser
Copy link
Contributor

There have been some very long discussions on handling NA in numpy on the numpy-discussion mailing list, and there is a "NEP" here: https://github.com/numpy/numpy/blob/master/doc/neps/missing-data.rst

@joehl
Copy link

joehl commented Apr 29, 2012

Congrats on Julia, looks like Julia gets many core things right that other systems such as {matlab, SAS, R} got wrong, such as {pricing, language, performance}.

I share Ross Ihaka's view, that R gets so many things wrong - with respect to performance - that it is justified to start from scratch with a new language, and it would be exciting if with Julia we would quickly get a good start from scratch, ideally joining efforts with Ross and other people who desire a restart. Having said this, I hope it is not too late to fix those things that Julia doesn't get right so far.

The most obvious thing our little girl Julia needs to learn about is "missing value handling". I dare predict that without proper missing value handling Julia will not be able to replace R, because R gets this quite right (with minor exceptions, see below).

Here is a short story about what happens if one gets missing values (NA) wrong: in SAS NA<0 -> TRUE. As a consequence in their PROC SQL, NULL<0 -> TRUE, which breaks the SQL-standard. So SAS has different semantics in SQL than DBs following the SQL standard (like Oracle). Worse than that, in SAS's Access to Oracle Interface, SAS feels free to decide whether pulling data and evaluate a SQL-statement in SAS or whether to push the SQL-statement for evaluation into Oracle (because in-database processing can be much faster). SAS as of today pushes SQL conditions without modifications, i.e. it does not enforce its deviating semantic when pushing to Oracle. As a consequence SAS SQL semantics are not only deviating from the standard, SAS SQL semantics are unpredictable in certain contexts. That's quite a mess, so let's save little Julia to end with such a destiny.

Today Julia doesn't have NA in string, integer and boolean types, only in floating point types it has NaN:

x = 0/0 # create a NaN
y = x # create a comparison partner that happens to be also NaN

In logical and comparison operations we can either propagate NAs (and return NA) or short-circuit over them in certain contexts and return TRUE or FALSE:

x==y # FALSE: propagating NA and not confirming equality would be ok in certain contexts (defining a bi-boolean filter on equal values)
x!=x # TRUE: but no longer propagating NA and stating that we know about inequality is a problem

x<y # FALSE: propagating NA by not stating that "x<y" might seem OK
!(x>=y) # TRUE: but no longer propagating NA and stating that "x>=y is FALSE" (thus stating that "x<y") is inconsistent to the above

If we accept that we should propagate NaN or NA in numerical computations (following the IEEE 754 standard), and if we follow E.T. Jaynes in understanding that logic reasoning is a special case of (numerical) probability calculation (http://bayes.wustl.edu/), then we also need to propagate NA in logical reasoning. While it is OK for functions such as isless() and isequal() to not propagate NAs and return bi-booleans, general comparison operators need to return a tri-boolean, R gets this right.

As I see it, it is not a question whether Julia needs consistent NA handling, the question is how to get there without sacrificing simplicity and performance. Let's start with the question how to represent NAs and defer to later the question how to handle NAs.

Some people - e.g. in NumPy - suggest to represent NAs in a masking vector residing in a seperate memory location, this is neither simple nor does it helps performance. I think R's solution to sacrifice just one value of a type domain is the way to go. Using a parametric type (vector of unions) instead, would require at least one dedicated bit, which is RAM-wise more expensive and opens all kind of problems with alignment or wasting even more RAM. I truly like Stefan Karpinski's suggestion to mark NAs in the data vector and only store NA reasons in a separate meta data vector (if those reasons are ever needed).

Julia should have NAs in all data types, with very few exceptions: there are good reasons to have a true (bi)"boolean" datatype without NAs (requiring only a single bit), and a tri-boolean "logical" datatype with NAs like in R, but occupying only 2 instead of 32 bits. Unsigned integers could also get away without NAs. In 'ff' (a package enhancing R with on-disk data-types) we choose to have NAs for signed, but not for unsigned integers. That gives us for example a unsigned 2-bit integer that can represent 4-valued factors (covering ATGC for bio-informatics) and signed 2-bit integers that can represent {NA,-1,0,1}. Not having NAs in unsigned integers does not introduce inconsistencies, if we never promote unsigned to signed integers:

julia> -2 + convert(Uint8, 1)
0xffffffffffffffff

julia> typeof(ans)
Uint64

Ouch!

Using the smallest negative integer as representation of NA has the mathematical beauty of creating a symetric value range and the practical advantage of being compatible with R (and C-Code written for R). Note that representing NA by the smallest integer and defining NA to be ordered above the largest integer (in isless()) is inconsistent and has negative performance implications. This is a point R has not solved optimally: R's 'order()' has default 'na.last=TRUE', by default sorts {-1,0,1,NA} instead of {NA,-1,0,1}. Sorting in C gives us 'na.first' for free, if we want to implement 'na.last', the comparison function in our sort (called O(n*log(n)) times) changes from a single "x<y" to a much more expensive: "x<y ? (ISNA(x) ? FALSE : TRUE) : ( (ISNA(y) && !ISNA(x)) ? TRUE : FALSE )". Is there any specific advantage of defining NA to be the last value of the domain?

In doubles R does distinguish between IEEE NaN and a special NA (a NaN with a special payload). I tend to believe that this overcomplicates matters, and - following Stefan - reasons for NAs should be kept seperate.

So far for today. Let me know, if you like more thougts on Julia's NA handling.

Cheers

Jens Oehlschlägel
Data scientist - Munich

@pao
Copy link
Member

pao commented Apr 29, 2012

@joehl, thanks for your interest in Julia! These R-ish things aren't my area, but are certainly important to a large part of the technical computing community. There've been multiple discussions on this topic in -dev, and I think Harlan has some working prototypes. I encourage you to check those out.

@ViralBShah
Copy link
Member

I am closing this issue as this discussion is on the mailing list and is being addressed in JuliaData.

StefanKarpinski pushed a commit that referenced this issue Feb 8, 2018
* Add argmin and argmax
KristofferC pushed a commit that referenced this issue Sep 25, 2018
unless the user has explicitly asked for it with
  --startup-file=yes

(cherry picked from commit 40d7f27f2ff08ec466df536f267129a9f5e950b4)
fredrikekre pushed a commit that referenced this issue Sep 26, 2018
* don't use startup.jl when precompiling, building and testing (#470)

unless the user has explicitly asked for it with
  --startup-file=yes

(cherry picked from commit 40d7f27f2ff08ec466df536f267129a9f5e950b4)

* do not precompile packages that have opt out to precompilation

(cherry picked from commit 57f7380a2641944be12695e92a3ad9f4cc20e6f2)
KristofferC pushed a commit that referenced this issue Sep 26, 2018
unless the user has explicitly asked for it with
  --startup-file=yes

(cherry picked from commit 40d7f27f2ff08ec466df536f267129a9f5e950b4)
(cherry picked from commit eb96811)
KristofferC pushed a commit that referenced this issue Feb 11, 2019
unless the user has explicitly asked for it with
  --startup-file=yes

(cherry picked from commit 40d7f27f2ff08ec466df536f267129a9f5e950b4)
(cherry picked from commit eb96811)
LilithHafner pushed a commit to LilithHafner/julia that referenced this issue Oct 11, 2021
DilumAluthge added a commit that referenced this issue Nov 15, 2023
Stdlib: SparseArrays
URL: https://github.com/JuliaSparse/SparseArrays.jl.git
Stdlib branch: main
Julia branch: master
Old commit: 37fc321
New commit: 7786a73
Julia version: 1.11.0-DEV
SparseArrays version: 1.11.0
Bump invoked by: @IanButterworth
Powered by:
[BumpStdlibs.jl](https://github.com/JuliaLang/BumpStdlibs.jl)

Diff:
JuliaSparse/SparseArrays.jl@37fc321...7786a73

```
$ git log --oneline 37fc321..7786a73
7786a73 Add Aqua compat. Create CompatHelper.yml (#470)
```

Co-authored-by: Dilum Aluthge <dilum@aluthge.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind:speculative Whether the change will be implemented is speculative
Projects
None yet
Development

No branches or pull requests

7 participants