# NA: representation of missing values

Why do we need **NA** in addition to **NaN**? In many cases, both **return similar** results:

In [1]:
using DataFrames
(false && NaN, false && NA)

(false, false)

In [2]:
(true && NaN, true && NA)

(NaN, NA)

In [3]:
((3 + NaN), (3 + NA))

(NaN, NA)

In [4]:
(0*NaN, 0*NA)

(NaN, NA)

However, in some cases `NaN` simply returns `false`, although we actually are not able to evaluate any logical statement, since we have a missing observation.

In [5]:
(3 > NaN, 3 > NA)

(false, NA)

In [6]:
((3 == NaN), (3 == NA))

(false, NA)

In [7]:
(NaN == NaN, NA == NA)

(false, NA)

Rather unexpectedly, however, **not equal** evaluates to **true** for _NA_.

In [8]:
((3 !== NaN), (3 !== NA)) # not good?

(true, true)

There are two ways to test equality for `NA`s:

In [9]:
NA == NA

NA

In [10]:
isequal(NA, NA)

true

##### Further references

- http://www.r-bloggers.com/r-na-vs-null/
- http://www.cookbook-r.com/Basics/Working_with_NULL_NA_and_NaN/
- http://www.cookbook-r.com/Manipulating_data/Comparing_vectors_or_factors_with_NA/

# Testing for equality with missing observations

#### DataFrames

Methods for function `isequal` are naturally extended for the case of `DataFrames`. Again, `isequal` returns either `true` or `false`. In order to evaluate to `true`, each entry must be exactly the same - in particular, `NA` is equal to `NA`.

In [11]:
df = DataFrame()
df[:a] = @data([3, NA])
df[:b] = @data([4, NA])

df2 = DataFrame()
df2[:a] = @data([3, 4])
df2[:b] = @data([4, NA])

df3 = DataFrame()
df3[:a] = @data([10, 4])
df3[:b] = @data([4, NA])

isequal(df, df)

true

In the next example, a missing observation in one `DataFrame` is compared to a number in the other `DataFrame`. Hence, `isequal` evaluates to `false`.

In [12]:
isequal(df, df2)

false

And, of course, with deviating observations:

In [13]:
isequal(df, df3)

false

In contrast, `==` has slightly different aggregating behaviour. If all elementwise comparisons evaluate to `true` or if a single elementwise comparison evalutes to `false` the method is similar to `isequal`. For the case of missing observations, however, `==` evaluates to `NA` if there is at least one `NA` for elementwise comparison, and all other values are `true`. First, comparing a `DataFrame` with missing observations with itself:

In [14]:
[df[ii, jj] == df[ii, jj] for ii=1:size(df, 1), jj=1:size(df, 2)]

2×2 Array{Any,2}:
 true    true  
     NA      NA

In [15]:
df == df

NA

Again, with `true` and `NA` present:

In [16]:
[df[ii, jj] == df2[ii, jj] for ii=1:size(df, 1), jj=1:size(df, 2)]

2×2 Array{Any,2}:
 true    true  
     NA      NA

In [17]:
df == df2

NA

Now, with a single entry evaluating to `false`:

In [18]:
[df[ii, jj] == df3[ii, jj] for ii=1:size(df, 1), jj=1:size(df, 2)]

2×2 Array{Any,2}:
 false    true  
      NA      NA

In [19]:
df == df3

false

## Session info

In [20]:
versioninfo()

Julia Version 0.6.0
Commit 9036443 (2017-06-19 13:05 UTC)
Platform Info:
  OS: Linux (x86_64-pc-linux-gnu)
  CPU: Intel(R) Core(TM) i5-4210U CPU @ 1.70GHz
  WORD_SIZE: 64
  BLAS: libopenblas (USE64BITINT DYNAMIC_ARCH NO_AFFINITY Haswell)
  LAPACK: libopenblas64_
  LIBM: libopenlibm
  LLVM: libLLVM-3.9.1 (ORCJIT, haswell)


In [21]:
Pkg.status()

173 required packages:
 - AbstractFFTs                  0.2.0
 - Atom                          0.6.1
 - AutoGrad                      0.0.7
 - AutoHashEquals                0.1.1
 - AxisAlgorithms                0.1.6
 - AxisArrays                    0.1.4
 - BenchmarkTools                0.0.8
 - BinDeps                       0.7.0
 - Blink                         0.5.3
 - Blosc                         0.3.0
 - BufferedStreams               0.3.3
 - BusinessDays                  0.7.1
 - CSV                           0.1.4
 - Calculus                      0.2.2
 - CatIndices                    0.0.2
 - CategoricalArrays             0.1.6
 - Clustering                    0.8.0
 - CodeTools                     0.4.6
 - Codecs                        0.3.0
 - ColorTypes                    0.5.2
 - ColorVectorSpace              0.4.4
 - Colors                        0.7.4
 - Combinatorics                 0.4.1
 - Compat                        0.28.0
 - Compose                       0.5.3
 