# NA: representation of missing values

Why do we need **NA** in addition to **NaN**? In many cases, both **return similar** results:

In [1]:
using DataFrames
{false && NaN false && NA}

1x2 Array{Any,2}:
 false  false

In [2]:
{true && NaN true && NA}

1x2 Array{Any,2}:
 NaN  NA

In [3]:
{(3 + NaN) (3 + NA)}

1x2 Array{Any,2}:
 NaN  NA

In [4]:
{0*NaN 0*NA}

1x2 Array{Any,2}:
 NaN  NA

However, in some cases `NaN` simply returns `false`, although we actually are not able to evaluate any logical statement, since we have a missing observation.

In [5]:
{3 > NaN 3 > NA}

1x2 Array{Any,2}:
 false  NA

In [6]:
{(3 == NaN) (3 == NA)}

1x2 Array{Any,2}:
 false  NA

In [7]:
{NaN == NaN NA == NA}

1x2 Array{Any,2}:
 false  NA

Rather unexpectedly, however, **not equal** evaluates to **true** for _NA_.

In [8]:
{(3 !== NaN) (3 !== NA)} # not good?

1x2 Array{Any,2}:
 true  true

There are two ways to test equality for `NA`s:

In [9]:
NA == NA

NA

In [10]:
isequal(NA, NA)

true

##### Further references

- http://www.r-bloggers.com/r-na-vs-null/
- http://www.cookbook-r.com/Basics/Working_with_NULL_NA_and_NaN/
- http://www.cookbook-r.com/Manipulating_data/Comparing_vectors_or_factors_with_NA/

# Testing for equality with missing observations

#### DataFrames

Methods for function `isequal` are naturally extended for the case of `DataFrames`. Again, `isequal` returns either `true` or `false`. In order to evaluate to `true`, each entry must be exactly the same - in particular, `NA` is equal to `NA`.

In [11]:
df = DataFrame()
df[:a] = @data([3, NA])
df[:b] = @data([4, NA])

df2 = DataFrame()
df2[:a] = @data([3, 4])
df2[:b] = @data([4, NA])

df3 = DataFrame()
df3[:a] = @data([10, 4])
df3[:b] = @data([4, NA])

isequal(df, df)

true

In the next example, a missing observation in one `DataFrame` is compared to a number in the other `DataFrame`. Hence, `isequal` evaluates to `false`.

In [12]:
isequal(df, df2)

false

And, of course, with deviating observations:

In [13]:
isequal(df, df3)

false

In contrast, `==` has slightly different aggregating behaviour. If all elementwise comparisons evaluate to `true` or if a single elementwise comparison evalutes to `false` the method is similar to `isequal`. For the case of missing observations, however, `==` evaluates to `NA` if there is at least one `NA` for elementwise comparison, and all other values are `true`. First, comparing a `DataFrame` with missing observations with itself:

In [14]:
[df[ii, jj] == df[ii, jj] for ii=1:size(df, 1), jj=1:size(df, 2)]

2x2 Array{Any,2}:
 true    true  
     NA      NA

In [15]:
df == df

NA

Again, with `true` and `NA` present:

In [16]:
[df[ii, jj] == df2[ii, jj] for ii=1:size(df, 1), jj=1:size(df, 2)]

2x2 Array{Any,2}:
 true    true  
     NA      NA

In [17]:
df == df2

NA

Now, with a single entry evaluating to `false`:

In [18]:
[df[ii, jj] == df3[ii, jj] for ii=1:size(df, 1), jj=1:size(df, 2)]

2x2 Array{Any,2}:
 false    true  
      NA      NA

In [19]:
df == df3

false

#### TimeData objects

`isequal` and `==` behave quite similarly for the case of `TimeData` objects. However, they also return `false` if either indices, column names or types are different (for example, if `Timematr` is compared to `Timenum` objects).

In [20]:
using TimeData
tn = Timenum(df)
tn2 = Timenum(df2)
tn3 = Timenum(df3)
td = Timedata(df)
tn4 = Timenum(df, [date(2010, 1, 1):date(2010, 1, 2)])

LoadError: date not defined
while loading In[20], in expression starting on line 6

In [21]:
[isequal(df, df) isequal(tn, tn);
    isequal(df, df2) isequal(tn, tn2);
    isequal(df, df3) isequal(tn, tn3)]

3x2 Array{Bool,2}:
  true   true
 false  false
 false  false

In [22]:
Any[(df == df) (tn == tn);
    (df == df2) (tn == tn2);
    (df == df3) (tn == tn3)]

3x2 Array{Any,2}:
      NA       NA
      NA       NA
 false    false  

Testing other differences:

In [23]:
[isequal(tn, td) (tn == td);
    isequal(tn, tn4) (tn == tn4)]

LoadError: tn4 not defined
while loading In[23], in expression starting on line 1

In addition, however, `Timematr` and `Timenum` objects also provide elementwise implementations of `isequal` and `==`. These are called `isequalElemwise` and `.==` respectively. These functions are not defined for `DataFrames`:

In [24]:
df .== df

LoadError: `equMeta` has no method matching equMeta(::DataFrame, ::DataFrame)
while loading In[24], in expression starting on line 1

For elementwise comparisons, differences in column names, indices or type will always throw an error:

In [25]:
isequalElemwise(tn, tn4)

LoadError: isequalElemwise not defined
while loading In[25], in expression starting on line 1

As `isequalElemwise` uses `isequal`, which only returns `true` or `false` values, its output is a `Timedata` object with boolean values.

In [26]:
isequalElemwise(tn, tn)

LoadError: isequalElemwise not defined
while loading In[26], in expression starting on line 1

In [27]:
isequalElemwise(tn, tn2)

LoadError: isequalElemwise not defined
while loading In[27], in expression starting on line 1

In contrast, using `==` also could return `NA`, so that elementwise comparison return boolean `DataArrays`.

In [28]:
tn == tn

NA

In [29]:
tn .== tn

Unnamed: 0,idx,a,b
1,1,True,True
2,2,,


## Session info

In [30]:
versioninfo()

Julia Version 0.3.5
Commit a05f87b* (2015-01-08 22:33 UTC)
Platform Info:
  System: Linux (x86_64-linux-gnu)
  CPU: Intel(R) Core(TM) i3-3240 CPU @ 3.40GHz
  WORD_SIZE: 64
  BLAS: libblas.so.3
  LAPACK: liblapack.so.3
  LIBM: libopenlibm
  LLVM: libLLVM-3.3


In [31]:
Pkg.status()

18 required packages:
 - DataArrays                    0.2.9
 - DataFrames                    0.6.0
 - Dates                         0.3.2
 - Debug                         0.0.4
 - Distributions                 0.6.3
 - EconDatasets                  0.0.2
 - GLM                           0.4.2
 - Gadfly                        0.3.10
 - IJulia                        0.1.16
 - JuMP                          0.7.3
 - MAT                           0.2.9
 - NLopt                         0.2.0
 - Quandl                        0.4.0
 - RDatasets                     0.1.1
 - Taro                          0.1.2
 - TimeData                      0.5.1
 - TimeSeries                    0.4.6
 - Winston                       0.11.7
56 additional packages:
 - ArrayViews                    0.4.8
 - BinDeps                       0.3.7
 - Blosc                         0.1.1
 - Cairo                         0.2.22
 - Calculus                      0.1.5
 - Codecs                        0.1.3
 - Color      