# Missing Values

are common in statistical data. This notebook gives an introduction to how to identify/handle them.

## Load Packages and Extra Functions

In [1]:
using Printf

include("src/printmat.jl");

# Quick Start

...of how to filter out all NaN/missing from an array and sum the remaining elements.

In [2]:
z = [1,NaN,2]
sum(filter(!isunordered,z))    #filter to only keep non-NaN/missing

3.0

# NaN and missing

The `NaN` (Not-a-Number) can be used to indicate that a floating point number is missing or otherwise strange. For other types of data (for instance, integers), use `missing` instead.

Most computations involving NaN/missing give `NaN` or `missing` as a result.

Financial data is often on floating point form and `NaN` is easier than `missing` to work with (since `NaN` is part of the floating point specification, while `missing` is an add-on). This might suggest using `NaN` rather than `missing`.

In [3]:
println(2.0 + NaN," ",2 + missing)

NaN missing


# Loading Data

The next cell replaces `-999.99` by `NaN` or `missing` in the matrix `data`. This is a common scenario when `data` has been loaded from a data set (csv file, say). See the tutorial on loading and saving data for more information.

In [22]:
data = [1.0 -999.99;
        3.0 13.0]

data = replace(data,-999.99=>NaN)           #replace -999.99 by NaN or missing
printblue("data: ")
printmat(data)

[34m[1mdata: [22m[39m
     1.000       NaN
     3.000    13.000



## Testing for NaN/missing in an Array

You can test whether a number is `NaN` or `missing` by using `isunordered()`. (Use `isnan()` or `ismissing()` if you want to test specifically for one of them.) 

In [15]:
if any(isunordered,z)                  #check if any NaN/missins
  println("z has some NaN/missing")    #can also do any(isunordered.(z))
end

z has some NaN/missing


# Disregarding NaN/missing in a Vector

can often be done by just `!filter()` the vector to get rid of all elements that are NaN/missing.

In [16]:
sum(filter(!isunordered,z))    #finds all elements that are not unordered, and sums them

3.0

# Prune All Rows (of a Matrix) with any NaN/missing

It is a (fairly) common procedure in statistics to throw out all cases with NaN/missing values. For instance, let `z` be a matrix and `z[t,:]` the data for period $t$  which contains one or more `NaN/missing` values. It is then common (for instance, in linear regressions) to throw out that entire row of the matrix.

The function `Cases2Keep(z,dims=2)` will create a bitvector with true for all rows in `z` that has no NaN/missing.

For statistical computations, you may also consider the [NaNStatistics.jl](https://github.com/brenhinkeller/NaNStatistics.jl) package. 

In [20]:
"""

Indicate rows (or cols) without NaN/missing
"""
Cases2Keep(z,dims=2) = .!vec(any(isunordered,z;dims))

z = [1 missing;2 21]

vc = Cases2Keep(z)
printblue("z and vc:")
printmat(z,vc;colNames=["col 1","col 2","vc"])

z2 = z[vc,:]           #keep only rows without NaN/missing
printblue("z2: a new matrix where all rows with any NaN/missing have been pruned:")
printmat(z2)

[34m[1mz and vc:[22m[39m
     col 1     col 2        vc
     1       missing     0    
     2        21         1    

[34m[1mz2: a new matrix where all rows with any NaN/missing have been pruned:[22m[39m
     2        21    



## Converting a Pruned Array to a Standard Type (extra)

Once you have pruned all rows with `missing`s, you may want to convert the matrix to, for instance, Float64. This might simplify some of the later code. Notice that if there were no missing (just NaN), then no conversion is needed.

As an alternative, consider `disallowmissing()` from the [`Missings.jl`](https://github.com/JuliaData/Missings.jl) package.

In [21]:
println("The type of z2 is ", typeof(z2))

z3 = convert.(Float64,z2)
println("\nThe type of z3 is ", typeof(z3))

The type of z2 is Matrix{Union{Missing, Int64}}

The type of z3 is Matrix{Float64}
