Skip to content
This repository has been archived by the owner on May 4, 2019. It is now read-only.

Commit

Permalink
Port from Nulls to Missings
Browse files Browse the repository at this point in the history
Add error messages to MissingException, which requires them.
  • Loading branch information
nalimilan committed Nov 17, 2017
1 parent 63a6158 commit 8cafe6a
Show file tree
Hide file tree
Showing 50 changed files with 1,177 additions and 1,177 deletions.
10 changes: 5 additions & 5 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -12,14 +12,14 @@ Documentation:
[![](https://img.shields.io/badge/docs-latest-blue.svg)](https://JuliaStats.github.io/DataArrays.jl/latest)

The DataArrays package provides the `DataArray` type for working efficiently with [missing data](https://en.wikipedia.org/wiki/Missing_data)
in Julia, based on the `null` value from the [Nulls.jl](https://github.com/JuliaData/Nulls.jl) package.
in Julia, based on the `missing` value from the [Missings.jl](https://github.com/JuliaData/Missings.jl) package.

Most Julian arrays cannot contain `null` values: only `Array{Union{T, Null}}` and more generally `Array{>:Null}` can contain `null` values.
Most Julian arrays cannot contain `missing` values: only `Array{Union{T, Missing}}` and more generally `Array{>:Missing}` can contain `missing` values.

The generic use of heterogeneous `Array` is discouraged in Julia versions below 0.7 because it is inefficient: accessing any value requires dereferencing a pointer. The `DataArray` type allows one to work around this inefficiency by providing tightly-typed arrays that can contain values of exactly one type, but can also contain `null` values.
The generic use of heterogeneous `Array` is discouraged in Julia versions below 0.7 because it is inefficient: accessing any value requires dereferencing a pointer. The `DataArray` type allows one to work around this inefficiency by providing tightly-typed arrays that can contain values of exactly one type, but can also contain `missing` values.

For example, a `DataArray{Int}` can contain integers and `null` values. We can construct one as follows:
For example, a `DataArray{Int}` can contain integers and `missing` values. We can construct one as follows:

da = @data([1, 2, null, 4])
da = @data([1, 2, missing, 4])

This package used to provide the `PooledDataArray` type, a variant of `DataArray{T}` optimized for representing arrays that contain many repetitions of a small number of unique values. `PooledDataArray` has been deprecated in favor of [`CategoricalArray`](https://github.com/JuliaData/CategoricalArrays.jl) or [`PooledArray`](https://github.com/JuliaComputing/PooledArrays.jl).
2 changes: 1 addition & 1 deletion REQUIRE
Original file line number Diff line number Diff line change
@@ -1,5 +1,5 @@
julia 0.6
Nulls 0.1.2
Missings
StatsBase 0.15.0
Reexport
SpecialFunctions
8 changes: 4 additions & 4 deletions benchmark/operators.jl
Original file line number Diff line number Diff line change
Expand Up @@ -6,11 +6,11 @@ srand(1776)

const TEST_NAMES = [
"Vector",
"DataVector No null",
"DataVector Half null",
"DataVector No missing",
"DataVector Half missing",
"Matrix",
"DataMatrix No null",
"DataMatrix Half null"
"DataMatrix No missing",
"DataMatrix Half missing"
]

function make_test_types(genfunc, sz)
Expand Down
12 changes: 6 additions & 6 deletions benchmark/reduce.jl
Original file line number Diff line number Diff line change
Expand Up @@ -6,10 +6,10 @@ srand(1776)

const TEST_NAMES = [
"Vector",
"DataVector No null skipnull=false",
"DataVector No null skipnull=true",
"DataVector Half null skipnull=false",
"DataVector Half null skipnull=true"
"DataVector No missing skipmissing=false",
"DataVector No missing skipmissing=true",
"DataVector Half missing skipmissing=false",
"DataVector Half missing skipmissing=true"
]

function make_test_types(genfunc, sz)
Expand All @@ -29,9 +29,9 @@ macro perf(fn, replications)
println($fn)
fns = [()->$fn(Data[1]),
()->$fn(Data[2]),
()->$fn(Data[2]; skipnull=true),
()->$fn(Data[2]; skipmissing=true),
()->$fn(Data[3]),
()->$fn(Data[3]; skipnull=true)]
()->$fn(Data[3]; skipmissing=true)]
gc_disable()
df = compare(fns, $replications)
gc_enable()
Expand Down
12 changes: 6 additions & 6 deletions benchmark/reducedim.jl
Original file line number Diff line number Diff line change
Expand Up @@ -6,10 +6,10 @@ srand(1776)

const TEST_NAMES = [
"Matrix",
"DataMatrix No null skipnull=false",
"DataMatrix No null skipnull=true",
"DataMatrix Half null skipnull=false",
"DataMatrix Half null skipnull=true"
"DataMatrix No missing skipmissing=false",
"DataMatrix No missing skipmissing=true",
"DataMatrix Half missing skipmissing=false",
"DataMatrix Half missing skipmissing=true"
]

function make_test_types(genfunc, sz)
Expand All @@ -29,9 +29,9 @@ macro perf(fn, dim, replications)
println($fn, " (region = ", $dim, ")")
fns = [()->$fn(Data[1], $dim),
()->$fn(Data[2], $dim),
()->$fn(Data[2], $dim; skipnull=true),
()->$fn(Data[2], $dim; skipmissing=true),
()->$fn(Data[3], $dim),
()->$fn(Data[3], $dim; skipnull=true)]
()->$fn(Data[3], $dim; skipmissing=true)]
gc_disable()
df = compare(fns, $replications)
gc_enable()
Expand Down
2 changes: 1 addition & 1 deletion docs/src/da.md
Original file line number Diff line number Diff line change
Expand Up @@ -12,7 +12,7 @@ DataArray
DataVector
DataMatrix
@data
padnull
padmissing
levels
```

Expand Down
4 changes: 2 additions & 2 deletions docs/src/index.md
Original file line number Diff line number Diff line change
@@ -1,10 +1,10 @@
# DataArrays.jl

This package provides array types for working efficiently with [missing data](https://en.wikipedia.org/wiki/Missing_data)
in Julia, based on the `null` value from the [Nulls.jl](https://github.com/JuliaData/Nulls.jl) package.
in Julia, based on the `missing` value from the [Missings.jl](https://github.com/JuliaData/Missings.jl) package.
In particular, it provides the following:

* `DataArray{T}`: An array type that can house both values of type `T` and missing values (of type `Null`)
* `DataArray{T}`: An array type that can house both values of type `T` and missing values (of type `Missing`)
* `PooledDataArray{T}`: An array type akin to `DataArray` but optimized for arrays with a smaller set of unique
values, as commonly occurs with categorical data. This type is now deprecated in favor of [`CategoricalArray`](https://github.com/JuliaData/CategoricalArrays.jl) or [`PooledArray`](https://github.com/JuliaComputing/PooledArrays.jl).

Expand Down
16 changes: 8 additions & 8 deletions spec/literals.md
Original file line number Diff line number Diff line change
Expand Up @@ -27,36 +27,36 @@ that will allow direct construction of `DataArray`s and `PooledDataArray`s.
The basic mechanism that powers the `@data` and `@pdata` macros is the
rewriting of array literals as a call to DataArray or PooledDataArray
with a rewritten array literal and a Boolean mask that specifies where
`null` occurred in the original literal.
`missing` occurred in the original literal.

For example,

@data [1, 2, null, 4]
@data [1, 2, missing, 4]

will be rewritten as,

DataArray([1, 2, 1, 4], [false, false, true, false])

Note the added `1` created during the rewriting of the array literal.
This value is called a `stub` and is always the first value found
in the literal array that is not `null`. The use of stubs explains two
in the literal array that is not `missing`. The use of stubs explains two
important properties of the `@data` and `@pdata` macros:

* If the entries of the array literal are not fixed values, but function calls, these function calls must be pure. Otherwise the impure funcion may be called more times than expected.
* It is not possible to specify a literal DataArray that contains only `null` values.
* None of the variables used in a literal array can be called `null`. This is just good style anyway, so it is not much of a limitation.
* It is not possible to specify a literal DataArray that contains only `missing` values.
* None of the variables used in a literal array can be called `missing`. This is just good style anyway, so it is not much of a limitation.

# Limitations

We restate the limitations noted above:

* If the entries of the array literal are not fixed values, but function calls, these function calls must be pure. Otherwise the impure funcion may be called more times than expected.
* It is not possible to specify a literal DataArray that contains only `null` values.
* None of the variables used in a literal array can be called `null`. This is just good style anyway, so it is not much of a limitation.
* It is not possible to specify a literal DataArray that contains only `missing` values.
* None of the variables used in a literal array can be called `missing`. This is just good style anyway, so it is not much of a limitation.


Note that the latter limitation is not very important, because a DataArray
with only `null` values is already problematic because it has no well-defined
with only `missing` values is already problematic because it has no well-defined
type in Julia.

One final limitation is that the rewriting rules are not able to
Expand Down
4 changes: 2 additions & 2 deletions src/DataArrays.jl
Original file line number Diff line number Diff line change
Expand Up @@ -4,7 +4,7 @@ module DataArrays
using Base: promote_op
using Base.Cartesian, Reexport
@reexport using StatsBase
@reexport using Nulls
@reexport using Missings
using SpecialFunctions

const DEFAULT_POOLED_REF_TYPE = UInt32
Expand All @@ -29,7 +29,7 @@ module DataArrays
FastPerm,
getpoolidx,
gl,
padnull,
padmissing,
pdata,
PooledDataArray,
PooledDataMatrix,
Expand Down
18 changes: 9 additions & 9 deletions src/abstractdataarray.jl
Original file line number Diff line number Diff line change
Expand Up @@ -2,9 +2,9 @@
AbstractDataArray{T, N}
An `N`-dimensional `AbstractArray` whose entries can take on values of type
`T` or the value `null`.
`T` or the value `missing`.
"""
abstract type AbstractDataArray{T, N} <: AbstractArray{Union{T,Null}, N} end
abstract type AbstractDataArray{T, N} <: AbstractArray{Union{T,Missing}, N} end

"""
AbstractDataVector{T}
Expand All @@ -20,7 +20,7 @@ A 2-dimensional [`AbstractDataArray`](@ref) with element type `T`.
"""
const AbstractDataMatrix{T} = AbstractDataArray{T, 2}

Base.eltype(d::AbstractDataArray{T, N}) where {T, N} = Union{T,Null}
Base.eltype(d::AbstractDataArray{T, N}) where {T, N} = Union{T,Missing}

# Generic iteration over AbstractDataArray's

Expand All @@ -30,20 +30,20 @@ Base.done(x::AbstractDataArray, state::Integer) = state > length(x)

# FIXME: type piracy
"""
isnull(a::AbstractArray, i) -> Bool
ismissing(a::AbstractArray, i) -> Bool
Determine whether the element of `a` at index `i` is missing, i.e. `null`.
Determine whether the element of `a` at index `i` is missing, i.e. `missing`.
# Examples
```jldoctest
julia> X = @data [1, 2, null];
julia> X = @data [1, 2, missing];
julia> isnull(X, 2)
julia> ismissing(X, 2)
false
julia> isnull(X, 3)
julia> ismissing(X, 3)
true
```
"""
Base.isnull(a::AbstractArray{T}, i::Real) where {T} = Null <: T ? isa(a[i], Null) : false # -> Bool
Missings.ismissing(a::AbstractArray{T}, i::Real) where {T} = Missing <: T ? isa(a[i], Missing) : false # -> Bool
34 changes: 17 additions & 17 deletions src/broadcast.jl
Original file line number Diff line number Diff line change
Expand Up @@ -5,24 +5,24 @@ _broadcast_shape(x...) = Base.to_shape(Base.Broadcast.broadcast_indices(x...))

# Get ref for value for a PooledDataArray, adding to the pool if
# necessary
_unsafe_pdaref!(Bpool, Brefdict::Dict, val::Null) = 0
_unsafe_pdaref!(Bpool, Brefdict::Dict, val::Missing) = 0
function _unsafe_pdaref!(Bpool, Brefdict::Dict, val)
@get! Brefdict val begin
push!(Bpool, val)
length(Bpool)
end
end

# Generate a branch for each possible combination of null/not null. This
# Generate a branch for each possible combination of missing/not missing. This
# gives good performance at the cost of 2^narrays branches.
function gen_na_conds(f, nd, arrtype, outtype,
daidx=find(t -> t <: DataArray || t <: PooledDataArray, arrtype), pos=1, isnull=())
daidx=find(t -> t <: DataArray || t <: PooledDataArray, arrtype), pos=1, ismissing=())

if pos > length(daidx)
args = Any[Symbol("v_$(k)") for k = 1:length(arrtype)]
for i = 1:length(daidx)
if isnull[i]
args[daidx[i]] = null
if ismissing[i]
args[daidx[i]] = missing
end
end

Expand All @@ -39,15 +39,15 @@ function gen_na_conds(f, nd, arrtype, outtype,
else
k = daidx[pos]
quote
if $(Symbol("isnull_$(k)"))
$(gen_na_conds(f, nd, arrtype, outtype, daidx, pos+1, tuple(isnull..., true)))
if $(Symbol("ismissing_$(k)"))
$(gen_na_conds(f, nd, arrtype, outtype, daidx, pos+1, tuple(ismissing..., true)))
else
$(if arrtype[k] <: DataArray
:(@inbounds $(Symbol("v_$(k)")) = $(Symbol("data_$(k)"))[$(Symbol("state_$(k)_0"))])
else
:(@inbounds $(Symbol("v_$(k)")) = $(Symbol("pool_$(k)"))[$(Symbol("r_$(k)"))])
end)
$(gen_na_conds(f, nd, arrtype, outtype, daidx, pos+1, tuple(isnull..., false)))
$(gen_na_conds(f, nd, arrtype, outtype, daidx, pos+1, tuple(ismissing..., false)))
end
end
end
Expand Down Expand Up @@ -128,13 +128,13 @@ Base.map!(f::F, B::Union{DataArray, PooledDataArray}, A0, As...) where {F} =

# body
begin
# Advance iterators for DataArray and determine null status
# Advance iterators for DataArray and determine missing status
$(Expr(:block, [
As[k] <: DataArray ? quote
@inbounds $(Symbol("isnull_$(k)")) = Base.unsafe_bitgetindex($(Symbol("na_$(k)")), $(Symbol("state_$(k)_0")))
@inbounds $(Symbol("ismissing_$(k)")) = Base.unsafe_bitgetindex($(Symbol("na_$(k)")), $(Symbol("state_$(k)_0")))
end : As[k] <: PooledDataArray ? quote
@inbounds $(Symbol("r_$(k)")) = @nref $nd $(Symbol("refs_$(k)")) d->$(Symbol("j_$(k)_d"))
$(Symbol("isnull_$(k)")) = $(Symbol("r_$(k)")) == 0
$(Symbol("ismissing_$(k)")) = $(Symbol("r_$(k)")) == 0
end : nothing
for k = 1:N]...))

Expand Down Expand Up @@ -190,20 +190,20 @@ Base.Broadcast._containertype(::Type{T}) where T<:PooledDataArray = PooledDa
Base.Broadcast.broadcast_indices(::Type{T}, A) where T<:AbstractDataArray = indices(A)

@inline function broadcast_t(f, ::Type{T}, shape, A, Bs...) where {T}
dest = Base.Broadcast.containertype(A, Bs...)(Nulls.T(T), Base.index_lengths(shape...))
dest = Base.Broadcast.containertype(A, Bs...)(Missings.T(T), Base.index_lengths(shape...))
return broadcast!(f, dest, A, Bs...)
end

# This is mainly to handle isnull.(x) since isnull is probably the only
# function that can guarantee that nulls will never propagate
# This is mainly to handle ismissing.(x) since ismissing is probably the only
# function that can guarantee that missings will never propagate
@inline function broadcast_t(f, ::Type{Bool}, shape, A, Bs...)
dest = similar(BitArray, shape)
return broadcast!(f, dest, A, Bs...)
end

# This one is almost identical to the version in Base and can hopefully be
# removed at some point. The main issue in Base is that it tests for
# isleaftype(T) which is false for Union{T,Null}. If the test in Base
# isleaftype(T) which is false for Union{T,Missing}. If the test in Base
# can be modified to cover simple unions of leaftypes then this method
# can probably be deleted and the two _t methods adjusted to match the Base
# invokation from Base.Broadcast.broadcast_c
Expand All @@ -214,5 +214,5 @@ end
end

# This one is much faster than normal broadcasting but the method won't get called
# in fusing operations like (!).(isnull.(x))
Base.broadcast(::typeof(isnull), da::DataArray) = copy(da.na)
# in fusing operations like (!).(ismissing.(x))
Base.broadcast(::typeof(ismissing), da::DataArray) = copy(da.na)
Loading

0 comments on commit 8cafe6a

Please sign in to comment.