Port from Nullable to Union{Null, T}

This requires Nulls, as well as new versions of CategoricalArrays, DataStreams and WeakRefStrings.
JuliaData · Sep 3, 2017 · 6035da8 · 6035da8
1 parent cc3c880
commit 6035da8
Show file tree

Hide file tree

Showing 38 changed files with 1,360 additions and 1,370 deletions.
diff --git a/REQUIRE b/REQUIRE
@@ -1,8 +1,8 @@
 julia 0.6
-NullableArrays 0.1.1
-CategoricalArrays 0.1.2
+Nulls 0.0.6
+CategoricalArrays 0.2.0
 StatsBase 0.11.0
 SortingAlgorithms
 Reexport
-WeakRefStrings 0.1.3
-DataStreams 0.1.0
+WeakRefStrings 0.3.0
+DataStreams 0.2.0
diff --git a/docs/src/man/categorical.md b/docs/src/man/categorical.md
@@ -7,14 +7,20 @@ v = ["Group A", "Group A", "Group A",
      "Group B", "Group B", "Group B"]
 ```
 
-The naive encoding used in an `Array` or in a `NullableArray` represents every entry of this vector as a full string. In contrast, we can represent the data more efficiently by replacing the strings with indices into a small pool of levels. This is what the `CategoricalArray` type does:
+The naive encoding used in an `Array` represents every entry of this vector as a full string. In contrast, we can represent the data more efficiently by replacing the strings with indices into a small pool of levels. This is what the `CategoricalArray` type does:
 
 ```julia
 cv = CategoricalArray(["Group A", "Group A", "Group A",
                        "Group B", "Group B", "Group B"])
 ```
 
-A companion type, `NullableCategoricalArray`, allows storing missing values in the array: is to `CategoricalArray` what `NullableArray` is to the standard `Array` type.
+`CategoricalArrays` support missing values via the `Nulls` package.
+
+```julia
+using Nulls
+cv = CategoricalArray(["Group A", null, "Group A",
+                       "Group B", "Group B", null])
+```
 
 In addition to representing repeated data efficiently, the `CategoricalArray` type allows us to determine efficiently the allowed levels of the variable at any time using the `levels` function (note that levels may or may not be actually used in the data):
 
@@ -30,7 +36,7 @@ By default, a `CategoricalArray` is able to represent 2<sup>32</sup>differents l
 cv = compact(cv)
 ```
 
-Often, you will have factors encoded inside a DataFrame with `Array` or `NullableArray` columns instead of `CategoricalArray` or `NullableCategoricalArray` columns. You can do conversion of a single column using the `categorical` function:
+Often, you will have factors encoded inside a DataFrame with `Array` columns instead of `CategoricalArray` columns. You can do conversion of a single column using the `categorical` function:
 
 ```julia
 cv = categorical(v)
@@ -44,6 +50,6 @@ df = DataFrame(A = [1, 1, 1, 2, 2, 2],
 categorical!(df, [:A, :B])
 ```
 
-Using categorical arrays is important for working with the [GLM package](https://github.com/JuliaStats/GLM.jl). When fitting regression models, `CategoricalArray` and `NullableCategoricalArray` columns in the input are translated into 0/1 indicator columns in the `ModelMatrix` with one column for each of the levels of the `CategoricalArray`/`NullableCategoricalArray`. This allows one to analyze categorical data efficiently.
+Using categorical arrays is important for working with the [GLM package](https://github.com/JuliaStats/GLM.jl). When fitting regression models, `CategoricalArray` columns in the input are translated into 0/1 indicator columns in the `ModelMatrix` with one column for each of the levels of the `CategoricalArray`. This allows one to analyze categorical data efficiently.
 
-See the [CategoricalArrays package](https://github.com/nalimilan/CategoricalArrays.jl) for more information regarding categorical arrays.
+See the [CategoricalArrays package](https://github.com/JuliaStats/CategoricalArrays.jl) for more information regarding categorical arrays.
diff --git a/docs/src/man/getting_started.md b/docs/src/man/getting_started.md
@@ -2,88 +2,109 @@
 
 ## Installation
 
-The DataFrames package is available through the Julia package system. Throughout the rest of this tutorial, we will assume that you have installed the DataFrames package and have already typed `using NullableArrays, DataFrames` to bring all of the relevant variables into your current namespace. In addition, we will make use of the `RDatasets` package, which provides access to hundreds of classical data sets.
+The DataFrames package is available through the Julia package system. Throughout the rest of this tutorial, we will assume that you have installed the DataFrames package and have already typed `using DataFrames` to bring all of the relevant variables into your current namespace.
 
-## The `Nullable` Type
+## The `Null` Type
 
-To get started, let's examine the `Nullable` type. Objects of this type can either hold a value, or represent a missing value (`null`). For example, this is a `Nullable` holding the integer `1`:
+To get started, let's examine the `Null` type. `Null` is a type implemented by [Nulls.jl](https://github.com/JuliaData/Nulls.jl) to represent missing data. `null` is an instance of the type `Null` used to represent a missing value.
 
 ```julia
-Nullable(1)
-```
+julia> using DataFrames
 
-And this represents a missing value:
-```julia
-Nullable()
-```
+julia> null
+null
 
-`Nullable` objects support all standard operators, which return another `Nullable`. One of the essential properties of `null` values is that they poison other items. To see this, try to add something like `Nullable(1)` to `Nullable()`:
+julia> typeof(null)
+Nulls.Null
 
-```julia
-Nullable(1) + Nullable()
 ```
 
-The `get` function can be used to extract the value from the [`Nullable`](http://docs.julialang.org/en/stable/manual/types/#nullable-types-representing-missing-values) wrapper when it is not null. For example:
+The `Null` type lets users create `Vector`s and `DataFrame` columns with missing values. Here we create a vector with a null value and the element-type of the returned vector is `Union{Nulls.Null, Int64}`.
 
 ```julia
-julia> a = Nullable("14:00:00")
-Nullable{String}("14:00:00")
+julia> x = [1, 2, null]
+3-element Array{Union{Nulls.Null, Int64},1}:
+ 1
+ 2
+  null
 
-julia> b = get(a)
-"14:00:00"
+julia> eltype(x)
+Union{Nulls.Null, Int64}
 
-julia> typeof(b)
-String
-```
+julia> Union{Null, Int}
+Union{Nulls.Null, Int64}
 
-Note that operations mixing `Nullable` and scalars (e.g. `1 + Nullable(1)`) are not supported.
+julia> eltype(x) == Union{Null, Int}
+true
 
-## The `NullableArray` Type
+```
 
-`Nullable` objects can be stored in a standard `Array` just like any value:
+`null` values can be excluded when performing operations by using `Nulls.skip`, which returns a memory-efficient iterator.
 
 ```julia
-v = Nullable{Int}[1, 3, 4, 5, 4]
+julia> Nulls.skip(x)
+Base.Generator{Base.Iterators.Filter{Nulls.##4#6{Nulls.Null},Array{Union{Nulls.Null, Int64},1}},Nulls.##3#5}(Nulls.#3, Base.Iterators.Filter{Nulls.##4#6{Nulls.Null},Array{Union{Nulls.Null, Int64},1}}(Nulls.#4, Union{Nulls.Null, Int64}[1, 2, null]))
+
 ```
 
-But arrays of `Nullable` are inefficient, both in terms of computation costs and of memory use. `NullableArrays` provide a more efficient storage, and behave like `Array{Nullable}` objects.
+The output of `Nulls.skip` can be passed directly into functions as an argument. For example, we can find the `sum` of all non-null values or `collect` the non-null values into a new null-free vector.
 
 ```julia
-nv = NullableArray(Nullable{Int}[Nullable(), 3, 2, 5, 4])
-```
+julia> sum(Nulls.skip(x))
+3
 
-In many cases we're willing to just ignore missing values and remove them from our vector. We can do that using the `dropnull` function:
+julia> collect(Nulls.skip(x))
+2-element Array{Int64,1}:
+ 1
+ 2
 
-```julia
-dropnull(nv)
-mean(dropnull(nv))
 ```
 
-Instead of removing `null` values, you can try to convert the `NullableArray` into a normal Julia `Array` using `convert`:
+`null` elements can be replaced with other values via `Nulls.replace`.
 
 ```julia
-convert(Array, nv)
+julia> collect(Nulls.replace(x, 1))
+3-element Array{Int64,1}:
+ 1
+ 2
+ 1
+
 ```
 
-This fails in the presence of `null` values, but will succeed if there are no `null` values:
+The function `Nulls.T` returns the element-type `T` in `Union{T, Null}`.
 
 ```julia
-nv[1] = 3
-convert(Array, nv)
+julia> Nulls.T(eltype(x))
+Int64
+
 ```
 
-In addition to removing `null` values and hoping they won't occur, you can also replace any `null` values using the `convert` function, which takes a replacement value as an argument:
+Use `nulls` to generate nullable `Vector`s and `Array`s, using the optional first argument to specify the element-type.
 
 ```julia
-nv = NullableArray(Nullable{Int}[Nullable(), 3, 2, 5, 4])
-mean(convert(Array, nv, 0))
-```
+julia> nulls(1)
+1-element Array{Nulls.Null,1}:
+ null
+
+julia> nulls(3)
+3-element Array{Nulls.Null,1}:
+ null
+ null
+ null
+
+julia> nulls(1, 3)
+1×3 Array{Nulls.Null,2}:
+ null  null  null
 
-Which strategy for dealing with `null` values is most appropriate will typically depend on the specific details of your data analysis pathway.
+julia> nulls(Int, 1, 3)
+1×3 Array{Union{Nulls.Null, Int64},2}:
+ null  null  null
+
+```
 
 ## The `DataFrame` Type
 
-The `DataFrame` type can be used to represent data tables, each column of which is an array (by default, a `NullableArray`). You can specify the columns using keyword arguments:
+The `DataFrame` type can be used to represent data tables, each column of which is a vector. You can specify the columns using keyword arguments:
 
 ```julia
 df = DataFrame(A = 1:4, B = ["M", "F", "F", "M"])
@@ -123,27 +144,27 @@ describe(df)
 To focus our search, we start looking at just the means and medians of specific columns. In the example below, we use numeric indexing to access the columns of the `DataFrame`:
 
 ```julia
-mean(dropnull(df[1]))
-median(dropnull(df[1]))
+mean(Nulls.skip(df[1]))
+median(Nulls.skip(df[1]))
 ```
 
 We could also have used column names to access individual columns:
 
 ```julia
-mean(dropnull(df[:A]))
-median(dropnull(df[:A]))
+mean(Nulls.skip(df[:A]))
+median(Nulls.skip(df[:A]))
 ```
 
 We can also apply a function to each column of a `DataFrame` with the `colwise` function. For example:
 
 ```julia
 df = DataFrame(A = 1:4, B = randn(4))
-colwise(c->cumsum(dropnull(c)), df)
+colwise(c->cumsum(Nulls.skip(c)), df)
 ```
 
 ## Importing and Exporting Data (I/O)
 
-For reading and writing tabular data from CSV and other delimited text files, use the [CSV.jl](https://github.com/JuliaStats/CSV.jl) package.
+For reading and writing tabular data from CSV and other delimited text files, use the [CSV.jl](https://github.com/JuliaData/CSV.jl) package.
 
 If you have not used the CSV.jl package before then you may need to download it first.
 ```julia
@@ -178,9 +199,7 @@ For more information, use the REPL [help-mode](http://docs.julialang.org/en/stab
 
 ## Accessing Classic Data Sets
 
-To see more of the functionality for working with `DataFrame` objects, we need a more complex data set to work with. We'll use the `RDatasets` package, which provides access to many of the classical data sets that are available in R.
-
-For example, we can access Fisher's iris data set using the following functions:
+To see more of the functionality for working with `DataFrame` objects, we need a more complex data set to work with. We can access Fisher's iris data set using the following functions:
 
 ```julia
 using CSV
@@ -194,4 +213,8 @@ In the next section, we'll discuss generic I/O strategy for reading and writing
 
 While the `DataFrames` package provides basic data manipulation capabilities, users are encouraged to use the following packages for more powerful and complete data querying functionality in the spirit of [dplyr](https://github.com/hadley/dplyr) and [LINQ](https://msdn.microsoft.com/en-us/library/bb397926.aspx):
 
+## Querying DataFrames
+
+While the `DataFrames` package provides basic data manipulation capabilities, users are encouraged to use the following packages for more powerful and complete data querying functionality in the spirit of [dplyr](https://github.com/hadley/dplyr) and [LINQ](https://msdn.microsoft.com/en-us/library/bb397926.aspx):
+
 - [Query.jl](https://github.com/davidanthoff/Query.jl) provides a LINQ like interface to a large number of data sources, including `DataFrame` instances.
diff --git a/docs/src/man/joins.md b/docs/src/man/joins.md
@@ -51,7 +51,7 @@ Cross joins are the only kind of join that does not use a key:
 join(a, b, kind = :cross)
 ```
 
-In order to join data frames on keys which have different names, you must first rename them so that they match. This can be done using rename!:
+In order to join data tables on keys which have different names, you must first rename them so that they match. This can be done using rename!:
 
 ```julia
 a = DataFrame(ID = [20, 40], Name = ["John Doe", "Jane Doe"])

diff --git a/docs/src/man/reshaping_and_pivoting.md b/docs/src/man/reshaping_and_pivoting.md
@@ -80,6 +80,6 @@ None of these reshaping functions perform any aggregation. To do aggregation, us
 
 ```julia
 d = stack(iris)
-x = by(d, [:variable, :Species], df -> DataFrame(vsum = mean(dropnull(df[:value]))))
+x = by(d, [:variable, :Species], df -> DataFrame(vsum = mean(Nulls.skip(df[:value]))))
 unstack(x, :Species, :vsum)
 ```
diff --git a/docs/src/man/split_apply_combine.md b/docs/src/man/split_apply_combine.md
@@ -12,15 +12,15 @@ using CSV
 iris = CSV.read(joinpath(Pkg.dir("DataFrames"), "test/data/iris.csv"), DataFrame)
 
 by(iris, :Species, size)
-by(iris, :Species, df -> mean(dropnull(df[:PetalLength])))
+by(iris, :Species, df -> mean(Nulls.skip(df[:PetalLength])))
 by(iris, :Species, df -> DataFrame(N = size(df, 1)))
 ```
 
 The `by` function also support the `do` block form:
 
 ```julia
 by(iris, :Species) do df
-   DataFrame(m = mean(dropnull(df[:PetalLength])), s² = var(dropnull(df[:PetalLength])))
+   DataFrame(m = mean(Nulls.skip(df[:PetalLength])), s² = var(Nulls.skip(df[:PetalLength])))
 end
 ```
 
@@ -30,7 +30,7 @@ We show several examples of the `aggregate` function applied to the `iris` datas
 
 ```julia
 aggregate(iris, :Species, sum)
-aggregate(iris, :Species, [sum, x->mean(dropnull(x))])
+aggregate(iris, :Species, [sum, x->mean(Nulls.skip(x))])
 ```
 
 If you only want to split the data set into subsets, use the `groupby` function:

diff --git a/docs/src/man/subsets.md b/docs/src/man/subsets.md
@@ -25,7 +25,7 @@ Referring to the first column by index or name:
 
 ```julia
 julia> df[1]
-10-element NullableArrays.NullableArray{Int64,1}:
+10-element Array{Int64,1}:
   1
   2
   3
@@ -38,7 +38,7 @@ julia> df[1]
  10
 
 julia> df[:A]
-10-element NullableArrays.NullableArray{Int64,1}:
+10-element Array{Int64,1}:
   1
   2
   3
@@ -55,10 +55,10 @@ Refering to the first element of the first column:
 
 ```julia
 julia> df[1, 1]
-Nullable{Int64}(1)
+1
 
 julia> df[1, :A]
-Nullable{Int64}(1)
+1
 ```
 
 Selecting a subset of rows by index and an (ordered) subset of columns by name:

diff --git a/src/DataFrames.jl b/src/DataFrames.jl
@@ -1,5 +1,4 @@
-VERSION >= v"0.4.0-dev+6521" && __precompile__(true)
-
+__precompile__(true)
 module DataFrames
 
 ##############################################################################
@@ -8,12 +7,9 @@ module DataFrames
 ##
 ##############################################################################
 
-using Reexport
-using StatsBase
-import NullableArrays: dropnull, dropnull!
-@reexport using NullableArrays
-@reexport using CategoricalArrays
-using SortingAlgorithms
+using Reexport, StatsBase, SortingAlgorithms
+@reexport using CategoricalArrays, Nulls
+
 using Base: Sort, Order
 import Base: ==, |>
 
@@ -23,13 +19,7 @@ import Base: ==, |>
 ##
 ##############################################################################
 
-export @~,
-       @csv_str,
-       @csv2_str,
-       @tsv_str,
-       @wsv_str,
-
-       AbstractDataFrame,
+export AbstractDataFrame,
        DataFrame,
        DataFrameRow,
        GroupApplied,
@@ -80,6 +70,8 @@ export @~,
 ##
 ##############################################################################
 
+const _displaysize = Base.displaysize
+
 for (dir, filename) in [
         ("other", "utils.jl"),
         ("other", "index.jl"),