Skip to content

Commit

Permalink
Port from Nullable to Union{Null, T}
Browse files Browse the repository at this point in the history
This requires Nulls, as well as new versions of CategoricalArrays,
DataStreams and WeakRefStrings.
  • Loading branch information
quinnj authored and nalimilan committed Sep 3, 2017
1 parent cc3c880 commit 6035da8
Show file tree
Hide file tree
Showing 38 changed files with 1,360 additions and 1,370 deletions.
8 changes: 4 additions & 4 deletions REQUIRE
@@ -1,8 +1,8 @@
julia 0.6
NullableArrays 0.1.1
CategoricalArrays 0.1.2
Nulls 0.0.6
CategoricalArrays 0.2.0
StatsBase 0.11.0
SortingAlgorithms
Reexport
WeakRefStrings 0.1.3
DataStreams 0.1.0
WeakRefStrings 0.3.0
DataStreams 0.2.0
16 changes: 11 additions & 5 deletions docs/src/man/categorical.md
Expand Up @@ -7,14 +7,20 @@ v = ["Group A", "Group A", "Group A",
"Group B", "Group B", "Group B"]
```

The naive encoding used in an `Array` or in a `NullableArray` represents every entry of this vector as a full string. In contrast, we can represent the data more efficiently by replacing the strings with indices into a small pool of levels. This is what the `CategoricalArray` type does:
The naive encoding used in an `Array` represents every entry of this vector as a full string. In contrast, we can represent the data more efficiently by replacing the strings with indices into a small pool of levels. This is what the `CategoricalArray` type does:

```julia
cv = CategoricalArray(["Group A", "Group A", "Group A",
"Group B", "Group B", "Group B"])
```

A companion type, `NullableCategoricalArray`, allows storing missing values in the array: is to `CategoricalArray` what `NullableArray` is to the standard `Array` type.
`CategoricalArrays` support missing values via the `Nulls` package.

```julia
using Nulls
cv = CategoricalArray(["Group A", null, "Group A",
"Group B", "Group B", null])
```

In addition to representing repeated data efficiently, the `CategoricalArray` type allows us to determine efficiently the allowed levels of the variable at any time using the `levels` function (note that levels may or may not be actually used in the data):

Expand All @@ -30,7 +36,7 @@ By default, a `CategoricalArray` is able to represent 2<sup>32</sup>differents l
cv = compact(cv)
```

Often, you will have factors encoded inside a DataFrame with `Array` or `NullableArray` columns instead of `CategoricalArray` or `NullableCategoricalArray` columns. You can do conversion of a single column using the `categorical` function:
Often, you will have factors encoded inside a DataFrame with `Array` columns instead of `CategoricalArray` columns. You can do conversion of a single column using the `categorical` function:

```julia
cv = categorical(v)
Expand All @@ -44,6 +50,6 @@ df = DataFrame(A = [1, 1, 1, 2, 2, 2],
categorical!(df, [:A, :B])
```

Using categorical arrays is important for working with the [GLM package](https://github.com/JuliaStats/GLM.jl). When fitting regression models, `CategoricalArray` and `NullableCategoricalArray` columns in the input are translated into 0/1 indicator columns in the `ModelMatrix` with one column for each of the levels of the `CategoricalArray`/`NullableCategoricalArray`. This allows one to analyze categorical data efficiently.
Using categorical arrays is important for working with the [GLM package](https://github.com/JuliaStats/GLM.jl). When fitting regression models, `CategoricalArray` columns in the input are translated into 0/1 indicator columns in the `ModelMatrix` with one column for each of the levels of the `CategoricalArray`. This allows one to analyze categorical data efficiently.

See the [CategoricalArrays package](https://github.com/nalimilan/CategoricalArrays.jl) for more information regarding categorical arrays.
See the [CategoricalArrays package](https://github.com/JuliaStats/CategoricalArrays.jl) for more information regarding categorical arrays.
125 changes: 74 additions & 51 deletions docs/src/man/getting_started.md
Expand Up @@ -2,88 +2,109 @@

## Installation

The DataFrames package is available through the Julia package system. Throughout the rest of this tutorial, we will assume that you have installed the DataFrames package and have already typed `using NullableArrays, DataFrames` to bring all of the relevant variables into your current namespace. In addition, we will make use of the `RDatasets` package, which provides access to hundreds of classical data sets.
The DataFrames package is available through the Julia package system. Throughout the rest of this tutorial, we will assume that you have installed the DataFrames package and have already typed `using DataFrames` to bring all of the relevant variables into your current namespace.

## The `Nullable` Type
## The `Null` Type

To get started, let's examine the `Nullable` type. Objects of this type can either hold a value, or represent a missing value (`null`). For example, this is a `Nullable` holding the integer `1`:
To get started, let's examine the `Null` type. `Null` is a type implemented by [Nulls.jl](https://github.com/JuliaData/Nulls.jl) to represent missing data. `null` is an instance of the type `Null` used to represent a missing value.

```julia
Nullable(1)
```
julia> using DataFrames

And this represents a missing value:
```julia
Nullable()
```
julia> null
null

`Nullable` objects support all standard operators, which return another `Nullable`. One of the essential properties of `null` values is that they poison other items. To see this, try to add something like `Nullable(1)` to `Nullable()`:
julia> typeof(null)
Nulls.Null

```julia
Nullable(1) + Nullable()
```

The `get` function can be used to extract the value from the [`Nullable`](http://docs.julialang.org/en/stable/manual/types/#nullable-types-representing-missing-values) wrapper when it is not null. For example:
The `Null` type lets users create `Vector`s and `DataFrame` columns with missing values. Here we create a vector with a null value and the element-type of the returned vector is `Union{Nulls.Null, Int64}`.

```julia
julia> a = Nullable("14:00:00")
Nullable{String}("14:00:00")
julia> x = [1, 2, null]
3-element Array{Union{Nulls.Null, Int64},1}:
1
2
null

julia> b = get(a)
"14:00:00"
julia> eltype(x)
Union{Nulls.Null, Int64}

julia> typeof(b)
String
```
julia> Union{Null, Int}
Union{Nulls.Null, Int64}

Note that operations mixing `Nullable` and scalars (e.g. `1 + Nullable(1)`) are not supported.
julia> eltype(x) == Union{Null, Int}
true

## The `NullableArray` Type
```

`Nullable` objects can be stored in a standard `Array` just like any value:
`null` values can be excluded when performing operations by using `Nulls.skip`, which returns a memory-efficient iterator.

```julia
v = Nullable{Int}[1, 3, 4, 5, 4]
julia> Nulls.skip(x)
Base.Generator{Base.Iterators.Filter{Nulls.##4#6{Nulls.Null},Array{Union{Nulls.Null, Int64},1}},Nulls.##3#5}(Nulls.#3, Base.Iterators.Filter{Nulls.##4#6{Nulls.Null},Array{Union{Nulls.Null, Int64},1}}(Nulls.#4, Union{Nulls.Null, Int64}[1, 2, null]))

```
But arrays of `Nullable` are inefficient, both in terms of computation costs and of memory use. `NullableArrays` provide a more efficient storage, and behave like `Array{Nullable}` objects.
The output of `Nulls.skip` can be passed directly into functions as an argument. For example, we can find the `sum` of all non-null values or `collect` the non-null values into a new null-free vector.
```julia
nv = NullableArray(Nullable{Int}[Nullable(), 3, 2, 5, 4])
```
julia> sum(Nulls.skip(x))
3

In many cases we're willing to just ignore missing values and remove them from our vector. We can do that using the `dropnull` function:
julia> collect(Nulls.skip(x))
2-element Array{Int64,1}:
1
2

```julia
dropnull(nv)
mean(dropnull(nv))
```
Instead of removing `null` values, you can try to convert the `NullableArray` into a normal Julia `Array` using `convert`:
`null` elements can be replaced with other values via `Nulls.replace`.
```julia
convert(Array, nv)
julia> collect(Nulls.replace(x, 1))
3-element Array{Int64,1}:
1
2
1

```
This fails in the presence of `null` values, but will succeed if there are no `null` values:
The function `Nulls.T` returns the element-type `T` in `Union{T, Null}`.
```julia
nv[1] = 3
convert(Array, nv)
julia> Nulls.T(eltype(x))
Int64

```
In addition to removing `null` values and hoping they won't occur, you can also replace any `null` values using the `convert` function, which takes a replacement value as an argument:
Use `nulls` to generate nullable `Vector`s and `Array`s, using the optional first argument to specify the element-type.
```julia
nv = NullableArray(Nullable{Int}[Nullable(), 3, 2, 5, 4])
mean(convert(Array, nv, 0))
```
julia> nulls(1)
1-element Array{Nulls.Null,1}:
null

julia> nulls(3)
3-element Array{Nulls.Null,1}:
null
null
null

julia> nulls(1, 3)
1×3 Array{Nulls.Null,2}:
null null null

Which strategy for dealing with `null` values is most appropriate will typically depend on the specific details of your data analysis pathway.
julia> nulls(Int, 1, 3)
1×3 Array{Union{Nulls.Null, Int64},2}:
null null null

```
## The `DataFrame` Type
The `DataFrame` type can be used to represent data tables, each column of which is an array (by default, a `NullableArray`). You can specify the columns using keyword arguments:
The `DataFrame` type can be used to represent data tables, each column of which is a vector. You can specify the columns using keyword arguments:
```julia
df = DataFrame(A = 1:4, B = ["M", "F", "F", "M"])
Expand Down Expand Up @@ -123,27 +144,27 @@ describe(df)
To focus our search, we start looking at just the means and medians of specific columns. In the example below, we use numeric indexing to access the columns of the `DataFrame`:
```julia
mean(dropnull(df[1]))
median(dropnull(df[1]))
mean(Nulls.skip(df[1]))
median(Nulls.skip(df[1]))
```
We could also have used column names to access individual columns:
```julia
mean(dropnull(df[:A]))
median(dropnull(df[:A]))
mean(Nulls.skip(df[:A]))
median(Nulls.skip(df[:A]))
```
We can also apply a function to each column of a `DataFrame` with the `colwise` function. For example:
```julia
df = DataFrame(A = 1:4, B = randn(4))
colwise(c->cumsum(dropnull(c)), df)
colwise(c->cumsum(Nulls.skip(c)), df)
```
## Importing and Exporting Data (I/O)
For reading and writing tabular data from CSV and other delimited text files, use the [CSV.jl](https://github.com/JuliaStats/CSV.jl) package.
For reading and writing tabular data from CSV and other delimited text files, use the [CSV.jl](https://github.com/JuliaData/CSV.jl) package.
If you have not used the CSV.jl package before then you may need to download it first.
```julia
Expand Down Expand Up @@ -178,9 +199,7 @@ For more information, use the REPL [help-mode](http://docs.julialang.org/en/stab
## Accessing Classic Data Sets
To see more of the functionality for working with `DataFrame` objects, we need a more complex data set to work with. We'll use the `RDatasets` package, which provides access to many of the classical data sets that are available in R.

For example, we can access Fisher's iris data set using the following functions:
To see more of the functionality for working with `DataFrame` objects, we need a more complex data set to work with. We can access Fisher's iris data set using the following functions:
```julia
using CSV
Expand All @@ -194,4 +213,8 @@ In the next section, we'll discuss generic I/O strategy for reading and writing
While the `DataFrames` package provides basic data manipulation capabilities, users are encouraged to use the following packages for more powerful and complete data querying functionality in the spirit of [dplyr](https://github.com/hadley/dplyr) and [LINQ](https://msdn.microsoft.com/en-us/library/bb397926.aspx):
## Querying DataFrames
While the `DataFrames` package provides basic data manipulation capabilities, users are encouraged to use the following packages for more powerful and complete data querying functionality in the spirit of [dplyr](https://github.com/hadley/dplyr) and [LINQ](https://msdn.microsoft.com/en-us/library/bb397926.aspx):
- [Query.jl](https://github.com/davidanthoff/Query.jl) provides a LINQ like interface to a large number of data sources, including `DataFrame` instances.
2 changes: 1 addition & 1 deletion docs/src/man/joins.md
Expand Up @@ -51,7 +51,7 @@ Cross joins are the only kind of join that does not use a key:
join(a, b, kind = :cross)
```

In order to join data frames on keys which have different names, you must first rename them so that they match. This can be done using rename!:
In order to join data tables on keys which have different names, you must first rename them so that they match. This can be done using rename!:

```julia
a = DataFrame(ID = [20, 40], Name = ["John Doe", "Jane Doe"])
Expand Down
2 changes: 1 addition & 1 deletion docs/src/man/reshaping_and_pivoting.md
Expand Up @@ -80,6 +80,6 @@ None of these reshaping functions perform any aggregation. To do aggregation, us

```julia
d = stack(iris)
x = by(d, [:variable, :Species], df -> DataFrame(vsum = mean(dropnull(df[:value]))))
x = by(d, [:variable, :Species], df -> DataFrame(vsum = mean(Nulls.skip(df[:value]))))
unstack(x, :Species, :vsum)
```
6 changes: 3 additions & 3 deletions docs/src/man/split_apply_combine.md
Expand Up @@ -12,15 +12,15 @@ using CSV
iris = CSV.read(joinpath(Pkg.dir("DataFrames"), "test/data/iris.csv"), DataFrame)

by(iris, :Species, size)
by(iris, :Species, df -> mean(dropnull(df[:PetalLength])))
by(iris, :Species, df -> mean(Nulls.skip(df[:PetalLength])))
by(iris, :Species, df -> DataFrame(N = size(df, 1)))
```

The `by` function also support the `do` block form:

```julia
by(iris, :Species) do df
DataFrame(m = mean(dropnull(df[:PetalLength])), s² = var(dropnull(df[:PetalLength])))
DataFrame(m = mean(Nulls.skip(df[:PetalLength])), s² = var(Nulls.skip(df[:PetalLength])))
end
```

Expand All @@ -30,7 +30,7 @@ We show several examples of the `aggregate` function applied to the `iris` datas

```julia
aggregate(iris, :Species, sum)
aggregate(iris, :Species, [sum, x->mean(dropnull(x))])
aggregate(iris, :Species, [sum, x->mean(Nulls.skip(x))])
```

If you only want to split the data set into subsets, use the `groupby` function:
Expand Down
8 changes: 4 additions & 4 deletions docs/src/man/subsets.md
Expand Up @@ -25,7 +25,7 @@ Referring to the first column by index or name:

```julia
julia> df[1]
10-element NullableArrays.NullableArray{Int64,1}:
10-element Array{Int64,1}:
1
2
3
Expand All @@ -38,7 +38,7 @@ julia> df[1]
10

julia> df[:A]
10-element NullableArrays.NullableArray{Int64,1}:
10-element Array{Int64,1}:
1
2
3
Expand All @@ -55,10 +55,10 @@ Refering to the first element of the first column:

```julia
julia> df[1, 1]
Nullable{Int64}(1)
1

julia> df[1, :A]
Nullable{Int64}(1)
1
```

Selecting a subset of rows by index and an (ordered) subset of columns by name:
Expand Down
22 changes: 7 additions & 15 deletions src/DataFrames.jl
@@ -1,5 +1,4 @@
VERSION >= v"0.4.0-dev+6521" && __precompile__(true)

__precompile__(true)
module DataFrames

##############################################################################
Expand All @@ -8,12 +7,9 @@ module DataFrames
##
##############################################################################

using Reexport
using StatsBase
import NullableArrays: dropnull, dropnull!
@reexport using NullableArrays
@reexport using CategoricalArrays
using SortingAlgorithms
using Reexport, StatsBase, SortingAlgorithms
@reexport using CategoricalArrays, Nulls

using Base: Sort, Order
import Base: ==, |>

Expand All @@ -23,13 +19,7 @@ import Base: ==, |>
##
##############################################################################

export @~,
@csv_str,
@csv2_str,
@tsv_str,
@wsv_str,

AbstractDataFrame,
export AbstractDataFrame,
DataFrame,
DataFrameRow,
GroupApplied,
Expand Down Expand Up @@ -80,6 +70,8 @@ export @~,
##
##############################################################################

const _displaysize = Base.displaysize

for (dir, filename) in [
("other", "utils.jl"),
("other", "index.jl"),
Expand Down

0 comments on commit 6035da8

Please sign in to comment.