Switch to DataTables #20

alyst · 2016-10-01T15:54:23Z

Convert from using DataArrays to NullableArrays and NullableCategoricalArrays.
Fixes #19, closes #17.

alyst · 2016-10-01T16:01:28Z

JuliaData/DataFrames.jl#1008 is not tagged yet, what is the best way to use the current master of DataFrames.jl for Travis and AppVeyor?

ararslan · 2016-10-01T17:43:52Z

what is the best way to use the current master of DataFrames.jl for Travis and AppVeyor?

I don't think you can... not sure though. I bet @tkelman would know.

nalimilan · 2016-10-01T18:17:42Z

src/convert.jl

+_zero_nas{R}(::Type{R}, v::Vector{Int32}) = [x != R_NA_INT32 ? R(x) : zero(R) for x in v]
+
+# convert R factor into NullableCategoricalArray{String}
+# TODO option to convert into Symbol etc?


I don't think converting to symbols would be useful. The strings are stored in a pool anyway, so sharing storage doesn't gain us much, and symbols aren't really supposed to be used for user data.

nalimilan · 2016-10-01T18:23:35Z

src/convert.jl

 end

 function sexp2julia(rex::RSEXPREC)
    warn("Conversion of $(typeof(rex)) to Julia is not implemented")
    return nothing
 end

+# FIXME remove when anynull(NullableCategoricalArray) would be available
+_anynull{T,N,R}(A::NullableCategoricalArray{T,N,R}) = any(A.refs == zero(R))


Did you mean .==? Anyway, you can avoid an allocation using any(x -> x!=0, A.refs).

Adding anynull(x::NullableCategoricalArray) would be a good idea, though that will require adding NullableArrays as a dependency. Another approach would be to see whether we can get any(isnull, A) to be optimized automatically. Anyway, PR welcome in the meantime.

See JuliaData/CategoricalArrays.jl#25.

nalimilan · 2016-10-01T18:24:39Z

src/convert.jl

+
+# convert nullable array without nulls into non-nullable array
+# `A` is expected to contain no nulls
+_dropnonulls(A::NullableArray) = A.values


This already exists in NullableArrays.jl, it's called dropnull (and it should also be added to CategoricalArrays).

This is different from dropnull(), it's for simple cases when there are no NAs, so no searches/(re)allocations are required.

Ah, I see. Yet the name is confusing, as it definitely doesn't drop non-null values.

nalimilan · 2016-10-01T18:27:22Z

src/convert.jl

-    nas = namask(rv)
-    hasna = any(nas)
+    # FIXME forceNullable option to always convert to NullableArray
+    jv = _nullable(rv)


The new approach seems a bit wasteful, since you build a Nullable(Categorical)Array before potentially converting it back to a non-nullable array. You could continue checking namask first, and only build the nullable array if needed.

The idea is that _nullable() handles the creation of a nullable vector of appropriate type (categorical/normal) and if it turns out there are no nulls, the conversion to a non-null version is done by _dropnonulls() (no reallocation needed). This would be handy when/if forceNullable is introduced.

You could still pass force_nullable=false to that function, and check for that before creating the nullable array.

nalimilan · 2016-10-01T18:30:48Z

test/RDA.jl

    df[:cplx] = Complex128[1.1+0.5im, 1.0im]
    @test isequal(sexp2julia(load("$testdir/data/types.rda",convert=false)["df"]), df)
    @test isequal(sexp2julia(load("$testdir/data/types_ascii.rda",convert=false)["df"]), df)

-    df[2, :] = NA
+    for col in DataFrames.columns(df)
+        col[2] = Nullable{eltype(col)}() # FIXME nullify!() is not supported by CategoricalArrays


I don't understand. Doesn't df[2, :] = Nullable() work here?

It does, thanks for the hint. Not used to the capabilities of NullableArrays yet.

nalimilan · 2016-10-01T18:31:50Z

test/RDA.jl

    append!(df, df[2, :])
    df[3, :num] = NaN
-    df[:, :cplx] = @data [NA, @compat(Complex128(1,NaN)), NaN]
+    df[:, :cplx] = NullableArray{Complex128}(Nullable{Complex128}[Nullable{Complex128}(), Complex128(1,NaN), NaN])


If you drop support for 0.4, NullableArray([Nullable(), Complex128(1,NaN), NaN]) should work.

nalimilan · 2016-10-01T18:32:15Z

test/RDA.jl

    @test isequal(sexp2julia(load("$testdir/data/NAs.rda",convert=false)["df"]), df)
    # ASCII format saves NaN as NA
-    df[3, :num] = NA
-    df[:, :cplx] = @data [NA, NA, NA]
+    df[3, :num] = Nullable{Complex128}()


Nullable() is enough here and below.

alyst · 2016-10-01T19:05:07Z

@nalimilan Thanks for the review! I've updated the PR.

nalimilan · 2016-10-01T20:12:19Z

src/convert.jl

-    hasna = any(nas)
+    # TODO dimnames?
+    # FIXME forceNullable option to always convert to NullableArray
+    jv = nullable_vector(rv)


I still think you could rename nullable_vector to something like vector, and have it return a non-nullable vector if no nulls are present.

nullable_vector() just extracts the data and NA mask and packages it into appropriate type, the real conversion logic is implemented by sexp2julia() that takes into account the current context and user-specified options. I was thinking about returning the tuple of data and NA mask and constructing the appropriate vector/nullable vector in sexp2julia(), but it would not work for categorical arrays, because the pool of categories also has to be returned somehow.

Why not pass hasna to _nullable_vector and make return type choices there?

After a second thought... yes, that way it is definitely better. Thanks for your persistence ;)

nalimilan · 2016-10-01T20:13:26Z

src/convert.jl

+# converts Vector{Int32} into Vector{R} replacing R_NA_INT32 with 0
+na2zero{R}(::Type{R}, v::Vector{Int32}) = [x != R_NA_INT32 ? R(x) : zero(R) for x in v]
+
+# convert R factor into NullableCategoricalArray{String}


Also converts to a NullableArrays in some cases apparently (but in when can that happen ?).

I've clarified the description, thanks.

ararslan · 2016-10-01T22:12:08Z

As for testing against DataFrames master, you should be able to edit the Travis and AppVeyor YAMLs to run Pkg.checkout("DataFrames", "master") after Pkg.build on this line (Travis) and also here (AV). Dunno if it's actually a good idea but it should work. ¯_(ツ)_/¯

alyst · 2016-10-01T22:17:15Z

@ararslan This PR should remove v0.4 support already (from REQUIRE and CI configs). I'm also not sure whether we should patch Travis and AppVeyor scripts just for the sake of seeing the green icon. The tests are passing for me locally. But thanks for the info.

ararslan · 2016-10-01T22:37:56Z

Could be worth it while you make updates to this branch, then you can just remove the commit before it's merged.

alyst · 2017-03-09T17:14:30Z

Updated the PR to use DataTables.jl.
Since DataTables.jl is already there, maybe we can just tag the new RData version?

cc @ararslan @nalimilan

tkelman · 2017-03-09T17:46:25Z

.travis.yml

-#  - julia --check-bounds=yes -e 'Pkg.clone(pwd()); Pkg.build("RData"); Pkg.test("RData"; coverage=true)'
+script:
+  - if [[ -a .git/shallow ]]; then git fetch --unshallow; fi
+  - julia --check-bounds=yes -e 'Pkg.clone(pwd()); Pkg.build("RData"); Pkg.checkout("DataTables", "master"); Pkg.test("RData"; coverage=true)'


do you still need the checkout?

Shouldn't be necessary since there's a tag for DataTables.

nalimilan · 2017-03-09T17:56:29Z

I'm not sure DataTables is ready yet. Maybe we should wait until more query frameworks work with it.

ararslan · 2017-03-09T18:00:11Z

This is another situation where I'm sad to see DataFrames support go. While it would be awesome to be able to use a table abstraction package here, so that one can create either a DataFrame or a DataTable using this package, I don't think it's viable given how closely the machinery here has to tie into the missing data machinery.

nalimilan · 2017-03-09T18:08:11Z

I guess both could be supported (without reexporting the packages' APIs), but we would still need to choose a default.

alyst · 2017-03-09T21:27:43Z

Would DataFrames be supported(developed) in the long term?
Both packages could be supported by RData. Though, implementation-wise, it's about supporting both DataArrays and NullableArrays+CategoricalArrays for representing NAs and factors.

If DataFrames development would cease, there could be still patch releases to the current (DataFrames-based) version of RData.

nalimilan · 2017-03-10T09:07:38Z

I think at some point only one package should remain (probably DataTables, though maybe not based on Nullable as it exists now). But supporting both DataArray and NullableArray shouldn't be too hard as their internal structures and constructors are very similar.

alyst · 2017-03-10T09:53:59Z

supporting both DataArray and NullableArray shouldn't be too hard as their internal structures and constructors are very similar.

That's true. I'm just a little bit worried that the community efforts would be diffused by maintaining the interchangeability of the two package families that provide essentially the same functionality.

nalimilan · 2017-03-14T12:31:59Z

src/convert.jl

+end
+
+# converts Vector{Int32} into Vector{R} replacing R_NA_INT32 with 0
+na2zero{R}(::Type{R}, v::Vector{Int32}) = [x != R_NA_INT32 ? R(x) : zero(R) for x in v]


ifelse could be faster than a branch here.

nalimilan · 2017-03-14T12:32:10Z

src/convert.jl

+# converts Vector{Int32} into Vector{R} replacing R_NA_INT32 with 0
+na2zero{R}(::Type{R}, v::Vector{Int32}) = [x != R_NA_INT32 ? R(x) : zero(R) for x in v]
+
+# convert to [Nullable]CategoricalArray{String} if `ri`is a factor,


Missing space before "is".

nalimilan · 2017-03-14T12:32:38Z

src/convert.jl

+# convert to [Nullable]CategoricalArray{String} if `ri`is a factor,
+# or to [Nullable]Array{Int32} otherwise
+function julia_vector(ri::RIntegerVector, force_nullable::Bool)
+    !isfactor(ri) && return _julia_vector(eltype(ri.data), ri, force_nullable) # not a factor


# not a factor is kind of redundant with the check.

nalimilan · 2017-03-14T12:33:30Z

src/convert.jl

+    # FIXME set ordered flag
+    refs = na2zero(REFTYPE, ri.data)
+    pool = CategoricalPool{String, REFTYPE}(rlevels)
+    (force_nullable || (findfirst(refs, zero(REFTYPE)) > 0)) ?


Would be more readable as an if.. else block.

nalimilan · 2017-03-14T12:41:32Z

src/convert.jl

-namask(ri::RIntegerVector) = BitArray(ri.data .== R_NA_INT32)
-namask(rn::RNumericVector) = BitArray(map(isna_float64, reinterpret(UInt64, rn.data)))
+namask(ri::RVector{Int32}) = [i == R_NA_INT32 for i in ri.data]
+namask(rn::RNumericVector) = map(isna_float64, reinterpret(UInt64, rn.data))
 # if re or im is NA, the whole complex number is NA
 # FIXME avoid temporary Vector{Bool}


This comment no longer applies since NullableArray currently uses a Vector{Bool}. Though the current approach is wasteful when there are no nulls, since it allocates the bit mask even though it's not used.

nalimilan · 2017-03-15T10:19:50Z

It appears the best solution would be to add DataStreams support to RData so that the result can be converted to any type, including DataFrame and DataTable. We should continue returning DataFrames by default to avoid breaking people's code, and add a new way of loading .RData files with DataStreams. Then we could deprecate the old interface. This means we will have to keep DataFrames support for some time.

@quinnj said he's going to work on DataStreams soon, which should improve the abstraction and make this possible/easier.

* drop Julia 0.4 support (since DataTables require Julia 0.5) * convert from using DataArrays to NullableArrays/CategoricalArrays

- use == instead of isequal() - explicitly make the columns nullable

alyst force-pushed the ast/nullable_arrays branch from d79d1c7 to 61a9ea7 Compare October 1, 2016 15:56

nalimilan reviewed Oct 1, 2016

View reviewed changes

alyst force-pushed the ast/nullable_arrays branch from 61a9ea7 to 5f14a8b Compare October 1, 2016 19:04

nalimilan reviewed Oct 1, 2016

View reviewed changes

alyst force-pushed the ast/nullable_arrays branch 2 times, most recently from 0a71e9f to f03c9ce Compare October 1, 2016 21:57

alyst force-pushed the ast/nullable_arrays branch 2 times, most recently from 3b81838 to 12f071b Compare October 1, 2016 22:19

alyst force-pushed the ast/nullable_arrays branch from c74d185 to 6882872 Compare October 1, 2016 22:46

alyst force-pushed the ast/nullable_arrays branch from 6882872 to dc0bbab Compare March 9, 2017 16:24

alyst changed the title ~~Update to DataFrames 0.8+~~ Switch to DataTables Mar 9, 2017

tkelman reviewed Mar 9, 2017

View reviewed changes

alyst force-pushed the ast/nullable_arrays branch from 5252fd1 to fb9bf08 Compare March 10, 2017 00:07

nalimilan reviewed Mar 14, 2017

View reviewed changes

remove ctor @compat for v0.4

b0048e1

alyst force-pushed the ast/nullable_arrays branch from fb9bf08 to 64da6be Compare July 13, 2017 14:48

alyst added 6 commits July 13, 2017 19:43

switch to DataTables

be72ad0

* drop Julia 0.4 support (since DataTables require Julia 0.5) * convert from using DataArrays to NullableArrays/CategoricalArrays

temporarily require DataTables master for CI

fae1a59

convert logical vector to Vector{Bool} + tests

01ad191

update tests

10d9294

- use == instead of isequal() - explicitly make the columns nullable

group tests into testsets

f53af10

implement reviewer suggestions

57aef24

alyst force-pushed the ast/nullable_arrays branch from 64da6be to 57aef24 Compare July 13, 2017 19:43

alyst closed this Sep 15, 2017

ararslan deleted the ast/nullable_arrays branch September 15, 2017 19:38

Switch to DataTables #20

Switch to DataTables #20

Conversation

alyst commented Oct 1, 2016

alyst commented Oct 1, 2016

ararslan commented Oct 1, 2016

Choose a reason for hiding this comment

Choose a reason for hiding this comment

alyst Oct 1, 2016 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

alyst commented Oct 1, 2016

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ararslan commented Oct 1, 2016 • edited Loading

alyst commented Oct 1, 2016

ararslan commented Oct 1, 2016

alyst commented Mar 9, 2017

Choose a reason for hiding this comment

Choose a reason for hiding this comment

nalimilan commented Mar 9, 2017

ararslan commented Mar 9, 2017

nalimilan commented Mar 9, 2017

alyst commented Mar 9, 2017

nalimilan commented Mar 10, 2017

alyst commented Mar 10, 2017

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

nalimilan commented Mar 15, 2017

alyst Oct 1, 2016 •

edited

Loading

ararslan commented Oct 1, 2016 •

edited

Loading