Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Switch to DataTables #20

Closed
wants to merge 7 commits into from
Closed

Switch to DataTables #20

wants to merge 7 commits into from

Conversation

alyst
Copy link
Collaborator

@alyst alyst commented Oct 1, 2016

Convert from using DataArrays to NullableArrays and NullableCategoricalArrays.
Fixes #19, closes #17.

@alyst
Copy link
Collaborator Author

alyst commented Oct 1, 2016

JuliaData/DataFrames.jl#1008 is not tagged yet, what is the best way to use the current master of DataFrames.jl for Travis and AppVeyor?

@ararslan
Copy link
Member

ararslan commented Oct 1, 2016

what is the best way to use the current master of DataFrames.jl for Travis and AppVeyor?

I don't think you can... not sure though. I bet @tkelman would know.

src/convert.jl Outdated
_zero_nas{R}(::Type{R}, v::Vector{Int32}) = [x != R_NA_INT32 ? R(x) : zero(R) for x in v]

# convert R factor into NullableCategoricalArray{String}
# TODO option to convert into Symbol etc?
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think converting to symbols would be useful. The strings are stored in a pool anyway, so sharing storage doesn't gain us much, and symbols aren't really supposed to be used for user data.

src/convert.jl Outdated
end

function sexp2julia(rex::RSEXPREC)
warn("Conversion of $(typeof(rex)) to Julia is not implemented")
return nothing
end

# FIXME remove when anynull(NullableCategoricalArray) would be available
_anynull{T,N,R}(A::NullableCategoricalArray{T,N,R}) = any(A.refs == zero(R))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Did you mean .==? Anyway, you can avoid an allocation using any(x -> x!=0, A.refs).

Adding anynull(x::NullableCategoricalArray) would be a good idea, though that will require adding NullableArrays as a dependency. Another approach would be to see whether we can get any(isnull, A) to be optimized automatically. Anyway, PR welcome in the meantime.

Copy link
Collaborator Author

@alyst alyst Oct 1, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

src/convert.jl Outdated

# convert nullable array without nulls into non-nullable array
# `A` is expected to contain no nulls
_dropnonulls(A::NullableArray) = A.values
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This already exists in NullableArrays.jl, it's called dropnull (and it should also be added to CategoricalArrays).

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is different from dropnull(), it's for simple cases when there are no NAs, so no searches/(re)allocations are required.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, I see. Yet the name is confusing, as it definitely doesn't drop non-null values.

src/convert.jl Outdated
nas = namask(rv)
hasna = any(nas)
# FIXME forceNullable option to always convert to NullableArray
jv = _nullable(rv)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The new approach seems a bit wasteful, since you build a Nullable(Categorical)Array before potentially converting it back to a non-nullable array. You could continue checking namask first, and only build the nullable array if needed.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The idea is that _nullable() handles the creation of a nullable vector of appropriate type (categorical/normal) and if it turns out there are no nulls, the conversion to a non-null version is done by _dropnonulls() (no reallocation needed). This would be handy when/if forceNullable is introduced.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You could still pass force_nullable=false to that function, and check for that before creating the nullable array.

test/RDA.jl Outdated
df[:cplx] = Complex128[1.1+0.5im, 1.0im]
@test isequal(sexp2julia(load("$testdir/data/types.rda",convert=false)["df"]), df)
@test isequal(sexp2julia(load("$testdir/data/types_ascii.rda",convert=false)["df"]), df)

df[2, :] = NA
for col in DataFrames.columns(df)
col[2] = Nullable{eltype(col)}() # FIXME nullify!() is not supported by CategoricalArrays
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't understand. Doesn't df[2, :] = Nullable() work here?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It does, thanks for the hint. Not used to the capabilities of NullableArrays yet.

test/RDA.jl Outdated
append!(df, df[2, :])
df[3, :num] = NaN
df[:, :cplx] = @data [NA, @compat(Complex128(1,NaN)), NaN]
df[:, :cplx] = NullableArray{Complex128}(Nullable{Complex128}[Nullable{Complex128}(), Complex128(1,NaN), NaN])
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you drop support for 0.4, NullableArray([Nullable(), Complex128(1,NaN), NaN]) should work.

test/RDA.jl Outdated
@test isequal(sexp2julia(load("$testdir/data/NAs.rda",convert=false)["df"]), df)
# ASCII format saves NaN as NA
df[3, :num] = NA
df[:, :cplx] = @data [NA, NA, NA]
df[3, :num] = Nullable{Complex128}()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nullable() is enough here and below.

@alyst
Copy link
Collaborator Author

alyst commented Oct 1, 2016

@nalimilan Thanks for the review! I've updated the PR.

src/convert.jl Outdated
hasna = any(nas)
# TODO dimnames?
# FIXME forceNullable option to always convert to NullableArray
jv = nullable_vector(rv)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I still think you could rename nullable_vector to something like vector, and have it return a non-nullable vector if no nulls are present.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nullable_vector() just extracts the data and NA mask and packages it into appropriate type, the real conversion logic is implemented by sexp2julia() that takes into account the current context and user-specified options. I was thinking about returning the tuple of data and NA mask and constructing the appropriate vector/nullable vector in sexp2julia(), but it would not work for categorical arrays, because the pool of categories also has to be returned somehow.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not pass hasna to _nullable_vector and make return type choices there?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

After a second thought... yes, that way it is definitely better. Thanks for your persistence ;)

src/convert.jl Outdated
# converts Vector{Int32} into Vector{R} replacing R_NA_INT32 with 0
na2zero{R}(::Type{R}, v::Vector{Int32}) = [x != R_NA_INT32 ? R(x) : zero(R) for x in v]

# convert R factor into NullableCategoricalArray{String}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also converts to a NullableArrays in some cases apparently (but in when can that happen ?).

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've clarified the description, thanks.

@alyst alyst force-pushed the ast/nullable_arrays branch 2 times, most recently from 0a71e9f to f03c9ce Compare October 1, 2016 21:57
@ararslan
Copy link
Member

ararslan commented Oct 1, 2016

As for testing against DataFrames master, you should be able to edit the Travis and AppVeyor YAMLs to run Pkg.checkout("DataFrames", "master") after Pkg.build on this line (Travis) and also here (AV). Dunno if it's actually a good idea but it should work. ¯_(ツ)_/¯

@alyst
Copy link
Collaborator Author

alyst commented Oct 1, 2016

@ararslan This PR should remove v0.4 support already (from REQUIRE and CI configs). I'm also not sure whether we should patch Travis and AppVeyor scripts just for the sake of seeing the green icon. The tests are passing for me locally. But thanks for the info.

@alyst alyst force-pushed the ast/nullable_arrays branch 2 times, most recently from 3b81838 to 12f071b Compare October 1, 2016 22:19
@ararslan
Copy link
Member

ararslan commented Oct 1, 2016

Could be worth it while you make updates to this branch, then you can just remove the commit before it's merged.

@alyst alyst changed the title Update to DataFrames 0.8+ Switch to DataTables Mar 9, 2017
@alyst
Copy link
Collaborator Author

alyst commented Mar 9, 2017

Updated the PR to use DataTables.jl.
Since DataTables.jl is already there, maybe we can just tag the new RData version?

cc @ararslan @nalimilan

# - julia --check-bounds=yes -e 'Pkg.clone(pwd()); Pkg.build("RData"); Pkg.test("RData"; coverage=true)'
script:
- if [[ -a .git/shallow ]]; then git fetch --unshallow; fi
- julia --check-bounds=yes -e 'Pkg.clone(pwd()); Pkg.build("RData"); Pkg.checkout("DataTables", "master"); Pkg.test("RData"; coverage=true)'
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do you still need the checkout?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldn't be necessary since there's a tag for DataTables.

@nalimilan
Copy link
Member

I'm not sure DataTables is ready yet. Maybe we should wait until more query frameworks work with it.

@ararslan
Copy link
Member

ararslan commented Mar 9, 2017

This is another situation where I'm sad to see DataFrames support go. While it would be awesome to be able to use a table abstraction package here, so that one can create either a DataFrame or a DataTable using this package, I don't think it's viable given how closely the machinery here has to tie into the missing data machinery.

@nalimilan
Copy link
Member

I guess both could be supported (without reexporting the packages' APIs), but we would still need to choose a default.

@alyst
Copy link
Collaborator Author

alyst commented Mar 9, 2017

Would DataFrames be supported(developed) in the long term?
Both packages could be supported by RData. Though, implementation-wise, it's about supporting both DataArrays and NullableArrays+CategoricalArrays for representing NAs and factors.

If DataFrames development would cease, there could be still patch releases to the current (DataFrames-based) version of RData.

@nalimilan
Copy link
Member

I think at some point only one package should remain (probably DataTables, though maybe not based on Nullable as it exists now). But supporting both DataArray and NullableArray shouldn't be too hard as their internal structures and constructors are very similar.

@alyst
Copy link
Collaborator Author

alyst commented Mar 10, 2017

supporting both DataArray and NullableArray shouldn't be too hard as their internal structures and constructors are very similar.

That's true. I'm just a little bit worried that the community efforts would be diffused by maintaining the interchangeability of the two package families that provide essentially the same functionality.

src/convert.jl Outdated
end

# converts Vector{Int32} into Vector{R} replacing R_NA_INT32 with 0
na2zero{R}(::Type{R}, v::Vector{Int32}) = [x != R_NA_INT32 ? R(x) : zero(R) for x in v]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ifelse could be faster than a branch here.

src/convert.jl Outdated
# converts Vector{Int32} into Vector{R} replacing R_NA_INT32 with 0
na2zero{R}(::Type{R}, v::Vector{Int32}) = [x != R_NA_INT32 ? R(x) : zero(R) for x in v]

# convert to [Nullable]CategoricalArray{String} if `ri`is a factor,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Missing space before "is".

src/convert.jl Outdated
# convert to [Nullable]CategoricalArray{String} if `ri`is a factor,
# or to [Nullable]Array{Int32} otherwise
function julia_vector(ri::RIntegerVector, force_nullable::Bool)
!isfactor(ri) && return _julia_vector(eltype(ri.data), ri, force_nullable) # not a factor
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

# not a factor is kind of redundant with the check.

src/convert.jl Outdated
# FIXME set ordered flag
refs = na2zero(REFTYPE, ri.data)
pool = CategoricalPool{String, REFTYPE}(rlevels)
(force_nullable || (findfirst(refs, zero(REFTYPE)) > 0)) ?
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would be more readable as an if.. else block.

src/convert.jl Outdated
namask(ri::RIntegerVector) = BitArray(ri.data .== R_NA_INT32)
namask(rn::RNumericVector) = BitArray(map(isna_float64, reinterpret(UInt64, rn.data)))
namask(ri::RVector{Int32}) = [i == R_NA_INT32 for i in ri.data]
namask(rn::RNumericVector) = map(isna_float64, reinterpret(UInt64, rn.data))
# if re or im is NA, the whole complex number is NA
# FIXME avoid temporary Vector{Bool}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This comment no longer applies since NullableArray currently uses a Vector{Bool}. Though the current approach is wasteful when there are no nulls, since it allocates the bit mask even though it's not used.

@nalimilan
Copy link
Member

It appears the best solution would be to add DataStreams support to RData so that the result can be converted to any type, including DataFrame and DataTable. We should continue returning DataFrames by default to avoid breaking people's code, and add a new way of loading .RData files with DataStreams. Then we could deprecate the old interface. This means we will have to keep DataFrames support for some time.

@quinnj said he's going to work on DataStreams soon, which should improve the abstraction and make this possible/easier.

* drop Julia 0.4 support (since DataTables require Julia 0.5)
* convert from using DataArrays to NullableArrays/CategoricalArrays
- use == instead of isequal()
- explicitly make the columns nullable
@alyst alyst closed this Sep 15, 2017
@ararslan ararslan deleted the ast/nullable_arrays branch September 15, 2017 19:38
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Drop DataArrays requirement?
4 participants