Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update to DataFrames v0.11 (CategoricalArrays + Missings) #28

Merged
merged 4 commits into from
Nov 26, 2017

Conversation

alyst
Copy link
Collaborator

@alyst alyst commented Sep 15, 2017

I suppose now it's the time for #19:

  • update to the latest DataFrames.jl v0.10v0.11 and switch to Nulls.jlMissings.jl and CategoricalArrays.jl from DataArrays.jl (NAs are representing as Union{T, Null}).
  • drop compatibility with Julia v0.5
  • fix R logical vector conversion

@alyst alyst force-pushed the update_to_0.6 branch 3 times, most recently from def7de9 to 24fc1bf Compare September 15, 2017 17:30
@alyst
Copy link
Collaborator Author

alyst commented Sep 15, 2017

Oops, still a little bit too early. It needs DataFrames.jl master.

@alyst
Copy link
Collaborator Author

alyst commented Sep 16, 2017

The CI pass on v0.6, but not on nightly (some error in DataStructures.jl)

Copy link
Member

@nalimilan nalimilan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great!

NEWS.md Outdated
##### Changes
* R logical vectors converted to `Vector{Bool}` (instead of `Vector{Int32}`)
* R factors converted into `CategoricalVector` (instead of `PooledDataArray`) ([#19])
* switched from DataVector to Vector{Union{T,Null}} for NAs ([#19])
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Missing backticks.

NEWS.md Outdated
@@ -15,5 +25,6 @@ Initial release based on `DataFrames.read_rda()` ([JuliaStats/DataFrames.jl#1031
[#9]: https://github.com/JuliaStats/RData.jl/issues/9
[#10]: https://github.com/JuliaStats/RData.jl/issues/10
[#15]: https://github.com/JuliaStats/RData.jl/issues/15
[#19]: https://github.com/JuliaStats/RData.jl/issues/19
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wrong place?

REQUIRE Outdated
DataFrames 0.7
DataArrays 0.3
julia 0.6
DataFrames 0.10
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That will be 0.11.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess I should keep it 0.10 for now, otherwise the package won't install.

src/context.jl Outdated
int2ver(v::Integer) = VersionNumber(v >> 16, (v >> 8) & 0xff, v & 0xff)
function RDAContext{T<:RDAIO}(io::T, kwoptions::Vector{Any}=Any[])

function RDAContext(io::T, kwoptions::Vector{Any}=Any[]) where {T<:RDAIO}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No need to define T.

src/convert.jl Outdated
# if re or im is NA, the whole complex number is NA
# FIXME avoid temporary Vector{Bool}
namask(rc::RComplexVector) = BitArray([isna_float64(v.re) || isna_float64(v.im) for v in reinterpret(Complex{UInt64}, rc.data)])
namask(rc::RComplexVector) = [isna_float64(v.re) || isna_float64(v.im) for v in reinterpret(Complex{UInt64}, rc.data)]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Better use broadcast with an anonymous function so that the result is a BitArray. Though I see you switched to Vector{Bool} in the RNullableVector struct. Why change that? Also, other namask methods return a BitArray, so that change may force an inefficient conversion.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That was a legacy from a switch to DataTables and the fact that I forgot that vectorizing logical operation gives BitVector.
I'll fix namask() to return BitVector, but should RLogicalVector be converted into Vector{Bool} or BitVector given that it will most likely be a column in a data frame?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think the difference matters a lot, given that the conversion to Array{Union{Null, T}} won't be able to reuse the mask directly. Array{Bool} should be faster, but will also use 8× more memory.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the revision I've removed namask(), because with Union{T,Null} we don't actually need NA mask. That was the legacy from DataArray/NullableArray times.

src/convert.jl Outdated
# convert R logical vector (uses Int32 to store values) into [Nullable]Vector{Bool}
function jlvec(rl::RLogicalVector, force_nullable::Bool=true)
na_mask = namask(rl)
vals = rl.data .!= zero(eltype(rl.data))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

.!= 0 should be enough here.

src/convert.jl Outdated
return PooledDataArray(DataArrays.RefArray(refs), pool)
# FIXME set ordered flag
refs = na2zero(REFTYPE, ri.data)
anyna = findfirst(refs, zero(REFTYPE)) > 0
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just any(iszero, refs)

src/convert.jl Outdated
end

# converts Vector{Int32} into Vector{R} replacing R_NA_INT32 with 0
# it's assumed that v fits into R
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How safe is that assumption?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The only caller so far is jlvec() method below, and it calculates R to fit all the factor levels.

src/convert.jl Outdated
# FIXME remove Any type assertion workaround
DataFrame(Any[data(col) for col in rl.data], map(identifier, names(rl)))
# FIXME add forceNullable option to control whether always convert to Union{T, Null}
DataFrame(Any[jlvec(col, true) for col in rl.data], identifier.(names(rl)))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

DataFrames no longer automatically converts columns to nullable, probably better keep them non-nullable when there are no nulls.

src/io/XDRIO.jl Outdated
@@ -25,7 +25,11 @@ function readfloatorNA(io::XDRIO, n::RVecLength)
reinterpret(Float64, map!(ntoh, v, v))
end

readuint8(io::XDRIO, n::RVecLength) = readbytes(io.sub, n)
function readuint8(io::XDRIO, n::RVecLength)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wouldn't read(io, UInt8, n) be enough?

@alyst alyst changed the title Update to Julia v0.6 & DataFrames v0.10 Update to Julia v0.6 & DataFrames v0.11 Sep 18, 2017
src/convert.jl Outdated
na_mask = namask(rv)
anyna = any(na_mask)
if force_nullable || anyna
function jlvec(::Type{T}, rv::RVEC, forceNullable::Bool=true) where T
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

force_nullable was a more Julian name. Same below.

src/convert.jl Outdated
res = convert(Vector{Union{T,Null}}, rv.data)
if anyna
@inbounds res[na_mask] = null
for (i,x) in enumerate(rv.data)
isna(x) && @inbounds res[i] = null
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think putting @inbounds before for will generate more efficient code since it will also apply to enumerate.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Probably you're right, and I'll fix it.
But it's natural to assume that for x in rv.data would do @inbounds access, and so should do for (i,x) in enumerate(rv.data). Maybe something could be tweaked in the Base.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, that's definitely how it should work, but currently AFAIK @inbounds still makes a difference.

Copy link
Member

@nalimilan nalimilan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Before merging we'll have to clean the Travis config and bump the DataFrames dependency to 0.11.

@alyst
Copy link
Collaborator Author

alyst commented Sep 18, 2017

@nalimilan Thanks for the review. I've just squashed the fixups into appropriate commits, but the cumulative changeset is identical to what you have just reviewed. So I'll wait until DataFrames 0.11.

This was referenced Oct 30, 2017
@alyst alyst force-pushed the update_to_0.6 branch 2 times, most recently from c313f78 to eaf82ff Compare November 4, 2017 12:47
@alyst alyst changed the title Update to Julia v0.6 & DataFrames v0.11 Update to DataFrames v0.11 (CategoricalArrays + Nulls) Nov 4, 2017
@alyst alyst changed the title Update to DataFrames v0.11 (CategoricalArrays + Nulls) Update to DataFrames v0.11 (CategoricalArrays + Missings) Nov 25, 2017
@alyst
Copy link
Collaborator Author

alyst commented Nov 25, 2017

I'm waiting for JuliaLang/METADATA.jl#12198 to see if this PR works on nightlies.
Then it should be ready to merge if there are no objections.

- switch to Missings.jl/CategoricalArrays.jl
- add jlvec() methods handling conversion logic
- fix conversion of RLogicalVector into Vector{Bool}
- remove DataArrays.jl dependency
(::Type{XXX})(...) -> XXX(...)
reinterpret() returns ReinterpretArray in 0.7 instead of Vector,
whereas RVector{} only accepts Vector
allows to fix the result of readcomplex() from ReinterpretArray to
Vector on 0.7
@alyst alyst merged commit 7e17dcf into master Nov 26, 2017
@alyst alyst deleted the update_to_0.6 branch November 26, 2017 01:10
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants