Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Remove CategoricalArrays/DataFrames dependencies #766

Merged
merged 3 commits into from
Nov 2, 2020
Merged

Conversation

quinnj
Copy link
Member

@quinnj quinnj commented Oct 31, 2020

These have been deprecated for a while now, so in preparation for the
0.8 release, we remove support for the categorical keyword argument,
producing CategoricalArray columns when the column type is given as
CategoricalValue, and CSV.read without an explicit sink argument.

I'd like to test a bit more what happens if you provide
CategoricalArray to CSV.write, or CategoricalValue as a column type.

For reference, this improves the loading time of CSV.jl on my machine from 1.05s, to 0.65s, around ~45% improvement

These have been deprecated for a while now, so in preparation for the
0.8 release, we remove support for the `categorical` keyword argument,
producing `CategoricalArray` columns when the column type is given as
`CategoricalValue`, and `CSV.read` without an explicit sink argument.

I'd like to test a bit more what happens if you provide
CategoricalArray to `CSV.write`, or `CategoricalValue` as a column type.
@codecov
Copy link

codecov bot commented Oct 31, 2020

Codecov Report

Merging #766 into master will increase coverage by 6.40%.
The diff coverage is 92.30%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master     #766      +/-   ##
==========================================
+ Coverage   85.57%   91.98%   +6.40%     
==========================================
  Files          10        9       -1     
  Lines        1948     1784     -164     
==========================================
- Hits         1667     1641      -26     
+ Misses        281      143     -138     
Impacted Files Coverage Δ
src/CSV.jl 25.00% <0.00%> (+2.77%) ⬆️
src/header.jl 95.62% <ø> (+0.35%) ⬆️
src/write.jl 85.76% <ø> (-0.05%) ⬇️
src/chunks.jl 100.00% <100.00%> (ø)
src/file.jl 94.63% <100.00%> (-0.12%) ⬇️
src/rows.jl 92.59% <100.00%> (ø)
src/utils.jl 87.33% <100.00%> (-0.12%) ⬇️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 2e8a8f3...90e0f5b. Read the comment docs.

Comment on lines -26 to -29
f = CSV.File(IOBuffer("X\nb\nc\na\nc"), types=[CategoricalValue{String, UInt32}])
v = f.X[1]
@test v == "b"
@test levels(v.pool) == ["a", "b", "c"]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like this test could be kept?

@quinnj quinnj merged commit 68d7153 into master Nov 2, 2020
@quinnj quinnj deleted the jq/removedeps branch November 2, 2020 15:44
@nalimilan
Copy link
Member

How about testing types=[CategoricalValue{String, UInt32}]? Is that supposed to work?

@quinnj
Copy link
Member Author

quinnj commented Nov 2, 2020

Hmmm, yeah, that's currently giving:

julia> using CSV, CategoricalArrays
[ Info: Precompiling CSV [336ed68f-0bac-5ca0-87d4-7b16caf5d00b]

julia> f = CSV.File(IOBuffer("X\nb\nc\na\nc"), types=[CategoricalValue{String, UInt32}])
ERROR: MethodError: no method matching parse(::Type{CategoricalValue{String, UInt32}}, ::String)
Closest candidates are:
  parse(::Type{Sockets.IPAddr}, ::AbstractString) at /Users/jacobquinn/julia/usr/share/julia/stdlib/v1.6/Sockets/src/IPAddr.jl:246
  parse(::Type{T}, ::AbstractString; base) where T<:Integer at parse.jl:240
  parse(::Type{T}, ::AbstractString; kwargs...) where T<:Real at parse.jl:379
  ...
Stacktrace:
 [1] xparse
   @ ~/.julia/dev/Parsers/src/Parsers.jl:754 [inlined]
 [2] parsevalue!(#unused#::Type{CategoricalValue{String, UInt32}}, flag::UInt8, column::SentinelArrays.SentinelVector{CategoricalValue{String, UInt32}, UndefInitializer, Missing, Vector{CategoricalValue{String, UInt32}}}, columns::Vector{AbstractVector{T} where T}, buf::Vector{UInt8}, pos::Int64, len::Int64, options::Parsers.Options{false, true, true, false, Missing, UInt8, Nothing}, row::Int64, rowoffset::Int64, col::Int64, types::Vector{Type}, flags::Vector{UInt8})
   @ CSV ~/.julia/dev/CSV/src/file.jl:907

We need to decide what to do here:

  • just tell people that isn't supported anymore; maybe try to special-case CategoricalValue specifically and throw helpful error
  • try to support it somehow; maybe through use of DataAPI.defaultarray, but I think we'd also need a way for CategoricalValue to tell us what kind of "wrapped" type it has

@nalimilan
Copy link
Member

OK. Probably not the end of the world.

DataAPI.defaultarray is already implemented by CategoricalArrays, but parse isn't and I'm not sure it should. Actually currently you can't construct a CategoricalValue in isolation, you can only get it via its parent pool or array: anyway it would be super slow to create a new CategoricalValue with a new pool for each entry. So I guess the only solution would be to assign the value to the array and let the conversion happen at that point. You're right that it would require knowing the wrapped type -- or fall back to just assignig the string without parsing it, which would work only for CategoricalValue{String} (but it's the most common type).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants