Skip to content

Conversation

@greimel
Copy link
Contributor

@greimel greimel commented May 6, 2020

DataFrames 0.21 doesn't have identfier() for cleaning column names any longer.

If we want to use it here, we have to port this function from DataFrames 0.20. Instead, I've removed the cleaning in this PR.

@greimel
Copy link
Contributor Author

greimel commented May 6, 2020

It seems as if porting identifier() from DataFrames is less breaking. I added two a commit accordingly. With it is quite easy to upgrade RDatasets to the new versions.

@coveralls
Copy link

coveralls commented May 6, 2020

Coverage Status

Coverage increased (+0.5%) to 86.177% when pulling 29a21d4 on greimel:compat into 5a04c97 on JuliaData:master.

@greimel
Copy link
Contributor Author

greimel commented May 6, 2020

one test keeps failing in RDatasets

using RDatasets
dataset("COUNT", "fasttrakg")

gives an

ArgumentError: Duplicate entries are not allowed in levels
handle_error(::ArgumentError, ::FileIO.File{FileIO.DataFormat{:RData}}) at error_handling.jl:82
handle_exceptions(::Array{Any,1}, ::String) at error_handling.jl:77
load(::FileIO.Formatted; options::Base.Iterators.Pairs{Union{},Union{},Tuple{},NamedTuple{(),Tuple{}}}) at loadsave.jl:189
load at loadsave.jl:169 [inlined]
#load#13 at loadsave.jl:118 [inlined]
load at loadsave.jl:118 [inlined]
dataset(::String, ::String) at dataset.jl:12
top-level scope at tmp.jl:3

Could this be due to a change in CategoricalArrays?

@nalimilan
Copy link
Member

Could this be due to a change in CategoricalArrays?

Yes, CategoricalArrays no longer allows duplicate levels. This wasn't really supposed to be working, so it probably indicates a bug. Maybe unique has to be called somewhere before creating the CategoricalArray?

@greimel
Copy link
Contributor Author

greimel commented May 8, 2020

as @bkamins said above

There is a decision to be made:

  • do what you propose: then we are non-breaking

  • do what I propose: we are breaking, but the benefit is that sending some data back and forth between R and Julia does not alter it

What about doing the non-breaking change now, to avoid that RData.jl holds back the adoption of the new DataFrames version. (RData is often just a test dependency.)

Shouldn't we open an issue for discussing the breaking change, @bkamins?

@greimel
Copy link
Contributor Author

greimel commented May 8, 2020

added a commit to remove duplicate levels and print a warning. @nalimilan, could you please check this? I know too little about the internals how categoricals are stored internally.

Anyways, RDatasets.jl test pass now with this commit.

@bkamins
Copy link
Member

bkamins commented May 8, 2020

Shouldn't we open an issue for discussing the breaking change

I am OK with making it a separate PR.

src/convert.jl Outdated

rlevels = getattr(ri, "levels")
sz0 = length(rlevels)
unique!(rlevels) # CategoricalArrays#v0.8 does not allow duplicate levels
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm afraid this won't be enough, as removing duplicates will break the mapping between refs and rlevels. If duplicates are present, I think you'll have to call indexin(rlevels, unique(rlevels)) and recompute refs based on that.

Tests should normally have caught this, but levels after the duplicate are not used (BTW, note that R prints a warning too):

> load("data/COUNT/fasttrakg.rda")
> table(fasttrakg$Anterior)

 Inferior  Anterior  Inferior      LBBB   Missing    NoSTUp OtherSTUp     Paced 
        8         7         0         0         0         0         0         0 
> fasttrakg$Anterior
 [1] Inferior Inferior Inferior Inferior Inferior Inferior Inferior Inferior
 [9] Anterior Anterior Anterior Anterior Anterior Anterior Anterior
attr(,"label")
[1] 0=inferior;1=anterior
attr(,"format")
[1] %9.0g
attr(,"value.label.table")
 Inferior  Anterior  Inferior      LBBB   Missing    NoSTUp OtherSTUp     Paced 
        0         1         2         3         4         5         6         7 
Levels: Inferior Anterior Inferior LBBB Missing NoSTUp OtherSTUp Paced
Warning message:
In print.factor(x) : duplicated level [3] in factor

A simple way to test this would be to take that dataset, change one value so that one of the levels at the end is used, and put it in test/.

@greimel
Copy link
Contributor Author

greimel commented May 11, 2020

@nalimilan I added a test.

load("~/.julia/dev/RDatasets/data/COUNT/fasttrakg.rda")
fasttrakg$Anterior[1] = "Paced"
dup_levels = fasttrakg$Anterior # Paced Inferior Inferior Inferior Inferior Inferior Inferior Inferior Anterior Anterior Anterior Anterior Anterior Anterior Anterior
attr(dup_levels, "levels") #  "Inferior"  "Anterior"  "Inferior"  "LBBB"      "Missing"   "NoSTUp"    "OtherSTUp" "Paced" 
save(dup_levels, file="~/.julia/dev/RData/test/data_v3/dup_levels.rda")

@greimel
Copy link
Contributor Author

greimel commented May 11, 2020

test for RDatasets also pass locally

@greimel
Copy link
Contributor Author

greimel commented May 11, 2020

squashed some commits. This is ready for final review.

Copy link
Member

@nalimilan nalimilan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, looks good! Just two small comments.

src/convert.jl Outdated
refs = na2zero(RT, ri.data)

if hasduplicates
RT = REFTYPE(sz)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you can just take the reftype for sz0. The probability of having so many duplicates that the type would change is very low, and not worth making the code more complex.


@testset "Duplicate levels in factor (version=3)" begin
dup_cat = sexp2julia(load(joinpath("data_v3", "dup_levels.rda"), convert=false)["dup_levels"])
@test dup_cat[1] == "Paced"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you also check levels(dup_cat) just in case?

greimel and others added 3 commits May 11, 2020 12:33
Co-authored-by: Milan Bouchet-Valat <nalimilan@club.fr>
@greimel
Copy link
Contributor Author

greimel commented May 11, 2020

comments addressed

Co-authored-by: Milan Bouchet-Valat <nalimilan@club.fr>
@nalimilan nalimilan requested review from alyst and bkamins May 11, 2020 11:05
Co-authored-by: Bogumił Kamiński <bkamins@sgh.waw.pl>
@nalimilan nalimilan merged commit 030b0fb into JuliaData:master May 13, 2020
@nalimilan
Copy link
Member

Thanks!

@nalimilan
Copy link
Member

@greimel greimel deleted the compat branch June 17, 2020 13:38
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants