-
Notifications
You must be signed in to change notification settings - Fork 17
Compat: DataFrames and CategoricalArrays #72
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
It seems as if porting |
|
one test keeps failing in RDatasets using RDatasets
dataset("COUNT", "fasttrakg")gives an Could this be due to a change in CategoricalArrays? |
Yes, CategoricalArrays no longer allows duplicate levels. This wasn't really supposed to be working, so it probably indicates a bug. Maybe |
|
as @bkamins said above
What about doing the non-breaking change now, to avoid that RData.jl holds back the adoption of the new DataFrames version. (RData is often just a test dependency.) Shouldn't we open an issue for discussing the breaking change, @bkamins? |
|
added a commit to remove duplicate levels and print a warning. @nalimilan, could you please check this? I know too little about the internals how categoricals are stored internally. Anyways, RDatasets.jl test pass now with this commit. |
I am OK with making it a separate PR. |
src/convert.jl
Outdated
|
|
||
| rlevels = getattr(ri, "levels") | ||
| sz0 = length(rlevels) | ||
| unique!(rlevels) # CategoricalArrays#v0.8 does not allow duplicate levels |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm afraid this won't be enough, as removing duplicates will break the mapping between refs and rlevels. If duplicates are present, I think you'll have to call indexin(rlevels, unique(rlevels)) and recompute refs based on that.
Tests should normally have caught this, but levels after the duplicate are not used (BTW, note that R prints a warning too):
> load("data/COUNT/fasttrakg.rda")
> table(fasttrakg$Anterior)
Inferior Anterior Inferior LBBB Missing NoSTUp OtherSTUp Paced
8 7 0 0 0 0 0 0
> fasttrakg$Anterior
[1] Inferior Inferior Inferior Inferior Inferior Inferior Inferior Inferior
[9] Anterior Anterior Anterior Anterior Anterior Anterior Anterior
attr(,"label")
[1] 0=inferior;1=anterior
attr(,"format")
[1] %9.0g
attr(,"value.label.table")
Inferior Anterior Inferior LBBB Missing NoSTUp OtherSTUp Paced
0 1 2 3 4 5 6 7
Levels: Inferior Anterior Inferior LBBB Missing NoSTUp OtherSTUp Paced
Warning message:
In print.factor(x) : duplicated level [3] in factorA simple way to test this would be to take that dataset, change one value so that one of the levels at the end is used, and put it in test/.
|
@nalimilan I added a test. load("~/.julia/dev/RDatasets/data/COUNT/fasttrakg.rda")
fasttrakg$Anterior[1] = "Paced"
dup_levels = fasttrakg$Anterior # Paced Inferior Inferior Inferior Inferior Inferior Inferior Inferior Anterior Anterior Anterior Anterior Anterior Anterior Anterior
attr(dup_levels, "levels") # "Inferior" "Anterior" "Inferior" "LBBB" "Missing" "NoSTUp" "OtherSTUp" "Paced"
save(dup_levels, file="~/.julia/dev/RData/test/data_v3/dup_levels.rda") |
|
test for RDatasets also pass locally |
|
squashed some commits. This is ready for final review. |
nalimilan
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, looks good! Just two small comments.
src/convert.jl
Outdated
| refs = na2zero(RT, ri.data) | ||
|
|
||
| if hasduplicates | ||
| RT = REFTYPE(sz) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think you can just take the reftype for sz0. The probability of having so many duplicates that the type would change is very low, and not worth making the code more complex.
|
|
||
| @testset "Duplicate levels in factor (version=3)" begin | ||
| dup_cat = sexp2julia(load(joinpath("data_v3", "dup_levels.rda"), convert=false)["dup_levels"]) | ||
| @test dup_cat[1] == "Paced" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you also check levels(dup_cat) just in case?
Co-authored-by: Milan Bouchet-Valat <nalimilan@club.fr>
|
comments addressed |
Co-authored-by: Milan Bouchet-Valat <nalimilan@club.fr>
Co-authored-by: Bogumił Kamiński <bkamins@sgh.waw.pl>
|
Thanks! |
DataFrames 0.21 doesn't have
identfier()for cleaning column names any longer.If we want to use it here, we have to port this function from DataFrames 0.20. Instead, I've removed the cleaning in this PR.