Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Unexpected result for max of character variable by group #5331

Closed
markseeto opened this issue Feb 14, 2022 · 5 comments · Fixed by #5342
Closed

Unexpected result for max of character variable by group #5331

markseeto opened this issue Feb 14, 2022 · 5 comments · Fixed by #5342

Comments

@markseeto
Copy link
Contributor

I was surprised by this:

DT <- data.table(group = c("g1", "g1", "g2", "g2"),
                 x = c("alice", "Bob", "carol", "david"))

DT
#    group     x
# 1:    g1 alice
# 2:    g1   Bob
# 3:    g2 carol
# 4:    g2 david

DT2 <- DT[, .(m1 = max(x)), by = "group"]

DT2
#    group    m1
# 1:    g1 alice
# 2:    g2 david

DT3 <- DT[, .(m1 = max(x), m2 = max(tolower(x))), by = "group"]

DT3
#    group    m1    m2
# 1:    g1   Bob   bob
# 2:    g2 david david

DT[group == "g1", max(x)]
# [1] "Bob"

sessionInfo()

R version 4.1.2 (2021-11-01)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 10 x64 (build 19044)

Matrix products: default

locale:
[1] LC_COLLATE=English_Australia.1252  LC_CTYPE=English_Australia.1252   
[3] LC_MONETARY=English_Australia.1252 LC_NUMERIC=C                      
[5] LC_TIME=English_Australia.1252    

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] data.table_1.14.2

loaded via a namespace (and not attached):
[1] compiler_4.1.2

Why are DT2$m1 and DT3$m1 different?

And why is DT2[group == "g1", m1] not the same as DT[group == "g1", max(x)]?

Thanks.

@markseeto markseeto changed the title Unexpected result for max of character variable Unexpected result for max of character variable by group Feb 14, 2022
@MichaelChirico
Copy link
Member

thanks, this is a great instructive example.

the difference is whether data.table or base R does the sorting for you.

data.table always sorts in C locale; base sorts in system locale by default (and I'm not sure there an option to toggle with max() to change this on the fly)

you can look at verbose=TRUE in your examples it should be illustrative.

what's your desired outcome? the workaround depends on what you expected.

@markseeto
Copy link
Contributor Author

Thanks for the explanation @MichaelChirico.

I don't have a specific desired outcome for whether "alice" or "Bob" is considered to be the maximum. Consistency with base R would be nice, but I can accept that there is more than one reasonable approach.

When I encountered something like this, what I found really confusing was m1 being different in DT[, .(m1 = max(x)), by = "group"] compared to DT[, .(m1 = max(x), m2 = max(tolower(x))), by = "group"], because m1 appears to be defined the same way in both cases, and I wouldn't have expected it to be affected by the inclusion of another column.

@MichaelChirico
Copy link
Member

agreed there. we have some plan to apply GForce more consistently. the current issue is that one we see an ad-hoc expression, it turns off GForce for the entire query.

understand this can be confusing and is basic exposing an implementation detail. for now your best bet is to remember trying verbose=TRUE to get some insight whenever encountering something like this.

@MichaelChirico
Copy link
Member

if you want consistency, I believe you can set Sys.setenv(LC_ALL="C")

@markseeto
Copy link
Contributor Author

Thanks for your replies @MichaelChirico.

I had tried verbose=TRUE but still wasn't sure about the reason until I read your explanation.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants