New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
prevent the utf8 string from being collected by the garbage collector in forder() #2678
Conversation
@@ -866,7 +866,7 @@ static void csort(SEXP *x, int *o, int n) | |||
/* can't use otmp, since iradix might be called here and that uses otmp (and xtmp). | |||
alloc_csort_otmp(n) is called from forder for either n=nrow if 1st column, | |||
or n=maxgrpn if onwards columns */ | |||
for(i=0; i<n; i++) csort_otmp[i] = (x[i] == NA_STRING) ? NA_INTEGER : -TRUELENGTH(ENC2UTF8(x[i])); | |||
for(i=0; i<n; i++) csort_otmp[i] = (x[i] == NA_STRING) ? NA_INTEGER : -TRUELENGTH(x[i]); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
great catch
@mattdowle It's the first time I realize that the
About the performance, it gets significantly improved when there're lots on non-ASCII chars library(data.table)
nonascii_string <- function(n, utf8 = TRUE) {
x <- c("公允价值变动损益", "红利收入", "价差收入", "其他业务支出", "资产减值损失")
if (isTRUE(utf8)) x <- enc2utf8(x)
sample(x, n, TRUE)
}
# ascii 1
tmp <- data.table(x = sample(letters, 1e8, TRUE))
system.time(setkey(tmp, x))
# ascii 2
tmp <- data.table(x = sample(letters, 1e8, TRUE), y = sample(letters, 1e8, TRUE))
system.time(setkey(tmp, y, x))
# utf8 1
tmp <- data.table(x = nonascii_string(1e7))
system.time(setkey(tmp, x))
# utf8 2
tmp <- data.table(x = nonascii_string(1e7), y = nonascii_string(1e7))
system.time(setkey(tmp, y, x))
# native
tmp <- data.table(x = nonascii_string(1e5, FALSE))
system.time(setkey(tmp, x))
|
Codecov Report
@@ Coverage Diff @@
## master #2678 +/- ##
==========================================
- Coverage 93.32% 93.31% -0.01%
==========================================
Files 61 61
Lines 12225 12237 +12
==========================================
+ Hits 11409 11419 +10
- Misses 816 818 +2
Continue to review full report at Codecov.
|
Line 1227 in 4d8545e
library(data.table)
utf8_strings <- enc2utf8(c("红利收入", "价差收入"))
native_strings <- enc2native(utf8_strings)
mixed_strings <- c(utf8_strings, native_strings)
DT <- data.table(x = mixed_strings, y = 1)
DT[, .N, by = .(x, y)]
x y N
1: 红利收入 1 2
2: 价差收入 1 2
DT[, .N, by = .(y, x)]
y x N
1: 1 红利收入 1
2: 1 价差收入 1
3: 1 红利收入 1
4: 1 价差收入 1 |
@mattdowle I've pushed more commits to fix the cases on grouping (a.k.a, |
Thanks for all this! It's looking great to me. According to codecov, |
I guess adding an example uses the |
The error log I downloaded from
I can't understand the failure because the following code gives the correct answer utf8_strings <- c("\u00e7ile", "fa\u00e7ile", "El. pa\u00c5\u00a1tas", "\u00a1tas", "\u00de")
latin1_strings <- iconv(utf8_strings, from = "UTF-8", to = "latin1")
mixed_strings <- c(utf8_strings, latin1_strings)
DT <- data.table(x = mixed_strings, y = c(latin1_strings, utf8_strings), z = 1)
nrow(DT[, .N, by = .(z, x, y)])
# 5 EDITGet it. Yes, it indeed fails on the x32 version of R... I'm investigating it now... EDIT2Should have been fixed now. Also, the |
Very nice. Thanks for the good comments. I read a few times; indeed a better cleaner approach.
…ed reminder to benchmarks.Rraw.
Closes #2674
This PR replaced PR #2675 (see comments there).
If somehow the garbage collector was triggered during sorting (like there're millions of non-ASCII characters), it leads to the collapse of
data.table
(see #2674 for details) because it assumes there're converted UTF-8 chars in the global string pool. This PR tries to fix this issue.In addition, it brings performance enhancement in the case of millions non-ASCII chars, because now it only needs to be converted to UTF-8 once. Before this PR, the strings need to be converted twice in
csort_pre()
andcsort()
respectively, which may be a big issue for a large character vector (for example, on my computer,enc2utf8()
takes about 20s for a 1e7 length Chinese character).TODO
by = .(x, y)
should return the same row asby = .(y, x)
(see comment below)