Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

prevent the utf8 string from being collected by the garbage collector in forder() #2678

merged 14 commits into from Mar 30, 2018


Copy link

@shrektan shrektan commented Mar 16, 2018

Closes #2674
This PR replaced PR #2675 (see comments there).

If somehow the garbage collector was triggered during sorting (like there're millions of non-ASCII characters), it leads to the collapse of data.table (see #2674 for details) because it assumes there're converted UTF-8 chars in the global string pool. This PR tries to fix this issue.

In addition, it brings performance enhancement in the case of millions non-ASCII chars, because now it only needs to be converted to UTF-8 once. Before this PR, the strings need to be converted twice in csort_pre() and csort() respectively, which may be a big issue for a large character vector (for example, on my computer, enc2utf8() takes about 20s for a 1e7 length Chinese character).


  • a new test
  • entries in
  • by = .(x, y) should return the same row as by = .(y, x) (see comment below)
  • should not affect the performance if all the strings are ASCII or encoded in UTF-8.

@@ -866,7 +866,7 @@ static void csort(SEXP *x, int *o, int n)
/* can't use otmp, since iradix might be called here and that uses otmp (and xtmp).
alloc_csort_otmp(n) is called from forder for either n=nrow if 1st column,
or n=maxgrpn if onwards columns */
for(i=0; i<n; i++) csort_otmp[i] = (x[i] == NA_STRING) ? NA_INTEGER : -TRUELENGTH(ENC2UTF8(x[i]));
for(i=0; i<n; i++) csort_otmp[i] = (x[i] == NA_STRING) ? NA_INTEGER : -TRUELENGTH(x[i]);
Copy link

@MichaelChirico MichaelChirico Mar 16, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

great catch 👏

Copy link
Member Author

@shrektan shrektan commented Mar 16, 2018

@mattdowle It's the first time I realize that the radixsort in R is the same as in data.table. Thanks for the great work! I don't think I've encountered any issue before so it should be a problem of data.table-1.10.5 only.

However, regarding the following line, are you sure that it's ok because the xd here will not be converted to UTF-8? (UPDATE: fixed by be9a6de)

About the performance, it gets significantly improved when there're lots on non-ASCII chars but yes, it becomes a little slower if all the strings are ASCII (UPDATES with commit 84ca0b0, the overheads should be minimal.) See the tests on my computers:

nonascii_string <- function(n, utf8 = TRUE) {
  x <- c("公允价值变动损益", "红利收入", "价差收入", "其他业务支出", "资产减值损失")
  if (isTRUE(utf8)) x <- enc2utf8(x)
  sample(x, n, TRUE)

# ascii 1
tmp <- data.table(x = sample(letters, 1e8, TRUE))
system.time(setkey(tmp, x)) 
# ascii 2
tmp <- data.table(x = sample(letters, 1e8, TRUE), y = sample(letters, 1e8, TRUE))
system.time(setkey(tmp, y, x)) 
# utf8 1
tmp <- data.table(x = nonascii_string(1e7))
system.time(setkey(tmp, x)) 
# utf8 2
tmp <- data.table(x = nonascii_string(1e7), y = nonascii_string(1e7))
system.time(setkey(tmp, y, x)) 
# native
tmp <- data.table(x = nonascii_string(1e5, FALSE))
system.time(setkey(tmp, x)) 
version ascii 1 ascii 2 utf8 1 utf8 2 native
v1.9.6 (before including ENC2UTF8) 1.90s 5.83s 0.16s 0.41s 0.00s
master (4d8545e) 1.90s 5.83s 0.16s 0.34s 0.47s (if not fail)
this version 1.90s 5.90s 0.16s 0.33s 0.14s

@shrektan shrektan requested a review from mattdowle Mar 16, 2018
@shrektan shrektan added this to the v1.10.6 milestone Mar 16, 2018
Copy link

@codecov-io codecov-io commented Mar 16, 2018

Codecov Report

Merging #2678 into master will decrease coverage by <.01%.
The diff coverage is 100%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master    #2678      +/-   ##
- Coverage   93.32%   93.31%   -0.01%     
  Files          61       61              
  Lines       12225    12237      +12     
+ Hits        11409    11419      +10     
- Misses        816      818       +2
Impacted Files Coverage Δ
src/forder.c 97.92% <100%> (+0.03%) ⬆️
src/fsort.c 72.72% <0%> (-1.22%) ⬇️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 437646f...1ca89ba. Read the comment docs.

Copy link
Member Author

@shrektan shrektan commented Mar 16, 2018

I confirm that we need to edit here as well. UPDATE this has been fixed by be9a6de

if (sortStr) { csort_pre(xd, n); alloc_csort_otmp(gsmax[1-flip]); g = &csort; }

utf8_strings <- enc2utf8(c("红利收入", "价差收入"))
native_strings <- enc2native(utf8_strings)
mixed_strings <- c(utf8_strings, native_strings)
DT <- data.table(x = mixed_strings, y = 1)
DT[, .N, by = .(x, y)]

          x y N
1: 红利收入 1 2
2: 价差收入 1 2

DT[, .N, by = .(y, x)]

   y        x N
1: 1 红利收入 1
2: 1 价差收入 1
3: 1 红利收入 1
4: 1 价差收入 1

Copy link
Member Author

@shrektan shrektan commented Mar 17, 2018

@mattdowle I've pushed more commits to fix the cases on grouping (a.k.a, DT[, j, by = .(x, y)]) and the overheads for all ASCII/UTF8 columns.

Copy link

@mattdowle mattdowle commented Mar 29, 2018

Thanks for all this! It's looking great to me. According to codecov, need2utf8 always returns false. Is there a way to add a test to cover those 4 lines?

Copy link
Member Author

@shrektan shrektan commented Mar 30, 2018

I guess adding an example uses the Latin-1 encoding may work. I'll submit one later.

Copy link
Member Author

@shrektan shrektan commented Mar 30, 2018

The error log I downloaded from AppVeyor says:

Running test id 1896.4      Test 1896.4 ran without errors but failed check that x equals y:
> x = nrow(DT[, .N, by = .(z, x, y)]) 
First 6 of 1 (type 'integer'): [1] 10
> y = 5L 
First 6 of 1 (type 'integer'): [1] 5

I can't understand the failure because the following code gives the correct answer 5 on both my Mac and Windows (I even tried to set my language and locale to the United States)...

utf8_strings <- c("\u00e7ile", "fa\u00e7ile", "El. pa\u00c5\u00a1tas", "\u00a1tas", "\u00de")
latin1_strings <- iconv(utf8_strings, from = "UTF-8", to = "latin1")
mixed_strings <- c(utf8_strings, latin1_strings)
DT <- data.table(x = mixed_strings, y = c(latin1_strings, utf8_strings), z = 1)
nrow(DT[, .N, by = .(z, x, y)])
# 5


Get it. Yes, it indeed fails on the x32 version of R... I'm investigating it now...


Should have been fixed now. Also, the need2utf8 condition gets covered.

Copy link

@mattdowle mattdowle left a comment

Very nice. Thanks for the good comments. I read a few times; indeed a better cleaner approach.

@mattdowle mattdowle merged commit 3950774 into master Mar 30, 2018
6 checks passed
@mattdowle mattdowle deleted the fix#2674 branch Mar 30, 2018
@shrektan shrektan added the encoding label Sep 9, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
None yet

Successfully merging this pull request may close these issues.

4 participants