New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

prevent the utf8 string from being collected by the garbage collector in forder() #2678

Merged
merged 14 commits into from Mar 30, 2018

Conversation

Projects
None yet
4 participants
@shrektan
Member

shrektan commented Mar 16, 2018

Closes #2674
This PR replaced PR #2675 (see comments there).

If somehow the garbage collector was triggered during sorting (like there're millions of non-ASCII characters), it leads to the collapse of data.table (see #2674 for details) because it assumes there're converted UTF-8 chars in the global string pool. This PR tries to fix this issue.

In addition, it brings performance enhancement in the case of millions non-ASCII chars, because now it only needs to be converted to UTF-8 once. Before this PR, the strings need to be converted twice in csort_pre() and csort() respectively, which may be a big issue for a large character vector (for example, on my computer, enc2utf8() takes about 20s for a 1e7 length Chinese character).

TODO

  • a new test
  • entries in NEWS.md
  • by = .(x, y) should return the same row as by = .(y, x) (see comment below)
  • should not affect the performance if all the strings are ASCII or encoded in UTF-8.
@@ -866,7 +866,7 @@ static void csort(SEXP *x, int *o, int n)
/* can't use otmp, since iradix might be called here and that uses otmp (and xtmp).
alloc_csort_otmp(n) is called from forder for either n=nrow if 1st column,
or n=maxgrpn if onwards columns */
for(i=0; i<n; i++) csort_otmp[i] = (x[i] == NA_STRING) ? NA_INTEGER : -TRUELENGTH(ENC2UTF8(x[i]));
for(i=0; i<n; i++) csort_otmp[i] = (x[i] == NA_STRING) ? NA_INTEGER : -TRUELENGTH(x[i]);

This comment has been minimized.

@MichaelChirico

MichaelChirico Mar 16, 2018

Member

great catch 👏

@shrektan

This comment has been minimized.

Member

shrektan commented Mar 16, 2018

@mattdowle It's the first time I realize that the radixsort in R is the same as in data.table. Thanks for the great work! I don't think I've encountered any issue before so it should be a problem of data.table-1.10.5 only.

However, regarding the following line, are you sure that it's ok because the xd here will not be converted to UTF-8? (UPDATE: fixed by be9a6de)

About the performance, it gets significantly improved when there're lots on non-ASCII chars but yes, it becomes a little slower if all the strings are ASCII (UPDATES with commit 84ca0b0, the overheads should be minimal.) See the tests on my computers:

library(data.table)
nonascii_string <- function(n, utf8 = TRUE) {
  x <- c("公允价值变动损益", "红利收入", "价差收入", "其他业务支出", "资产减值损失")
  if (isTRUE(utf8)) x <- enc2utf8(x)
  sample(x, n, TRUE)
}

# ascii 1
tmp <- data.table(x = sample(letters, 1e8, TRUE))
system.time(setkey(tmp, x)) 
# ascii 2
tmp <- data.table(x = sample(letters, 1e8, TRUE), y = sample(letters, 1e8, TRUE))
system.time(setkey(tmp, y, x)) 
# utf8 1
tmp <- data.table(x = nonascii_string(1e7))
system.time(setkey(tmp, x)) 
# utf8 2
tmp <- data.table(x = nonascii_string(1e7), y = nonascii_string(1e7))
system.time(setkey(tmp, y, x)) 
# native
tmp <- data.table(x = nonascii_string(1e5, FALSE))
system.time(setkey(tmp, x)) 
version ascii 1 ascii 2 utf8 1 utf8 2 native
v1.9.6 (before including ENC2UTF8) 1.90s 5.83s 0.16s 0.41s 0.00s
master (4d8545e) 1.90s 5.83s 0.16s 0.34s 0.47s (if not fail)
this version 1.90s 5.90s 0.16s 0.33s 0.14s

@shrektan shrektan requested a review from mattdowle Mar 16, 2018

@shrektan shrektan added this to the v1.10.6 milestone Mar 16, 2018

@codecov-io

This comment has been minimized.

codecov-io commented Mar 16, 2018

Codecov Report

Merging #2678 into master will decrease coverage by <.01%.
The diff coverage is 100%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master    #2678      +/-   ##
==========================================
- Coverage   93.32%   93.31%   -0.01%     
==========================================
  Files          61       61              
  Lines       12225    12237      +12     
==========================================
+ Hits        11409    11419      +10     
- Misses        816      818       +2
Impacted Files Coverage Δ
src/forder.c 97.92% <100%> (+0.03%) ⬆️
src/fsort.c 72.72% <0%> (-1.22%) ⬇️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 437646f...1ca89ba. Read the comment docs.

@shrektan

This comment has been minimized.

Member

shrektan commented Mar 16, 2018

I confirm that we need to edit here as well. UPDATE this has been fixed by be9a6de

if (sortStr) { csort_pre(xd, n); alloc_csort_otmp(gsmax[1-flip]); g = &csort; }

library(data.table)
utf8_strings <- enc2utf8(c("红利收入", "价差收入"))
native_strings <- enc2native(utf8_strings)
mixed_strings <- c(utf8_strings, native_strings)
DT <- data.table(x = mixed_strings, y = 1)
DT[, .N, by = .(x, y)]

          x y N
1: 红利收入 1 2
2: 价差收入 1 2

DT[, .N, by = .(y, x)]

   y        x N
1: 1 红利收入 1
2: 1 价差收入 1
3: 1 红利收入 1
4: 1 价差收入 1

shrektan added some commits Mar 16, 2018

@shrektan

This comment has been minimized.

Member

shrektan commented Mar 17, 2018

@mattdowle I've pushed more commits to fix the cases on grouping (a.k.a, DT[, j, by = .(x, y)]) and the overheads for all ASCII/UTF8 columns.

shrektan and others added some commits Mar 20, 2018

@shrektan shrektan referenced this pull request Mar 27, 2018

Closed

Segfault during sorting #2707

@mattdowle

This comment has been minimized.

Member

mattdowle commented Mar 29, 2018

Thanks for all this! It's looking great to me. According to codecov, need2utf8 always returns false. Is there a way to add a test to cover those 4 lines?

@shrektan

This comment has been minimized.

Member

shrektan commented Mar 30, 2018

I guess adding an example uses the Latin-1 encoding may work. I'll submit one later.

@shrektan

This comment has been minimized.

Member

shrektan commented Mar 30, 2018

The error log I downloaded from AppVeyor says:

Running test id 1896.4      Test 1896.4 ran without errors but failed check that x equals y:
> x = nrow(DT[, .N, by = .(z, x, y)]) 
First 6 of 1 (type 'integer'): [1] 10
> y = 5L 
First 6 of 1 (type 'integer'): [1] 5

I can't understand the failure because the following code gives the correct answer 5 on both my Mac and Windows (I even tried to set my language and locale to the United States)...

utf8_strings <- c("\u00e7ile", "fa\u00e7ile", "El. pa\u00c5\u00a1tas", "\u00a1tas", "\u00de")
latin1_strings <- iconv(utf8_strings, from = "UTF-8", to = "latin1")
mixed_strings <- c(utf8_strings, latin1_strings)
DT <- data.table(x = mixed_strings, y = c(latin1_strings, utf8_strings), z = 1)
nrow(DT[, .N, by = .(z, x, y)])
# 5

EDIT

Get it. Yes, it indeed fails on the x32 version of R... I'm investigating it now...

EDIT2

Should have been fixed now. Also, the need2utf8 condition gets covered.

@mattdowle

Very nice. Thanks for the good comments. I read a few times; indeed a better cleaner approach.

mattdowle added some commits Mar 30, 2018

@mattdowle mattdowle merged commit 3950774 into master Mar 30, 2018

6 checks passed

codecov/patch 100% of diff hit (target 93.32%)
Details
codecov/project Absolute coverage decreased by -<.01% but relative coverage increased by +6.67% compared to 437646f
Details
continuous-integration/appveyor/branch AppVeyor build succeeded
Details
continuous-integration/appveyor/pr AppVeyor build succeeded
Details
continuous-integration/travis-ci/pr The Travis CI build passed
Details
continuous-integration/travis-ci/push The Travis CI build passed
Details

@mattdowle mattdowle deleted the fix#2674 branch Mar 30, 2018

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment