Join GitHub today
GitHub is home to over 50 million developers working together to host and review code, manage projects, and build software together.
Sign updata.table breaks when there're millions of Chinese characters #2674
Comments
|
After hours of investigating, I think it's highly possible that it misses some kind of protection in Line 902 in bb3ba9a |
related to #2566
I find a serious issue that when there're millions of non-ASCII characters encoded in non-UTF8 encoding (see the following example),
setkey()on that column causesdata.tablebecomes extremely slow and throws the error that'translateCharUTF8' must be called on a CHARSXPin the end.After that error, all the
data.tablefunction calls end up with another error thatInternal error: savetl_init checks failed.I will investigate and report more details later. Hopefully, I can file a PR to fix this.
Example
NOTE, you have to execute this on a windows machine with GB2312 as the default encoding (i.e., a Simplified Chinese Windows Machine). Otherwise it won't work. Also, if it won't fail for the first time, try to execute twice. I've tried this on several machines in my office. I'm quite confident it's reproducible.
session info
UPDATES
ENC2UTF8is very slow for millions of chars now but still not sure if the issue is caused by this or not. Moreover, it's hard for me to understand why it's slow because it seems like R itself implementsenc2utf8in a similar way EDIT:enc2utf8()takes a long time (17s) to convert 1e7 chars, too. So this is not the issue.savetl(),savetl_end(). I'm not familiar with how the global string pool works in R. However, I doubt that the utf-8 char created bydata.tablegets released whengc()causes (that's why it only occurs when the number of chars is large). If the char gets released andsavetl_end()tries to modify a non-existed char'struelength...