Join GitHub today
GitHub is home to over 28 million developers working together to host and review code, manage projects, and build software together.Sign up
data.table breaks when there're millions of Chinese characters #2674
related to #2566
I find a serious issue that when there're millions of non-ASCII characters encoded in non-UTF8 encoding (see the following example),
After that error, all the
I will investigate and report more details later. Hopefully, I can file a PR to fix this.
NOTE, you have to execute this on a windows machine with GB2312 as the default encoding (i.e., a Simplified Chinese Windows Machine). Otherwise it won't work. Also, if it won't fail for the first time, try to execute twice. I've tried this on several machines in my office. I'm quite confident it's reproducible.
library(data.table) dt <- data.table( x = sample( c("公允价值变动损益", "红利收入", "价差收入", "其他业务支出", "资产减值损失"), 1e7, TRUE ), z = 1 ) setkey(dt, x)
After hours of investigating, I think it's highly possible that it misses some kind of protection in