Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG REPORT] Query fails if the key contains Chinese Character on Windows7 #2462

Closed
shrektan opened this issue Nov 3, 2017 · 1 comment · Fixed by #2566
Closed

[BUG REPORT] Query fails if the key contains Chinese Character on Windows7 #2462

shrektan opened this issue Nov 3, 2017 · 1 comment · Fixed by #2566
Labels
encoding

Comments

@shrektan
Copy link
Member

@shrektan shrektan commented Nov 3, 2017

DESCRIPTION

Thanks for this great package. It helps us a lot. Currently, I'm working on a windows platform. If my data.table's key contains Chinese characters, the query will return wrong answer sometimes. Below is my minimal example. I can only get the right answer when the Chinese character key column gets converted to UTF-8 encoding.

@renkun-ken It would be great if you can confirm this bug for me because you are one of the data.table users who probably has the similar working environment - Windows + Chinese characters. I believe this bug actually is introduced by 03cd45f and nobody else reports this for almost two years, which is quite odd to me.

Minimal reproducible example

Dataset

library(data.table)
## data.table 1.10.5 IN DEVELOPMENT built 2017-12-01 20:06:10 UTC
## The fastest way to learn (by data.table authors): https://www.datacamp.com/courses/data-analysis-the-data-table-way
##   Documentation: ?data.table, example(data.table) and browseVignettes("data.table")
##   Release notes, videos and slides: http://r-datatable.com
dt <- data.table(
  x = c("公允价值变动损益", "红利收入", "价差收入", "其他业务支出", "资产减值损失"),
  y = 1:5,
  key = "x"
)

Will fail (returns NA) if the encoding is native

dt[]
##                   x y
## 1: 公允价值变动损益 1
## 2:         红利收入 2
## 3:         价差收入 3
## 4:     其他业务支出 4
## 5:     资产减值损失 5
Encoding(dt$x) 
## [1] "unknown" "unknown" "unknown" "unknown" "unknown"
dt[J("公允价值变动损益")][]
##                   x  y
## 1: 公允价值变动损益 NA

Will succeed only if the encoding is converted to utf8

Now it returns the correct answer 1.
Note the dt's order now also becomes different, which is not supposed to happen.

dt[, x := enc2utf8(x)]
setkey(dt, x)

dt[]
##                   x y
## 1:         价差收入 3
## 2: 公允价值变动损益 1
## 3:     其他业务支出 4
## 4:         红利收入 2
## 5:     资产减值损失 5
Encoding(dt$x)
## [1] "UTF-8" "UTF-8" "UTF-8" "UTF-8" "UTF-8"
dt[J("公允价值变动损益")][]
##                   x y
## 1: 公允价值变动损益 1

sessionInfo

## R version 3.4.1 (2017-06-30)
## Platform: x86_64-w64-mingw32/x64 (64-bit)
## Running under: Windows 7 x64 (build 7601) Service Pack 1
## 
## Matrix products: default
## 
## locale:
## [1] LC_COLLATE=Chinese (Simplified)_People's Republic of China.936 
## [2] LC_CTYPE=Chinese (Simplified)_People's Republic of China.936   
## [3] LC_MONETARY=Chinese (Simplified)_People's Republic of China.936
## [4] LC_NUMERIC=C                                                   
## [5] LC_TIME=Chinese (Simplified)_People's Republic of China.936    
## 
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base     
## 
## other attached packages:
## [1] data.table_1.10.5
## 
## loaded via a namespace (and not attached):
##  [1] compiler_3.4.1  backports_1.1.1 magrittr_1.5    rprojroot_1.2  
##  [5] tools_3.4.1     htmltools_0.3.6 Rcpp_0.12.13    stringi_1.1.5  
##  [9] rmarkdown_1.8   knitr_1.17      stringr_1.2.0   digest_0.6.12  
## [13] evaluate_0.10.1

UPDATE Using the devel version to run the example.

@shrektan
Copy link
Member Author

@shrektan shrektan commented Jan 13, 2018

This bug is due to this two lines (csort and csort_pre in forder.c) https://github.com/Rdatatable/data.table/blob/master/src/forder.c#L869 and https://github.com/Rdatatable/data.table/blob/master/src/forder.c#L902

It uses the TRUELENGTH MACRO to convert a string to a number. However, if the string is under different encodings, this number will be different. It means what data.table order behavior still depends on the encodings.

I will try to file a PR later. Hopefully, it can help to settle this issue down (finally...).

UPDATE The original understanding is not accurate, so I modified the comment.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
encoding
Projects
None yet
Development

Successfully merging a pull request may close this issue.

1 participant