Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

test.data.table report forderv #3502

Closed
minemR opened this issue Apr 11, 2019 · 2 comments · Fixed by #6074
Closed

test.data.table report forderv #3502

minemR opened this issue Apr 11, 2019 · 2 comments · Fixed by #6074
Labels
encoding issues related to Encoding tests

Comments

@minemR
Copy link

minemR commented Apr 11, 2019

sessionInfo()
# R version 3.5.3 (2019-03-11)
# Platform: x86_64-w64-mingw32/x64 (64-bit)
# Running under: Windows >= 8 x64 (build 9200)
# 
# Matrix products: default
# 
# locale:
#   [1] LC_COLLATE=Latvian_Latvia.1257  LC_CTYPE=Latvian_Latvia.1257    LC_MONETARY=Latvian_Latvia.1257 LC_NUMERIC=C                   
# [5] LC_TIME=Latvian_Latvia.1257    
# 
# attached base packages:
#   [1] stats     graphics  grDevices utils     datasets  methods   base     
# 
# other attached packages:
#   [1] R.utils_2.8.0     R.oo_1.22.0       R.methodsS3_1.7.1 nanotime_0.2.3    xts_0.11-2        zoo_1.8-5         bit64_0.9-7      
# [8] bit_1.1-14        data.table_1.12.2
# 
# loaded via a namespace (and not attached):
#   [1] compiler_3.5.3  tools_3.5.3     RcppCCTZ_0.2.5  yaml_2.2.0      Rcpp_1.0.1      stringi_1.4.3   grid_3.5.3      lattice_0.20-38
sink(file = 'aut.txt')
require(data.table)
test.data.table()
sink()

Using sink because too many errors to scan manualy.
Also, at one point in testing, each test number starts to be printed, like:
.... Running test id 1964.2 Running test id 1964.3 Running test id 1964.4 Running test id 1965 Running test id 1966.1 Running test id 1966.2 Running test id 1966.3 Running test id 1966.4 Running test id 1966.5 Running test id 1966.6 Running test id 1967.1 ...
overfilling console.

Tests results in:

# Error in test.data.table() : 
  # 873 errors out of 8556 in 00:02:35 elapsed (00:01:43 cpu) on Thu Apr 11 11:54:40 2019. [endian==little, sizeof(long double)==16, sizeof(pointer)==8, TZ=Europe/Helsinki, locale='LC_COLLATE=Latvian_Latvia.1257;LC_CTYPE=Latvian_Latvia.1257;LC_MONETARY=Latvian_Latvia.1257;LC_NUMERIC=C;LC_TIME=Latvian_Latvia.1257', l10n_info()='MBCS=FALSE; UTF-8=FALSE; Latin-1=FALSE; codepage=1257', getDTthreads()='omp_get_num_procs()==8; R_DATATABLE_NUM_PROCS_PERCENT=="" (default 50); R_DATATABLE_NUM_THREADS==""; omp_get_thread_limit()==2147483647; omp_get_max_threads()==8; OMP_THREAD_LIMIT==""; OMP_NUM_THREADS==""; data.table is using 4 threads. This is set on startup, and by setDTthreads(). See ?setDTthreads.; RestoreAfterFork==true']. Search tests/tests.Rraw for test numbers: 168.1, 168.2, 168.3, 610.1, 610.3, 1102.14, 1102.14, 1102.15, 1223.003, 1223.004, 1223.009, 1223.01, 1223.013, 1223.014, 1223.015, 1223.016, 1223.025, 1223.026, 1223.027, 1223.028, 1223.033, 1223.034, 1223.035, 1223.036, 

From looking at other issues, my initial guess is that errors mainly arise because of locale.
Getting ids:

report <- readLines('aut.txt')
i <- grep('errors', report)
er <- report[i]
length(er) # 841 509 873 # number of errors is not consistent
ids <- stringi::stri_extract_all_regex(er, '(?<=id )[0-9]+', simplify = T)
as.integer(unique(ids))
# [1]  168  610 1102 1223 1253 1590
# [1]  168  610 1102 1223 1253 1590
# [1]  168  610 1102 1223 1253 1590
# but always the same tests

168 - definitely locale problem

610 :

set.seed(32)
x = sample(LETTERS)
chorder(x)
# [1] 24 23 26  5 12 22 10 17 13 21 11 18  9  1  2  7  4 25 20  3  6 15  8 14 19 16
base::order(x)
# [1] 24 23 26  5 12 22 10 17 13 19 21 11 18  9  1  2  7  4 25 20  3  6 15  8 14 16

whats happening here?

1102 - dcast column order, so also forderv?
1223 - relating forderv ?
1253 - relating forderv ?
1590 - locale is changed before, so should not have errors?

Maybe too much information for one issue, but my guess is that it could be tracked down to forderv?

@MichaelChirico MichaelChirico added the encoding issues related to Encoding label Apr 6, 2024
@MichaelChirico
Copy link
Member

I confirm this on Linux: LC_ALL=lv_LV.utf8 Rscript -e "library(data.table); test.data.table()" generates tons of errors.

@MichaelChirico
Copy link
Member

One way to accomplish this is with another feature like #5848 / #5842 -- an argument to set the locale temporarily within test().

OTOH, e.g. for test 610.1:

x = sample(LETTERS)
test(610.1, chorder(x), base::order(x))

What's really going wrong is we're assuming something about base::order(LETTERS), in other words, it's nothing to do with data.table code that's breaking the test (since data.table always sorts in C locale). For the env=, options= features, it's about making sure data.table code runs as expected, whereas this issue is about making base code run as expected. That tells me maybe we want something like:

x = sample(LETTERS)
test(610.1, chorder(x), match(LETTERS, x))

Or to really keep the emphasis on the test of chorder ↔️ base::order:

x = sample(LETTERS)
base_order <- local({
  old = Sys.getlocale("LC_COLLATE")
  on.exit(Sys.setlocale("LC_COLLATE", old))
  base::order(x)
})
test(610.1, chorder(x), base_order)

Or even add a helper to use throughout tests:

with_c_collate = function(expr) {
  old = Sys.getlocale("LC_COLLATE")
  on.exit(Sys.setlocale("LC_COLLATE", old))
  expr
}
test(610.1, chorder(x), with_c_collate(base::order(x)))

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
encoding issues related to Encoding tests
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants