fwrite compression error on Solaris #3931

mattdowle · 2019-10-03T17:57:24Z

using R version 3.6.1 Patched (2019-10-02 r77254)
using platform: i386-pc-solaris2.10 (32-bit)
using session charset: UTF-8
using option ‘--no-stop-on-test-error’
checking for file ‘data.table/DESCRIPTION’ ... OK
this is package ‘data.table’ version ‘1.12.4’
...
checking tests ... [102s/121s] ERROR
  Running ‘autoprint.R’
  Comparing ‘autoprint.Rout’ to ‘autoprint.Rout.save’ ... OK
  Running ‘froll.R’ [37s/42s]
  Running ‘knitr.R’
  Comparing ‘knitr.Rout’ to ‘knitr.Rout.save’ ... OK
  Running ‘main.R’ [56s/67s]
  Running ‘nafill.R’
  Running ‘other.R’
  Running ‘types.R’
Running the tests in ‘tests/main.R’ failed.
Complete output:
  > require(data.table)
  Loading required package: data.table
  >
  > test.data.table() # runs the main test suite of 5,000+ tests in /inst/tests/tests.Rraw
  getDTthreads(verbose=TRUE):
    omp_get_num_procs() 16
    R_DATATABLE_NUM_PROCS_PERCENT unset (default 50)
    R_DATATABLE_NUM_THREADS unset
    omp_get_thread_limit() 2
    omp_get_max_threads() 2
    OMP_THREAD_LIMIT 2
    OMP_NUM_THREADS unset
    RestoreAfterFork true
    data.table is using 2 threads. See ?setDTthreads.
  test.data.table() running: /home/ripley/R/Lib32/data.table/tests/tests.Rraw
  
  Running test id 1.1
  Running test id 1.2
  Running test id 2.1
...
  Running test id 1658.39
  Running test id 1658.4
  Running test id 1658.41 Error in fwrite(DT, file = f1 <- tempfile(fileext = ".gz")) :
    Error -2: one or more threads failed to allocate buffers or there was a compression error. Please try again with verbose=TRUE and try searching online for this error message.
  Calls: test.data.table -> sys.source -> eval -> eval -> fwrite
  Execution halted

The text was updated successfully, but these errors were encountered:

philippechataignon · 2019-10-03T20:30:37Z

Error -2: negative value in failed comes from malloc error with failed=-errno. On Linux, when malloc fails, errno is set to ENOMEM = 12 /* Out of memory */ and errno 2 is ENOENT /* No such file or directory */ which is a open/write fail.
I see here that Solaris malloc sometimes return a EAGAIN errno. But EAGAIN seems to be errno=11.
Do you know if Solaris errno codes are the same as Linux ? I can't understand this -2 value.

mattdowle · 2019-10-04T02:54:06Z

Talking out loud/brainstorming ...

I'm not sure why I didn't spot this before, but the first thing when looking at the code in general is that failed is global. Two threads shouldn't have potential to write to the same memory at the same time. Might it be that the -2 is the result of a race? In other parallel regions I do allow threads to write "naked" (i.e. unprotected by omp atomic directive) to a flag, but strictly only when that flag is bool AND when the only write is the value 1. In this case failed is a 4 byte int. I'm not sure how a race could weird-out an int within the bytes; perhaps if those 4 bytes are allowed to straddle a cache line boundary on Solaris? Anyway, each thread should have its own failed error code, and then whichever thread gets to the ordered region first in failed status, that winning thread gets the honor of transferring its failed reason to the global failed. In that way, only the first failed reason gets reported reliably.
But even if a race did occur and that corrupted failed, the best case would be that's hiding the real problem.
It seems that it is running with 2 threads so that's good. But it's 32bit. Anything that could be 32bit related, like a 2GB or 4GB limit or something? It passes on Windows-32bit though and it's usually Windows-32bit that is the most sensitive to memory problems in this area (a good thing that Win-32bit finds quite a lot for us).

mattdowle · 2019-10-04T03:39:16Z

Does errno need to be set to 0 before the call that might set it on error? There seems to be some debate and that's something Solaris 10 could be different about; whether errno is set in a thread safe way.
I'm hoping it was a straightforward malloc out-of-memory on a busy machine. We don't know the size of the buffer each of the 2 threads was requesting. That could be added to the error message to confirm. Maybe it being 32bit is causing the buffer calculation to be unintentionally large due to an overflow or similar.

mattdowle added this to the 1.13.0 milestone Oct 3, 2019

mattdowle modified the milestones: 1.12.7, 1.12.5 Oct 8, 2019

mattdowle mentioned this issue Oct 9, 2019

trace fwrite-zlib on solaris #3954

Merged

2 tasks

mattdowle closed this as completed in #3954 Oct 9, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fwrite compression error on Solaris #3931

fwrite compression error on Solaris #3931

mattdowle commented Oct 3, 2019

philippechataignon commented Oct 3, 2019

mattdowle commented Oct 4, 2019 •

edited

Loading

mattdowle commented Oct 4, 2019 •

edited

Loading

fwrite compression error on Solaris #3931

fwrite compression error on Solaris #3931

Comments

mattdowle commented Oct 3, 2019

philippechataignon commented Oct 3, 2019

mattdowle commented Oct 4, 2019 • edited Loading

mattdowle commented Oct 4, 2019 • edited Loading

mattdowle commented Oct 4, 2019 •

edited

Loading

mattdowle commented Oct 4, 2019 •

edited

Loading