Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fwrite compression error on Solaris #3931

Closed
mattdowle opened this issue Oct 3, 2019 · 3 comments · Fixed by #3954
Closed

fwrite compression error on Solaris #3931

mattdowle opened this issue Oct 3, 2019 · 3 comments · Fixed by #3954
Milestone

Comments

@mattdowle
Copy link
Member

using R version 3.6.1 Patched (2019-10-02 r77254)
using platform: i386-pc-solaris2.10 (32-bit)
using session charset: UTF-8
using option ‘--no-stop-on-test-error’
checking for file ‘data.table/DESCRIPTION’ ... OK
this is package ‘data.table’ version ‘1.12.4’
...
checking tests ... [102s/121s] ERROR
  Running ‘autoprint.R’
  Comparing ‘autoprint.Rout’ to ‘autoprint.Rout.save’ ... OK
  Running ‘froll.R’ [37s/42s]
  Running ‘knitr.R’
  Comparing ‘knitr.Rout’ to ‘knitr.Rout.save’ ... OK
  Running ‘main.R’ [56s/67s]
  Running ‘nafill.R’
  Running ‘other.R’
  Running ‘types.R’
Running the tests in ‘tests/main.R’ failed.
Complete output:
  > require(data.table)
  Loading required package: data.table
  >
  > test.data.table() # runs the main test suite of 5,000+ tests in /inst/tests/tests.Rraw
  getDTthreads(verbose=TRUE):
    omp_get_num_procs() 16
    R_DATATABLE_NUM_PROCS_PERCENT unset (default 50)
    R_DATATABLE_NUM_THREADS unset
    omp_get_thread_limit() 2
    omp_get_max_threads() 2
    OMP_THREAD_LIMIT 2
    OMP_NUM_THREADS unset
    RestoreAfterFork true
    data.table is using 2 threads. See ?setDTthreads.
  test.data.table() running: /home/ripley/R/Lib32/data.table/tests/tests.Rraw
  
  Running test id 1.1
  Running test id 1.2
  Running test id 2.1
...
  Running test id 1658.39
  Running test id 1658.4
  Running test id 1658.41 Error in fwrite(DT, file = f1 <- tempfile(fileext = ".gz")) :
    Error -2: one or more threads failed to allocate buffers or there was a compression error. Please try again with verbose=TRUE and try searching online for this error message.
  Calls: test.data.table -> sys.source -> eval -> eval -> fwrite
  Execution halted
@mattdowle mattdowle added this to the 1.13.0 milestone Oct 3, 2019
@philippechataignon
Copy link
Contributor

Error -2: negative value in failed comes from malloc error with failed=-errno. On Linux, when malloc fails, errno is set to ENOMEM = 12 /* Out of memory */ and errno 2 is ENOENT /* No such file or directory */ which is a open/write fail.
I see here that Solaris malloc sometimes return a EAGAIN errno. But EAGAIN seems to be errno=11.
Do you know if Solaris errno codes are the same as Linux ? I can't understand this -2 value.

@mattdowle
Copy link
Member Author

mattdowle commented Oct 4, 2019

Talking out loud/brainstorming ...

I'm not sure why I didn't spot this before, but the first thing when looking at the code in general is that failed is global. Two threads shouldn't have potential to write to the same memory at the same time. Might it be that the -2 is the result of a race? In other parallel regions I do allow threads to write "naked" (i.e. unprotected by omp atomic directive) to a flag, but strictly only when that flag is bool AND when the only write is the value 1. In this case failed is a 4 byte int. I'm not sure how a race could weird-out an int within the bytes; perhaps if those 4 bytes are allowed to straddle a cache line boundary on Solaris? Anyway, each thread should have its own failed error code, and then whichever thread gets to the ordered region first in failed status, that winning thread gets the honor of transferring its failed reason to the global failed. In that way, only the first failed reason gets reported reliably.
But even if a race did occur and that corrupted failed, the best case would be that's hiding the real problem.
It seems that it is running with 2 threads so that's good. But it's 32bit. Anything that could be 32bit related, like a 2GB or 4GB limit or something? It passes on Windows-32bit though and it's usually Windows-32bit that is the most sensitive to memory problems in this area (a good thing that Win-32bit finds quite a lot for us).

@mattdowle
Copy link
Member Author

mattdowle commented Oct 4, 2019

Does errno need to be set to 0 before the call that might set it on error? There seems to be some debate and that's something Solaris 10 could be different about; whether errno is set in a thread safe way.
I'm hoping it was a straightforward malloc out-of-memory on a busy machine. We don't know the size of the buffer each of the 2 threads was requesting. That could be added to the error message to confirm. Maybe it being 32bit is causing the buffer calculation to be unintentionally large due to an overflow or similar.

@mattdowle mattdowle modified the milestones: 1.12.7, 1.12.5 Oct 8, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants