Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fwrite gzip on Solaris error #4099

Closed
mattdowle opened this issue Dec 9, 2019 · 15 comments · Fixed by #4845
Closed

fwrite gzip on Solaris error #4099

mattdowle opened this issue Dec 9, 2019 · 15 comments · Fixed by #4845

Comments

@mattdowle
Copy link
Member

Now that 1.12.8 is on CRAN, we have the output from the extra tracing in #4028.
https://www.r-project.org/nosvn/R.check/r-patched-solaris-x86/data.table-00check.html

using R version 3.6.2 RC (2019-12-06 r77544)
using platform: i386-pc-solaris2.10 (32-bit)
using session charset: UTF-8
using option ‘--no-stop-on-test-error’
checking for file ‘data.table/DESCRIPTION’ ... OK
this is package ‘data.table’ version ‘1.12.8’
checking package namespace information ... OK
checking package dependencies ... OK
checking if this is a source package ... OK
checking if there is a namespace ... OK
checking for executable files ... OK
checking for hidden files and directories ... OK
checking for portable file names ... OK
checking for sufficient/correct file permissions ... OK
checking whether package ‘data.table’ can be installed ... OK
checking installed package size ... OK
checking package directory ... OK
checking ‘build’ directory ... OK
checking DESCRIPTION meta-information ... OK
checking top-level files ... OK
checking for left-over files ... OK
checking index information ... OK
checking package subdirectories ... OK
checking R files for non-ASCII characters ... OK
checking R files for syntax errors ... OK
checking whether the package can be loaded ... OK
checking whether the package can be loaded with stated dependencies ... OK
checking whether the package can be unloaded cleanly ... OK
checking whether the namespace can be loaded with stated dependencies ... OK
checking whether the namespace can be unloaded cleanly ... OK
checking loading without being on the library search path ... OK
checking use of S3 registration ... OK
checking dependencies in R code ... OK
checking S3 generic/method consistency ... OK
checking replacement functions ... OK
checking foreign function calls ... OK
checking R code for possible problems ... [34s/41s] OK
checking Rd files ... OK
checking Rd metadata ... OK
checking Rd cross-references ... OK
checking for missing documentation entries ... OK
checking for code/documentation mismatches ... OK
checking Rd \usage sections ... OK
checking Rd contents ... OK
checking for unstated dependencies in examples ... OK
checking line endings in shell scripts ... OK
checking line endings in C/C++/Fortran sources/headers ... OK
checking line endings in Makefiles ... OK
checking compilation flags in Makevars ... OK
checking for GNU extensions in Makefiles ... OK
checking for portable use of $(BLAS_LIBS) and $(LAPACK_LIBS) ... OK
checking include directives in Makefiles ... OK
checking compilation flags used ... OK
checking compiled code ... OK
checking installed files from ‘inst/doc’ ... OK
checking files in ‘vignettes’ ... OK
checking examples ... OK
checking for unstated dependencies in ‘tests’ ... OK
checking tests ... [102s/115s] ERROR
  Running ‘autoprint.R’
  Comparing ‘autoprint.Rout’ to ‘autoprint.Rout.save’ ... OK
  Running ‘froll.R’ [35s/38s]
  Running ‘knitr.R’
  Comparing ‘knitr.Rout’ to ‘knitr.Rout.save’ ... OK
  Running ‘main.R’ [58s/66s]
  Running ‘nafill.R’
  Running ‘other.R’
  Running ‘types.R’
Running the tests in ‘tests/main.R’ failed.
Complete output:
  > require(data.table)
  Loading required package: data.table
  >
  > test.data.table() # runs the main test suite of 5,000+ tests in /inst/tests/tests.Rraw
  getDTthreads(verbose=TRUE):
    omp_get_num_procs() 16
    R_DATATABLE_NUM_PROCS_PERCENT unset (default 50)
    R_DATATABLE_NUM_THREADS unset
    omp_get_thread_limit() 2
    omp_get_max_threads() 2
    OMP_THREAD_LIMIT 2
    OMP_NUM_THREADS unset
    RestoreAfterFork true
    data.table is using 2 threads. See ?setDTthreads.
  test.data.table() running: /home/ripley/R/Lib32/data.table/tests/tests.Rraw.bz2
  Test 1658.421 produced 1 errors but expected 0
  Expected:
  Observed: zlib 1.2.11 (zlib.h 1.2.11) deflate() returned error -2 with z_stream->msg=="" Z_FINISH=4 Z_BLOCK=5. Please include the full output above and below this message in your data.table bug report.
  Output captured before unexpected warning/error/message:
    omp_get_num_procs() 16
    R_DATATABLE_NUM_PROCS_PERCENT unset (default 50)
    R_DATATABLE_NUM_THREADS unset
    omp_get_thread_limit() 2
    omp_get_max_threads() 2
    OMP_THREAD_LIMIT 2
    OMP_NUM_THREADS unset
    RestoreAfterFork true
    data.table is using 2 threads. See ?setDTthreads.
  No list columns are present. Setting sep2='' otherwise quote='auto' would quote fields containing sep2.
  Column writers: 3 3
  args.doRowNames=0 args.rowNames=0 doQuote=-128 args.nrow=200 args.ncol=2 eolLen=1
  maxLineLen=51. Found in 0.000s
  Writing bom (false), yaml (0 characters) and column names (true) ... deflate input stream: e15b240 39 afdc2f8 4
  deflate returned 1 with stream->total_out==24; Z_FINISH==4, Z_OK==0, Z_STREAM_END==1
  done in 0.000s
  Writing 200 rows in 1 batches of 200 rows (each buffer size 8MB, showProgress=0, nth=1)
  deflate input stream: d259448 8391193 f9f9fd8 800
  deflate returned -2 with stream->total_out==0; Z_FINISH==4, Z_OK==0, Z_STREAM_END==1
  <simpleError in fwrite(DT, file = f1 <- tempfile(fileext = ".gz"), verbose = TRUE): zlib 1.2.11 (zlib.h 1.2.11) deflate() returned error -2 with z_stream->msg=="" Z_FINISH=4 Z_BLOCK=5. Please include the full output above and below this message in your data.table bug report.>
  Test 1658.43 ran without errors but failed check that x equals y:
  > x = fread(f1)
  Empty data.table (0 rows and 2 cols): a,b
   [Key= Types=log,log Classes=log,log]
  > y = DT
       a b [Key= Types=int,int Classes=int,int]
    1: 1 1
    2: 1 1
   ---
  199: 2 4
  200: 2 4
  Different number of rows
  Error in fwrite(DT, file = f3 <- tempfile(), compress = "gzip") :
    zlib 1.2.11 (zlib.h 1.2.11) deflate() returned error -2 with z_stream->msg=="" Z_FINISH=4 Z_BLOCK=5. Please retry fwrite() with verbose=TRUE and include the full output with your data.table bug report.
  Error in test.data.table() :
    Failed after test 1658.43 before the next test() call in /home/ripley/R/Lib32/data.table/tests/tests.Rraw.bz2
  Execution halted
checking for unstated dependencies in vignettes ... OK
checking package vignettes in ‘inst/doc’ ... OK
checking re-building of vignette outputs ... [139s/159s] OK
checking PDF version of manual ... OK
checking for detritus in the temp directory ... OK
DONE
Status: 1 ERROR
@mattdowle mattdowle added this to the 1.12.9 milestone Dec 9, 2019
@mattdowle
Copy link
Member Author

mattdowle commented Dec 9, 2019

The only thing I see so far to follow up on is in the 2nd deflate call :
deflate input stream: d259448 8391193 f9f9fd8 800
where that 8391193 comes from. That is the 8MB buffer (8 * 1024^2 = 8,388,608) - 2,585. Why *destLen would be 8MB - 2585 at that stage I'm not sure. Shouldn't it be the whole 8MB?

A: by reading the code it looks like 8391193 is returned by deflateBound when passed 8MB. Maybe it knows that no matter what the contents of the 8MB, it can always reduce a little bit (2,585 bytes). I'm not sure why we're using deflateBound for the 2nd part, because our buffers are fixed. The idea of deflateBound I thought was to provide an upper bound for the result of the compression, so that just that amount can be allocated. Important for the header, because for the very small header, the compressed output is most often larger. But not sure why for the data chunks.

Running locally I get the same output, absolutely no difference, including that 8391193. So that was barking up the wrong tree.

No list columns are present. Setting sep2='' otherwise quote='auto' would quote fields containing sep2.
Column writers: 3 3 
args.doRowNames=0 args.rowNames=0 doQuote=-128 args.nrow=200 args.ncol=2 eolLen=1
maxLineLen=51. Found in 0.000s
Writing bom (false), yaml (0 characters) and column names (true) ... deflate input stream: 0x55df63204b70 39 0x55df5feff690 4
deflate returned 1 with stream->total_out==24; Z_FINISH==4, Z_OK==0, Z_STREAM_END==1
done in 0.000s
Writing 200 rows in 1 batches of 200 rows (each buffer size 8MB, showProgress=1, nth=1)
deflate input stream: 0x55df69ade4b0 8391193 0x55df692de4a0 800
deflate returned 1 with stream->total_out==50; Z_FINISH==4, Z_OK==0, Z_STREAM_END==1

@mattdowle
Copy link
Member Author

nth==1 in this test (output confirms that), otherwise it could be a thread safety issue.

@mattdowle
Copy link
Member Author

I asked on S.O. here :
https://stackoverflow.com/questions/59259256/z-stream-error-from-deflate-on-solaris-only

I fear that will get closed for being too vague. Just a quick review of the code or some pointers from @madler would be awesome! He's by far the top user in that tag too, so fingers crossed!

@madler
Copy link

madler commented Dec 10, 2019

Your arithmetic needs a little work. 8,391,193 is greater than 8,388,608.

@mattdowle
Copy link
Member Author

Thanks, @madler. Arithmetic was never my strong point :-) Just an error here in comments. I rechecked our code in fwrite.c and it does allocate that 8,391,193 size (slightly larger than 8MB) and looks ok, afaics.

@mattdowle
Copy link
Member Author

mattdowle commented Dec 13, 2019

Now extra tracing is merged, will revisit after next release to CRAN. See comments at the top of #4107. Hence bump to 1.12.11 milestone.

@mattdowle mattdowle modified the milestones: 1.12.9, 1.12.11 Dec 13, 2019
@jangorecki jangorecki modified the milestones: 1.12.11, 1.12.9 May 26, 2020
@mattdowle mattdowle modified the milestones: 1.13.1, 1.13.3, 1.13.5 Oct 17, 2020
@mattdowle
Copy link
Member Author

Latest Solaris output now showing on CRAN with the extra tracing to tackle.

using R version 4.0.3 Patched (2020-10-19 r79352)
using platform: i386-pc-solaris2.10 (32-bit)
using session charset: UTF-8
using option ‘--no-stop-on-test-error’
checking for file ‘data.table/DESCRIPTION’ ... OK
this is package ‘data.table’ version ‘1.13.2’
checking package namespace information ... OK
checking package dependencies ... OK
checking if this is a source package ... OK
checking if there is a namespace ... OK
checking for executable files ... OK
checking for hidden files and directories ... OK
checking for portable file names ... OK
checking for sufficient/correct file permissions ... OK
checking whether package ‘data.table’ can be installed ... OK
checking installed package size ... OK
checking package directory ... OK
checking ‘build’ directory ... OK
checking DESCRIPTION meta-information ... OK
checking top-level files ... OK
checking for left-over files ... OK
checking index information ... OK
checking package subdirectories ... OK
checking R files for non-ASCII characters ... OK
checking R files for syntax errors ... OK
checking whether the package can be loaded ... OK
checking whether the package can be loaded with stated dependencies ... OK
checking whether the package can be unloaded cleanly ... OK
checking whether the namespace can be loaded with stated dependencies ... OK
checking whether the namespace can be unloaded cleanly ... OK
checking loading without being on the library search path ... OK
checking use of S3 registration ... OK
checking dependencies in R code ... OK
checking S3 generic/method consistency ... OK
checking replacement functions ... OK
checking foreign function calls ... OK
checking R code for possible problems ... [39s/48s] OK
checking Rd files ... OK
checking Rd metadata ... OK
checking Rd cross-references ... OK
checking for missing documentation entries ... OK
checking for code/documentation mismatches ... OK
checking Rd \usage sections ... OK
checking Rd contents ... OK
checking for unstated dependencies in examples ... OK
checking line endings in shell scripts ... OK
checking line endings in C/C++/Fortran sources/headers ... OK
checking line endings in Makefiles ... OK
checking compilation flags in Makevars ... OK
checking for GNU extensions in Makefiles ... OK
checking for portable use of $(BLAS_LIBS) and $(LAPACK_LIBS) ... OK
checking use of PKG_*FLAGS in Makefiles ... OK
checking include directives in Makefiles ... OK
checking compilation flags used ... OK
checking compiled code ... OK
checking installed files from ‘inst/doc’ ... OK
checking files in ‘vignettes’ ... OK
checking examples ... OK
checking for unstated dependencies in ‘tests’ ... OK
checking tests ... [112s/132s] ERROR
  Running ‘autoprint.R’
  Comparing ‘autoprint.Rout’ to ‘autoprint.Rout.save’ ... OK
  Running ‘froll.R’ [38s/44s]
  Running ‘knitr.R’
  Comparing ‘knitr.Rout’ to ‘knitr.Rout.save’ ... OK
  Running ‘main.R’ [64s/75s]
  Running ‘nafill.R’
  Running ‘other.R’
  Running ‘types.R’
Running the tests in ‘tests/main.R’ failed.
Complete output:
  > require(data.table)
  Loading required package: data.table
  >
  > test.data.table() # runs the main test suite of 5,000+ tests in /inst/tests/tests.Rraw
  getDTthreads(verbose=TRUE):
    omp_get_num_procs() 16
    R_DATATABLE_NUM_PROCS_PERCENT unset (default 50)
    R_DATATABLE_NUM_THREADS unset
    R_DATATABLE_THROTTLE unset (default 1024)
    omp_get_thread_limit() 2
    omp_get_max_threads() 2
    OMP_THREAD_LIMIT 2
    OMP_NUM_THREADS unset
    RestoreAfterFork true
    data.table is using 2 threads with throttle==1024. See ?setDTthreads.
  test.data.table() running: /home/ripley/R/Lib32/data.table/tests/tests.Rraw.bz2
  Test 1658.421 produced 1 errors but expected 0
  Expected:
  Observed: zlib 1.2.11 (zlib.h 1.2.11) deflate() returned error -2 with z_stream->msg=="" Z_FINISH=4 Z_BLOCK=5. Please include the full output above and below this message in your data.table bug report.
  Output captured before unexpected warning/error/message:
    omp_get_num_procs() 16
    R_DATATABLE_NUM_PROCS_PERCENT unset (default 50)
    R_DATATABLE_NUM_THREADS unset
    R_DATATABLE_THROTTLE unset (default 1024)
    omp_get_thread_limit() 2
    omp_get_max_threads() 2
    OMP_THREAD_LIMIT 2
    OMP_NUM_THREADS unset
    RestoreAfterFork true
    data.table is using 2 threads with throttle==1024. See ?setDTthreads.
  No list columns are present. Setting sep2='' otherwise quote='auto' would quote fields containing sep2.
  Column writers: 3 3
  args.doRowNames=0 args.rowNames=0 doQuote=-128 args.nrow=200 args.ncol=2 eolLen=1
  maxLineLen=51. Found in 0.000s
  Writing bom (false), yaml (0 characters) and column names (true) ... z_stream for header (1): 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 90 9d 47 0e 10 0a 95 fe 40 0a 95 fe 00 00 00 00 02 00 00 00 00 00 00 00 00 00 00 00
  deflate input stream: ed20768 39 e11b548 4
  deflate returned 1 with stream->total_out==24; Z_FINISH==4, Z_OK==0, Z_STREAM_END==1
  z_stream for header (2): 4c b5 11 0e 00 00 00 00 04 00 00 00 80 07 d2 0e 0f 00 00 00 18 00 00 00 00 00 00 00 90 9d 47 0e 10 0a 95 fe 40 0a 95 fe 00 00 00 00 01 00 00 00 c5 10 97 24 00 00 00 00
  done in 0.000s
  Writing 200 rows in 1 batches of 200 rows (each buffer size 8MB, showProgress=0, nth=1)
  z_stream for data (1): 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 98 9d c7 0e 10 0a 95 fe 40 0a 95 fe 00 00 00 00 02 00 00 00 00 00 00 00 00 00 00 00
  z_stream for data (2): 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 98 9d c7 0e 10 0a 95 fe 40 0a 95 fe 00 00 00 00 02 00 00 00 00 00 00 00 00 00 00 00
  deflate input stream: f57a9b0 8391193 e479d90 800
  deflate returned -2 with stream->total_out==0; Z_FINISH==4, Z_OK==0, Z_STREAM_END==1
  z_stream for data (3): 90 9d 47 0e 20 03 00 00 00 00 00 00 b0 a9 57 0f 19 0a 80 00 00 00 00 00 00 00 00 00 98 9d c7 0e 10 0a 95 fe 40 0a 95 fe 00 00 00 00 02 00 00 00 00 00 00 00 00 00 00 00
  <simpleError in fwrite(DT, file = f1 <- tempfile(fileext = ".gz"), verbose = TRUE): zlib 1.2.11 (zlib.h 1.2.11) deflate() returned error -2 with z_stream->msg=="" Z_FINISH=4 Z_BLOCK=5. Please include the full output above and below this message in your data.table bug report.>
  Test 1658.43 ran without errors but failed check that x equals y:
  > x = fread(f1)
  Empty data.table (0 rows and 2 cols): a,b
   [Key= Types=log,log Classes=log,log]
  > y = DT
       a b [Key= Types=int,int Classes=int,int]
    1: 1 1
    2: 1 1
   ---
  199: 2 4
  200: 2 4
  Different number of rows
  Error in fwrite(DT, file = f3 <- tempfile(), compress = "gzip") :
    zlib 1.2.11 (zlib.h 1.2.11) deflate() returned error -2 with z_stream->msg=="" Z_FINISH=4 Z_BLOCK=5. Please retry fwrite() with verbose=TRUE and include the full output with your data.table bug report.
  
  Wed Oct 21 09:41:03 2020 endian==little, sizeof(long double)==12, longdouble.digits==64, sizeof(pointer)==4, TZ=='GB', Sys.timezone()=='GB', Sys.getlocale()=='/en_GB.UTF-8/C/en_GB.UTF-8/C/en_GB.UTF-8/en_GB.UTF-8', l10n_info()=='MBCS=TRUE; UTF-8=TRUE; Latin-1=FALSE', getDTthreads()=='omp_get_num_procs()==16; R_DATATABLE_NUM_PROCS_PERCENT==unset (default 50); R_DATATABLE_NUM_THREADS==unset; R_DATATABLE_THROTTLE==unset (default 1024); omp_get_thread_limit()==2; omp_get_max_threads()==2; OMP_THREAD_LIMIT==2; OMP_NUM_THREADS==unset; RestoreAfterFork==true; data.table is using 2 threads with throttle==1024. See ?setDTthreads.'
  Error in test.data.table() :
    Failed after test 1658.43 before the next test() call in /home/ripley/R/Lib32/data.table/tests/tests.Rraw.bz2
  Execution halted
checking for unstated dependencies in vignettes ... OK
checking package vignettes in ‘inst/doc’ ... OK
checking re-building of vignette outputs ... [140s/156s] OK
checking PDF version of manual ... OK
checking for non-standard things in the check directory ... OK
checking for detritus in the temp directory ... OK
DONE
Status: 1 ERROR

@madler
Copy link

madler commented Oct 26, 2020

deflate returning -2 indicates that the z_stream structure is invalid. This is usually due to either not initializing it in the first place, pointing to where there isn't a z_stream structure, or the z_stream structure getting trampled on in memory somehow. Since this involves multiple threads, I'm guessing the last of those. I can't tell much else from that output.

@mattdowle
Copy link
Member Author

mattdowle commented Dec 1, 2020

I ran v1.13.2 on the excellent R-hub package builder, https://builder.r-hub.io/. Both of the following :

  • Oracle Solaris 10, x86, 32 bit, R-release log
  • Oracle Solaris 10, x86, 32 bit, R-release, Oracle Developer Studio 12.6 log

But surprisingly, and frustratingly, both pass fine. CRAN's Solaris could be different to R-hub's Solaris in some way. It could also be a fault that could happen on Windows or Linux too, but is just showing up on CRAN's Solaris. It could be 32bit related, but CI tests 32bit Windows regularly and we haven't seen a problem there. We've already established that the library and headers match on zlib 1.2.11 which is the latest version.

Looking at the source code of zlib 1.2.11, there are some #ifdef GZIP in zlib's internals, in particular in deflate.c where the stream validity is checked by deflateStateCheck :

local int deflateStateCheck (strm)
    z_streamp strm;
{
    deflate_state *s;
    if (strm == Z_NULL ||
        strm->zalloc == (alloc_func)0 || strm->zfree == (free_func)0)
        return 1;
    s = strm->state;
    if (s == Z_NULL || s->strm != strm || (s->status != INIT_STATE &&
#ifdef GZIP
                                           s->status != GZIP_STATE &&
#endif
                                           s->status != EXTRA_STATE &&
                                           s->status != NAME_STATE &&
    ...

We do request GZIP in fwrite.c:init_stream by passing 31 to deflateInit2. But let's say the zlib library on CRAN's Solaris box has not been compiled with GZIP support, by setting the provided NO_GZIP macro at compile time (see zlib's manual). Then GZIP would not be defined, and GZIP_STATE (which I see locally is the state at the point that fails on CRAN's Solaris) would not be considered a valid state. deflateStaeCheck() would then return 1, and deflate() would then return Z_STREAM_ERROR (-2) as we see. However, I've tentatively ruled this theory out because deflateInit2() would have already failed inside it: 31 would not be reduced by 16 in deflate.c:295, and then line 301 would return Z_STREAM_ERROR due to windowBits>15. But that doesn't happen because we see from the log that the init worked fine to write the header, and the init didn't fail in the data writing region. That indicates that CRAN's Solaris zlib is compiled with GZIP enabled.

We should be able to rule out multi-threaded because the data size is small and we see from the log that only one thread is being used by design which is correct. Not ruling anything out though.

I've traced the internal zlib structure locally and will extend the output further. The output in the log already contains the full z_stream contents, but that contains a pointer we need to follow to get to the state and other conditions that cause Z_STREAM_ERROR. These are not exposed by zlib but we can get to them for the purposes of tracing this fault.

sizeof(z_stream) is 112 on 64bit, and 56 on Solaris 32bit, structured as follows up to `*state`
                                     <= *next_in          => <= avail_in          => <= total_in          => <= *next_out         => <= avail_out         => <= total_out         => <= *msg              => <= *state            =>
local       > z_stream for data (3): e0 ee 4b 6b 4b 56 00 00 00 00 00 00 00 00 00 00 20 03 00 00 00 00 00 00 02 ec cb 6b 4b 56 00 00 e7 09 80 00 00 00 00 00 32 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 30 99 0f 68 4b 56 00 00 90 90 b3 e8 2b 7f 00 00 a0 90 b3 e8 2b 7f 00 00 00 00 00 00 00 00 00 00 01 00 00 00 00 00 00 00 f8 f0 0e c1 00 00 00 00 00 00 00 00 00 00 00 00 
CRAN Solaris> z_stream for data (3): 90 9d 47 0e 20 03 00 00 00 00 00 00 b0 a9 57 0f 19 0a 80 00 00 00 00 00 00 00 00 00 98 9d c7 0e 10 0a 95 fe 40 0a 95 fe 00 00 00 00 02 00 00 00 00 00 00 00 00 00 00 00
                                     < next_in > <avail_in > <total_in > <*next_out> <avail_out> <total_out> <*msg     > <*state   >

@mattdowle
Copy link
Member Author

mattdowle commented Dec 8, 2020

v1.13.4 is on CRAN today and we have the new output.
Pasting it here now in case it disappears from CRAN for any unlikely reason.
The extra tracing of zlib's internals seems to be working and I see the new output.
After a quick glance ...
I see zlib status is 57 which is GZIP_STATE. That's normal then and rules out some unusual status code, unfortunately.
zlib's internal deflate.c:deflateStateCheck function seems to be failing on the s->strm != strm condition of line 361.
The new output shows ...
strm==8029760 state->strm==8029990 ... (s->strm==strm)==0
These two values should match. I see (s->strm==strm)==1 locally and on all other Windows, Mac, Linux machines, both 32bit and 64bit, and even on R-hub's two Solaris instances. It is just CRAN's particular Solaris! That's still not to say it might not really be to do with CRAN's Solaris (random corruptions can occur consistently on one machine), but it sure makes it difficult that's the only machine we can reproduce it on.
First thought is to look further up the output to see if those pointers ever matched, and if so, all I can think is that Mark Adler was right (his comment higher up in this issue) and it's a corruption due to a fault in our code fwrite.c. It's odd that the two addresses are not far apart, but the corruption could be just in the minor bytes of the pointer address. Or, it could be some kind of strange write problem of pointer addresses on that Solaris machine; e.g. a write and read back not matching. We see that in faulty multi-threaded code, but in this case it's not multi-threaded. It's single threaded because the data size is small resulting in num_threads(1) being passed to the OpenMP region.
If we are into needing a hardware watchpoint to find it, we could prepare a well-tested script we could ask Prof Brian Ripley to run; i.e. we could create a deliberate corruption, make sure the hardware watchpoint finds it, before sending the script to him.
Just talking out loud.

> test.data.table() # runs the main test suite of 5,000+ tests in /inst/tests/tests.Rraw
  getDTthreads(verbose=TRUE):
    OpenMP version (_OPENMP) 201307
    omp_get_num_procs() 16
    R_DATATABLE_NUM_PROCS_PERCENT unset (default 50)
    R_DATATABLE_NUM_THREADS unset
    R_DATATABLE_THROTTLE unset (default 1024)
    omp_get_thread_limit() 2
    omp_get_max_threads() 2
    OMP_THREAD_LIMIT 2
    OMP_NUM_THREADS unset
    RestoreAfterFork true
    data.table is using 2 threads with throttle==1024. See ?setDTthreads.
  test.data.table() running: /home/ripley/R/Lib32/data.table/tests/tests.Rraw.bz2
  Test 1658.421 produced 1 errors but expected 0
  Expected:
  Observed: zlib 1.2.11 (zlib.h 1.2.11) deflate() returned error -2 with z_stream->msg=="" Z_FINISH=4 Z_BLOCK=5. Please include the full output above and below this message in your data.table bug report.
  Output captured before unexpected warning/error/message:
    OpenMP version (_OPENMP) 201307
    omp_get_num_procs() 16
    R_DATATABLE_NUM_PROCS_PERCENT unset (default 50)
    R_DATATABLE_NUM_THREADS unset
    R_DATATABLE_THROTTLE unset (default 1024)
    omp_get_thread_limit() 2
    omp_get_max_threads() 2
    OMP_THREAD_LIMIT 2
    OMP_NUM_THREADS unset
    RestoreAfterFork true
    data.table is using 2 threads with throttle==1024. See ?setDTthreads.
  No list columns are present. Setting sep2='' otherwise quote='auto' would quote fields containing sep2.
  Column writers: 3 3
  args.doRowNames=0 args.rowNames=0 doQuote=-128 args.nrow=200 args.ncol=2 eolLen=1
  maxLineLen=51. Found in 0.000s
  Writing bom (false), yaml (0 characters) and column names (true) ... z_stream for header (1): sizeof(z_stream)==56: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 90 d1 59 0e 10 0a 95 fe 40 0a 95 fe 00 00 00 00 02 00 00 00 00 00 00 00 00 00 00 00 state: 30 9f 02 08 39 00 00 00 strm==8029f30 state->strm==8029f30 state->status==57 zalloc==fe950a10 zfree==fe950a40 (s->strm==strm)==1 s->next_out==0 s->avail_in=0 s->next_in=0 deflates()'s checks (excluding status) would return -2 here
  deflate input stream: ce36e90 39 d65a8f8 4 z_stream: sizeof(z_stream)==56: f8 a8 65 0d 04 00 00 00 00 00 00 00 90 6e e3 0c 27 00 00 00 00 00 00 00 00 00 00 00 90 d1 59 0e 10 0a 95 fe 40 0a 95 fe 00 00 00 00 02 00 00 00 00 00 00 00 00 00 00 00 state: 30 9f 02 08 39 00 00 00 strm==8029f30 state->strm==8029f30 state->status==57 zalloc==fe950a10 zfree==fe950a40 (s->strm==strm)==1 s->next_out==ce36e90 s->avail_in=4 s->next_in=d65a8f8 deflates()'s checks (excluding status) would be ok here
  deflate returned 1 with stream->total_out==24; Z_FINISH==4, Z_OK==0, Z_STREAM_END==1 z_stream: sizeof(z_stream)==56: fc a8 65 0d 00 00 00 00 04 00 00 00 a8 6e e3 0c 0f 00 00 00 18 00 00 00 00 00 00 00 90 d1 59 0e 10 0a 95 fe 40 0a 95 fe 00 00 00 00 01 00 00 00 c5 10 97 24 00 00 00 00 state: 30 9f 02 08 9a 02 00 00 strm==8029f30 state->strm==8029f30 state->status==666 zalloc==fe950a10 zfree==fe950a40 (s->strm==strm)==1 s->next_out==ce36ea8 s->avail_in=0 s->next_in=d65a8fc deflates()'s checks (excluding status) would be ok here
  z_stream for header (2): sizeof(z_stream)==56: fc a8 65 0d 00 00 00 00 04 00 00 00 a8 6e e3 0c 0f 00 00 00 18 00 00 00 00 00 00 00 90 d1 59 0e 10 0a 95 fe 40 0a 95 fe 00 00 00 00 01 00 00 00 c5 10 97 24 00 00 00 00 state: 30 9f 02 08 9a 02 00 00 strm==8029f30 state->strm==8029f30 state->status==666 zalloc==fe950a10 zfree==fe950a40 (s->strm==strm)==1 s->next_out==ce36ea8 s->avail_in=0 s->next_in=d65a8fc deflates()'s checks (excluding status) would be ok here
  done in 0.001s
  Writing 200 rows in 1 batches of 200 rows (each buffer size 8MB, showProgress=0, nth=1)
  zbuffSize=8391193 returned from deflateBound
  z_stream for data (1): sizeof(z_stream)==56: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 68 55 d7 0f 10 0a 95 fe 40 0a 95 fe 00 00 00 00 02 00 00 00 00 00 00 00 00 00 00 00 state: 90 99 02 08 39 00 00 00 strm==8029990 state->strm==8029990 state->status==57 zalloc==fe950a10 zfree==fe950a40 (s->strm==strm)==1 s->next_out==0 s->avail_in=0 s->next_in=0 deflates()'s checks (excluding status) would return -2 here
  z_stream for data (2): sizeof(z_stream)==56: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 68 55 d7 0f 10 0a 95 fe 40 0a 95 fe 00 00 00 00 02 00 00 00 00 00 00 00 00 00 00 00 state: 90 99 02 08 39 00 00 00 strm==8029760 state->strm==8029990 state->status==57 zalloc==fe950a10 zfree==fe950a40 (s->strm==strm)==0 s->next_out==0 s->avail_in=0 s->next_in=0 deflates()'s checks (excluding status) would return -2 here
  deflate input stream: 10611670 8391193 f575560 800 z_stream: sizeof(z_stream)==56: 60 55 57 0f 20 03 00 00 00 00 00 00 70 16 61 10 19 0a 80 00 00 00 00 00 00 00 00 00 68 55 d7 0f 10 0a 95 fe 40 0a 95 fe 00 00 00 00 02 00 00 00 00 00 00 00 00 00 00 00 state: 90 99 02 08 39 00 00 00 strm==8029760 state->strm==8029990 state->status==57 zalloc==fe950a10 zfree==fe950a40 (s->strm==strm)==0 s->next_out==10611670 s->avail_in=800 s->next_in=f575560 deflates()'s checks (excluding status) would return -2 here
  deflate returned -2 with stream->total_out==0; Z_FINISH==4, Z_OK==0, Z_STREAM_END==1 z_stream: sizeof(z_stream)==56: 60 55 57 0f 20 03 00 00 00 00 00 00 70 16 61 10 19 0a 80 00 00 00 00 00 00 00 00 00 68 55 d7 0f 10 0a 95 fe 40 0a 95 fe 00 00 00 00 02 00 00 00 00 00 00 00 00 00 00 00 state: 90 99 02 08 39 00 00 00 strm==8029760 state->strm==8029990 state->status==57 zalloc==fe950a10 zfree==fe950a40 (s->strm==strm)==0 s->next_out==10611670 s->avail_in=800 s->next_in=f575560 deflates()'s checks (excluding status) would return -2 here
  z_stream for data (3): sizeof(z_stream)==56: 60 55 57 0f 20 03 00 00 00 00 00 00 70 16 61 10 19 0a 80 00 00 00 00 00 00 00 00 00 68 55 d7 0f 10 0a 95 fe 40 0a 95 fe 00 00 00 00 02 00 00 00 00 00 00 00 00 00 00 00 state: 90 99 02 08 39 00 00 00 strm==8029760 state->strm==8029990 state->status==57 zalloc==fe950a10 zfree==fe950a40 (s->strm==strm)==0 s->next_out==10611670 s->avail_in=800 s->next_in=f575560 deflates()'s checks (excluding status) would return -2 here
  <simpleError in fwrite(DT, file = f1 <- tempfile(fileext = ".gz"), verbose = TRUE): zlib 1.2.11 (zlib.h 1.2.11) deflate() returned error -2 with z_stream->msg=="" Z_FINISH=4 Z_BLOCK=5. Please include the full output above and below this message in your data.table bug report.>
  Test 1658.43 ran without errors but failed check that x equals y:
  > x = fread(f1)
  Empty data.table (0 rows and 2 cols): a,b
   [Key= Types=log,log Classes=log,log]
  > y = DT
       a b [Key= Types=int,int Classes=int,int]
    1: 1 1
    2: 1 1
   ---
  199: 2 4
  200: 2 4
  Different number of rows
  Error in fwrite(DT, file = f3 <- tempfile(), compress = "gzip") :
    zlib 1.2.11 (zlib.h 1.2.11) deflate() returned error -2 with z_stream->msg=="" Z_FINISH=4 Z_BLOCK=5. Please retry fwrite() with verbose=TRUE and include the full output with your data.table bug report.
  
  Tue Dec 8 15:05:48 2020 endian==little, sizeof(long double)==12, longdouble.digits==64, sizeof(pointer)==4, TZ=='GB', Sys.timezone()=='GB', Sys.getlocale()=='/en_GB.UTF-8/C/en_GB.UTF-8/C/en_GB.UTF-8/en_GB.UTF-8', l10n_info()=='MBCS=TRUE; UTF-8=TRUE; Latin-1=FALSE', getDTthreads()=='OpenMP version (_OPENMP)==201307; omp_get_num_procs()==16; R_DATATABLE_NUM_PROCS_PERCENT==unset (default 50); R_DATATABLE_NUM_THREADS==unset; R_DATATABLE_THROTTLE==unset (default 1024); omp_get_thread_limit()==2; omp_get_max_threads()==2; OMP_THREAD_LIMIT==2; OMP_NUM_THREADS==unset; RestoreAfterFork==true; data.table is using 2 threads with throttle==1024. See ?setDTthreads.', zlibVersion()==1.2.11 ZLIB_VERSION==1.2.11
  Error in test.data.table() :
    Failed after test 1658.43 before the next test() call in /home/ripley/R/Lib32/data.table/tests/tests.Rraw.bz2
  Execution halted

@mattdowle
Copy link
Member Author

mattdowle commented Dec 8, 2020

Ok so the strm address changes somewhere between z_stream for data (1) and z_stream for data (2) ...

z_stream for data (1): sizeof(z_stream)==56: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 68 55 d7 0f 10 0a 95 fe 40 0a 95 fe 00 00 00 00 02 00 00 00 00 00 00 00 00 00 00 00 state: 90 99 02 08 39 00 00 00 strm==8029990 state->strm==8029990 state->status==57 zalloc==fe950a10 zfree==fe950a40 (s->strm==strm)==1 s->next_out==0 s->avail_in=0 s->next_in=0 deflates()'s checks (excluding status) would return -2 here

z_stream for data (2): sizeof(z_stream)==56: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 68 55 d7 0f 10 0a 95 fe 40 0a 95 fe 00 00 00 00 02 00 00 00 00 00 00 00 00 00 00 00 state: 90 99 02 08 39 00 00 00 strm==8029760 state->strm==8029990 state->status==57 zalloc==fe950a10 zfree==fe950a40 (s->strm==strm)==0 s->next_out==0 s->avail_in=0 s->next_in=0 deflates()'s checks (excluding status) would return -2 here

The first deflates()'s checks (excluding status) would return -2 here is a red herring as that's normal to be the case at that point. It's the parts in bold that are the crux of it. Seems like strm is being copied / moved / realloc'd ...

@mattdowle
Copy link
Member Author

mattdowle commented Dec 8, 2020

_OPENMP == 201307 so that corresponds to OpenMP 4.0. Should be fine.

On first glance it seems the address of mystream is being changed when entering the parallel for here https://github.com/Rdatatable/data.table/blob/master/src/fwrite.c#L880
or a thread seeing another thread's stack (something like that).

However, mystream is local to that parallel region, declared on line 870. Each thread should see its own stack, and I don't see anything wrong with that. The region is single threaded anyway (!) in this small data example, as confirm by nth=1 in the output, and so num_threads(1) has been passed to the region on line 861. And we wouldn't be seeing any verbose output anyway if nth>1 because the output itself is not thread-safe, so the verbose is only on when nth==1.

@mattdowle
Copy link
Member Author

mattdowle commented Dec 8, 2020

There are no zlib calls between these two trace points: lines 877 and 909, so it's not zlib.
Since state->strm==8029990 does not change, I don't think it's a corruption by our code between those points. It is complicated code but it is well tested, has full coverage, and has been through ASAN on multiple versions of very latest compilers for several years.
I think mystream is being moved in memory by this OpenMP implementation when it starts the parallel for. If that is what's happening, that could happen even when single threaded.

An extra trace output on line 882 would confirm that.

Let's do one thing at a time, and add that trace point and resubmit. We have to release again anyway to get out of the way of the breaking change to the breaking change to all.equal in recent days in R-devel.
Maybe attempting a fix at the same time is worth doing, on the basis that my guess is right. That's manageable and would have a good chance of passing on the next submission. Otherwise we keep going.

@mattdowle mattdowle removed this from the 1.13.7 milestone Dec 8, 2020
@mattdowle
Copy link
Member Author

1.13.6 was accepted on CRAN today and we're now passing on its Solaris. This confirms that #4845 did fix it.

@MichaelChirico
Copy link
Member

awesome news! thanks for all that effort debugging!

alexchwong added a commit to alexchwong/ompBAM that referenced this issue Sep 11, 2021
NxtIRFcore is failing tests giving z_stream_error on inflate
Perhaps this is similar to this problem: Rdatatable/data.table#4099

Try creating z_stream vector on stack, outside OpenMP loop
alexchwong added a commit to alexchwong/ompBAM that referenced this issue Sep 16, 2021
NxtIRFcore is failing tests giving z_stream_error on inflate
Perhaps this is similar to this problem: Rdatatable/data.table#4099

Try creating z_stream vector on stack, outside OpenMP loop


Former-commit-id: 2226ecf
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants