Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Sporadic segfaults with GCC on Ubuntu Linux #110

Closed
FrancescAlted opened this issue Apr 4, 2016 · 1 comment
Closed

Sporadic segfaults with GCC on Ubuntu Linux #110

FrancescAlted opened this issue Apr 4, 2016 · 1 comment

Comments

@FrancescAlted
Copy link
Member

When using GCC (tested with 4.9.3 and 5.2.1) on a Ubuntu 15.10 box one can get sporadicly but consistently segfaults when exercising the test suite enough times:

$ for i in {1..10}; do nosetests --with-doctest blosc; done
........................
----------------------------------------------------------------------
Ran 24 tests in 5.054s

OK
........................
----------------------------------------------------------------------
Ran 24 tests in 5.368s

OK
........................
----------------------------------------------------------------------
Ran 24 tests in 5.122s

OK
........................
----------------------------------------------------------------------
Ran 24 tests in 5.184s

OK
........................
----------------------------------------------------------------------
Ran 24 tests in 5.123s

OK
........................
----------------------------------------------------------------------
Ran 24 tests in 5.753s

OK
........................
----------------------------------------------------------------------
Ran 24 tests in 5.343s

OK
........................
----------------------------------------------------------------------
Ran 24 tests in 5.133s

OK
........................
----------------------------------------------------------------------
Ran 24 tests in 5.487s

OK
Segmentation fault (core dumped)

I cannot get any segfault when using clang (tested with 3.6 and 3.7). Testing on a Mac OSX box does not show any problem either (this is normal because xcode brings clang/LLVM).

A detailed investigation using valgrind does not show anything too evident, except things like:

test_no_leaks (blosc.test.TestCodec) ... ==5330== Invalid read of size 4
==5330==    at 0x4ECEF73: PyObject_Free (obmalloc.c:1013)
==5330==    by 0x4EE2A72: tupledealloc (tupleobject.c:235)
==5330==    by 0x4F327C6: ext_do_call (ceval.c:4665)
==5330==    by 0x4F327C6: PyEval_EvalFrameEx (ceval.c:3026)
==5330==    by 0x4F35A2D: PyEval_EvalCodeEx (ceval.c:3582)
==5330==    by 0x4F34A54: fast_function (ceval.c:4446)
==5330==    by 0x4F34A54: call_function (ceval.c:4371)
==5330==    by 0x4F34A54: PyEval_EvalFrameEx (ceval.c:2987)
==5330==    by 0x4F35A2D: PyEval_EvalCodeEx (ceval.c:3582)
==5330==    by 0x4EB14A7: function_call (funcobject.c:526)
==5330==    by 0x4E81D22: PyObject_Call (abstract.c:2546)
==5330==    by 0x4F32796: ext_do_call (ceval.c:4663)
==5330==    by 0x4F32796: PyEval_EvalFrameEx (ceval.c:3026)
==5330==    by 0x4F35A2D: PyEval_EvalCodeEx (ceval.c:3582)
==5330==    by 0x4EB13A0: function_call (funcobject.c:526)
==5330==    by 0x4E81D22: PyObject_Call (abstract.c:2546)
==5330==  Address 0x428b9020 is 32 bytes before a block of size 80,002,976 in arena "client"

so perhaps there is a problem with reference counting but I am not sure if this is a red herring.

Anyway, as GCC is a very important compiler this ticket has high priority.

@FrancescAlted
Copy link
Member Author

I have ended making a minimal example that crashes:

from __future__ import print_function
import numpy
import blosc

print("Blosc version info:", blosc.blosclib_version)
# Setting the number of threads to 3 accelerates the segfaults occurrencies
blosc.set_nthreads(3)

a = numpy.arange(1e6)
parray = blosc.compress(a, clevel=9, shuffle=blosc.SHUFFLE, cname="blosclz")
ratio = len(a) * a.itemsize * 1. / len(parray)
print("Compression: %s -> %s (%4.1fx)" % (
    len(a) * a.itemsize, len(parray), ratio))

With that, it is quite easy to make the python-blosc wrapper to crash:

$ time for i in {1..100}; do PYTHONPATH=. python segfault.py>p ; done
Segmentation fault (core dumped)

real    0m9.803s
user    0m8.416s
sys     0m1.380s

Then, during my investigations I found this:

  1. The crashes only happen when you combine Python + GCC + high compiler optimization level (-O2 or higher) + threading. I have verified this in both Ubuntu 15.10 and Gentoo 2.2.

  2. The crashes do not happen when you replace GCC by CLANG or you don't use multi-threading or you use a low optimization level (-O1 or less).

  3. The main C-Blosc library seems not affected by this. See this equivalent example in pure C.

  4. When compiling python-blosc against an external C-Blosc library with this:

 $ python setup.py build_ext --inplace --blosc=/my_c-blosc_lib_path

everything is fine, even in the case 1) above.

So, that's a funny situation, and after thinking about this for a good amount of time, I propose to approach this issue as follows:

  1. In case the C-Blosc library is not found, print a visible warning saying that, for maximum performance, the user should install the C-Blosc library separately.

  2. In case the vendored library is to be compiled inside the extension, force the use of -O1 in setup.py for Linux platforms (Mac OSX is not affected that much because CLANG/LLVM is probably used there, and Windows/MSVC is definitely not an issue here).

  3. Add information about this issue early in the README file. If people is using Blosc it is probably because of speed reasons, so making this as apparent as possible seems reasonable.

Addedum: Here it follows what you can expect from using python-blosc with an external C-Blosc library:

$ PYTHONPATH=. python bench/compress_ptr.py 
Creating NumPy arrays with 10**8 int64/float64 elements:
  *** ctypes.memmove() *** Time for memcpy():   0.295 s (2.53 GB/s)

Times for compressing/decompressing with clevel=5 and 8 threads

*** the arange linear distribution ***
  *** blosclz , noshuffle  ***  0.455 s (1.64 GB/s) / 0.087 s (8.58 GB/s)       Compr. ratio:   1.0x
  *** blosclz , shuffle    ***  0.108 s (6.93 GB/s) / 0.075 s (10.00 GB/s)      Compr. ratio:  57.1x
  *** blosclz , bitshuffle ***  0.120 s (6.19 GB/s) / 0.107 s (6.97 GB/s)       Compr. ratio:  74.0x
  *** lz4     , noshuffle  ***  0.342 s (2.18 GB/s) / 0.212 s (3.52 GB/s)       Compr. ratio:   2.0x
  *** lz4     , shuffle    ***  0.078 s (9.54 GB/s) / 0.093 s (8.02 GB/s)       Compr. ratio:  58.6x
  *** lz4     , bitshuffle ***  0.116 s (6.41 GB/s) / 0.135 s (5.53 GB/s)       Compr. ratio:  52.5x
  *** lz4hc   , noshuffle  ***  8.142 s (0.09 GB/s) / 0.212 s (3.52 GB/s)       Compr. ratio:   2.0x
  *** lz4hc   , shuffle    ***  0.140 s (5.33 GB/s) / 0.092 s (8.06 GB/s)       Compr. ratio: 137.2x
  *** lz4hc   , bitshuffle ***  1.572 s (0.47 GB/s) / 0.142 s (5.25 GB/s)       Compr. ratio: 208.9x
  *** snappy  , noshuffle  ***  0.381 s (1.95 GB/s) / 0.244 s (3.06 GB/s)       Compr. ratio:   2.0x
  *** snappy  , shuffle    ***  0.073 s (10.25 GB/s) / 0.136 s (5.48 GB/s)      Compr. ratio:  17.4x
  *** snappy  , bitshuffle ***  0.126 s (5.92 GB/s) / 0.177 s (4.22 GB/s)       Compr. ratio:  18.2x
  *** zlib    , noshuffle  ***  5.298 s (0.14 GB/s) / 0.401 s (1.86 GB/s)       Compr. ratio:   5.3x
  *** zlib    , shuffle    ***  0.974 s (0.76 GB/s) / 0.393 s (1.90 GB/s)       Compr. ratio: 237.3x
  *** zlib    , bitshuffle ***  1.026 s (0.73 GB/s) / 0.444 s (1.68 GB/s)       Compr. ratio: 305.4x

*** the linspace linear distribution ***
  *** blosclz , noshuffle  ***  0.434 s (1.72 GB/s) / 0.088 s (8.45 GB/s)       Compr. ratio:   1.0x
  *** blosclz , shuffle    ***  0.298 s (2.50 GB/s) / 0.090 s (8.32 GB/s)       Compr. ratio:   2.0x
  *** blosclz , bitshuffle ***  0.476 s (1.56 GB/s) / 0.166 s (4.50 GB/s)       Compr. ratio:   2.8x
  *** lz4     , noshuffle  ***  0.219 s (3.41 GB/s) / 0.088 s (8.45 GB/s)       Compr. ratio:   1.0x
  *** lz4     , shuffle    ***  0.190 s (3.92 GB/s) / 0.112 s (6.63 GB/s)       Compr. ratio:   3.2x
  *** lz4     , bitshuffle ***  0.248 s (3.00 GB/s) / 0.149 s (5.00 GB/s)       Compr. ratio:   4.9x
  *** lz4hc   , noshuffle  ***  2.797 s (0.27 GB/s) / 0.211 s (3.53 GB/s)       Compr. ratio:   1.2x
  *** lz4hc   , shuffle    ***  0.528 s (1.41 GB/s) / 0.085 s (8.78 GB/s)       Compr. ratio:  24.1x
  *** lz4hc   , bitshuffle ***  2.918 s (0.26 GB/s) / 0.131 s (5.71 GB/s)       Compr. ratio:  35.0x
  *** snappy  , noshuffle  ***  0.088 s (8.49 GB/s) / 0.087 s (8.61 GB/s)       Compr. ratio:   1.0x
  *** snappy  , shuffle    ***  0.235 s (3.16 GB/s) / 0.176 s (4.24 GB/s)       Compr. ratio:   4.2x
  *** snappy  , bitshuffle ***  0.317 s (2.35 GB/s) / 0.198 s (3.76 GB/s)       Compr. ratio:   6.1x
  *** zlib    , noshuffle  ***  6.569 s (0.11 GB/s) / 0.718 s (1.04 GB/s)       Compr. ratio:   1.6x
  *** zlib    , shuffle    ***  1.313 s (0.57 GB/s) / 0.339 s (2.20 GB/s)       Compr. ratio:  27.0x
  *** zlib    , bitshuffle ***  1.348 s (0.55 GB/s) / 0.380 s (1.96 GB/s)       Compr. ratio:  35.2x

*** the random distribution ***
  *** blosclz , noshuffle  ***  0.517 s (1.44 GB/s) / 0.087 s (8.60 GB/s)       Compr. ratio:   1.0x
  *** blosclz , shuffle    ***  0.212 s (3.52 GB/s) / 0.070 s (10.62 GB/s)      Compr. ratio:   3.9x
  *** blosclz , bitshuffle ***  0.181 s (4.13 GB/s) / 0.104 s (7.16 GB/s)       Compr. ratio:   6.1x
  *** lz4     , noshuffle  ***  0.373 s (2.00 GB/s) / 0.149 s (5.00 GB/s)       Compr. ratio:   2.1x
  *** lz4     , shuffle    ***  0.135 s (5.52 GB/s) / 0.101 s (7.36 GB/s)       Compr. ratio:   4.5x
  *** lz4     , bitshuffle ***  0.129 s (5.77 GB/s) / 0.138 s (5.39 GB/s)       Compr. ratio:   6.1x
  *** lz4hc   , noshuffle  ***  4.684 s (0.16 GB/s) / 0.101 s (7.36 GB/s)       Compr. ratio:   3.2x
  *** lz4hc   , shuffle    ***  3.223 s (0.23 GB/s) / 0.101 s (7.37 GB/s)       Compr. ratio:   5.4x
  *** lz4hc   , bitshuffle ***  0.429 s (1.74 GB/s) / 0.139 s (5.36 GB/s)       Compr. ratio:   6.2x
  *** snappy  , noshuffle  ***  0.461 s (1.62 GB/s) / 0.257 s (2.90 GB/s)       Compr. ratio:   2.2x
  *** snappy  , shuffle    ***  0.166 s (4.49 GB/s) / 0.160 s (4.66 GB/s)       Compr. ratio:   4.3x
  *** snappy  , bitshuffle ***  0.136 s (5.48 GB/s) / 0.167 s (4.45 GB/s)       Compr. ratio:   5.0x
  *** zlib    , noshuffle  ***  5.383 s (0.14 GB/s) / 0.499 s (1.49 GB/s)       Compr. ratio:   3.9x
  *** zlib    , shuffle    ***  2.903 s (0.26 GB/s) / 0.408 s (1.83 GB/s)       Compr. ratio:   6.1x
  *** zlib    , bitshuffle ***  1.403 s (0.53 GB/s) / 0.433 s (1.72 GB/s)       Compr. ratio:   6.3x

Th above also has the advantage that C-Blosc CMake infraestructure can recognize the AVX2 support by the compiler much easier. Anyway, here it is the output with python-blosc extensions compiled with -O1 flag:

$ PYTHONPATH=. python bench/compress_ptr.py 
Creating NumPy arrays with 10**8 int64/float64 elements:
  *** ctypes.memmove() *** Time for memcpy():   0.295 s (2.52 GB/s)

Times for compressing/decompressing with clevel=5 and 8 threads

*** the arange linear distribution ***
  *** blosclz , noshuffle  ***  0.517 s (1.44 GB/s) / 0.086 s (8.67 GB/s)       Compr. ratio:   1.0x
  *** blosclz , shuffle    ***  0.115 s (6.50 GB/s) / 0.083 s (8.95 GB/s)       Compr. ratio:  57.1x
  *** blosclz , bitshuffle ***  0.167 s (4.46 GB/s) / 0.172 s (4.33 GB/s)       Compr. ratio:  74.0x
  *** lz4     , noshuffle  ***  0.917 s (0.81 GB/s) / 0.246 s (3.03 GB/s)       Compr. ratio:   2.0x
  *** lz4     , shuffle    ***  0.109 s (6.84 GB/s) / 0.145 s (5.12 GB/s)       Compr. ratio:  58.6x
  *** lz4     , bitshuffle ***  0.198 s (3.77 GB/s) / 0.267 s (2.79 GB/s)       Compr. ratio:  52.5x
  *** lz4hc   , noshuffle  ***  8.224 s (0.09 GB/s) / 0.245 s (3.04 GB/s)       Compr. ratio:   2.0x
  *** lz4hc   , shuffle    ***  0.193 s (3.86 GB/s) / 0.144 s (5.19 GB/s)       Compr. ratio: 137.2x
  *** lz4hc   , bitshuffle ***  1.800 s (0.41 GB/s) / 0.206 s (3.62 GB/s)       Compr. ratio: 208.9x
  *** snappy  , noshuffle  ***  0.404 s (1.84 GB/s) / 0.251 s (2.97 GB/s)       Compr. ratio:   2.0x
  *** snappy  , shuffle    ***  0.110 s (6.78 GB/s) / 0.196 s (3.80 GB/s)       Compr. ratio:  17.4x
  *** snappy  , bitshuffle ***  0.191 s (3.90 GB/s) / 0.306 s (2.43 GB/s)       Compr. ratio:  18.2x
  *** zlib    , noshuffle  ***  5.167 s (0.14 GB/s) / 0.410 s (1.82 GB/s)       Compr. ratio:   5.3x
  *** zlib    , shuffle    ***  1.046 s (0.71 GB/s) / 0.523 s (1.42 GB/s)       Compr. ratio: 237.3x
  *** zlib    , bitshuffle ***  1.338 s (0.56 GB/s) / 0.721 s (1.03 GB/s)       Compr. ratio: 305.4x

*** the linspace linear distribution ***
  *** blosclz , noshuffle  ***  0.540 s (1.38 GB/s) / 0.088 s (8.44 GB/s)       Compr. ratio:   1.0x
  *** blosclz , shuffle    ***  0.324 s (2.30 GB/s) / 0.103 s (7.25 GB/s)       Compr. ratio:   2.0x
  *** blosclz , bitshuffle ***  0.533 s (1.40 GB/s) / 0.234 s (3.19 GB/s)       Compr. ratio:   2.8x
  *** lz4     , noshuffle  ***  0.359 s (2.08 GB/s) / 0.088 s (8.43 GB/s)       Compr. ratio:   1.0x
  *** lz4     , shuffle    ***  0.351 s (2.13 GB/s) / 0.142 s (5.26 GB/s)       Compr. ratio:   3.2x
  *** lz4     , bitshuffle ***  0.396 s (1.88 GB/s) / 0.221 s (3.37 GB/s)       Compr. ratio:   4.9x
  *** lz4hc   , noshuffle  ***  3.223 s (0.23 GB/s) / 0.239 s (3.12 GB/s)       Compr. ratio:   1.2x
  *** lz4hc   , shuffle    ***  0.572 s (1.30 GB/s) / 0.104 s (7.18 GB/s)       Compr. ratio:  24.1x
  *** lz4hc   , bitshuffle ***  2.920 s (0.26 GB/s) / 0.203 s (3.67 GB/s)       Compr. ratio:  35.0x
  *** snappy  , noshuffle  ***  0.088 s (8.51 GB/s) / 0.088 s (8.49 GB/s)       Compr. ratio:   1.0x
  *** snappy  , shuffle    ***  0.262 s (2.85 GB/s) / 0.190 s (3.92 GB/s)       Compr. ratio:   4.2x
  *** snappy  , bitshuffle ***  0.418 s (1.78 GB/s) / 0.256 s (2.91 GB/s)       Compr. ratio:   6.1x
  *** zlib    , noshuffle  ***  6.463 s (0.12 GB/s) / 0.753 s (0.99 GB/s)       Compr. ratio:   1.6x
  *** zlib    , shuffle    ***  1.431 s (0.52 GB/s) / 0.351 s (2.12 GB/s)       Compr. ratio:  27.0x
  *** zlib    , bitshuffle ***  1.433 s (0.52 GB/s) / 0.451 s (1.65 GB/s)       Compr. ratio:  35.2x

*** the random distribution ***
  *** blosclz , noshuffle  ***  0.538 s (1.38 GB/s) / 0.088 s (8.47 GB/s)       Compr. ratio:   1.0x
  *** blosclz , shuffle    ***  0.232 s (3.21 GB/s) / 0.081 s (9.18 GB/s)       Compr. ratio:   3.9x
  *** blosclz , bitshuffle ***  0.221 s (3.37 GB/s) / 0.160 s (4.67 GB/s)       Compr. ratio:   6.1x
  *** lz4     , noshuffle  ***  0.857 s (0.87 GB/s) / 0.218 s (3.42 GB/s)       Compr. ratio:   2.1x
  *** lz4     , shuffle    ***  0.278 s (2.68 GB/s) / 0.176 s (4.24 GB/s)       Compr. ratio:   4.5x
  *** lz4     , bitshuffle ***  0.232 s (3.22 GB/s) / 0.268 s (2.78 GB/s)       Compr. ratio:   6.1x
  *** lz4hc   , noshuffle  ***  5.000 s (0.15 GB/s) / 0.151 s (4.92 GB/s)       Compr. ratio:   3.2x
  *** lz4hc   , shuffle    ***  3.526 s (0.21 GB/s) / 0.124 s (6.02 GB/s)       Compr. ratio:   5.4x
  *** lz4hc   , bitshuffle ***  0.541 s (1.38 GB/s) / 0.206 s (3.61 GB/s)       Compr. ratio:   6.2x
  *** snappy  , noshuffle  ***  0.621 s (1.20 GB/s) / 0.260 s (2.86 GB/s)       Compr. ratio:   2.2x
  *** snappy  , shuffle    ***  0.196 s (3.80 GB/s) / 0.172 s (4.32 GB/s)       Compr. ratio:   4.3x
  *** snappy  , bitshuffle ***  0.174 s (4.29 GB/s) / 0.224 s (3.32 GB/s)       Compr. ratio:   5.0x
  *** zlib    , noshuffle  ***  5.319 s (0.14 GB/s) / 0.505 s (1.48 GB/s)       Compr. ratio:   3.9x
  *** zlib    , shuffle    ***  2.910 s (0.26 GB/s) / 0.415 s (1.80 GB/s)       Compr. ratio:   6.1x
  *** zlib    , bitshuffle ***  1.548 s (0.48 GB/s) / 0.492 s (1.52 GB/s)       Compr. ratio:   6.3x

So, although the -O1 case still performs very well, the external library can be more than 2 GB/s faster in some cases (specially with the bitshuffle filter that takes quite a bit of advantage from AVX2).

Thoughts?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant