Segfault on optimization process #676

saddy001 · 2018-03-20T17:35:16Z

I'm getting a segfault at some stage of the process. I got a traceback with gdb. Code as follows:

model = TPOTClassifier(
generations=1000, population_size=100, offspring_size=100, cv=cv - cv[0][0][0], verbosity=2, n_jobs=-1,
random_state=42, config_dict='data/football/tpot_config.py', subsample=1, scoring='precision',
periodic_checkpoint_folder=pathjoin(data_path, 'checkpoints'))

Traceback and context:

Generation 26 - Current best internal CV score: 0.6414797008547009
Optimization Progress: 3%|███▋ | 2777/100100 [4:43:23<985:05:20, 36.44s/pipeline][New Thread 0x7fff44ff9700 (LWP 8098)]
[New Thread 0x7fff457fa700 (LWP 8099)]
[New Thread 0x7fff65ffb700 (LWP 8100)]
[New Thread 0x7fff45ffb700 (LWP 8101)]
[New Thread 0x7fff8a7fc700 (LWP 8102)]
[New Thread 0x7fff89ffb700 (LWP 8103)]
[New Thread 0x7fff897fa700 (LWP 8104)]
[New Thread 0x7fff88ff9700 (LWP 8105)]
[New Thread 0x7fff67fff700 (LWP 8106)]
[New Thread 0x7fff677fe700 (LWP 8107)]
[New Thread 0x7fff66ffd700 (LWP 8108)]
[New Thread 0x7fff667fc700 (LWP 8109)]
[New Thread 0x7fff47fff700 (LWP 8110)]
[New Thread 0x7fff477fe700 (LWP 8111)]
[New Thread 0x7fff46ffd700 (LWP 8112)]
[New Thread 0x7fff467fc700 (LWP 8113)]
[New Thread 0x7fff2ffff700 (LWP 8114)]
[New Thread 0x7fff2f7fe700 (LWP 8115)]
[New Thread 0x7fff2effd700 (LWP 8116)]
[New Thread 0x7fff2e7fc700 (LWP 8117)]
[New Thread 0x7fff2dffb700 (LWP 8118)]
[New Thread 0x7fff2d7fa700 (LWP 8119)]
[New Thread 0x7fff2cff9700 (LWP 8120)]
[New Thread 0x7fff13fff700 (LWP 8121)]
[New Thread 0x7fff137fe700 (LWP 8122)]
[New Thread 0x7fff12ffd700 (LWP 8123)]
[New Thread 0x7fff127fc700 (LWP 8124)]
[New Thread 0x7fff11ffb700 (LWP 8125)]
[New Thread 0x7fff117fa700 (LWP 8126)]
[New Thread 0x7fff10ff9700 (LWP 8127)]
[New Thread 0x7ffefffff700 (LWP 8128)]
[New Thread 0x7ffeff7fe700 (LWP 8129)]
[New Thread 0x7ffefeffd700 (LWP 8130)]
[New Thread 0x7ffefe7fc700 (LWP 8131)]
[New Thread 0x7ffefdffb700 (LWP 8132)]
[New Thread 0x7ffefd7fa700 (LWP 8133)]
[Thread 0x7ffefeffd700 (LWP 8130) exited]
[New Thread 0x7ffefeffd700 (LWP 8134)]
[Thread 0x7ffefdffb700 (LWP 8132) exited]
[New Thread 0x7ffefdffb700 (LWP 8144)]
[Thread 0x7ffefd7fa700 (LWP 8133) exited]
[New Thread 0x7ffefd7fa700 (LWP 8145)]
[Thread 0x7fff2d7fa700 (LWP 8119) exited]
[New Thread 0x7fff2d7fa700 (LWP 8146)]
[Thread 0x7ffefeffd700 (LWP 8134) exited]
[New Thread 0x7ffefeffd700 (LWP 8165)]
[Thread 0x7fff2cff9700 (LWP 8120) exited]
[New Thread 0x7fff2cff9700 (LWP 8166)]
[Thread 0x7ffefd7fa700 (LWP 8145) exited]
[New Thread 0x7ffefd7fa700 (LWP 8176)]
[Thread 0x7fff13fff700 (LWP 8121) exited]
[New Thread 0x7fff13fff700 (LWP 8177)]
[Thread 0x7fff10ff9700 (LWP 8127) exited]
[New Thread 0x7fff10ff9700 (LWP 8178)]
[Thread 0x7ffefe7fc700 (LWP 8131) exited]
[New Thread 0x7ffefe7fc700 (LWP 8179)]
[Thread 0x7fff12ffd700 (LWP 8123) exited]
[New Thread 0x7fff12ffd700 (LWP 8180)]
[Thread 0x7fff2dffb700 (LWP 8118) exited]
[New Thread 0x7fff2dffb700 (LWP 8181)]
[Thread 0x7fff117fa700 (LWP 8126) exited]
[New Thread 0x7fff117fa700 (LWP 8182)]
[Thread 0x7fff137fe700 (LWP 8122) exited]
[New Thread 0x7fff137fe700 (LWP 8183)]
[Thread 0x7fff11ffb700 (LWP 8125) exited]
[New Thread 0x7fff11ffb700 (LWP 8193)]
[Thread 0x7fff127fc700 (LWP 8124) exited]
[New Thread 0x7fff127fc700 (LWP 8194)]
[Thread 0x7fff2cff9700 (LWP 8166) exited]
[New Thread 0x7fff2cff9700 (LWP 8195)]
[Thread 0x7fff127fc700 (LWP 8194) exited]
[New Thread 0x7fff127fc700 (LWP 8196)]
[Thread 0x7ffefffff700 (LWP 8128) exited]
[New Thread 0x7ffefffff700 (LWP 8197)]
[Thread 0x7fff2d7fa700 (LWP 8146) exited]
[New Thread 0x7fff2d7fa700 (LWP 8207)]
[Thread 0x7fff13fff700 (LWP 8177) exited]
[New Thread 0x7fff13fff700 (LWP 8208)]
[Thread 0x7fff2cff9700 (LWP 8195) exited]
[New Thread 0x7fff2cff9700 (LWP 8209)]
[Thread 0x7ffefeffd700 (LWP 8165) exited]
[New Thread 0x7ffefeffd700 (LWP 8219)]
[Thread 0x7ffefd7fa700 (LWP 8176) exited]
[New Thread 0x7ffefd7fa700 (LWP 8220)]
[Thread 0x7fff13fff700 (LWP 8208) exited]
*** Error in '.../miniconda3/bin/python': corrupted double-linked list: 0x00007fff2443b060 ***

Thread 4400 "python" received signal SIGABRT, Aborted.
[Switching to Thread 0x7fff897fa700 (LWP 8104)]
__GI_raise (sig=sig@entry=6) at ../sysdeps/unix/sysv/linux/raise.c:51
51 ../sysdeps/unix/sysv/linux/raise.c: Datei oder Verzeichnis nicht gefunden.

(gdb) backtrace
#0 __GI_raise (sig=sig@entry=6) at ../sysdeps/unix/sysv/linux/raise.c:51
#1 0x00007ffff780ef5d in __GI_abort () at abort.c:90
#2 0x00007ffff785728d in __libc_message (action=action@entry=do_abort, fmt=fmt@entry=0x7ffff797e528 "*** Error in `%s': %s: 0x%s ***\n") at ../sysdeps/posix/libc_fatal.c:181
#3 0x00007ffff785e64a in malloc_printerr (action=, str=0x7ffff797ae0b "corrupted double-linked list", ptr=, ar_ptr=) at malloc.c:5426
#4 0x00007ffff7862b4a in _int_malloc (av=av@entry=0x7fff24000020, bytes=bytes@entry=2000) at malloc.c:3930
#5 0x00007ffff7864f3e in __GI___libc_malloc (bytes=2000) at malloc.c:3086
#6 0x00007fffed6219a1 in ?? () from /home/user/miniconda3/lib/python3.6/site-packages/numpy/core/multiarray.cpython-36m-x86_64-linux-gnu.so
#7 0x00007fffed6849a3 in ?? () from /home/user/miniconda3/lib/python3.6/site-packages/numpy/core/multiarray.cpython-36m-x86_64-linux-gnu.so
#8 0x00007fffed684a6a in ?? () from /home/user/miniconda3/lib/python3.6/site-packages/numpy/core/multiarray.cpython-36m-x86_64-linux-gnu.so
#9 0x00007fffed7327ad in ?? () from /home/user/miniconda3/lib/python3.6/site-packages/numpy/core/multiarray.cpython-36m-x86_64-linux-gnu.so
#10 0x00007fffed7330fb in ?? () from /home/user/miniconda3/lib/python3.6/site-packages/numpy/core/multiarray.cpython-36m-x86_64-linux-gnu.so
#11 0x00007fffed735988 in ?? () from /home/user/miniconda3/lib/python3.6/site-packages/numpy/core/multiarray.cpython-36m-x86_64-linux-gnu.so
#12 0x00007fffc7255e58 in ?? () from /home/user/miniconda3/lib/python3.6/site-packages/numpy/core/umath.cpython-36m-x86_64-linux-gnu.so
#13 0x00007fffc7256d82 in ?? () from /home/user/miniconda3/lib/python3.6/site-packages/numpy/core/umath.cpython-36m-x86_64-linux-gnu.so
#14 0x00005555556631bb in _PyObject_FastCallDict ()
#15 0x000055555568140d in PyObject_CallFunctionObjArgs ()
#16 0x00007fffed626639 in ?? () from /home/user/miniconda3/lib/python3.6/site-packages/numpy/core/multiarray.cpython-36m-x86_64-linux-gnu.so
#17 0x0000555555660a3c in PyObject_RichCompare ()
#18 0x00005555557153bc in _PyEval_EvalFrameDefault ()
#19 0x00005555556ea7db in fast_function ()
#20 0x00005555556f0cc5 in call_function ()
#21 0x000055555571519a in _PyEval_EvalFrameDefault ()
#22 0x00005555556ea7db in fast_function ()
#23 0x00005555556f0cc5 in call_function ()
#24 0x000055555571519a in _PyEval_EvalFrameDefault ()
#25 0x00005555556e99a6 in _PyEval_EvalCodeWithName ()
#26 0x00005555556eaa11 in fast_function ()
#27 0x00005555556f0cc5 in call_function ()
#28 0x0000555555715eb1 in _PyEval_EvalFrameDefault ()
#29 0x00005555556e9c76 in _PyEval_EvalCodeWithName ()
#30 0x00005555556eaa11 in fast_function ()
---Type to continue, or q to quit---
#31 0x00005555556f0cc5 in call_function ()
#32 0x0000555555715eb1 in _PyEval_EvalFrameDefault ()
#33 0x00005555556eb529 in PyEval_EvalCodeEx ()
#34 0x00005555556ec456 in function_call ()
#35 0x0000555555662dde in PyObject_Call ()
#36 0x0000555555716994 in _PyEval_EvalFrameDefault ()
#37 0x00005555556e99a6 in _PyEval_EvalCodeWithName ()
#38 0x00005555556eaa11 in fast_function ()
#39 0x00005555556f0cc5 in call_function ()
#40 0x000055555571519a in _PyEval_EvalFrameDefault ()
#41 0x00005555556eae4b in _PyFunction_FastCallDict ()
#42 0x000055555566339f in _PyObject_FastCallDict ()
#43 0x0000555555667ff3 in _PyObject_Call_Prepend ()
#44 0x0000555555662dde in PyObject_Call ()
#45 0x00005555556be901 in slot_tp_call ()
#46 0x00005555556631bb in _PyObject_FastCallDict ()
#47 0x00005555556f0d3e in call_function ()
#48 0x000055555571519a in _PyEval_EvalFrameDefault ()
#49 0x00005555556eae4b in _PyFunction_FastCallDict ()
#50 0x000055555566339f in _PyObject_FastCallDict ()
#51 0x0000555555667ff3 in _PyObject_Call_Prepend ()
#52 0x0000555555662dde in PyObject_Call ()
#53 0x00005555556bdf6b in slot_tp_init ()
#54 0x00005555556f0f27 in type_call ()
#55 0x00005555556631bb in _PyObject_FastCallDict ()
#56 0x00005555556f0d3e in call_function ()
#57 0x000055555571519a in _PyEval_EvalFrameDefault ()
#58 0x00005555556e99a6 in _PyEval_EvalCodeWithName ()
#59 0x00005555556eaa11 in fast_function ()
#60 0x00005555556f0cc5 in call_function ()
#61 0x0000555555715eb1 in _PyEval_EvalFrameDefault ()
---Type to continue, or q to quit---
#62 0x00005555556ea7db in fast_function ()
#63 0x00005555556f0cc5 in call_function ()
#64 0x000055555571519a in _PyEval_EvalFrameDefault ()
#65 0x00005555556ea7db in fast_function ()
#66 0x00005555556f0cc5 in call_function ()
#67 0x000055555571519a in _PyEval_EvalFrameDefault ()
#68 0x00005555556eae4b in _PyFunction_FastCallDict ()
#69 0x000055555566339f in _PyObject_FastCallDict ()
#70 0x0000555555667ff3 in _PyObject_Call_Prepend ()
#71 0x0000555555662dde in PyObject_Call ()
#72 0x00005555556be901 in slot_tp_call ()
#73 0x00005555556631bb in _PyObject_FastCallDict ()
#74 0x00005555556f0d3e in call_function ()
#75 0x000055555571519a in _PyEval_EvalFrameDefault ()
#76 0x00005555556e9dfe in _PyEval_EvalCodeWithName ()
#77 0x00005555556eb108 in _PyFunction_FastCallDict ()
#78 0x000055555566339f in _PyObject_FastCallDict ()
#79 0x0000555555667ff3 in _PyObject_Call_Prepend ()
#80 0x0000555555662dde in PyObject_Call ()
#81 0x0000555555716994 in _PyEval_EvalFrameDefault ()
#82 0x00005555556e99a6 in _PyEval_EvalCodeWithName ()
#83 0x00005555556eb108 in _PyFunction_FastCallDict ()
#84 0x000055555566339f in _PyObject_FastCallDict ()
#85 0x0000555555667ff3 in _PyObject_Call_Prepend ()
#86 0x0000555555662dde in PyObject_Call ()
#87 0x0000555555716994 in _PyEval_EvalFrameDefault ()
#88 0x00005555556e9dfe in _PyEval_EvalCodeWithName ()
#89 0x00005555556eaa11 in fast_function ()
#90 0x00005555556f0cc5 in call_function ()
#91 0x0000555555715eb1 in _PyEval_EvalFrameDefault ()
#92 0x00005555556e9dfe in _PyEval_EvalCodeWithName ()
---Type to continue, or q to quit---
#93 0x00005555556eaa11 in fast_function ()
#94 0x00005555556f0cc5 in call_function ()
#95 0x000055555571519a in _PyEval_EvalFrameDefault ()
#96 0x00005555556ebb6e in PyEval_EvalCodeEx ()
#97 0x00005555556ec456 in function_call ()
#98 0x0000555555662dde in PyObject_Call ()
#99 0x0000555555716994 in _PyEval_EvalFrameDefault ()
#100 0x00005555556e9dfe in _PyEval_EvalCodeWithName ()
#101 0x00005555556eb108 in _PyFunction_FastCallDict ()
#102 0x000055555566339f in _PyObject_FastCallDict ()
#103 0x0000555555758032 in partial_call ()
#104 0x0000555555662dde in PyObject_Call ()
#105 0x0000555555716994 in _PyEval_EvalFrameDefault ()
#106 0x00005555556e99a6 in _PyEval_EvalCodeWithName ()
#107 0x00005555556eaa11 in fast_function ()
#108 0x00005555556f0cc5 in call_function ()
#109 0x000055555571519a in _PyEval_EvalFrameDefault ()
#110 0x00005555556eae4b in _PyFunction_FastCallDict ()
#111 0x000055555566339f in _PyObject_FastCallDict ()
#112 0x0000555555667ff3 in _PyObject_Call_Prepend ()
#113 0x0000555555662dde in PyObject_Call ()
#114 0x00005555556be901 in slot_tp_call ()
#115 0x0000555555662dde in PyObject_Call ()
#116 0x0000555555716994 in _PyEval_EvalFrameDefault ()
#117 0x00005555556e99a6 in _PyEval_EvalCodeWithName ()
#118 0x00005555556eb108 in _PyFunction_FastCallDict ()
#119 0x000055555566339f in _PyObject_FastCallDict ()
#120 0x0000555555667ff3 in _PyObject_Call_Prepend ()
#121 0x0000555555662dde in PyObject_Call ()
#122 0x00005555556be901 in slot_tp_call ()
#123 0x0000555555662dde in PyObject_Call ()
---Type to continue, or q to quit---
#124 0x0000555555716994 in _PyEval_EvalFrameDefault ()
#125 0x00005555556eb529 in PyEval_EvalCodeEx ()
#126 0x00005555556ec456 in function_call ()
#127 0x0000555555662dde in PyObject_Call ()
#128 0x0000555555716994 in _PyEval_EvalFrameDefault ()
#129 0x00005555556ea7db in fast_function ()
#130 0x00005555556f0cc5 in call_function ()
#131 0x000055555571519a in _PyEval_EvalFrameDefault ()
#132 0x00005555556ea7db in fast_function ()
#133 0x00005555556f0cc5 in call_function ()
#134 0x000055555571519a in _PyEval_EvalFrameDefault ()
#135 0x00005555556eae4b in _PyFunction_FastCallDict ()
#136 0x000055555566339f in _PyObject_FastCallDict ()
#137 0x0000555555667ff3 in _PyObject_Call_Prepend ()
#138 0x0000555555662dde in PyObject_Call ()
#139 0x0000555555762426 in t_bootstrap ()
#140 0x00007ffff7bbd7fc in start_thread (arg=0x7fff897fa700) at pthread_create.c:465
#141 0x00007ffff78eab5f in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:95
(gdb)

Versions:

$ python -c "import numpy; print('numpy %s' % numpy.version)"
numpy 1.14.2
$ python -c "import scipy; print('scipy %s' % scipy.version)"
scipy 1.0.0
$ python -c "import sklearn; print('sklearn %s' % sklearn.version)"
sklearn 0.19.1
$ python -c "import deap; print('deap %s' % deap.version)"
deap 1.2
$ python -c "import xgboost; print('xgboost %s ' % xgboost.version)"
Traceback (most recent call last):
File "", line 1, in
ModuleNotFoundError: No module named 'xgboost'
$ python -c "import update_checker; print('update_checker %s ' % update_checker.version)"
update_checker 0.16
$ python -c "import tqdm; print('tqdm %s' % tqdm.version)"
tqdm 4.19.7
$ python -c "import pandas; print('pandas %s' % pandas.version)"
pandas 0.22.0

If you need further information, please let me know.

saddy001 · 2018-03-20T17:39:23Z

While I'm looking at it... This is a pretty mighty stack. Could it have something to do with the maximum recursion depth like in https://stackoverflow.com/q/10035541?

weixuanfu · 2018-03-23T17:28:23Z

I am not sure why this error showed up. What are the RAMs size and the number of CPU? I just limited chunk-size when n_jobs = -1 or very large number in #677 (I will merge it dev branch soon). Maybe the dev branch is more stable for n_jobs=-1.

saddy001 · 2018-03-23T18:33:37Z

RAM size is 16GB and nCPUs=16 (hyperthreading). I will check the dev branch out and see if it fixes the issue.

saddy001 · 2018-03-23T18:36:19Z

By the way, I think this line in the output above is significant:

Error in '.../miniconda3/bin/python': corrupted double-linked list: 0x00007fff2443b060

weixuanfu · 2018-03-23T18:37:09Z

16GB RAM maybe not enough to handle n_jobs=-1. Could you please also try n_jobs=4 instead?

saddy001 · 2018-03-23T18:37:44Z

Of course. In the stable branch?

weixuanfu · 2018-03-23T18:38:19Z

Yep

saddy001 · 2018-03-24T14:19:54Z

Nope:

Generation 1 - Current best internal CV score: 0.514192598512397
Optimization Progress: 0%|▎ | 203/100100 [33:13<911:00:21, 32.83s/pipeline]
Speicherzugriffsfehler (Speicherabzug geschrieben)

Also, I noticed that it used all available cores, even though n_jobs=4 was specified.

saddy001 · 2018-03-25T10:51:19Z

After it ran 1.5 hours, it's using all 16 cores for a short period even though I specified n_jobs=1.
Although it seems stable up until now.

Could it be, that TPOT is spawning several jobs when n_jobs>1 and some estimators used by TPOT are spawning several jobs on their own (no matter if n_jobs=1)? I got trouble with this scenario some time ago too.
I'm using default config dict at the moment.

Also, I made some memory measurements during a segfault:

while true; do free >> memory.log; sleep 0.5; done
grep Speicher memory.log |awk '{print $4}'|sort |head -n1
10000444

Either the peak was shorter than .5 seconds or memory usage is not a problem.

I also tried a non-conda python Version, since I read about segfaults in conda's python. Seems that python is not the problem here, it crashes too.

Another observation I made is that the other programs become unstable too when TPOT is run with n_jobs>1: Firefox (sometimes only tabs crash, sometimes the whole thing), conda (segfaults when creating environments), ...

rhiever · 2018-03-25T14:28:59Z

None of the operators in TPOT should be spawning new processes, especially when n_jobs=1. We’ve been careful to make sure of that.

saddy001 · 2018-03-25T14:38:58Z

But how can this be explained? Command is

model = TPOTClassifier(
generations=1000, population_size=100, offspring_size=100, cv=cv - cv[0][0][0], verbosity=2, n_jobs=1,
random_state=42, subsample=1, scoring='f1',
periodic_checkpoint_folder=pathjoin(data_path, 'checkpoints'))

It's only using multiple cores for short time periods and it starts after ~2hours of optimization.

saddy001 · 2018-03-25T15:03:38Z

I built a minimal working example that shows the problem:

from tpot import TPOTClassifier
from sklearn.datasets import load_digits
from sklearn.model_selection import train_test_split, TimeSeriesSplit

digits = load_digits()
cv = TimeSeriesSplit(n_splits=20)
tpot = TPOTClassifier(
        generations=1000, population_size=100, offspring_size=100, cv=cv, verbosity=2, n_jobs=1,
        random_state=42, subsample=1, scoring='f1',
        periodic_checkpoint_folder='test_checkpoints')

tpot.fit(digits.data, digits.target.astype('bool'))

Now, when it runs 1 minute or so, multiple processes/or threads? are spawned.

rhiever · 2018-03-25T16:34:18Z

I’ll have to take a look when I get back to the office. Thank you for the minimal working example.

rhiever · 2018-03-26T00:31:10Z

I couldn't reproduce the issue on my MacBook with TPOT v0.9.2 that's available through pip. For reference, here are the package versions I'm running:

$ python -c "import numpy; print('numpy %s' % numpy.__version__)"
numpy 1.12.1
$ python -c "import scipy; print('scipy %s' % scipy.__version__)"
scipy 1.0.0
$ python -c "import sklearn; print('sklearn %s' % sklearn.__version__)"
sklearn 0.19.1
$ python -c "import deap; print('deap %s' % deap.__version__)"
deap 1.2
$ python -c "import xgboost; print('xgboost %s ' % xgboost.__version__)"
xgboost 0.7.post3 
$ python -c "import update_checker; print('update_checker %s ' % update_checker.__version__)"
update_checker 0.16 
$ python -c "import tqdm; print('tqdm %s' % tqdm.__version__)"
tqdm 4.19.5
$ python -c "import pandas; print('pandas %s' % pandas.__version__)"
pandas 0.22.0

saddy001 · 2018-03-26T18:34:13Z

This is interesting. How did you count the threads? Here's what I came up with:

python ./test2.py 
# wait a few secs
ps hH p $(pgrep -f test2.py)
67341 pts/3    Rl+    3:04 /usr/bin/python ./test2.py
67341 pts/3    Rl+    0:21 /usr/bin/python ./test2.py
67341 pts/3    Rl+    0:22 /usr/bin/python ./test2.py
67341 pts/3    Rl+    0:21 /usr/bin/python ./test2.py
67341 pts/3    Rl+    0:21 /usr/bin/python ./test2.py
67341 pts/3    Rl+    0:21 /usr/bin/python ./test2.py
67341 pts/3    Rl+    0:20 /usr/bin/python ./test2.py
67341 pts/3    Rl+    0:21 /usr/bin/python ./test2.py
67341 pts/3    Rl+    0:21 /usr/bin/python ./test2.py
67341 pts/3    Rl+    0:22 /usr/bin/python ./test2.py
67341 pts/3    Rl+    0:21 /usr/bin/python ./test2.py
67341 pts/3    Rl+    0:20 /usr/bin/python ./test2.py
67341 pts/3    Rl+    0:21 /usr/bin/python ./test2.py
67341 pts/3    Rl+    0:20 /usr/bin/python ./test2.py
67341 pts/3    Rl+    0:20 /usr/bin/python ./test2.py
67341 pts/3    Rl+    0:16 /usr/bin/python ./test2.py
67341 pts/3    Sl+    0:00 /usr/bin/python ./test2.py
67341 pts/3    Sl+    0:00 /usr/bin/python ./test2.py

To setup up a fresh install I did this:

conda config --remove channels conda-forge
conda clean --all
conda update conda python pip
conda create --name test python=3
source activate test
(test) pip install --no-cache-dir tpot

weixuanfu · 2018-03-26T21:22:26Z

I think the issue about multit-threads in htop maybe due to the threading backend used by some ensemble-based algorithms. Also I found it happened with XGboost, but I am not sure why for now. You could use the code to reproduce the issue. But I noticed the cpu% in htop was 100% even with multithreads since n_jobs=1.

from sklearn.datasets import load_digits
from sklearn.model_selection import train_test_split, TimeSeriesSplit, cross_val_score
from sklearn.ensemble import GradientBoostingClassifier

digits = load_digits()
cv = TimeSeriesSplit(n_splits=20)
gbc = GradientBoostingClassifier(learning_rate=0.5, max_depth=6, max_features=0.1, min_samples_leaf=1, min_samples_split=13, n_estimators=100, subsample=0.60)

for _ in range(1000):
    cv_scores = cross_val_score(gbc, digits.data, digits.target.astype('bool'), cv=cv)
    print(cv_scores)

saddy001 · 2018-03-27T17:06:24Z

I have found the reason for the initial problem, the segfaults:
My RAM has defects.

$ sudo memtester 12G
FAILURE: 0xee6eaae5e9499bb3 != 0xee6ebae5e9499bb3 at offset 0x44ae4678.

This issue can be closed.

rhiever · 2018-03-27T18:23:16Z

Glad you were able to get to the bottom of the issue! Guess that's about as low-level as an issue can go.

saddy001 · 2018-03-27T18:27:39Z

It's nasty. Due to it's unpredictable nature, it makes you slowly distrust every piece of software on the system. Then I removed the module that I disliked most and now. Suddenly
everything is stable.

Benjamin-Lee · 2018-07-07T23:52:25Z

Am also running running into this issue on an AWS instance with 137 Gb RAM and 72 CPUs. I created a new instance with the same specs and ran into the same segfault after three generations. Pretty unlikely that two separate instances would both have defective RAM.

[ec2-user@ip-172-31-20-193 ~]$ python classification.py
/usr/local/lib/python2.7/site-packages/deap/tools/_hypervolume/pyhv.py:33: ImportWarning: Falling back to the python version of hypervolume module. Expect this to be very slow.
  "module. Expect this to be very slow.", ImportWarning)
/usr/local/lib64/python2.7/site-packages/scipy/spatial/__init__.py:96: ImportWarning: Not importing directory '/usr/local/lib64/python2.7/site-packages/scipy/spatial/qhull': missing __init__.py
  from .qhull import *
/usr/local/lib64/python2.7/site-packages/scipy/optimize/_minimize.py:37: ImportWarning: Not importing directory '/usr/local/lib64/python2.7/site-packages/scipy/optimize/lbfgsb': missing __init__.py
  from .lbfgsb import _minimize_lbfgsb
/usr/local/lib/python2.7/site-packages/tpot/operator_utils.py:1: ImportWarning: Not importing directory '/home/ec2-user/xgboost': missing __init__.py
  # -*- coding: utf-8 -*-
/usr/local/lib/python2.7/site-packages/xgboost-0.72-py2.7.egg/xgboost/training.py:11: ImportWarning: Not importing directory '/usr/local/lib/python2.7/site-packages/xgboost-0.72-py2.7.egg/xgboost/rabit': missing __init__.py
  from . import rabit
Generation 1 - Current best internal CV score: 0.870559367619
Generation 2 - Current best internal CV score: 0.875569109325
Optimization Progress:   3%|███▏                                                                                                        | 
300/10100 [03:28<40:30:15, 14.88s/pipeline]Segmentation fault

saddy001 · 2018-07-08T09:25:28Z

Pretty unlikely that two separate instances would both have defective RAM

Really? I don't think so. Do a memory test.

PS: They seem to use ECC RAM. Nevertheless I would try to exclude faulty RAM as a reason, although I may be a little biased now.

weixuanfu added the question label Mar 23, 2018

weixuanfu closed this as completed Mar 27, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Segfault on optimization process #676

Segfault on optimization process #676

saddy001 commented Mar 20, 2018 •

edited

saddy001 commented Mar 20, 2018

weixuanfu commented Mar 23, 2018 •

edited

saddy001 commented Mar 23, 2018

saddy001 commented Mar 23, 2018

weixuanfu commented Mar 23, 2018

saddy001 commented Mar 23, 2018 •

edited

weixuanfu commented Mar 23, 2018

saddy001 commented Mar 24, 2018 •

edited

saddy001 commented Mar 25, 2018 •

edited

rhiever commented Mar 25, 2018 via email •

edited

saddy001 commented Mar 25, 2018 •

edited

saddy001 commented Mar 25, 2018 •

edited

rhiever commented Mar 25, 2018 via email •

edited

rhiever commented Mar 26, 2018

saddy001 commented Mar 26, 2018

weixuanfu commented Mar 26, 2018

saddy001 commented Mar 27, 2018

rhiever commented Mar 27, 2018

saddy001 commented Mar 27, 2018 •

edited

Benjamin-Lee commented Jul 7, 2018

saddy001 commented Jul 8, 2018 •

edited

Segfault on optimization process #676

Segfault on optimization process #676

Comments

saddy001 commented Mar 20, 2018 • edited

saddy001 commented Mar 20, 2018

weixuanfu commented Mar 23, 2018 • edited

saddy001 commented Mar 23, 2018

saddy001 commented Mar 23, 2018

weixuanfu commented Mar 23, 2018

saddy001 commented Mar 23, 2018 • edited

weixuanfu commented Mar 23, 2018

saddy001 commented Mar 24, 2018 • edited

saddy001 commented Mar 25, 2018 • edited

rhiever commented Mar 25, 2018 via email • edited

saddy001 commented Mar 25, 2018 • edited

saddy001 commented Mar 25, 2018 • edited

rhiever commented Mar 25, 2018 via email • edited

rhiever commented Mar 26, 2018

saddy001 commented Mar 26, 2018

weixuanfu commented Mar 26, 2018

saddy001 commented Mar 27, 2018

rhiever commented Mar 27, 2018

saddy001 commented Mar 27, 2018 • edited

Benjamin-Lee commented Jul 7, 2018

saddy001 commented Jul 8, 2018 • edited

saddy001 commented Mar 20, 2018 •

edited

weixuanfu commented Mar 23, 2018 •

edited

saddy001 commented Mar 23, 2018 •

edited

saddy001 commented Mar 24, 2018 •

edited

saddy001 commented Mar 25, 2018 •

edited

rhiever commented Mar 25, 2018 via email •

edited

saddy001 commented Mar 25, 2018 •

edited

saddy001 commented Mar 25, 2018 •

edited

rhiever commented Mar 25, 2018 via email •

edited

saddy001 commented Mar 27, 2018 •

edited

saddy001 commented Jul 8, 2018 •

edited