You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
We had a cluster fail in production. Turns out a value in our key value store was wrong and our python code was frequently exiting with an unhandled exception. This apparently caused a fast memory leak and all the unbound servers in the cluster started dying due to OOM killer.
I reproduced the problem in our test environment and under valgrind I could clearly see Python leaking. We've had similar high rates of unhandled exceptions during previous events under previous versions of unbound, but never had OOM killer get involved. I wonder if it's related to a change in the unbound codebase (We just recently upgraded to a 1.13.2 from a very old version)? But it seems like a python bug to me.
Below is the valgrind stack trace of the biggest leak. We tried adding gc.collect() (Manually run a garbage collection run) to our codebase but there seemed to be no impact.
==4241== 52,405,144 bytes in 595,513 blocks are possibly lost in loss record 3,991 of 3,996
==4241== at 0x4C28BE3: malloc (vg_replace_malloc.c:299)
==4241== by 0x512E593: PyObject_Malloc (in /usr/lib64/libpython2.7.so.1.0)
==4241== by 0x51B8D58: _PyObject_GC_Malloc (in /usr/lib64/libpython2.7.so.1.0)
==4241== by 0x51B8E95: _PyObject_GC_NewVar (in /usr/lib64/libpython2.7.so.1.0)
==4241== by 0x514012E: PyTuple_New (in /usr/lib64/libpython2.7.so.1.0)
==4241== by 0x518521F: PyEval_EvalFrameEx (in /usr/lib64/libpython2.7.so.1.0)
==4241== by 0x518B1FC: PyEval_EvalCodeEx (in /usr/lib64/libpython2.7.so.1.0)
==4241== by 0x5187609: PyEval_EvalFrameEx (in /usr/lib64/libpython2.7.so.1.0)
==4241== by 0x518B1FC: PyEval_EvalCodeEx (in /usr/lib64/libpython2.7.so.1.0)
==4241== by 0x511571F: ??? (in /usr/lib64/libpython2.7.so.1.0)
==4241== by 0x50F0FE2: PyObject_Call (in /usr/lib64/libpython2.7.so.1.0)
==4241== by 0x50F10C4: ??? (in /usr/lib64/libpython2.7.so.1.0)
==4241== by 0x50F119D: PyObject_CallFunction (in /usr/lib64/libpython2.7.so.1.0)
==4241== by 0x4B1D00: pythonmod_operate (pythonmod.c:530)
==4241== by 0x4655DF: mesh_run (mesh.c:1704)
==4241== by 0x42946B: worker_handle_request (worker.c:1573)
==4241== by 0x4AEB70: comm_point_udp_callback (netevent.c:769)
==4241== by 0x4E7C36: handle_select (mini_event.c:220)
==4241== by 0x4E7DEB: minievent_base_dispatch (mini_event.c:242)
==4241== by 0x4AED1B: comm_base_dispatch (netevent.c:246)
The text was updated successfully, but these errors were encountered:
The 'recent' change was the addition of a log routine that logs the python stack trace to syslog. This is for debugging convenience.
It turns out that this log routine does not Py_XDECREF the three arguments from PyErr_Fetch. That is likely the memory leak, by keeping extra references to the exception data. In the commit I have added Py_XDECREFs for the exception data and stack trace. That hopefully removes the memory leak that you have discovered.
Thanks for the detailed report. I hope the fix fixes your problem, please let me know if it does not; although I have no clue what it would otherwise be, except more reference leaks in pythonmod/pythonmod.c:122 log_py_err().
We had a cluster fail in production. Turns out a value in our key value store was wrong and our python code was frequently exiting with an unhandled exception. This apparently caused a fast memory leak and all the unbound servers in the cluster started dying due to OOM killer.
I reproduced the problem in our test environment and under valgrind I could clearly see Python leaking. We've had similar high rates of unhandled exceptions during previous events under previous versions of unbound, but never had OOM killer get involved. I wonder if it's related to a change in the unbound codebase (We just recently upgraded to a 1.13.2 from a very old version)? But it seems like a python bug to me.
Below is the valgrind stack trace of the biggest leak. We tried adding
gc.collect()
(Manually run a garbage collection run) to our codebase but there seemed to be no impact.==4241== 52,405,144 bytes in 595,513 blocks are possibly lost in loss record 3,991 of 3,996
==4241== at 0x4C28BE3: malloc (vg_replace_malloc.c:299)
==4241== by 0x512E593: PyObject_Malloc (in /usr/lib64/libpython2.7.so.1.0)
==4241== by 0x51B8D58: _PyObject_GC_Malloc (in /usr/lib64/libpython2.7.so.1.0)
==4241== by 0x51B8E95: _PyObject_GC_NewVar (in /usr/lib64/libpython2.7.so.1.0)
==4241== by 0x514012E: PyTuple_New (in /usr/lib64/libpython2.7.so.1.0)
==4241== by 0x518521F: PyEval_EvalFrameEx (in /usr/lib64/libpython2.7.so.1.0)
==4241== by 0x518B1FC: PyEval_EvalCodeEx (in /usr/lib64/libpython2.7.so.1.0)
==4241== by 0x5187609: PyEval_EvalFrameEx (in /usr/lib64/libpython2.7.so.1.0)
==4241== by 0x518B1FC: PyEval_EvalCodeEx (in /usr/lib64/libpython2.7.so.1.0)
==4241== by 0x511571F: ??? (in /usr/lib64/libpython2.7.so.1.0)
==4241== by 0x50F0FE2: PyObject_Call (in /usr/lib64/libpython2.7.so.1.0)
==4241== by 0x50F10C4: ??? (in /usr/lib64/libpython2.7.so.1.0)
==4241== by 0x50F119D: PyObject_CallFunction (in /usr/lib64/libpython2.7.so.1.0)
==4241== by 0x4B1D00: pythonmod_operate (pythonmod.c:530)
==4241== by 0x4655DF: mesh_run (mesh.c:1704)
==4241== by 0x42946B: worker_handle_request (worker.c:1573)
==4241== by 0x4AEB70: comm_point_udp_callback (netevent.c:769)
==4241== by 0x4E7C36: handle_select (mini_event.c:220)
==4241== by 0x4E7DEB: minievent_base_dispatch (mini_event.c:242)
==4241== by 0x4AED1B: comm_base_dispatch (netevent.c:246)
The text was updated successfully, but these errors were encountered: