Python Module Seems to Leak Memory if it Experiences an Unhandled Exception #506

drachs · 2021-07-01T22:30:56Z

We had a cluster fail in production. Turns out a value in our key value store was wrong and our python code was frequently exiting with an unhandled exception. This apparently caused a fast memory leak and all the unbound servers in the cluster started dying due to OOM killer.

I reproduced the problem in our test environment and under valgrind I could clearly see Python leaking. We've had similar high rates of unhandled exceptions during previous events under previous versions of unbound, but never had OOM killer get involved. I wonder if it's related to a change in the unbound codebase (We just recently upgraded to a 1.13.2 from a very old version)? But it seems like a python bug to me.

Below is the valgrind stack trace of the biggest leak. We tried adding gc.collect() (Manually run a garbage collection run) to our codebase but there seemed to be no impact.

==4241== 52,405,144 bytes in 595,513 blocks are possibly lost in loss record 3,991 of 3,996
==4241== at 0x4C28BE3: malloc (vg_replace_malloc.c:299)
==4241== by 0x512E593: PyObject_Malloc (in /usr/lib64/libpython2.7.so.1.0)
==4241== by 0x51B8D58: _PyObject_GC_Malloc (in /usr/lib64/libpython2.7.so.1.0)
==4241== by 0x51B8E95: _PyObject_GC_NewVar (in /usr/lib64/libpython2.7.so.1.0)
==4241== by 0x514012E: PyTuple_New (in /usr/lib64/libpython2.7.so.1.0)
==4241== by 0x518521F: PyEval_EvalFrameEx (in /usr/lib64/libpython2.7.so.1.0)
==4241== by 0x518B1FC: PyEval_EvalCodeEx (in /usr/lib64/libpython2.7.so.1.0)
==4241== by 0x5187609: PyEval_EvalFrameEx (in /usr/lib64/libpython2.7.so.1.0)
==4241== by 0x518B1FC: PyEval_EvalCodeEx (in /usr/lib64/libpython2.7.so.1.0)
==4241== by 0x511571F: ??? (in /usr/lib64/libpython2.7.so.1.0)
==4241== by 0x50F0FE2: PyObject_Call (in /usr/lib64/libpython2.7.so.1.0)
==4241== by 0x50F10C4: ??? (in /usr/lib64/libpython2.7.so.1.0)
==4241== by 0x50F119D: PyObject_CallFunction (in /usr/lib64/libpython2.7.so.1.0)
==4241== by 0x4B1D00: pythonmod_operate (pythonmod.c:530)
==4241== by 0x4655DF: mesh_run (mesh.c:1704)
==4241== by 0x42946B: worker_handle_request (worker.c:1573)
==4241== by 0x4AEB70: comm_point_udp_callback (netevent.c:769)
==4241== by 0x4E7C36: handle_select (mini_event.c:220)
==4241== by 0x4E7DEB: minievent_base_dispatch (mini_event.c:242)
==4241== by 0x4AED1B: comm_base_dispatch (netevent.c:246)

The text was updated successfully, but these errors were encountered:

wcawijngaards · 2021-07-02T07:46:31Z

The 'recent' change was the addition of a log routine that logs the python stack trace to syslog. This is for debugging convenience.

It turns out that this log routine does not Py_XDECREF the three arguments from PyErr_Fetch. That is likely the memory leak, by keeping extra references to the exception data. In the commit I have added Py_XDECREFs for the exception data and stack trace. That hopefully removes the memory leak that you have discovered.

Thanks for the detailed report. I hope the fix fixes your problem, please let me know if it does not; although I have no clue what it would otherwise be, except more reference leaks in pythonmod/pythonmod.c:122 log_py_err().

drachs · 2021-07-02T21:05:24Z

I tried your code today and I can confirm it fixes the issue. Do you know when it will see a production ready release?

wcawijngaards closed this as completed in f62994f Jul 2, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Python Module Seems to Leak Memory if it Experiences an Unhandled Exception #506

Python Module Seems to Leak Memory if it Experiences an Unhandled Exception #506

drachs commented Jul 1, 2021

wcawijngaards commented Jul 2, 2021

drachs commented Jul 2, 2021

Python Module Seems to Leak Memory if it Experiences an Unhandled Exception #506

Python Module Seems to Leak Memory if it Experiences an Unhandled Exception #506

Comments

drachs commented Jul 1, 2021

wcawijngaards commented Jul 2, 2021

drachs commented Jul 2, 2021