Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Python Module Seems to Leak Memory if it Experiences an Unhandled Exception #506

Closed
drachs opened this issue Jul 1, 2021 · 2 comments
Closed

Comments

@drachs
Copy link

drachs commented Jul 1, 2021

We had a cluster fail in production. Turns out a value in our key value store was wrong and our python code was frequently exiting with an unhandled exception. This apparently caused a fast memory leak and all the unbound servers in the cluster started dying due to OOM killer.

I reproduced the problem in our test environment and under valgrind I could clearly see Python leaking. We've had similar high rates of unhandled exceptions during previous events under previous versions of unbound, but never had OOM killer get involved. I wonder if it's related to a change in the unbound codebase (We just recently upgraded to a 1.13.2 from a very old version)? But it seems like a python bug to me.

Below is the valgrind stack trace of the biggest leak. We tried adding gc.collect() (Manually run a garbage collection run) to our codebase but there seemed to be no impact.

==4241== 52,405,144 bytes in 595,513 blocks are possibly lost in loss record 3,991 of 3,996
==4241== at 0x4C28BE3: malloc (vg_replace_malloc.c:299)
==4241== by 0x512E593: PyObject_Malloc (in /usr/lib64/libpython2.7.so.1.0)
==4241== by 0x51B8D58: _PyObject_GC_Malloc (in /usr/lib64/libpython2.7.so.1.0)
==4241== by 0x51B8E95: _PyObject_GC_NewVar (in /usr/lib64/libpython2.7.so.1.0)
==4241== by 0x514012E: PyTuple_New (in /usr/lib64/libpython2.7.so.1.0)
==4241== by 0x518521F: PyEval_EvalFrameEx (in /usr/lib64/libpython2.7.so.1.0)
==4241== by 0x518B1FC: PyEval_EvalCodeEx (in /usr/lib64/libpython2.7.so.1.0)
==4241== by 0x5187609: PyEval_EvalFrameEx (in /usr/lib64/libpython2.7.so.1.0)
==4241== by 0x518B1FC: PyEval_EvalCodeEx (in /usr/lib64/libpython2.7.so.1.0)
==4241== by 0x511571F: ??? (in /usr/lib64/libpython2.7.so.1.0)
==4241== by 0x50F0FE2: PyObject_Call (in /usr/lib64/libpython2.7.so.1.0)
==4241== by 0x50F10C4: ??? (in /usr/lib64/libpython2.7.so.1.0)
==4241== by 0x50F119D: PyObject_CallFunction (in /usr/lib64/libpython2.7.so.1.0)
==4241== by 0x4B1D00: pythonmod_operate (pythonmod.c:530)
==4241== by 0x4655DF: mesh_run (mesh.c:1704)
==4241== by 0x42946B: worker_handle_request (worker.c:1573)
==4241== by 0x4AEB70: comm_point_udp_callback (netevent.c:769)
==4241== by 0x4E7C36: handle_select (mini_event.c:220)
==4241== by 0x4E7DEB: minievent_base_dispatch (mini_event.c:242)
==4241== by 0x4AED1B: comm_base_dispatch (netevent.c:246)

@wcawijngaards
Copy link
Member

The 'recent' change was the addition of a log routine that logs the python stack trace to syslog. This is for debugging convenience.

It turns out that this log routine does not Py_XDECREF the three arguments from PyErr_Fetch. That is likely the memory leak, by keeping extra references to the exception data. In the commit I have added Py_XDECREFs for the exception data and stack trace. That hopefully removes the memory leak that you have discovered.

Thanks for the detailed report. I hope the fix fixes your problem, please let me know if it does not; although I have no clue what it would otherwise be, except more reference leaks in pythonmod/pythonmod.c:122 log_py_err().

@drachs
Copy link
Author

drachs commented Jul 2, 2021

I tried your code today and I can confirm it fixes the issue. Do you know when it will see a production ready release?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants