Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Sporadic segfaults in TensorFlow during PyObject_GC_UnTrack when used through PyO3 #1623

Closed
awestlake87 opened this issue May 20, 2021 · 13 comments

Comments

@awestlake87
Copy link
Contributor

I've been seeing some SEGV errors when calling TensorFlow ops through PyO3. I haven't been able to find a solid pattern for when they occur. As far as I can tell it's pretty random, although sometimes I can find a sweet spot by rearranging or splitting up some calls.

The backtrace consistently starts with the following frames:

#0  0x00007f9d45b5e55f in PyObject_GC_UnTrack () from /usr/lib/x86_64-linux-gnu/libpython3.8.so.1.0
#1  0x00007f9c9affb8df in EagerTensor_dealloc () from /usr/local/lib/python3.8/dist-packages/tensorflow/python/_pywrap_tensorflow_internal.so
#2  0x00005555b0f8b27f in pyo3::ffi::object::Py_DECREF (op=0x7f9c8dfe3b40) at /opt/.cargo/registry/src/github.com-1ecc6299db9ec823/pyo3-0.13.2/src/ffi/object.rs:825
#3  0x00005555b0f83396 in pyo3::gil::ReferencePool::update_counts (self=0x5555b1a65eb0 <pyo3::gil::POOL>, _py=...)
    at /opt/.cargo/registry/src/github.com-1ecc6299db9ec823/pyo3-0.13.2/src/gil.rs:357
#4  0x00005555b0f834c9 in pyo3::gil::GILPool::new () at /opt/.cargo/registry/src/github.com-1ecc6299db9ec823/pyo3-0.13.2/src/gil.rs:386
#5  0x00005555b0f82cff in pyo3::gil::GILGuard::acquire () at /opt/.cargo/registry/src/github.com-1ecc6299db9ec823/pyo3-0.13.2/src/gil.rs:267
#6  0x00005555b0f83b99 in pyo3::gil::ensure_gil () at /opt/.cargo/registry/src/github.com-1ecc6299db9ec823/pyo3-0.13.2/src/gil.rs:490
#7  0x00005555b00e1be1 in pyo3::python::Python::with_gil (f=...) at /opt/.cargo/registry/src/github.com-1ecc6299db9ec823/pyo3-0.13.2/src/python.rs:157

For reference, that EagerTensor_dealloc function is defined here:

// tp_dealloc for EagerTensor.
void EagerTensor_dealloc(EagerTensor* self) {
  // Unhook the object from python's GC so that the weakref deleter doesn't
  // try to re-delete this.
  PyObject_GC_UnTrack((PyObject*)self);

  // Clear weak references to self.
  // Needs to happen before any actual destruction.
  PyObject_ClearWeakRefs((PyObject*)self);

  Py_DECREF(self->handle_data);
  Py_DECREF(self->tensor_shape);
  // If an attribute dictionary has been created, release it. Note that this
  // is only ever created by CPython's attribute setting methods; we don't
  // create it ourselves.
  Py_CLEAR(self->dict);
  if (self->handle != nullptr) {
    TFE_DeleteTensorHandle(self->handle);
    self->handle = nullptr;
  }

  // Decref context after deleting the tensor handle.
  Py_XDECREF(self->context);

  // We have the global interpreter lock, so use this chance to perform delayed
  // refcount decrements.
  tensorflow::ClearDecrefCache();
  auto id = self->id;
  Py_TYPE(self)->tp_free(self);
  TFE_Py_TapeSetDeleteTrace(id);
}

This may not be the right spot to file this issue, so don't feel obligated to help with this if it doesn't seem related to PyO3. I just thought I'd check with you guys to see if these snippets raise any red flags.

Environment
Docker image: nvidia/cuda11.0-base-ubuntu20.04
Python 3.8.5
PyO3 v0.13.2

@awestlake87
Copy link
Contributor Author

Ok, it's possible I may have found a fix for it after all.

The GC calls lead me to some Python dev threads that mentioned errors like this. It seems like Python's GC calls have been going through some churn with the addition of asyncio and the behavior of this code snippet seems very version-specific.

I had been using Python 3.8.5 ever since we were discussing some solutions in #1274, but it seems like this might have been a mistake. TensorFlow claims to support Python 3.6-3.8, but I'm not sure if that actually includes 3.8.5 and I'm not sure if they've tested that support along with asyncio. Their latest docker image only goes up to Python 3.6.9, so I decided to try and get that version working.

I had to hack the initialization a bit because of the sys.argv error #1241 by calling this function in my main:

/// hack to fix module 'sys' has no attribute 'argv' error in python 3.6
pub fn hack_py_sys_argv() {
    Python::with_gil(|_py| unsafe {
        let s = std::ffi::CString::new("").unwrap();
        let mut args = [pyo3::ffi::Py_DecodeLocale(s.as_ptr(), 0)];
        pyo3::ffi::PySys_SetArgv(1, args.as_mut_ptr());
    });
}

After that, the segfaults disappeared, so it looks like this was just a regression as a result of the internal Python GC behavior changing a bit and that TensorFlow may need to change it's dealloc logic in a future version.

It's also possible that this error is racy and it's just slipped into a sweet spot temporarily (which has happened before), so I'm hesitant to call it completely fixed. I'll post an update if the error comes back.

@davidhewitt
Copy link
Member

That tensorflow code looks correct on inspection, however the stack trace doesn't yet seem conclusive to point the blame at us either.

Please do let us know if the problems reoccur.

@davidhewitt
Copy link
Member

Btw note that if this PyO3 code path is being exercised you're dropping Py smart-pointers without the GIL held. PyO3 has to stash the reference to the object until the next time PyO3 enters GIL-bound code, which delays deallocation. Interestingly it looks like tensorflow has a similar strategy (tensorflow::ClearDecrefCache()).

@awestlake87
Copy link
Contributor Author

awestlake87 commented May 21, 2021

@davidhewitt unfortunately I think the errors have come back. They're definitely more rare now so I haven't gotten a proper backtrace yet to verify that they're the same error, but it looks the same from the outside.

I've also been seeing new problems since moving back to Python 3.6. And even stranger, it seems like there are multiple different crashes. I just got this one after about 2 hours of running my program:

Fatal Python error: GC object already tracked
...

In some runs, it'll occasionally complain that some of my tensor shapes are off too, which is really strange, because it should be running the same set of training loop operations each time.

I'm really not sure which library has the issue at this point (although I'm fairly certain tensorflow is doing some weird stuff), so I'll keep trying some things.

That information about the Py smart pointers and the cache is helpful though. At one point I was trying to restrict my deallocation of tensors to the same thread that they were created in (for no particular reason other than examples of tensorflow usage are usually only single threaded), but since the deallocation was cached and delayed, it's likely that another thread would have taken over the GIL and deallocated them there instead.

Edit:
At least some of the errors that I thought were the segfaults were actually SIGKILLs where my PC was running out of RAM, so now I'm not certain I have been seeing the segfaults again. I've got everything running through GDB now, so if those segfaults happen again, I should know for sure.

@awestlake87
Copy link
Contributor Author

awestlake87 commented May 22, 2021

Just as an update, I have seen new segfaults in different places since rolling back to 3.6.9 with traces like this:

Thread 81 "tokio-runtime-w" received signal SIGSEGV, Segmentation fault.
[Switching to Thread 0x7fc1be7fc700 (LWP 106)]
0x00007fc25b0b8dc1 in TFE_TensorHandleDataType () from /usr/local/lib/python3.6/dist-packages/tensorflow/python/_pywrap_tensorflow_internal.so
#0  0x00007fc25b0b8dc1 in TFE_TensorHandleDataType () from /usr/local/lib/python3.6/dist-packages/tensorflow/python/_pywrap_tensorflow_internal.so
#1  0x00007fc25b04a436 in (anonymous namespace)::AddInputToOp((anonymous namespace)::FastPathOpExecInfo*, _object*, bool, tensorflow::OpDef_ArgDef const&, std::vector<std::unique_ptr<_object, tensorflow::detail::PyDecrefDeleter>, std::allocator<std::unique_ptr<_object, tensorflow::detail::PyDecrefDeleter> > >*, std::vector<std::unique_ptr<_object, tensorflow::detail::PyDecrefDeleter>, std::allocator<std::unique_ptr<_object, tensorflow::detail::PyDecrefDeleter> > >*, TFE_Op*, TF_Status*) () from /usr/local/lib/python3.6/dist-packages/tensorflow/python/_pywrap_tensorflow_internal.so
#2  0x00007fc25b04b2d3 in TFE_Py_FastPathExecute_C(_object*) () from /usr/local/lib/python3.6/dist-packages/tensorflow/python/_pywrap_tensorflow_internal.so
#3  0x00007fc24d741957 in void pybind11::cpp_function::initialize<pybind11_init__pywrap_tfe(pybind11::module&)::{lambda(pybind11::args)#43}, pybind11::object, pybind11::args, pybind11::name, pybind11::scope, pybind11::sibling>(pybind11_init__pywrap_tfe(pybind11::module&)::{lambda(pybind11::args)#43}&&, pybind11::object (*)(pybind11::args), pybind11::name const&, pybind11::scope const&, pybind11::sibling const&)::{lambda(pybind11::detail::function_call&)#3}::_FUN(pybind11::detail::function_call) () from /usr/local/lib/python3.6/dist-packages/tensorflow/python/_pywrap_tfe.so
#4  0x00007fc24d7661d4 in pybind11::cpp_function::dispatcher(_object*, _object*, _object*) () from /usr/local/lib/python3.6/dist-packages/tensorflow/python/_pywrap_tfe.so
#5  0x00007fc29e2e658a in _PyCFunction_FastCallDict () from /usr/lib/x86_64-linux-gnu/libpython3.6m.so.1.0
#6  0x00007fc29e24f1ec in ?? () from /usr/lib/x86_64-linux-gnu/libpython3.6m.so.1.0
#7  0x00007fc29e255303 in _PyEval_EvalFrameDefault () from /usr/lib/x86_64-linux-gnu/libpython3.6m.so.1.0
#8  0x00007fc29e24ea0f in ?? () from /usr/lib/x86_64-linux-gnu/libpython3.6m.so.1.0
#9  0x00007fc29e24f0fc in ?? () from /usr/lib/x86_64-linux-gnu/libpython3.6m.so.1.0
#10 0x00007fc29e2540d5 in _PyEval_EvalFrameDefault () from /usr/lib/x86_64-linux-gnu/libpython3.6m.so.1.0
#11 0x00007fc29e24ea0f in ?? () from /usr/lib/x86_64-linux-gnu/libpython3.6m.so.1.0
#12 0x00007fc29e24f4ce in PyEval_EvalCodeEx () from /usr/lib/x86_64-linux-gnu/libpython3.6m.so.1.0
#13 0x00007fc29e30ee36 in ?? () from /usr/lib/x86_64-linux-gnu/libpython3.6m.so.1.0
#14 0x00007fc29e33a768 in PyObject_Call () from /usr/lib/x86_64-linux-gnu/libpython3.6m.so.1.0
#15 0x00007fc29e254225 in _PyEval_EvalFrameDefault () from /usr/lib/x86_64-linux-gnu/libpython3.6m.so.1.0
#16 0x00007fc29e24ea0f in ?? () from /usr/lib/x86_64-linux-gnu/libpython3.6m.so.1.0
#17 0x00007fc29e24f0fc in ?? () from /usr/lib/x86_64-linux-gnu/libpython3.6m.so.1.0
#18 0x00007fc29e255303 in _PyEval_EvalFrameDefault () from /usr/lib/x86_64-linux-gnu/libpython3.6m.so.1.0
#19 0x00007fc29e24d803 in ?? () from /usr/lib/x86_64-linux-gnu/libpython3.6m.so.1.0
#20 0x00007fc29e24ee25 in _PyFunction_FastCallDict () from /usr/lib/x86_64-linux-gnu/libpython3.6m.so.1.0
#21 0x00007fc29e339fd1 in _PyObject_FastCallDict () from /usr/lib/x86_64-linux-gnu/libpython3.6m.so.1.0
#22 0x00007fc29e33a6fd in _PyObject_Call_Prepend () from /usr/lib/x86_64-linux-gnu/libpython3.6m.so.1.0
#23 0x00007fc29e33a768 in PyObject_Call () from /usr/lib/x86_64-linux-gnu/libpython3.6m.so.1.0
#24 0x0000564fff376dfa in pyo3::types::any::PyAny::call_method::{{closure}} (name=0x7fc1da503b58) at /opt/.cargo/registry/src/github.com-1ecc6299db9ec823/pyo3-0.13.2/src/types/any.rs:276
#25 0x0000564fff37e548 in pyo3::conversion::ToBorrowedObject::with_borrowed_ptr (self=0x7fc1be7f2360, py=..., f=...) at /opt/.cargo/registry/src/github.com-1ecc6299db9ec823/pyo3-0.13.2/src/conversion.rs:97
#26 0x0000564fff376c68 in pyo3::types::any::PyAny::call_method (self=0x7fc1da521a40, name=..., args=(), kwargs=...) at /opt/.cargo/registry/src/github.com-1ecc6299db9ec823/pyo3-0.13.2/src/types/any.rs:268
#27 0x0000564fff376eef in pyo3::types::any::PyAny::call_method0 (self=0x7fc1da521a40, name=...) at /opt/.cargo/registry/src/github.com-1ecc6299db9ec823/pyo3-0.13.2/src/types/any.rs:297
...

These segfaults happen maybe once every 30mins. Also, I'm seeing a litany of new exceptions being thrown around in tensorflow code that I've never seen before. Most seem distinct from one another, and they only happen every once in awhile.

I've been trying to constrain my tensorflow logic even more by running the pyo3 tensorflow calls in one single thread, but I'm even having issues with that. It seems to work fine until it reaches a certain point in the code, then it always deadlocks in the same place on a call to PyEval_AcquireThread.

I thought maybe it was because I was creating the thread in Rust, so I tried creating one with the python threading.Thread class. That turned out to be even worse and deadlocked when I called the start function on PyEval_RestoreThread.

Adding a call to pyo3::prepare_freethreaded_python did not help in either case.

Upgrading the version back to 3.8.5 avoids all of the issues listed above, but I just can't shake the original segfaults in PyObject_GC_UnTrack. I must still be deleting a Py<T> reference outside the GIL lock somewhere, but I haven't been able to figure out where. If TensorFlow is also keeping a cache of references, maybe PyO3 and TensorFlow are both deallocating the same tensor and causing a double-free error?

In short, I've just been having rotten luck with this all around. I just can't seem to win with it.

@davidhewitt
Copy link
Member

Yikes. Are you able to share a sample of this code? There's enough issues that I'd like to run it myself if possible to understand the cause. (If you're not able to share it in the public domain, potentially can you reach out to me via Gitter to arrange something confidential?)

@awestlake87
Copy link
Contributor Author

Yeah, it's not a work project so there's no problem sharing it, although it's not at a point yet where I want to fully open source it. It's around 20k lines of code too, so I'd like to strip it down as much as I can. That should help us narrow the problem down a bit.

I'm not sure how much time I'll have to work on it this weekend, but I'll post an update once I have the sample code.

@davidhewitt
Copy link
Member

Sure thing, whenever suits you! Thanks!

@awestlake87
Copy link
Contributor Author

I did manage to track down the spots where I was deleting Py<T> references outside of the GIL context. After I changed them to drop inside the GIL context, the segfaults on 3.8.5 disappeared. I ran the test that was failing 100 times in a row with no issues and my long running training tasks have been running for 11 hours without crashing, so it seems stable enough that I'm unblocked for the time being.

As for the deadlock on PyEval_AcquireThread, it actually happens on 3.8.5 too. It was just segfaulting before it got to that point apparently. I've worked around it by executing the tensorflow code on a Python ThreadPoolExecutor instead of a single Rust thread. Not sure why that makes a difference, but apparently it does.

These are just hacks though, so I'd still like to get a better understanding of the underlying bugs since I feel like they're going to come back to bite me at some point. I've reduced the code that reproduces the problem down to ~4k lines so far, but I think I can reduce that down further. I'll try to push that sample to GitLab within a few days.

@davidhewitt
Copy link
Member

After I changed them to drop inside the GIL context, the segfaults on 3.8.5 disappeared.

Awesome! Especially the deadlock is very interesting so would be curious to investigate that too.

I'll try to push that sample to GitLab within a few days.

👍 thanks.

@awestlake87
Copy link
Contributor Author

Just letting you know, I havent forgotten about this, Ive just been more busy than i expected lately. I'll try to get back to this soon

@davidhewitt
Copy link
Member

No problem, will help debug once it's ready

@davidhewitt
Copy link
Member

It's been a year since last action on this issue, so I'm going to close it on the assumption that whatever conditions which were causing the problem have changed.

If there's still a problem, please ping and we can continue investigation.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants