Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

PyExc_TypeError has ob_type == NULL #32

Open
CalumFreeman opened this issue Aug 15, 2018 · 25 comments
Open

PyExc_TypeError has ob_type == NULL #32

CalumFreeman opened this issue Aug 15, 2018 · 25 comments

Comments

@CalumFreeman
Copy link
Contributor

CalumFreeman commented Aug 15, 2018

I have been investigating #13, and I found that numpy.random.randint() segfaults occasionally. I'm not sure why it occurs so rarely or what the root cause is, but (after finding myself reading assembly) the actual seg fault is connected to PyObject_Call being passed PyExc_TypeError as the function to be called. This then causes a segfault as soon as a property of the ob_type is accessed (in this case: ob_type->tp_flags). I have written a simple C function that reliably causes this bug:

static PyObject* test_TypeError(PyObject *self, PyObject *args){
	PyObject* tup = PyTuple_Pack(1, Py_BuildValue("s","Hello"));
	PyObject_Call(PyExc_TypeError, tup, NULL);
	return tup;
}

CPython happily runs this (when it is part of a C extension), while JyNI segfaults. In JyNI, for PyExc_TypeError->ob_type==NULL, I think this is the cause of the segfault. I haven't been able to work out if PyExc_TypeError in CPython has ob_type==NULL. However, I can't find any checks for ob_type==NULL throughout the python codebase, which suggests the property should never be NULL. I did find a check in JyNI for this, and it has this comment just before it:

//JyNI note: if ob_type is NULL, we assume that base is a Type object that was
//not initialized because of our use-on-demand philosophy regarding PyType_Ready.

I don't quite understand what this means but it seems related, could you (@Stewori) possibly explain a bit about what is happening/what this means?

This could be solved in this case by working around it in the PyObject_Call function, it's possible to work out you are dealing with an error/exception and figure out what type without accessing any properties of ob_type. But if ob_type should never be NULL, perhaps it would be better to solve this more generally instead of trying to work around it.

With regards to numpy.random.randint(), it is possible that this error is always caught internally since it doesn't seem to set an error message or call the type error properly. Maybe in JyNI it just segfaults before it can be caught. But I'm not entirely sure, I haven't been able to track the particular problem down as it seems to be affected by timing (so adding print statements or running in debug mode make it less likely to occur), and it is random when it occurs(so I need to run numpy.random.randint() millions of times to get the error to appear).

@Stewori
Copy link
Owner

Stewori commented Aug 16, 2018

Calum, that's a brilliant finding. For some objects, especially type objects (PyType) themselves, the ob_type field is not inititalized at definition time, but by PyType_Ready. IIRC there was not really a good location in JyNI to run PyType_Ready. Running it on load was too early because other stuff it depends on was not yet initialized. Additionally, JyNI rarely needs the stuff it initializes. I think I ended up with running it on demand when something was detected to be missing. Maybe we should reconsider this. Please try to add this to PyObject_Call (just as a workability check):

if (!whatever->ob_type) {
  PyType_Ready(whatever);
}

If that solves it we can still consider if there's a better place for this. You might also want to add some logging to PyType_Ready to see when it was first called for PyExc_TypeError if at all. This would help to understand why it works sometimes.

@CalumFreeman
Copy link
Contributor Author

CalumFreeman commented Aug 17, 2018

tl;dr:

I have no idea 😃! I'm posting this so that I can remember where I got to on monday, any comments would be helpful but don't spend too long on it since I'll continue investigating on monday.

First fix

I added the code you suggested to JyNI_GetJythonDelegate() (it doesn't seem to matter if it is there or in PyObject_Call()). Although it still segfaults, it is a different error, this is the new problematic frame:

V  [libjvm.so+0x697a28]  JavaCalls::call_helper(JavaValue*, methodHandle*, JavaCallArguments*, Thread*)+0x1008

The problem seems to occur when calling dest = (*env)->CallObjectMethod(env, type, pyObject___call__); (line 1746 of JyNI.c). Everything prior to this seems fairly normal/acceptable. I've compared what I can to CPython and nothing has seemed obviously wrong. I think the two most likely sources of error are:

  • type = JyNI_JythonExceptionType_FromPyExceptionType((PyObject*) Py_TYPE(src)); is returning an incorrect jobject
  • the error types in jython don't have a __call__ method

Second "fix"

I have found a sort of fix for the new segfault, I changed the if statement(line 1743 onwards):

type = JyNI_JythonExceptionType_FromPyExceptionType((PyObject*) Py_TYPE(src));
if (type)
	{
		dest = (*env)->CallObjectMethod(env, type, pyObject___call__);
	} else {
		type = JyNI_JythonPyObject_FromPyObject((PyObject*) Py_TYPE(src));
		dest = (*env)->NewObject(env, pyBaseExceptionClass, pyBaseException_subTypeConstructor, type);
	}

To simply call the else section:

type = JyNI_JythonPyObject_FromPyObject((PyObject*) Py_TYPE(src));
dest = (*env)->NewObject(env, pyBaseExceptionClass, pyBaseException_subTypeConstructor, type);

This may not be a good fix, while it does seem to stop the particular segfault caused by using PyObject_Call() to raise an error, it also seems to cause some trouble for numpy.random.randint() which previously worked most of the time (and occasionally segfaulted) and now mostly fails(with error message: TypeError: Unsupported dtype "int64" for randint) and still occasionally segfaults.

I suspect the TypeError may be an error that was hidden before and "fixing" this bug has made them visible. I think the segfault that is occuring now did occur before this change so it may be unrelated (or it may not, I don't know). It's problematic frame is:
J 3456 C2 JyNI.JyNI.clearNativeHandle(Lorg/python/core/PyObject;)V (37 bytes) @ 0x00007f759f67de6b [0x00007f759f67dde0+0x8b]

So it may be that this fix works fine and all the new errors are really just hidden/pre-existing ones that are more visible, or it may be that this "fix" causes lots of errors. I don't know.

Also, when calling exceptions from C the normal way (PyErr_SetStrin()), the JyNI_InitJythonPyException function doesn't seem to be called, I don't know how relevant that is but I thought I would mention it.

@Stewori
Copy link
Owner

Stewori commented Aug 18, 2018

I remember it is not easy to find the actual C-source code of random.rand, it is in a cython-generated file. For easier discussion I'll simply post it here. It boils down to

/* Python wrapper */
static PyObject *__pyx_pw_6mtrand_11RandomState_29rand(PyObject *__pyx_v_self, PyObject *__pyx_args, PyObject *__pyx_kwds); /*proto*/
static char __pyx_doc_6mtrand_11RandomState_28rand[] = "\n        rand(d0, d1, ..., dn)\n\n        Random values in a given shape.\n\n        Create an array of the given shape and populate it with\n        random samples from a uniform distribution\n        over ``[0, 1)``.\n\n        Parameters\n        ----------\n        d0, d1, ..., dn : int, optional\n            The dimensions of the returned array, should all be positive.\n            If no argument is given a single Python float is returned.\n\n        Returns\n        -------\n        out : ndarray, shape ``(d0, d1, ..., dn)``\n            Random values.\n\n        See Also\n        --------\n        random\n\n        Notes\n        -----\n        This is a convenience function. If you want an interface that\n        takes a shape-tuple as the first argument, refer to\n        np.random.random_sample .\n\n        Examples\n        --------\n        >>> np.random.rand(3,2)\n        array([[ 0.14022471,  0.96360618],  #random\n               [ 0.37601032,  0.25528411],  #random\n               [ 0.49313049,  0.94909878]]) #random\n\n        ";
static PyObject *__pyx_pw_6mtrand_11RandomState_29rand(PyObject *__pyx_v_self, PyObject *__pyx_args, PyObject *__pyx_kwds) {
  PyObject *__pyx_v_args = 0;
  PyObject *__pyx_r = 0;
  __Pyx_RefNannyDeclarations
  __Pyx_RefNannySetupContext("rand (wrapper)", 0);
  if (unlikely(__pyx_kwds) && unlikely(PyDict_Size(__pyx_kwds) > 0) && unlikely(!__Pyx_CheckKeywordStrings(__pyx_kwds, "rand", 0))) return NULL;
  __Pyx_INCREF(__pyx_args);
  __pyx_v_args = __pyx_args;
  __pyx_r = __pyx_pf_6mtrand_11RandomState_28rand(((struct __pyx_obj_6mtrand_RandomState *)__pyx_v_self), __pyx_v_args);

  /* function exit code */
  __Pyx_XDECREF(__pyx_v_args);
  __Pyx_RefNannyFinishContext();
  return __pyx_r;
}

Do you know which line causes the segfault?
Then, is it possible to write this in numpy-independent fashion? Is the C-code you posted in #32 (comment) still suitable to reproduce the follow-up issue?

The code

type = JyNI_JythonExceptionType_FromPyExceptionType((PyObject*) Py_TYPE(src));
if (!type)
{
	dest = (*env)->CallObjectMethod(env, type, pyObject___call__);
} else {
	type = JyNI_JythonPyObject_FromPyObject((PyObject*) Py_TYPE(src));
	dest = (*env)->NewObject(env, pyBaseExceptionClass, pyBaseException_subTypeConstructor, type);
}

somehow looks like if and else sections are accidentially swapped. Can it be that this did not cause issues earlier? Please try to change if (!type) to if (type) as a replacement to your second fix.

I think the two most likely sources of error are...

Can you add something like

if (!whatever->tp_call) jputs("tp_call is NULL");

And something similar to check type = JyNI_JythonExceptionType_FromPyExceptionType((PyObject*) Py_TYPE(src)); ?

@Stewori
Copy link
Owner

Stewori commented Aug 19, 2018

I checked about JyNI_InitJythonPyException. That function is to be understood like the stuff in JySync.c. It is a data converter that is only to be used for initially converting a C-level exception to a Java-level exception. It is called from JyNI_JythonPyObject_FromPyObject. Now note that exceptions created at C-level are not necessarily converted to Java. They are stored in the thread state. Also CPython distincts C-level exceptions from Python-level exceptions:

    PyObject *curexc_type;
    PyObject *curexc_value;
    PyObject *curexc_traceback;

    PyObject *exc_type;
    PyObject *exc_value;
    PyObject *exc_traceback;

curexc are the fields for the C-exception, exc are the fields for the Python exception. A C-level exception can be cleared by C-code before returning control to Python. Only if a C-level exception is uncleared at that point it is copied into the exc fields. In JyNI the macro LEAVE_JyNI takes care of that. If it finds an exception in curexc it is converted to Java and indeed then JyNI_InitJythonPyException should be called. However, JyNI never populates the exc fields, but instead hands the Java-exception to Jython (by throwing it; in JyNI.java maybeExc takes care of that), which creates a PyException and stores it in Jython's notion of a thread state.

@Stewori
Copy link
Owner

Stewori commented Aug 21, 2018

Regarding JyNI.JyNI.clearNativeHandle it is crucial to know in what context it was called and what type of object was passed in. Strangely, that's a Java method, so how can it possibly segfault? It does not even trigger a native call:

public static void clearNativeHandle(PyObject object) {
	if (object == null) {
		System.err.println("JyNI-Warning: clearNativeHandle called with null!");
		return;
	}
	if (object instanceof PyCPeer)
		((PyCPeer) object).objectHandle = 0;
	else
		JyAttribute.delAttr(object, JyAttribute.JYNI_HANDLE_ATTR);
}

Is the warning triggered? Given that it nulls an object handle, it can potentially cause a subsequent segfault if used unproperly. That's why we need to know how it is called. I suspect it is called as part of a gc run. The original code contains some out-commented printouts. Maybe it helps to check their output. Also consider to print a stacktrace there.

@Stewori
Copy link
Owner

Stewori commented Aug 21, 2018

clearNativeHandle is only called by JyNI_CleanUp_JyObject in JyNI.c. JyNI_CleanUp_JyObject in turn is called from various places in JyNI-C project, also from gcmodule.c. To track down this issue it would be crucial to know which place calls JyNI_CleanUp_JyObject in a segfaulting way. Can you see this from the error log?
I suspect it is the call from gcmodule.c. Can you confirm this (e.g. by surrounding it with printouts of __LINE__)?

@CalumFreeman
Copy link
Contributor Author

The !type vs type was a mistake in the github post, sorry. The actual code is if(type), but it segfaulted, changing it to !type fixed some errors but not still had some issues and just having the else block seemed to lead to the fewest errors.

The code from earlier doesn't reproduce the subsequent errors. The error seems to occur around line 1400 of my mtrand.c (shown at bottom of this post)

There are a few different errors going on, I found there was one error related to this line:

__pyx_t_2 = (__Pyx_PySequence_Contains(__pyx_v_key, __pyx_t_3, Py_NE)); if (unlikely(__pyx_t_2 < 0)) {__pyx_filename = __pyx_f[0]; __pyx_lineno = 970; __pyx_clineno = __LINE__; goto __pyx_L1_error;}

I think this was raising the TypeError, I tried setting PySequence_Contains to just return true, that meant an error was thrown later (__pyx_t_5 = PyObject_GetItem(__pyx_t_3, __pyx_v_key); if (!__pyx_t_5) {__pyx_filename = __pyx_f[0]; __pyx_lineno = 973; __pyx_clineno = __LINE__; goto __pyx_L1_error;}). This made it a little easier to work out what was happening. It seems that the dictionary _randint_type is being emptied at some point, when I run PyString_AsString(mp->ob_type->tp_repr(mp)) on the PyDict represenging _randint_type, it returns "{}". The dictionary in question is defined on line 580 of mtrand.pyx.

At some point I also put a print statement that printed what PySequence_Contains returned, it seemed that it would return 1 many times, the unit test would pass a few times then it would switch to returning 0, after that every time it was called it would return 0 (yielding an error), here's an illustrative example where the unittest has been ran 5 times:
Each time it runs test_nprand it does this:

import numpy as np
for i in range(50): # actuall number is 10000
	np.random.randint(10)

And the results look like this:

running test_nprand...11111111111111111111111111111111111111111111111111
ok
OK
running test_nprand...11111111111111111111111111111111111111111111111111
ok
OK
running test_nprand...11111111111110
FAILED
running test_nprand...0
FAILED
running test_nprand...0
FAILED

In each test np.random.randint is called many times, it seems to call PySequence_Contains on a dictionary each time it is called, it works consistently until it fails, then it fails consistently. It seems to fail between unittests/import statements.

It is random when the first 0 occurs, even if I put np.random.seed(0) on the line after the import (to set the seed so that the numbers generated are the same each time). I feel like this is GC related, possibly the objects in the dictionary are being GC'd or the dictionary is somehow being partially GC'd, but I'm not sure.

If it helps, the original code doesn't seem to cause this, my understanding of the changes was that it was something to do with GC, further supporting the idea that the bugs are GC related.

I haven't looked into clearNativeHandle much since it was rare and the type error looked easier/more promising to fix. Also, it may be that all the errors are related to something being GC'd when it shouldn't be and it's just when that occurs that determines what error is thrown.

At this point I've spent more time that I probably should have on this so I'm going to work on the other iterators for a bit (more on that in #13) before spending too much more time on this. (I also have a poster and presentation to prepare)

mtrand.c:

 /* "mtrand.pyx":970
 *         # 'generate_randint_helpers.py'
 *         key = np.dtype(dtype).name
 *         if key not in _randint_type:             # <<<<<<<<<<<<<<
 *             raise TypeError('Unsupported dtype "%s" for randint' % key)
 * 
 */
  __pyx_t_3 = __Pyx_GetModuleGlobalName(__pyx_n_s___randint_type); if (unlikely(!__pyx_t_3)) {__pyx_filename = __pyx_f[0]; __pyx_lineno = 970; __pyx_clineno = __LINE__; goto __pyx_L1_error;}
  __Pyx_GOTREF(__pyx_t_3);
  __pyx_t_2 = (__Pyx_PySequence_Contains(__pyx_v_key, __pyx_t_3, Py_NE)); if (unlikely(__pyx_t_2 < 0)) {__pyx_filename = __pyx_f[0]; __pyx_lineno = 970; __pyx_clineno = __LINE__; goto __pyx_L1_error;}
  __Pyx_DECREF(__pyx_t_3); __pyx_t_3 = 0;
  __pyx_t_1 = (__pyx_t_2 != 0);
  if (__pyx_t_1) {

    /* "mtrand.pyx":971
 *         key = np.dtype(dtype).name
 *         if key not in _randint_type:
 *             raise TypeError('Unsupported dtype "%s" for randint' % key)             # <<<<<<<<<<<<<<
 * 
 *         lowbnd, highbnd, randfunc = _randint_type[key]
 */
    __pyx_t_3 = PyNumber_Remainder(((PyObject *)__pyx_kp_s_56), __pyx_v_key); if (unlikely(!__pyx_t_3)) {__pyx_filename = __pyx_f[0]; __pyx_lineno = 971; __pyx_clineno = __LINE__; goto __pyx_L1_error;}
    __Pyx_GOTREF(((PyObject *)__pyx_t_3));
    __pyx_t_5 = PyTuple_New(1); if (unlikely(!__pyx_t_5)) {__pyx_filename = __pyx_f[0]; __pyx_lineno = 971; __pyx_clineno = __LINE__; goto __pyx_L1_error;}
    __Pyx_GOTREF(__pyx_t_5);
    PyTuple_SET_ITEM(__pyx_t_5, 0, ((PyObject *)__pyx_t_3));
    __Pyx_GIVEREF(((PyObject *)__pyx_t_3));
    __pyx_t_3 = 0;
    __pyx_t_3 = PyObject_Call(__pyx_builtin_TypeError, ((PyObject *)__pyx_t_5), NULL); if (unlikely(!__pyx_t_3)) {__pyx_filename = __pyx_f[0]; __pyx_lineno = 971; __pyx_clineno = __LINE__; goto __pyx_L1_error;}
    __Pyx_GOTREF(__pyx_t_3);
    __Pyx_DECREF(((PyObject *)__pyx_t_5)); __pyx_t_5 = 0;
    __Pyx_Raise(__pyx_t_3, 0, 0, 0);
    __Pyx_DECREF(__pyx_t_3); __pyx_t_3 = 0;
    {__pyx_filename = __pyx_f[0]; __pyx_lineno = 971; __pyx_clineno = __LINE__; goto __pyx_L1_error;}
    goto __pyx_L4;
  }
  __pyx_L4:;

  /* "mtrand.pyx":973
 *             raise TypeError('Unsupported dtype "%s" for randint' % key)
 * 
 *         lowbnd, highbnd, randfunc = _randint_type[key]             # <<<<<<<<<<<<<<
 * 
 *         # TODO: Do not cast these inputs to Python int
 */
  __pyx_t_3 = __Pyx_GetModuleGlobalName(__pyx_n_s___randint_type); if (unlikely(!__pyx_t_3)) {__pyx_filename = __pyx_f[0]; __pyx_lineno = 973; __pyx_clineno = __LINE__; goto __pyx_L1_error;}
  __Pyx_GOTREF(__pyx_t_3);
  __pyx_t_5 = PyObject_GetItem(__pyx_t_3, __pyx_v_key); if (!__pyx_t_5) {__pyx_filename = __pyx_f[0]; __pyx_lineno = 973; __pyx_clineno = __LINE__; goto __pyx_L1_error;}
  __Pyx_GOTREF(__pyx_t_5);
  __Pyx_DECREF(__pyx_t_3); __pyx_t_3 = 0;

@CalumFreeman
Copy link
Contributor Author

And also:

clearNativeHandle could be given a bad argument when it is called from C, that could be because the argument it is supposed to be passed has been GC'd in some way so it's getting a bad pointer or something like that.

@Stewori
Copy link
Owner

Stewori commented Aug 23, 2018

There are a few different errors going on

Let's strictly work through them by order of appearance. Feel free to open separate issues here if appropriate.

just having the else block seemed to lead to the fewest errors.

That feels like a bit too much of trial'n error. We need to tackle this more systematically.

If it helps, the original code doesn't seem to cause this

I suspect that solution would leak memory, at least under some circumstances. Of course, then it does not segfault too easily.

I think that you are right with the GC cause. Please assert this by

  • adding a print out to JyNIInitializer.SentinelFinalizer.finalize. That print out should be visible before the first 0 appears or at least close by because of async run.

  • triggering GC actively via System.gc() after receiving a bunch of 1s. Run System.gc(); time.sleep(2). After that only 0 should appear. The printout from the bullet above should pop up during gc run.

Then, do you have an idea about the reference relationship? There must be some object A depending on an object B, but B is GC'ed because of some flaw although still in use by A. Then A segfaults. We need to identify A and B. It sounds like B might be _randint_type. What is A? And how would A keep B alive?

Please drop me some notes how to reproduce this. Is the whole code uploaded in a branch of your fork? Can I just clone it and run?

Finally I am curious how reading assembler helped you to pin down the issue. You mentioned in one of the first posts I think. Did you decompile the binary? Can you share how you did that?

I know issues like this are the challenging part when working on JyNI. I've been through various of these. There is no other way than solving them one by one. But it finally pays off. NumPy support (i.e. what works so far) was achieved in this painful manner. Issue #2 tells part of this story but only the tip of the ice berg. Don't give up!

@CalumFreeman
Copy link
Contributor Author

That feels like a bit too much of trial'n error

I suspect that solution would leak memory

Yes, both of these would be terrible solutions, they were just to help narrow down the problem a bit more.

do you have an idea about the reference relationship?

I agree that B is probably _randint_type but I don't know what A is, _randint_type is declared outside of any classes, its needed by the RandomState class (the one calling it). It could simply be that a ref increment is missing from somewhere.

how to reproduce this

The branch Iterator_Support has the changes to dictionary that cause this, to cause the segfault, simply running numpy.random.randint(10) loads of times should be sufficient. The actual test I'm running is in the Building_Tests branch, in test_JyNI_Iterators.py. I'm running the test_nprand test ~200 times, it should break before it makes it to 200 (although it sometimes goes for a while so I terminate and re-run it).

reading assembler

I simply used the eclipse Window->Show View->Dissasembly when in the debugger perspective, if debug symbols are on then it maps the current breakpoint/instruction to where you are in assembly. I think this just allowed us to work out which variable was actually set to null and causing the segfault, but I don't remember the details.

triggering GC actively

I set a breakpoint in PySequence_Contains (with ignore count: ~50 to skip to the part we care about), at this point I used gdb to check the dictionary: print PyString_AsString(seq->ob_type->tp_repr(seq)) and this gave the correct dictionary. Then I used jVisualVM to run GC(I also used JNI calls from gdb once but jvisualVM is easier), the next time PySeqence_Contains was called, the dictionary was still correct, the time after it was not. This behaviour happens consistently. I don't know why it doesn't fail the first time after GC, but the consistent failure on the second call means it must be GC related.

Don't give up!

Don't worry, I was thinking I would shift focus to some other parts of iterators where I could be more effective then return to this in a few days. There will be a post on iterators with some questions, but as it turns out, basically I don't understand enough of GC to solve this problem or implement the other iterators. So it's all tied together really. My gut feeling is that somewhere a reference isn't being incremented and that leads to things being deallocated when they shouldn't, that's sort of the focus of my questions on the Iterators issue.

@Stewori
Copy link
Owner

Stewori commented Aug 24, 2018

I agree that B is probably _randint_type but I don't know what A is, _randint_type is declared outside of any classes, its needed by the RandomState class (the one calling it).

Okay. So RandomState needs it but does not keep it alive. Then A is probably the module dict. I just suspect where the issue might originate. Please take a look at PyImport_AddModule in JyNI.java. You will notice that it inserts a PyNativeRefHoldingStringMap as module dict.
I don't remember the details right now but for some reason the module dict needs to hold an extra reference to its elements to keep them alive. Yes, every dict usually does that, but not strictly target the native counterparts. E.g. in mirror mode the native part can be cleared independently from its Java pendant. Or if the native part is just a view on the Java part it can also be cleared while the Java backend lives on. So PyNativeRefHoldingStringMap puts explicit emphasis on keeping the native counterparts of its elements alive. That is required as the root of the native reference tree.
The native reference tree has some immortal roots. These are module dicts, but also JyNI.nativeStaticPyObjectHeads and JyNI.nativeInternedStrings which account for native objects that are not member of any module dict, but are still in a sense roots of the native reference graph.
Such roots are kind of immortal and everything on native side must be reachable from such root in order to be kept alive.

Okay, back to the module dict. I suspect that for some reason this mechanism is spoofed for the random module. Please observe how PyImport_AddModule operates when the random module is loaded. Is the PyNativeRefHoldingStringMap properly initialized? You can insert code into PyNativeRefHoldingStringMap.add to assert that _randint_type is really hosted there.
Then, is the module dict somehow replaced by the random module at some point? Maybe they want to hook into module membership somehow. Finally it might be a bug in PyNativeRefHoldingStringMap. I just skimmed through the code and saw a couple of todo notes.

@CalumFreeman
Copy link
Contributor Author

Sorry I haven't had much time to give to this recently, but I did try a few things:

  • PyNativeRefHoldingStringMap's __setitem__ does get called with _randint_type as the key, I don't know how to check if it is properly initialized though.
  • changing AS_JY_NO_GC to AS_JY_WITH_GC, didn't seem to help, but it was a very random guess I tried before asking what it did.
  • incrementing the reference count on creating the iterator (I think this is actually done in JyObject_New) doesn't seem to help either.

I also noticed that the reference count on the dictionary is not 0(it's ~10-20) when it is GC'd, I'm also still puzzled by it surviving a whole loop before segfaulting. I'm not confident I know what is resulting in it being emptied.

I'm wondering if it is maybe the java object backend that is getting GC'd. I tried to force GC on just that object by calling the finalize method on it during the PyDict_Contains method via JNI

I used the gdb Debugger Console to do this at a breakpoint just after env(-1);:
call (*env)->CallVoidMethod(env, dict, (*env)->GetMethodID(env, (*env)->GetObjectClass(env, dict), "finalize", "()V"))

This should get the jclass with (*env)->GetObjectClass(env, dict) then use that to get the jmethodID with (*env)->GetMethodID(env, (*env)->GetObjectClass(env, dict), "finalize", "()V") and then call the method. Sorry it's a nested call, but I can't store variables in the console.

But after calling finalize() on the jobject dict, it didn't segfault after many runs through the loop, then I called GC and after a loop it failed as per usual.

I'm away for the next little while then I have a presentation on the 10th of September and then I'm back to uni quite rapidly so I'm unlikely to have much more time on this (what time I have for JyNI will probably go to getting everything else tidied up and ready for PR's). Hopefully I can, I really want to solve this, but just so you know I might not be able to do much more on this.

@Stewori
Copy link
Owner

Stewori commented Sep 1, 2018

I just found time to attempt to reproduce this myself. It seems I don't even get to the point of the segfault. I cloned the branch "Iterator_Support", using NumPy 1.13.3:

import sys
sys.path.insert(0, '/data/workspace/linux/numpy/1.13.3')
import numpy as np
np.random.randint(10)

I get the following output:

JyNI warning: _PySequence_IterSearch not yet implemented.
Traceback (most recent call last):
  File "/data/workspace/linux/JyNI/calum/JyNI/JyNI-Demo/src/JyNI_numpy_rand.py", line 16, in <module>
    np.random.randint(10)
TypeError: 'builtin_function_or_method' object is not callable

Were there additional changes recently? Can you reproduce this on a fresh clone?
I am using the current trunk version of Jython.
What puzzles me is that I seem to get the same result using original JyNI. Maybe the branch wasn't checked out correctly...

@CalumFreeman
Copy link
Contributor Author

Hmm, I'm not sure. I think I removed the JyNI warning since I had _PySequence_IterSearch "implemented" (ie commented in since the code worked with dictionaries once tp_as_sequence was implemented). That would point to an issue with cloning or pushing.

But, I may have seen that before, I'm not sure if it was this problem or somewhere else. I'm not at work (and won't be for a bit) so I don't have my list of errors, my laptop also needs reconfiguring, if I have time once I've fixed my laptop, I'll try a fresh clone.

If this is a different manifestation of the same error then re-running the same code should give one of the other errors. Also does it appear every single time you run it? If so then that probably isn't my error since my error isn't consistent.

@Stewori
Copy link
Owner

Stewori commented Sep 1, 2018

Sorry, I had eclipse somehow using an outdated Jython.jar for execution. Now this error is no more. Will continue with attempting to reproduce the actual error...
Sorry for the confusion.

@Stewori
Copy link
Owner

Stewori commented Sep 1, 2018

I think I can reproduce it now.

import sys

sys.path.insert(0, '/data/workspace/linux/numpy/1.13.3')

import numpy as np
#from numpy import random as rn
from java.lang import System
import time

for i in range(200):
	print i
	np.random.randint(10)
System.gc()
time.sleep(2)

seems to reproduce it reliably.
Interestingly this is still a bit fragile if we iterate less, e.g. only 100 times. Maybe System.gc wouldn't actually run then, after all it's just a request, not a guarantee to run GC.
Otherwise something undeterministic might cause this in randint. Some unlucky memory fault brings JyNI in a state where gc will crash.

@Stewori
Copy link
Owner

Stewori commented Sep 2, 2018

From the logfile:
(Do you see roughly the same in your logfile?)

Native frames: (J=compiled Java code, j=interpreted, Vv=VM code, C=native code)
J 2143 C1 org.python.core.JyAttribute.delAttr(Lorg/python/core/PyObject;B)V (169 bytes) @ 0x00007f19297499bb [0x00007f19297496c0+0x2fb]
J 2148 C1 JyNI.JyNI.clearNativeHandle(Lorg/python/core/PyObject;)V (37 bytes) @ 0x00007f192974d02c [0x00007f192974cd60+0x2cc]
v  ~StubRoutines::call_stub
V  [libjvm.so+0x666a0b]
V  [libjvm.so+0x6886fe]
V  [libjvm.so+0x68b02f]
C  [libJyNI.so+0x46f7e]  JyNI_CleanUp_JyObject+0xfc
C  [libJyNI.so+0xf42fe]  meth_dealloc+0x240
C  [libJyNI.so+0xfc87b]  JyGC_clearNativeReferences+0xefa
C  [libJyNI-Loader.so+0x561c]  Java_JyNI_JyNI_JyGC_1clearNativeReferences+0x34
j  JyNI.JyNI.JyGC_clearNativeReferences([JJ)Z+0
j  JyNI.gc.JyWeakReferenceGC$GCReaperThread.run()V+251
v  ~StubRoutines::call_stub
V  [libjvm.so+0x666a0b]
V  [libjvm.so+0x663fe4]
V  [libjvm.so+0x6645c7]
V  [libjvm.so+0x6a78a4]
V  [libjvm.so+0xa15547]
V  [libjvm.so+0xa15a0c]
V  [libjvm.so+0x8bae12]
C  [libpthread.so.0+0x76ba]  start_thread+0xca

This clearly shows that it is triggered by a GC-run. Somehow meth_dealloc is involved. Then it is e.g. useful to print method names on dealloc to get a clue what method object it is...
Also, we can see that weak references are involved. Probably something is only weakly referenced and cleaned up although still needed. We can add some output to weak reference creation to find out what objects this is about. Are there any weak references used by the random module?
Will continue.

@Stewori
Copy link
Owner

Stewori commented Sep 2, 2018

Hmm sometimes I get a different stack:

Native frames: (J=compiled Java code, j=interpreted, Vv=VM code, C=native code)
C  [mtrand.so+0xb43c]  __Pyx_PyObject_Call+0xc

Java frames: (J=compiled Java code, j=interpreted, Vv=VM code)
J 2936  JyNI.JyNI.callPyCPeer(JLorg/python/core/PyObject;Lorg/python/core/PyObject;J)Lorg/python/core/PyObject; (0 bytes) @ 0x00007f1b0d115d6c [0x00007f1b0d115d00+0x6c]
J 3318 C2 JyNI.PyCFunction.__call__([Lorg/python/core/PyObject;[Ljava/lang/String;)Lorg/python/core/PyObject; (207 bytes) @ 0x00007f1b0d207b50 [0x00007f1b0d2076c0+0x490]
J 3333 C2 org.python.core.PyObject.__call__(Lorg/python/core/PyObject;)Lorg/python/core/PyObject; (16 bytes) @ 0x00007f1b0d1ebdb0 [0x00007f1b0d1ebd00+0xb0]
J 3159 C2 org.python.core.PyObject.__call__(Lorg/python/core/ThreadState;Lorg/python/core/PyObject;)Lorg/python/core/PyObject; (6 bytes) @ 0x00007f1b0dbb6ae0 [0x00007f1b0dbb6aa0+0x40]
j  org.python.pycode._pyx0.f$0(Lorg/python/core/PyFrame;Lorg/python/core/ThreadState;)Lorg/python/core/PyObject;+397
j  org.python.pycode._pyx0.call_function(ILorg/python/core/PyFrame;Lorg/python/core/ThreadState;)Lorg/python/core/PyObject;+24
J 2026 C1 org.python.core.PyTableCode.call(Lorg/python/core/ThreadState;Lorg/python/core/PyFrame;Lorg/python/core/PyObject;)Lorg/python/core/PyObject; (361 bytes) @ 0x00007f1b0d6af5cc [0x00007f1b0d6aea80+0xb4c]
j  org.python.core.PyCode.call(Lorg/python/core/ThreadState;Lorg/python/core/PyFrame;)Lorg/python/core/PyObject;+4
j  org.python.core.Py.runCode(Lorg/python/core/PyCode;Lorg/python/core/PyObject;Lorg/python/core/PyObject;)Lorg/python/core/PyObject;+154
j  org.python.util.PythonInterpreter.execfile(Ljava/io/InputStream;Ljava/lang/String;)V+21
j  org.python.util.jython.run([Ljava/lang/String;)V+780
j  org.python.util.jython.main([Ljava/lang/String;)V+5
v  ~StubRoutines::call_stub

I hope these issues have the same origin. Cannot tell yet.

@Stewori
Copy link
Owner

Stewori commented Sep 2, 2018

I think the second type of error is what you observe. I get it if the gc run passes and then call randint again twice. I suspect the faulty gc run has a chance of passing if the corrupted memory is still valid access for our process or something like that. The second type of error is a consequence of the invalid state. Well that's a guess so far. However given that the first type of error occurs during gc while the second type occurs after gc I clearly suspect that the first type of error is closer to the origin. We should definitely start there.

@Stewori
Copy link
Owner

Stewori commented Sep 2, 2018

Okay, after some investigation I came to the conclusion that at least the first type of error, i.e. the one involving JyAttribute.delAttr is not related to your changes.
I can reproduce it without using random module:

import sys

sys.path.insert(0, '/data/workspace/linux/numpy/1.13.3')

import numpy as np
from java.lang import System
import time

for i in range(2000):
	print i
	np.array(range(7))
print 'run gc...'
System.gc()
time.sleep(2)

One can circumvent this by modifying JyNI.clearNativeHandle as follows:

clearNativeHandle(PyObject object) {
		if (object == null) {
			System.err.println("JyNI-Warning: clearNativeHandle called with null!");
			return;
		}
		if (object instanceof PyCPeer)
			((PyCPeer) object).objectHandle = 0;
		else {
			if (!(object instanceof PyObject))
				System.out.println("clearNativeHandle received Non-PyObject!");
			else
				JyAttribute.delAttr(object, JyAttribute.JYNI_HANDLE_ATTR);

Obviously that's not the solution, although I would keep this sanity check. We'l have to find the root cause of having a non-PyObject going in (Interesting: despite method declaration, a type-invalid call from JNI level is apparently possible).

JyNI's gc machinery is really complex and so far I never investigated it in the complex NumPy setting. It seems some workload above basic tests can trigger some bugs...

I think I will switch back to JyNI master branch for this. Just to be sure it is really unrelated to your work.

@Stewori
Copy link
Owner

Stewori commented Sep 3, 2018

Did not yet switch back to master yet. But I observed the following:
The first type of error goes away if I run GC right after import numpy, e.g.:

import numpy as np
System.gc()
time.sleep(2)

So I guess the obscure invalid jobject is due to some memory issue. Importing numpy in JyNI context is a serious memory demand. Maybe the JVM runs out of native jobject references.
E.g. see EnsureLocalCapacity in https://docs.oracle.com/javase/7/docs/technotes/guides/jni/spec/functions.html. Also, JVM does not take native memory into account when deciding whether to run GC. Maybe we will have to keep track of this in JyNI somehow and launch GC explicitly form time to time.

That said, I still encounter the second type of error. Will now shift focus to that.

@Stewori
Copy link
Owner

Stewori commented Sep 5, 2018

I really had not much time for this the last days. Just spent about 15 minutes for investigation and started looking at __Pyx_PyObject_Call, given that the stack starts with C [mtrand.so+0xb43c] __Pyx_PyObject_Call+0xc:

#if CYTHON_COMPILING_IN_CPYTHON
static CYTHON_INLINE PyObject* __Pyx_PyObject_Call(PyObject *func, PyObject *arg, PyObject *kw);
#else
#define __Pyx_PyObject_Call(func, arg, kw) PyObject_Call(func, arg, kw)
#endif

This smells like Cython would usually inline CPython's PyObject_Call, which could be disastrous in JyNI case. We should - for now - ensure (by hand, i.e. commenting out the #if branch) that no inlining takes place. However, I don't think this causes the issue. It doesn't fit with the GC involvement we observe. Additionally JyNI's PyObject_Call isn't too different from CPython's. It only prepends a potential delegation to Jython for certain functions:

PyObject *
PyObject_Call(PyObject *func, PyObject *arg, PyObject *kw)
{
	jobject delegate = JyNI_GetJythonDelegate(func);
	if (delegate)
	{
		return JyNI_PyObject_Call(delegate, arg, kw);
	}
	else
	{
		/* ... CPython's PyObject_Call code ...*/
	}
}

I guess that __Pyx_PyObject_Call simply is not used with functions that would delegate to Jython. I think it is only called with internal functions of the pyx module which are all entirely generated C code. If this wasn't the case we would have encountered issues on this front much earlier. So we're probably safe here (otherwise we might have to give up binary compatibility for Cython-generated stuff). Still, for further investigation of this issue we should nuke this inlining.

@Stewori
Copy link
Owner

Stewori commented Sep 5, 2018

Okay, without the inlining I get an error even earlier:

Native frames: (J=compiled Java code, j=interpreted, Vv=VM code, C=native code)
J 2448 C1 org.python.core.ArgParser.check()V (208 bytes) @ 0x00007f98f58d1e40 [0x00007f98f58d1b20+0x320]
j  org.python.core.ArgParser.<init>(Ljava/lang/String;[Lorg/python/core/PyObject;[Ljava/lang/String;Ljava/lang/String;)V+21
j  org.python.core.PyBaseException.BaseException___init__([Lorg/python/core/PyObject;[Ljava/lang/String;)V+15
j  org.python.core.PyBaseException$BaseException___init___exposer.__call__([Lorg/python/core/PyObject;[Ljava/lang/String;)Lorg/python/core/PyObject;+9
J 2442 C1 org.python.core.Deriveds.dispatch__init__(Lorg/python/core/PyObject;[Lorg/python/core/PyObject;[Ljava/lang/String;)V (70 bytes) @ 0x00007f98f58c773c [0x00007f98f58c7640+0xfc]
j  org.python.core.PyBaseExceptionDerived.dispatch__init__([Lorg/python/core/PyObject;[Ljava/lang/String;)V+3
J 2478 C1 org.python.core.PyType.type___call__([Lorg/python/core/PyObject;[Ljava/lang/String;)Lorg/python/core/PyObject; (93 bytes) @ 0x00007f98f58f306c [0x00007f98f58f2120+0xf4c]
J 1970 C1 org.python.core.PyType.__call__([Lorg/python/core/PyObject;[Ljava/lang/String;)Lorg/python/core/PyObject; (7 bytes) @ 0x00007f98f5697a34 [0x00007f98f5697940+0xf4]
v  ~StubRoutines::call_stub
V  [libjvm.so+0x666a0b]
V  [libjvm.so+0x68071a]
V  [libjvm.so+0x68223d]
C  [libJyNI.so+0x4511c]  JyNI_InitJythonPyException+0xa3
C  [libJyNI.so+0x4573b]  JyNI_JythonPyObject_FromPyObject+0x4fe
C  [libJyNI.so+0x744e6]  BaseException_init+0x141
C  [libJyNI.so+0x7b941]  type_call+0x191
C  [libJyNI.so+0x9e93e]  PyObject_Call+0xb7
C  [mtrand.so+0x192e2]  __pyx_pw_6mtrand_11RandomState_21randint+0x1422

In principle this should also be investigated but for sake of priority I'll switch back to the original implementation for now and attempt to pin the root cause under the assumption that it's not the inlining.

Note that the inlined PyObject_Call implementation can be found somewhere at the bottom of mtrand.c, in NumPy 1.13.3 at line 45125:

/* PyObjectCall */
  #if CYTHON_COMPILING_IN_CPYTHON
static CYTHON_INLINE PyObject* __Pyx_PyObject_Call(PyObject *func, PyObject *arg, PyObject *kw) {
    PyObject *result;
    ternaryfunc call = func->ob_type->tp_call; // func->ob_type is NULL in the segfault case
    if (unlikely(!call))
        return PyObject_Call(func, arg, kw);
    if (unlikely(Py_EnterRecursiveCall((char*)" while calling a Python object")))
        return NULL;
    result = (*call)(func, arg, kw);
    Py_LeaveRecursiveCall();
    if (unlikely(!result) && unlikely(!PyErr_Occurred())) {
        PyErr_SetString(
            PyExc_SystemError,
            "NULL result without error in PyObject_Call");
    }
    return result;
}
#endif

The segfault occurs in line 2 where func->ob_type is NULL.
I also identified that the other near-top lines in the stacktrace

J 2936  JyNI.JyNI.callPyCPeer(JLorg/python/core/PyObject;Lorg/python/core/PyObject;J)Lorg/python/core/PyObject; (0 bytes) @ 0x00007f1b0d115d6c [0x00007f1b0d115d00+0x6c]
J 3318 C2 JyNI.PyCFunction.__call__([Lorg/python/core/PyObject;[Ljava/lang/String;)Lorg/python/core/PyObject; (207 bytes) @ 0x00007f1b0d207b50 [0x00007f1b0d2076c0+0x490]

refer to the function call randint. Not a surprise actually. But good to know.

In case that func->ob_type is NULL I printed a stacktrace:

/data/workspace/linux/numpy/1.13.3/numpy/random/mtrand.so(+0xb47fc) [0x7f803036d7fc]
/data/workspace/linux/numpy/1.13.3/numpy/random/mtrand.so(+0xb48d6) [0x7f803036d8d6]
/data/workspace/linux/numpy/1.13.3/numpy/random/mtrand.so(+0x3ece3) [0x7f80302f7ce3]
/data/workspace/linux/numpy/1.13.3/numpy/random/mtrand.so(+0x3e2a2) [0x7f80302f72a2]
/data/workspace/linux/JyNI/calum/JyNI/build/libJyNI.so(PyCFunction_Call+0xfb) [0x7f8038330faf]
/data/workspace/linux/JyNI/calum/JyNI/build/libJyNI.so(JyNI_callPyCPeer+0xae) [0x7f803827d2fd]
/data/workspace/linux/JyNI/calum/JyNI/build/libJyNI-Loader.so(Java_JyNI_JyNI_callPyCPeer+0x44) [0x7f80385b026c]
[0x7f807d19d626]

So far I failed to convince the NumPy build system to put function names into mtrand.so.
I tried to add -g and -rdynamic according to https://stackoverflow.com/questions/6934659/how-to-make-backtrace-backtrace-symbols-print-the-function-names but that didn't do the trick. Maybe I should look into libunwind as mentioned there, but actually @CalumFreeman already identified the failing line to be line 16280 __pyx_t_5 = __Pyx_GetModuleGlobalName(__pyx_n_s_randint_type); if (unlikely(!__pyx_t_5)) __PYX_ERR(0, 970, __pyx_L1_error) , so I will instead try to confirm that now.

@Stewori
Copy link
Owner

Stewori commented Sep 5, 2018

Okay, the exact failure happens in line __pyx_t_5 = __Pyx_PyObject_Call(__pyx_builtin_TypeError, __pyx_t_3, NULL); but that's just because the inlined PyObject_Call does not contain the logic to repair the not filled in exception type. This was discussed earlier in this thread. It refers to the key error if key not in _randint_type:. So this clearly confirms what we already knew: Somehow mtrand's module dict gets broken by GC.

Todo: We should really take care to initialize the exception types properly. (Where?) (Source this out as separate issue?)

@Stewori
Copy link
Owner

Stewori commented Sep 5, 2018

When __Pyx_GetModuleGlobalName calls PySequence_Contains it seems not to find the key - in my case int64 - in _randint_type. However by some surrounding checks I am convinced now that the dictionary _randint_type is still in place and well populated. I can print it right before the key in question is said to be not found:
{'bool': (0, 2, <built-in function _rand_bool>), 'int16': (-32768, 32768, <built-in function _rand_int16>), 'int32': (-2147483648, 2147483648L, <built-in function _rand_int32>), 'int64': (-9223372036854775808L, 9223372036854775808L, <built-in function _rand_int64>), 'int8': (-128, 128, <built-in function _rand_int8>), 'uint64': (0, 18446744073709551616L, <built-in function _rand_uint64>), 'uint32': (0, 4294967296L, <built-in function _rand_uint32>), 'uint8': (0, 256, <built-in function _rand_uint8>), 'uint16': (0, 65536, <built-in function _rand_uint16>)}

Whatever goes wrong here is not because the dictionary is broken. Somehow the iteration process gets broken.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants