-
Notifications
You must be signed in to change notification settings - Fork 17
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
PyExc_TypeError has ob_type == NULL #32
Comments
Calum, that's a brilliant finding. For some objects, especially type objects (
If that solves it we can still consider if there's a better place for this. You might also want to add some logging to PyType_Ready to see when it was first called for |
tl;dr:I have no idea 😃! I'm posting this so that I can remember where I got to on monday, any comments would be helpful but don't spend too long on it since I'll continue investigating on monday. First fixI added the code you suggested to
The problem seems to occur when calling
Second "fix"I have found a sort of fix for the new segfault, I changed the if statement(line 1743 onwards):
To simply call the else section:
This may not be a good fix, while it does seem to stop the particular segfault caused by using I suspect the So it may be that this fix works fine and all the new errors are really just hidden/pre-existing ones that are more visible, or it may be that this "fix" causes lots of errors. I don't know. Also, when calling exceptions from C the normal way ( |
I remember it is not easy to find the actual C-source code of
Do you know which line causes the segfault? The code
somehow looks like if and else sections are accidentially swapped. Can it be that this did not cause issues earlier? Please try to change
Can you add something like
And something similar to check |
I checked about
|
Regarding
Is the warning triggered? Given that it nulls an object handle, it can potentially cause a subsequent segfault if used unproperly. That's why we need to know how it is called. I suspect it is called as part of a gc run. The original code contains some out-commented printouts. Maybe it helps to check their output. Also consider to print a stacktrace there. |
|
The The code from earlier doesn't reproduce the subsequent errors. The error seems to occur around line 1400 of my mtrand.c (shown at bottom of this post) There are a few different errors going on, I found there was one error related to this line:
I think this was raising the TypeError, I tried setting At some point I also put a print statement that printed what
And the results look like this:
In each test It is random when the first 0 occurs, even if I put If it helps, the original code doesn't seem to cause this, my understanding of the changes was that it was something to do with GC, further supporting the idea that the bugs are GC related. I haven't looked into At this point I've spent more time that I probably should have on this so I'm going to work on the other iterators for a bit (more on that in #13) before spending too much more time on this. (I also have a poster and presentation to prepare) mtrand.c:
|
And also:
|
Let's strictly work through them by order of appearance. Feel free to open separate issues here if appropriate.
That feels like a bit too much of trial'n error. We need to tackle this more systematically.
I suspect that solution would leak memory, at least under some circumstances. Of course, then it does not segfault too easily. I think that you are right with the GC cause. Please assert this by
Then, do you have an idea about the reference relationship? There must be some object A depending on an object B, but B is GC'ed because of some flaw although still in use by A. Then A segfaults. We need to identify A and B. It sounds like B might be Please drop me some notes how to reproduce this. Is the whole code uploaded in a branch of your fork? Can I just clone it and run? Finally I am curious how reading assembler helped you to pin down the issue. You mentioned in one of the first posts I think. Did you decompile the binary? Can you share how you did that? I know issues like this are the challenging part when working on JyNI. I've been through various of these. There is no other way than solving them one by one. But it finally pays off. NumPy support (i.e. what works so far) was achieved in this painful manner. Issue #2 tells part of this story but only the tip of the ice berg. Don't give up! |
Yes, both of these would be terrible solutions, they were just to help narrow down the problem a bit more.
I agree that B is probably
The branch Iterator_Support has the changes to dictionary that cause this, to cause the segfault, simply running
I simply used the eclipse
I set a breakpoint in
Don't worry, I was thinking I would shift focus to some other parts of iterators where I could be more effective then return to this in a few days. There will be a post on iterators with some questions, but as it turns out, basically I don't understand enough of GC to solve this problem or implement the other iterators. So it's all tied together really. My gut feeling is that somewhere a reference isn't being incremented and that leads to things being deallocated when they shouldn't, that's sort of the focus of my questions on the Iterators issue. |
Okay. So Okay, back to the module dict. I suspect that for some reason this mechanism is spoofed for the random module. Please observe how |
Sorry I haven't had much time to give to this recently, but I did try a few things:
I also noticed that the reference count on the dictionary is not 0(it's ~10-20) when it is GC'd, I'm also still puzzled by it surviving a whole loop before segfaulting. I'm not confident I know what is resulting in it being emptied. I'm wondering if it is maybe the java object backend that is getting GC'd. I tried to force GC on just that object by calling the finalize method on it during the I used the gdb Debugger Console to do this at a breakpoint just after This should get the But after calling I'm away for the next little while then I have a presentation on the 10th of September and then I'm back to uni quite rapidly so I'm unlikely to have much more time on this (what time I have for JyNI will probably go to getting everything else tidied up and ready for PR's). Hopefully I can, I really want to solve this, but just so you know I might not be able to do much more on this. |
I just found time to attempt to reproduce this myself. It seems I don't even get to the point of the segfault. I cloned the branch "Iterator_Support", using NumPy 1.13.3:
I get the following output:
Were there additional changes recently? Can you reproduce this on a fresh clone? |
Hmm, I'm not sure. I think I removed the JyNI warning since I had _PySequence_IterSearch "implemented" (ie commented in since the code worked with dictionaries once tp_as_sequence was implemented). That would point to an issue with cloning or pushing. But, I may have seen that before, I'm not sure if it was this problem or somewhere else. I'm not at work (and won't be for a bit) so I don't have my list of errors, my laptop also needs reconfiguring, if I have time once I've fixed my laptop, I'll try a fresh clone. If this is a different manifestation of the same error then re-running the same code should give one of the other errors. Also does it appear every single time you run it? If so then that probably isn't my error since my error isn't consistent. |
Sorry, I had eclipse somehow using an outdated Jython.jar for execution. Now this error is no more. Will continue with attempting to reproduce the actual error... |
I think I can reproduce it now.
seems to reproduce it reliably. |
From the logfile:
This clearly shows that it is triggered by a GC-run. Somehow |
Hmm sometimes I get a different stack:
I hope these issues have the same origin. Cannot tell yet. |
I think the second type of error is what you observe. I get it if the gc run passes and then call |
Okay, after some investigation I came to the conclusion that at least the first type of error, i.e. the one involving
One can circumvent this by modifying
Obviously that's not the solution, although I would keep this sanity check. We'l have to find the root cause of having a non-PyObject going in (Interesting: despite method declaration, a type-invalid call from JNI level is apparently possible). JyNI's gc machinery is really complex and so far I never investigated it in the complex NumPy setting. It seems some workload above basic tests can trigger some bugs... I think I will switch back to JyNI master branch for this. Just to be sure it is really unrelated to your work. |
Did not yet switch back to master yet. But I observed the following:
So I guess the obscure invalid That said, I still encounter the second type of error. Will now shift focus to that. |
I really had not much time for this the last days. Just spent about 15 minutes for investigation and started looking at __Pyx_PyObject_Call, given that the stack starts with
This smells like Cython would usually inline CPython's
I guess that |
Okay, without the inlining I get an error even earlier:
In principle this should also be investigated but for sake of priority I'll switch back to the original implementation for now and attempt to pin the root cause under the assumption that it's not the inlining. Note that the inlined
The segfault occurs in line 2 where
refer to the function call In case that
So far I failed to convince the NumPy build system to put function names into mtrand.so. |
Okay, the exact failure happens in line Todo: We should really take care to initialize the exception types properly. (Where?) (Source this out as separate issue?) |
When Whatever goes wrong here is not because the dictionary is broken. Somehow the iteration process gets broken. |
I have been investigating #13, and I found that
numpy.random.randint()
segfaults occasionally. I'm not sure why it occurs so rarely or what the root cause is, but (after finding myself reading assembly) the actual seg fault is connected toPyObject_Call
being passedPyExc_TypeError
as the function to be called. This then causes a segfault as soon as a property of theob_type
is accessed (in this case:ob_type->tp_flags
). I have written a simple C function that reliably causes this bug:CPython happily runs this (when it is part of a C extension), while JyNI segfaults. In JyNI, for
PyExc_TypeError->ob_type==NULL
, I think this is the cause of the segfault. I haven't been able to work out ifPyExc_TypeError
in CPython hasob_type==NULL
. However, I can't find any checks forob_type==NULL
throughout the python codebase, which suggests the property should never beNULL
. I did find a check in JyNI for this, and it has this comment just before it:I don't quite understand what this means but it seems related, could you (@Stewori) possibly explain a bit about what is happening/what this means?
This could be solved in this case by working around it in the PyObject_Call function, it's possible to work out you are dealing with an error/exception and figure out what type without accessing any properties of ob_type. But if
ob_type
should never beNULL
, perhaps it would be better to solve this more generally instead of trying to work around it.With regards to
numpy.random.randint()
, it is possible that this error is always caught internally since it doesn't seem to set an error message or call the type error properly. Maybe in JyNI it just segfaults before it can be caught. But I'm not entirely sure, I haven't been able to track the particular problem down as it seems to be affected by timing (so adding print statements or running in debug mode make it less likely to occur), and it is random when it occurs(so I need to runnumpy.random.randint()
millions of times to get the error to appear).The text was updated successfully, but these errors were encountered: