Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Shutdown issues with exceptions #15

Closed
max3-2 opened this issue Feb 11, 2021 · 20 comments
Closed

Shutdown issues with exceptions #15

max3-2 opened this issue Feb 11, 2021 · 20 comments
Labels
fixed Fix for issue has been released.

Comments

@max3-2
Copy link
Contributor

max3-2 commented Feb 11, 2021

I have to "reopen" the issue with the JVM shutdown again.
Partly to document an underlying problem which might help someone else with debugging, partly to enforce the importance to fix the current solution.

With the current setup, atexit calls .exit(0). This shuts down the python interpreter with exit status 0. Additionally, as
stated in the atexit docs:

If an exception is raised during execution of the exit handlers, a traceback is printed (unless SystemExit is raised) and the exception information is saved. After all exit handlers have had a chance to run the last exception to be raised is re-raised.

Thus, any exception raised by actually ANY code after importing client will be printed, however is lost afterwards. A standard use case at least for me is building python script which then runs a model by loading, doing some adaptions, solving etc...Another case would be even running a script using subprocesses.
In both cases neither stderr nor the exit code are set. This is quite drastic...

Have you had an issue running on jpype regarding changes in between 0.75 and >1.0? I did a backward test and with 0.7.5 everything works smoothly as you stated in the original issues (this also holds on macOS)

@john-hen
Copy link
Collaborator

Can you elaborate what you mean by this:

Thus, any exception raised by actually ANY code after importing client will be printed, however is lost afterwards.

Maybe with a code example. I don't quite see the problem with exceptions. I also don't understand how that quote from the atexit docs applies, as it only talks about exceptions raised inside the exit handler(s). As far as I can tell, exceptions (anywhere else in the code) are handled properly, even those raised inside the JVM.

I do see your point concerning the exit code. Currently, a script like this one

import mph, sys
client = mph.Client()
sys.exit(1)

exits with code 0, and not 1 as intended. It surprises me a bit as I didn't think Python would pay any attention to the exit code passed to the JVM inside the exit handler, which is indeed 0. But apparently it does. Maybe some JPype magic takes care of that?

Anyway, according to this Stackoverflow question, this issue can be fixed with a little bit of monkey-patching.

@max3-2
Copy link
Contributor Author

max3-2 commented Feb 11, 2021

I have to patch up some code for examples, but I will need some days sorry. To give you a quick answer: Running something like above but

import mph, sys
client = mph.Client()
raise RuntimeError(‘Something‘)

and connecting a subprocess will not only yield code 0 but also an empty stderr. Leaving the client away will give you an exception in stderr and an exit code. The emptiness of stderr and exit code 0 makes it virtually impossible to check if a subprocess raised an exception after client =
If atexit finishes normally, e.g. not forcing an interpreter shutdown, the exception is in stderr...

@john-hen
Copy link
Collaborator

No worries, take your time.

I was planning on adding an example to the documentation that demonstrates running multiple processes, either via the subprocess module or via multiprocessing, but also haven't found time for that yet. It would be an instructive example, that's for sure, so the more we learn about it, the better.

As hinted at in #10, I have built a job scheduler based on MPh before. (I'm not at liberty to publish it as I coded that on company time.) It's basically a Qt front-end for running simulations in parallel. The different "workers" were started as a subprocess, but the "scheduler" that would start them didn't rely on stderr or exit codes. I instead used file-based communication to relay status information between the processes. So I may have unintentionally worked around this problem.

Regarding the JPype issue you opened: It remains a mystery why JPype 0.7.5 worked so much better with Comsol. But, as explained in #1, going back to the old JPype isn't really an option as it won't work with the latest Python versions. Calling .exit()is our best option at this point, I think. Calling .halt() would not change things for the better, unless it somehow returned control to Python afterwards, which I don't think it does. I've also tinkered with Comsol threads (in this old test script, which I've since removed), but got nowhere. Though I also didn't really know what I was doing there. After all, the Comsol implementation is a black box for us.

@Thrameos
Copy link

For reference here is the differences in question

From native/common/jp_context.cpp

void JPContext::shutdownJVM()
{
        JP_TRACE_IN("JPContext::shutdown");
        if (m_JavaVM == NULL)
                JP_RAISE(PyExc_RuntimeError, "Attempt to shutdown without a live JVM");
        //      if (m_Embedded)
        //              JP_RAISE(PyExc_RuntimeError, "Cannot shutdown from embedded Python");

        // Wait for all non-demon threads to terminate
        JP_TRACE("Destroy JVM");
        {
                JPPyCallRelease call;
====== old 
               JPJavaFrame frame(this);
               JPPyCallRelease release;
               if (m_JavaContext.get() != 0)
                       frame.CallVoidMethodA(m_JavaContext.get(), m_ShutdownMethodID, 0);
====== new
                m_JavaVM->DestroyJavaVM();
======
        }

        JP_TRACE("Delete resources");
        for (std::list<JPResource*>::iterator iter = m_Resources.begin();
                        iter != m_Resources.end(); ++iter)
        {
                delete *iter;
        }
        m_Resources.clear();

        // unload the jvm library
        JP_TRACE("Unload JVM");
        m_JavaVM = NULL;
        JPPlatformAdapter::getAdapter()->unloadLibrary();
        JP_TRACE_OUT;
}

In the old version there is a message to Java which tells it to free resources, then it goes straight to unload the Library which effectively halts a running JVM without any clean up. Crashes could happen because any calls between the delete of resources and the terminate would access dead objects.

In the new version we first issue a destroy which must not return until all Java threads and proxies are terminated. Then once everything is dead we then clean up the resources and then finally unload the JVM.

There is a possible argument that we should unload before we kill the resources. That would have made it a bit safer in the old version, but there is still the possibility of an active proxy which would crash if it were waiting with a java resource on the call stack.

@john-hen
Copy link
Collaborator

john-hen commented Feb 12, 2021

@Thrameos:
I'm pretty sure it's the m_JavaVM->DestroyJavaVM() call were the delay occurs. That's the conclusion I came to in issue #1.

I don't really know what proxies Comsol might create. (Or what proxies really are, due to my very limited knowledge of Java.) But I did try to terminate the non-daemon threads that Comsol creates. Only that didn't solve the issue either.

Here's how far I got. I took this test script and added the following debug code right before calling jpype.shutdownJVM():

threads = jpype.java.lang.Thread.getAllStackTraces()
current = jpype.java.lang.Thread.currentThread()
threads = sorted(threads, key=lambda thread: thread.getId())

for thread in threads:
    id = int(thread.getId())
    name = str(thread.getName())
    daemon = '(daemon)' if thread.isDaemon() else ''
    print(f'{id:2}: {name:<25} {daemon}', flush=True)

for thread in threads:
    if thread != current and not thread.isDaemon():
        print(f'Stopping thread {thread.getId()}.')
        thread.stop()
        while not thread.isInterrupted():
            pass

for thread in threads:
    id = int(thread.getId())
    name = str(thread.getName())
    state = '(alive)' if thread.isAlive() else '(stopped)'
    print(f'{id:2}: {name:<25} {state}', flush=True)

If we comment out the lines in the test script where the Comsol client loads the simulation model from a file, the output is this:

Starting Comsol's Java VM via JPype 1.2.1.
Java VM started in 0.547 seconds.
Starting stand-alone Comsol client.
Client started in 2.632 seconds.
 1: main
 2: Reference Handler         (daemon)
 3: Finalizer                 (daemon)
 4: Signal Dispatcher         (daemon)
 5: Attach Listener           (daemon)
10: Python Reference Queue    (daemon)
 1: main                      (alive)
 2: Reference Handler         (alive)
 3: Finalizer                 (alive)
 4: Signal Dispatcher         (alive)
 5: Attach Listener           (alive)
10: Python Reference Queue    (alive)
Shutting down Java VM.
Java VM shut down in 0.275 seconds.

So here the shutdown works fine. But once the model is actually loaded, I get this:

Starting Comsol's Java VM via JPype 1.2.1.
Java VM started in 0.556 seconds.
Starting stand-alone Comsol client.
Client started in 2.671 seconds.
Loading model file.
Model loaded in 14.944 seconds.
 1: main
 2: Reference Handler         (daemon)
 3: Finalizer                 (daemon)
 4: Signal Dispatcher         (daemon)
 5: Attach Listener           (daemon)
10: Python Reference Queue    (daemon)
13: Thread-2                  (daemon)
15: EMF Reference Cleaner     (daemon)
18: pool-5-thread-1
19: pool-6-thread-1
20: pool-7-thread-1
Stopping thread 18.
Stopping thread 19.
Stopping thread 20.
 1: main                      (alive)
 2: Reference Handler         (alive)
 3: Finalizer                 (alive)
 4: Signal Dispatcher         (alive)
 5: Attach Listener           (alive)
10: Python Reference Queue    (alive)
13: Thread-2                  (alive)
15: EMF Reference Cleaner     (alive)
18: pool-5-thread-1           (stopped)
19: pool-6-thread-1           (stopped)
20: pool-7-thread-1           (stopped)
Shutting down Java VM.
Java VM shut down in 60.070 seconds.

So there are three extra non-daemon threads that Comsol apparently creates when it loads the model into memory. And two daemon threads as well. However, forcing the non-daemon threads to stop doesn't eliminate the shutdown delay. I would conclude from this that the threads as such are not the issue. It must be some other kind of resource.

@Thrameos
Copy link

Seems like a deadlock issue. There is a java utility to dump all thread states. Unfortunately as this is commercial software it would be difficult to replicate.

@max3-2
Copy link
Contributor Author

max3-2 commented Feb 12, 2021

Ill check other options and the detailed discussion later on, but just to give you a quick replicator what my main concern is.

Running the following via python simpleExample.py > testout 2> testerr should come close to your file solution. The same applies to running it as a subprocess, obviously.

import sys
from mph import Server, Client

raise RuntimeError('Im very lazy today')

print('Starting')
cServer = server.Server()
port = cServer.port
host = 'localhost'
print('Local server created')

raise RuntimeError('Im lazy today')

cClient = client.Client()

raise RuntimeError('I am at least trying')

model = cClient.load('testVNew_caeModel.mph')

raise RuntimeError('I loaded a model')

Depending on where the exception is raised (specifically after the Client() part where the JVM starts, the return code will be 0 AND the stderr file will be empty. Thats a bad combination...

@max3-2
Copy link
Contributor Author

max3-2 commented Feb 12, 2021

I had a call with Support. They will report the bug but state that this is neither a confirmation for a solution nor a timeframe for solution.
Regard ing the behavior windows (stopping after some time) and UNIX (running indefinitely): Support stated that there are two completely different frameworks for windows and UNIX, which can lead to this.

@john-hen
Copy link
Collaborator

john-hen commented Feb 12, 2021

Yeah, we shouldn't be holding our breaths. There is no money in it for Comsol, not even in the long run, because there are certain "contractual reasons" why they cannot work on a Python API.

There are clearly differences in behavior between Windows and Unix. The issue you report here is a good example: I can reproduce it on Linux, but not on Windows.

If I run this test

import mph
client = mph.Client()
raise RuntimeError

on Windows (python test.py 2>stderr.txt), it's all good and I get the expected

Traceback (most recent call last):
  File "test.py", line 3, in <module>
    raise RuntimeError
RuntimeError

in the stderr output file. If I run it on Linux though, there is indeed nothing in stderr, just like you report for macOS.

So this (part of the) issue cannot be fixed with a little bit of monkey-patching, it seems. And it's almost certainly related to the JVM shutdown, as starting the client is definitely what triggers this behavior, and it's also at that point that we register the hard .exit() of the JVM.

@max3-2
Copy link
Contributor Author

max3-2 commented Feb 12, 2021

I guess MathWorks got them by their best parts...A last try to stand the fight against python...

So with all the information collected, I forked Jpype and build a flag in which allows the old way. This actually works quite well and does not interfere to much with the rest, since you have to register atexit to set the flag in shutdownJVM() from its default value.
I asked them if they would like to merge. If not, we can think about either building wheels (I can do macOS) and distributing them with Mph or any other way of distribution. However I would not add a git address to requirements since this would mean compilation and at least on windows compiling C for python is a headache for a normal user ...

@john-hen
Copy link
Collaborator

john-hen commented Feb 12, 2021

Yes, that's true. I myself refuse anything that requires a C compiler on Windows. It's just too much of a headache.

I saw your JPype PR and read Thrameos's excellent explanation. I finally understand what the issue is: It's who goes down first, Python or Java. If they do add some way to configure the shutdown behavior, then that's a pretty easy choice for us since MPh only accesses Java objects from the Python side, not the other way around, so we want Python to die last.

Also, in my last comment, I spoke too soon. There is actually a very easy fix for that too. The reason we see no output in the file (but do see it in the console) is the "lazy writing" on Unix-like systems. So all it takes is to flush stdout and stderr right before exiting.

Edit: I've just committed this fix to master: eebb0cb

@max3-2
Copy link
Contributor Author

max3-2 commented Feb 12, 2021

Yes I was aiming for a sort of fallback solution for locken java processes and by hiding it like this it should not interfere too much. If there‘s a superior solution down the road on either end changing back would be very easy.

...I think I tried that during debugging to no avail, however I am not 100% sure. I’ll check that on Monday, I have left my work for the weekend and intend to not code too much...we will see... ;)

If this works, shouldn‘t it be possible to check if something is in stderr and if so set .exit(1)? I had that idea but tried it using sys.exc_info() which failed due to the above discussed reasons..

@john-hen
Copy link
Collaborator

I agree, this is not a clean solution. But see if it works for you when you get a chance next week.

I've also just committed the monkey-patch for the exit-code issue: 7feee56. This only hooks into sys.exit() for now. I don't know what's supposed to happen after an exception. Does that automatically change the exit code? If so, yes, we'd have to add a second hook into sys.exc_info(). Just checking stderr, if that's even possible, is probably not a good solution. Because stderr also receives all kinds of warnings that can often be ignored (from Qt for example), but shouldn't necessarily change the exit status.

@max3-2
Copy link
Contributor Author

max3-2 commented Feb 12, 2021

I dod forget about that use of stderr. Not a good idea then.

I don't know what's supposed to happen after an exception. Does that automatically change the exit code?

Terminating a script with raise will set an exit code of usually 1. However sys.exc_info() exists only in the except clause so its gone in the shutdown routine

@max3-2
Copy link
Contributor Author

max3-2 commented Feb 12, 2021

By the way Thrameos offered a fix over at Jpype. I will test it on Monday too ...

@john-hen
Copy link
Collaborator

Great. I'll happily revert all changes if we get a clean solution in the end.

Though in the meantime, I have committed the missing monkey-patch with the exception hook: a88cf7f.

@max3-2
Copy link
Contributor Author

max3-2 commented Feb 12, 2021

First of all here too: Thanks for the effort and the great work!

I managed to run some tests via VPN which confirmed that the hooks and the exit codes are working. I will do some more testing on Monday. If this is confirmed my suggestion would be to leave this as a fix and if / when jpype adds a shutdown config this can be rolled back. This would prevent having to deal with custom dependencies and compile issues. Also, I feel like Thrameos solution might be vastly superior to my patching.

@john-hen
Copy link
Collaborator

john-hen commented Feb 12, 2021

Thanks to you for getting the ball rolling again on this issue.

I've also run some tests with the latest JPype PR. As far as I can tell (by running the test suite on Windows and Linux), this solves the issue of extended shutdown delays if all we do is jpype.config.destroy_jvm = False and then just use the regular jpype.shutdownJVM(). (Or not call it at all as it would run automatically anyway. But I like to have some logging around it.)

I've created a branch named shutdown_fix which implements the new (old) shutdown behavior, and also reverts all the work-arounds discussed above. I haven't tested exceptions and exit codes with it, but I trust that it would work as Python is now in full control again after the JVM shutdown (as the logs of the test suite show).

Yes, vendoring in whatever version of JPype is a lot of hassle and also not future-proof. We're better off waiting for its next release. I could release the work-arounds earlier if you want. Or we just discard them and wait for JPype.

@max3-2
Copy link
Contributor Author

max3-2 commented Feb 13, 2021

I think this fix is ok and will not interfere to much with anything else so I suggest to release. With the complexity of a package like jpye, I think until a new release with the fix and with wheels is available it will take some time.

If you want to wait a few days I have some convenience functions in the model class for common tasks like toggling boundary conditions or adding new file data to interpolation functions. I think I could offer a PR starting next week...However I have to work out how to catch exceptions from java in python, I am still looking for the appropriate classes...

@john-hen
Copy link
Collaborator

Okay, I released 0.8.2 with the work-arounds. Yeah, the next JPype release may be a while, especially as it would introduce new API features.

Speaking of new API features, I'm open to that, but also want to keep the scope manageable. Like anything needed for general scripting, yes. But nothing too project-specific. So importing data for interpolation functions is a definite yes, but changing boundary conditions maybe not. But we can discuss/negotiate that in a new issue one of these days.

@john-hen john-hen added the fixed Fix for issue has been released. label Feb 15, 2021
@john-hen john-hen changed the title atexit issues with exceptions Shutdown issues with exceptions Apr 23, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
fixed Fix for issue has been released.
Projects
None yet
Development

No branches or pull requests

3 participants