Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Java InterruptedException at end of Python session #38

Closed
john-hen opened this issue May 4, 2021 · 10 comments
Closed

Java InterruptedException at end of Python session #38

john-hen opened this issue May 4, 2021 · 10 comments
Labels
fixed Fix for issue has been released.

Comments

@john-hen
Copy link
Collaborator

john-hen commented May 4, 2021

On Linux and macOS, we occasionally encounter this error:

WARN:oejuc.AbstractLifeCycle:FAILED ClientWebSocketCommandManager.factory{0<=0<=0/0,0}:
java.lang.InterruptedException: sleep interrupted
    at java.lang.Thread.sleep(Native Method)
    at org.eclipse.jetty.util.thread.QueuedThreadPool.doStop(QueuedThreadPool.java:122)
    at org.eclipse.jetty.util.component.AbstractLifeCycle.stop(AbstractLifeCycle.java:89)
    at org.eclipse.jetty.util.component.AggregateLifeCycle.doStop(AggregateLifeCycle.java:107)
    at org.eclipse.jetty.websocket.WebSocketClientFactory.doStop(WebSocketClientFactory.java:221)
    at org.eclipse.jetty.util.component.AbstractLifeCycle.stop(AbstractLifeCycle.java:89)
    at com.comsol.client.interfaces.ClientWebSocketCommandManager$c.run(SourceFile:551)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
    at java.lang.Thread.run(Thread.java:748)

The error is harmless, in that it occurs at the very end of the Python session, but never during simulations or any other normal interaction with Comsol. It is not 100% reproducible, though it does occur about half the time the test suite is run on Linux. The error happens when the client disconnects from the server during the shutdown sequence of the Python/Java session. A regular disconnect, pre-shutdown, does not produce the error.

All of this points to some kind of race condition in JPype's shutdown sequence of the Java VM. JPype pull request #937 would fix this, as repeated tests have shown, but work on JPype is currently on hold and it is unclear when that fix will be released. Hence this notice here.

@john-hen john-hen added the bug Something isn't working. label May 4, 2021
@john-hen john-hen mentioned this issue Jun 15, 2021
Closed
@john-hen
Copy link
Collaborator Author

john-hen commented Aug 1, 2021

So, first of all, my assessment in the original post was, to some extent, wrong: The problem was not reliably solved with that development version (1.2.2-dev0) of JPype. It seems (now) that (back then) I did not repeat the tests enough times for (new) problems to become evident. Mea culpa.

On that note, as far as I can tell there is no observable difference in behavior between that dev version and the current JPype release, 1.3. I did, at some point, go back to the old dev build just to verify that.

To recap… In the current release of MPh, 1.0.4, we force a "hard exit" of the Java VM by calling jpype.java.lang.Runtime.getRuntime().exit() in a special exit handler, registered via @atexit in the session module. This avoids the inexplicable shutdown delay, typically one long minute, we'd otherwise have during a regular shutdown of the Java VM, as described in issue #1 and further discussed in #10 and #15. But it is obviously not a clean solution: The Python session ends abruptly and other exit handlers, possibly defined in application code, may never get a chance to run. Plus, we occasionally encounter the InterruptedException described in the original post here.

The idea, therefore, was to leverage the new config options introduced in JPype 1.3, in particular by setting destroy_jvm from its default True to False. (The other two available options seem to be of no help as far as this particular problem is concerned.) This gets rid of those sporadic InterruptedExceptions as well as the shutdown delay, so that we can perform a regular shutdown by calling jpype.shutdownJVM() (or letting JPype run it automatically).

However, in about 1 in 10 runs on Linux and 2 out of 10 on Windows (according to my tests) and something like 6 out of 10 on MacOS (according to Max), we now get access violations / segmentation faults.

I've narrowed it down to the following minimal test script:

import mph
client = mph.start()
model = client.create('empty')
parameters = model/'parameters'
(parameters/'Parameters 1').rename('parameters')

All the script does is, it starts the Comsol client (a Java singleton), creates an empty model (a Java container object), and renames one of the model nodes that Comsol creates automatically from its default name to something else. Notably, no I/O is involved. On Windows, it would actually suffice to just start the client in order to observe the crashes, but on Linux I only see them if I also modify the model object, if only so slightly.

I then wrote a second script that runs the above test a number of times as a subprocess and reports non-zero exits codes as well as any unexpected output on stdout or stderr.

On Windows, the test script, the 10 or so percent of the time that it crashes, will exit with error code 1, report an access violation on stdout, and dump a Java "fatal error log", i.e. a file named hs_err_pid*.log in the working directory, such as this one: hs_err_pid13300.log.

On Linux (Ubuntu 20.04 LTS), the test script exits with code -6 and FATAL: exception not rethrown is printed to stderr. No error log is dumped, at least not in my latest tests. It may have been the case in earlier tests I ran weeks ago, but I'm not sure.

For MacOS, Max has reported in Chat that the behavior is pretty much the same as on Windows, crashes (due to a SIGSEGV) leaving a fairly similar fatal error log behind.

So the error manifests itself somewhat differently across platforms, but is probably the same in nature.

Additionally, I can no longer create the coverage report successfully. Something strange is happening there, in that pyTest, towards the end of the test suite's run (when subprocesses are tested), seems to forget its own configuration and somehow adds the test folder to the code coverage, even though that folder should be excluded. (We only want to measure coverage of the actual library code.) I have no explanation for this, and no solution either. It's just frustrating that something so basic doesn't work as expected. It is certainly related to the crashes described above, but may also have something to do with pyTest turning on "faulthandlers" in its own exit handler. I tried to work around this, but to no avail.

At this point, I'm quite pessimistic this issue will ever be resolved. I could try and narrow down further what triggers the crashes, like having only direct calls to the JPype API in the test script. And maybe activate tracing to gain further insight. But in the end, Comsol remains a fairly opaque black box to us, and we cannot expect any "upstream" support whatsoever. But if you have any ideas, @Thrameos, I could try and dig a little deeper.

It should also be noted that Comsol itself is not the most stable piece of software. There have been many bugs in the past, and certainly many which still remain, that would cause even the GUI to crash and leave one of those fatal error log files in the user's working directory. So this is not unheard of, and it could all just be Comsol's fault somehow.

In order to get unstuck, however, I think we should leave things as they are now (and have been for a good while), i.e. keep performing the hard JVM exit and hold on to all the work-arounds preserving the exit code. The InterruptedException errors are a bit annoying, but can be safely ignored. The crashes however, especially the error logs littering the working directory, make for a more disconcerting user experience, in my opinion. @max3-2, how do you see this?

@Thrameos
Copy link

Thrameos commented Aug 2, 2021

I am not aware of any good solution here. The best would simply if Comsol client had a shutdown routine that the user can call the would politely terminate any resources so that you can proceed to destroying the JVM. But then needs to be a defined API. My general opinion is the source of crashes are resources that where something contains both Python and Java on the stack at the same time shutting down pulls the rug out and thus some method fails and has no way to return as the caller is no longer available. But as I don't see any usage of JImplements/JProxy in the example that doesn't really fit.

I can give some details on the shutdown sequence for Python/JVM. It is of course very complicated as it depends how the sequence was triggered. If shutdown was invoked manually then it is straight forward to work out the order of operations, and you may be able to manually run the Python atexit routines before the shutdown proceeds. If called as part of the Python shutdown then it gets complicated as Python is processing the atexit routines and then we start terminate around it.

Other than exit() the next best thing would be for the JVM to not get terminated at all. If we disable the shutdown hook on the atexit then the Python will shut itself down cleanly which then calls exit. Unfortunately if Java calls a Python method during that period then it will fail, though perhaps there is a way to determine if were the case and just fail the method with an exception. Once Python is complete it will terminate the process with exit. No Java shutdown, close, or atexit routines would take place.

Unfortunately as this is proprietary software, there is very little advice I can give. If you know exactly where the Java code were going wrong and you need to fix a bug in Comsol itself, then your could use a classloader and asm library to "patch" the method to say catch the exception or handle termination gracefully. But that is a lot of effort. I did something similar to patch byte codes to catch NaN in a variable, but the effort and knowledge level is very high and only works if the library were pure Java. A mixed code with native components would be pretty much hopeless.

@john-hen
Copy link
Collaborator Author

john-hen commented Aug 2, 2021

Thanks for the explanations. Yeah, Comsol does have native components beside the Java code. Lots, actually. There's almost certainly no API method to terminate the client: I've spent hours with introspection of the Java object hierarchy. I don't see how Java would ever call anything Python, the way everything is set up. All I know is, without the explicit exit() call, the session will (usually) hang at the end. And at this point there should only be Python objects that hold mere references to Java objects.

But I think I'll give it one more shot and try come up with a simple demo that does not use any MPh code, only direct JPype calls. Should be easier to reason about…

@max3-2
Copy link
Contributor

max3-2 commented Aug 3, 2021

I did some testing here and there and found this really hard to reproduce consistently - thus my idea of anything happening in COMSOL which just needs more time. This is in agreement what we saw with earlier versions.

I have little to no hopes that anything will happen on the COMSOL side. They are heavily blocking anything that is not Mathworks for their API and in recent time are pretty clear that they have obligations to do so. Regarding bugs, I can confirm. COMSOL is just a massive black box and everything that comes out there needs more than serious control (guess thats the case with any (commercial) simulation software anyway).

To move forward, I would reinforce the thought of not trying to fix too much which is not fixable on our end. Rather spend time and motivation to improve the functionality and stability on our side. I also think it is better to keep the exit routines and the above exception, which occurs rarely, instead of having the SEGFAULT and error log files.

@Thrameos
Copy link

Thrameos commented Aug 3, 2021

I am still not clear on the issue. What specific options leads to a SEGFAULT and what leads to the exception?

The current path for shutdown.

  • Java starts the shutdown by informing all threads that it is time to terminate.
  • JPypeContext.shutdown gets called and starts shutdown.
  • shutdown waits 5 times with yield to try to get itself to the back to the queue so that others can proceed first to avoid race conditions.
  • shutdown executes all of its user installed hooks and waits for each to complete. (Here is where you can customize the sequence)
  • shutdown deactivates future Python calls which prevents another Proxy from executing.
  • shutdown queries for any threads that remain alive at this point and sends them sleep interrupted (which could generate an exception)
  • shutdown disables JPype module (at which point further interaction will lead to a crash)
  • shutdown destroys all Python resources held by Java.
  • shutdown executes post clean up actions

So the easiest path here is to add a user installed hook that simply waits for 0.5 seconds so that anything else has time to clear before it proceeds from poking it with the sleep interrupted.

@Thrameos
Copy link

Thrameos commented Aug 3, 2021

Example. This program adds a hook that adds one second to the shutdown sequence to avoid a race condition.

import jpype
import jpype.imports
jpype.startJVM()
ctxt = jpype.JClass('org.jpype.JPypeContext').getInstance()

@jpype.JImplements("java.lang.Runnable")
class Wait():
    @jpype.JOverride
    def run(self):
        print("B")
        import time
        time.sleep(1)
        print("C")

ctxt.addShutdownHook(jpype.java.lang.Thread(Wait()))
print("A")

You can take other actions like scanning the comsol threads to find a specific thread and send it the interrupt signal, then wait etc.

Basically I tried to make it so that something with a complex race condition can be resolved by adding a hook that allowing the user to force the issue a particular direction.

@john-hen
Copy link
Collaborator Author

john-hen commented Sep 13, 2021

So I spent some more time on this, and I've come to the conclusion that it must be some kind of bug in that Java VM which ships with Comsol.

Here is the test script I used. It makes all calls into the Comsol class library directly via JPype, no MPh business.

import jpype
import jpype.imports
import jpype.config

print('# Configuring JPype.')
jpype.config.onexit         = True     # default: True
jpype.config.destroy_jvm    = False    # default: True
jpype.config.free_resources = True     # default: True

print('# Starting JVM.')
root = 'C:/Program Files/COMSOL/COMSOL56/Multiphysics'
jvm  = root + '/java/win64/jre/bin/server/jvm.dll'
jars = root + '/plugins/*'
jpype.startJVM(jvm, classpath=jars)

print('# Importing client.')
from com.comsol.model.util import ModelUtil

print('# Initializing client.')
ModelUtil.initStandalone(True)

print('# Creating model.')
model = ModelUtil.createUnique('model')

print('# Shutting down.')

I have a second script that repeats the test until a crash occurs. On Windows, the test script crashes with a segmentation fault about 1 out of 10 times. Here, like in the MPh library code, we start Comsol's own Java VM, which is a recent build of the old JDK 8.

C:\Program Files\COMSOL\COMSOL56\Multiphysics\java\win64\jre\bin
$ java -version
openjdk version "1.8.0_265"
OpenJDK Runtime Environment (AdoptOpenJDK)(build 1.8.0_265-b01)
OpenJDK 64-Bit Server VM (AdoptOpenJDK)(build 25.265-b01, mixed mode)

However, no crashes occur (over more than 100 runs) with the current JDK 16.

C:\programs\Java\bin
$ java -version
openjdk version "16.0.2" 2021-07-20
OpenJDK Runtime Environment Temurin-16.0.2+7 (build 16.0.2+7)
OpenJDK 64-Bit Server VM Temurin-16.0.2+7 (build 16.0.2+7, mixed mode, sharing)

So, well, there's that.

Ultimately though, it is some kind of design flaw in Comsol's architecture. The only reason we have to skip the DestroyJavaVM() JNI call is because Comsol keeps creating certain threads after, for example, loading a model, and these threads can't be interrupted and they somehow even "respawn" when forcefully stopped.

In issue #15, specifically this comment, I overlooked that that's what's going on. It's possible to stop the non-daemon threads, as I observed there. They are in fact dead. But if we call getAllStackTraces() a second time, we then see that they have been replaced by new threads with the same name and ID. I don't know what that's all about, but it's the reason why the DestroyJavaVM() call blocks (for a while or even indefinitely), producing the shutdown delay.

I've also just now noticed that the same happens with pure Java code. The Java equivalent of the above Python test script is this:

import com.comsol.model.*;
import com.comsol.model.util.*;

public class test {
   public static void main(String[] args) {
      System.out.println("# Initializing client.");
      ModelUtil.initStandalone(false);
      System.out.println("# Creating model.");
      Model model = ModelUtil.create("model");
      System.out.println("# Shutting down.");
   }
}

It is compiled with javac -classpath "%ComsolDir%\plugins\*" test.java and run via java -classpath ".;%ComsolDir%\plugins\*" test. Whichever JDK we use, it also hangs for about a minute at the very end. This is most likely the reason why the Comsol documentation enigmatically says:

You also need to call System.exit(0) at the end of the Java program to terminate the process.

Adding the line System.exit(0); at the end of main() does in fact eliminate the shutdown delay. And it's why I ended up doing the same thing in Python. But it's really strange that this is needed.

Now it's too bad that skipping the DestroyJavaVM() call doesn't remedy the situation. But we wouldn't need to skip if those threads didn't do whatever it is they are doing. And the issue seems to be compounded by some bug in the Java 8 run-time. After all, the "problematic frame" is always in VM code (jvm.dll or libjvm), never in compiled Java code, and the offending VM operation is always RevokeBias, mode: safepoint, which has something to do with how locks are negotiated between threads. And there are a number of bug reports (1, 2, 3, 4, probably more) with a very similar stack trace, all for JDK 8, none consistently reproducible, all unresolved.

For the record, here is the trace log I get when the test script crashes: trace 1. I don't know how accurate these logs are towards the very end. I did run Python in unbuffered mode (with the -u flag) to make sure stderr gets flushed as quickly as possible, but not sure if all lines down to the very last one actually made it to the output file when the whole thing went down. I also decided to comment out the line JPJavaFrame frame = JPJavaFrame::outer(context); in PyJPValue_finalize(), as that object seems to never be used inside that function, and that adds a few more lines to the trace: trace 2. I also tried unloading the model and removing the Python reference (del model), but then it crashes when the value org.jpype.JPypeContext is finalized, specifically when env->functions->DeleteGlobalRef(env, obj) gets called in JPContext::ReleaseGlobalRef(). And not deleting the global reference is also not an option, that leads to even more frequent crashes.

Anyway, I think this is case closed. Ultimately it is Comsol's fault and there's not much we can do about it, except adding a dependency on a newer Java run-time, which I don't want to do. It should work with Comsol's very own JVM build. Importantly, @Thrameos , the issue is not with JPype. The only reason it once worked with an earlier version, I believe, is that the shutdown procedure was less clean back then, so it glossed over this issue.

@Thrameos
Copy link

Okay good to hear that the options that I provided are helping. It is likely that offending code is simply a loop which catches all exception and then attempts to restart the threads which would typically happen if someone wants to make code respawn no matter what when wrong. Given your heroic effort in tracking down the issue, there isn't anything else that can be done. Their documentation makes it pretty clear there is no shutdown procedure other than pulling the plug with System.exit. You may want to see if there is a way to disable the JPype shutdown sequence entirely and add an atexit hook that calls the System.exit directly. At least that will keep the user from having to fight with it.

@john-hen
Copy link
Collaborator Author

Yeah, that might explain the behavior of these threads. They seem to be doing some kind of book-keeping as they often do eventually die on their own.

Calling System.exit() directly in the exit handler is the work-around that's currently in place. We still get this InterruptedException every now and then, as described in the original post. They are different from the VM crashes, and less annoying as they don't produce fatal error logs. I've never tried to catch these Java exceptions, maybe worth a shot, but since they happen after the System.exit() call, I don't know if that's possible or even sensible.

I did try out the three JPype config options, but no combination lead to an entirely crash-free shutdown. Though in these tests there was then no System.exit() call as I was hoping to get rid of our hacky exit handler.

@john-hen john-hen added can't fix Issue cannot be resolved. and removed bug Something isn't working. labels Oct 7, 2021
john-hen added a commit that referenced this issue Oct 16, 2021
Tests have shown that this has no effect on Linux. With our without
this, the `InterruptedExcetion` discussed in #38 still occurs at exit
about the same number of times. So it's best not to touch this.
@john-hen
Copy link
Collaborator Author

This issue seems to be resolved with the release of Comsol 6.0. I can no longer produce these errors in repeated tests, which I could before, with Comsol 5.6.

I'm pretty sure this is because Comsol upgraded its vendored Java environment from JDK 8 to JDK 11. As I mentioned in an earlier comment, all bug reports on the OpenJDK issue tracker with similar error traces were for JDK 8, but never above.

For the record, these were the test scripts I used (now) to confirm this...

test.py:

import mph

version = '6.0'

mph.option('session', 'client-server')
client = mph.start(cores=1, version=version)
assert client.version ==  version

model = client.load('capacitor.mph')
model.solve('static')

repeat.py:

from subprocess import run, PIPE, STDOUT
from textwrap   import indent
from sys        import executable as python

n = f = 0
while n < 20:
    n += 1
    print(f'Test run {n} started.')
    command = [python, '-u', 'test.py']
    process = run(command, stdout=PIPE, stderr=STDOUT, text=True)
    if process.returncode:
        print(f'  Failed with return code {process.returncode}.')
    if process.stdout:
        if 'java.lang.InterruptedException' in process.stdout:
            print('  Failed with InterruptedException.')
            f += 1
        else:
            print('  Text printed to stdout.')
            print(indent(process.stdout, '  '))
    if process.stderr:
            print('  Text printed to stdout.')
            print(indent(process.stdout, '  '))

print(f'{f} runs failed with InterruptedException.')

@john-hen john-hen added fixed Fix for issue has been released. and removed can't fix Issue cannot be resolved. labels Aug 27, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
fixed Fix for issue has been released.
Projects
None yet
Development

No branches or pull requests

3 participants