Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Running Comsol via MPh on the cluster crashes during solving #128

Closed
wacht02 opened this issue Mar 8, 2023 · 8 comments
Closed

Running Comsol via MPh on the cluster crashes during solving #128

wacht02 opened this issue Mar 8, 2023 · 8 comments
Labels
general General ideas and feedback.

Comments

@wacht02
Copy link

wacht02 commented Mar 8, 2023

Hello, I recently moved my COMSOL effords to a cluster and want to use the MPH package to run and evaluate the simulations.
The cluster is running CentOS 7 and has COMSOL 6.1.0.252 and MPH 1.2.2 installed. When I now run a job on the cluster it sometimes crashes with the following error message in the log file:

/PATH/TO/software/COMSOL/6.1.0.252/bin/glnxa64/comsol: line 237: 39864 Killed                  ${FLROOT}/bin/comsol "$@"
Traceback (most recent call last):
  File "SourceFile", line 192, in com.comsol.clientapi.impl.StudyClient.run
Exception: Java Exception

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/home/scripts/python/Run_COMSOL_from_Python.py", line 285, in <module>
    model.solve(f'Study')
  File "/home/miniconda3/envs/COMSOL_Env/lib/python3.9/site-packages/mph/model.py", line 349, in solve
    node.run()
  File "/home/miniconda3/envs/COMSOL_Env/lib/python3.9/site-packages/mph/node.py", line 546, in run
    java.run()
java.lang.java.lang.NullPointerException: java.lang.NullPointerException
Error while disconnecting client at session clean-up.
Traceback (most recent call last):
  File "SourceFile", line 290, in com.comsol.model.util.ModelUtil.disconnect
Exception: Java Exception

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/home/miniconda3/envs/COMSOL_Env/lib/python3.9/site-packages/mph/session.py", line 154, in cleanup
    client.disconnect()
  File "/home/miniconda3/envs/COMSOL_Env/lib/python3.9/site-packages/mph/client.py", line 438, in disconnect
    self.java.disconnect()
java.lang.java.lang.IllegalStateException: java.lang.IllegalStateException: Tried to disconnect while not connected.

I can however run the exact same code on the Windows machiene that has the same installations without any problems. Could this be an OS issue?

Further I think the job only crashes with more complex models as I have not seen this error with simpler models.

Are there any things I need to watch out for when running MPH and COMSOL on the cluster?

If you need further information please let me know and thank you in advance for your effords.

@john-hen
Copy link
Collaborator

john-hen commented Mar 8, 2023

What's the 39864 Killed in the log file? Doesn't that mean that the batch scheduler (Slurm or whatever your cluster is using) has killed the job because it ran for too long? Would explain why you don't see it with simpler models. I haven't tested what kind of error traceback would be produced if the job was killed, but what I see there is kind of what I'd expect.

If it's not that, then I don't know. The exceptions there are super generic: "Java Exception", IllegalStateException.

Are there any things I need to watch out for when running MPH and COMSOL on the cluster?

In theory, no. Because it's Linux. But that theory has rarely been tested, by me at least. So there could always be issues with external shared libraries, which Comsol depends on for linear algebra as well as image processing. And not all Linux distros are doing things the same way. Though that doesn't seem to be the problem here. And to be clear: these would be issues with the Comsol setup, not with MPh per se.

@wacht02
Copy link
Author

wacht02 commented Mar 8, 2023

I don't know what the 39864 Killed exactly means. It's not the Job ID. Also in this example the job only ran for a couple of minutes before crashing, so it wasn't canceled by the scheduler. It also only states "Failed" in the job report. I have also had jobs crashing after some hours of runtime. They have a different statement in the first line e.g.:

terminate called after throwing an instance of 'flbase::Exception'
Magick: abort due to signal 6 (SIGABRT) "Abort"...
/PATH/TO/software/COMSOL/6.1.0.252/bin/glnxa64/comsol: line 237: 202727 Aborted                 ${FLROOT}/bin/comsol "$@"
Traceback (most recent call last):
  File "SourceFile", line 192, in com.comsol.clientapi.impl.StudyClient.run
Exception: Java Exception

But the error also ends with the exeption

Traceback (most recent call last):
  File "/home/miniconda3/envs/COMSOL_Env/lib/python3.9/site-packages/mph/session.py", line 154, in cleanup
    client.disconnect()
  File "/home/miniconda3/envs/COMSOL_Env/lib/python3.9/site-packages/mph/client.py", line 438, in disconnect
    self.java.disconnect()
java.lang.java.lang.IllegalStateException: java.lang.IllegalStateException: Tried to disconnect while not connected.

I will ask the support of the cluster if they know what's going on. Thank you for the quick response.

@john-hen
Copy link
Collaborator

john-hen commented Mar 8, 2023

Okay, so there is a difference between aborted and killed. If not the run time, it could also be some other resource that might have been exhausted. Like memory or disk space (i.e., "scratch" space, the one for temporary files). But your cluster support will know more.

That the last exception is the same in both cases is not surprising. Each time, the process is terminated abruptly while it's solving. The other exceptions are a consequence of that, as the traceback states. MPh does run a clean-up routine whenever Python exits. It has to, because Comsol's way of ending its own Java process is, well... peculiar. Usually "we" tell the Java VM to shut down, in the Python exit handler. But before that, we disconnect the client from the server. Any exceptions that might occur while doing that are caught and silently ignored. But that assumes that the Java VM (that we're shutting down) is still running at that point. Though in your case, it seems to be in an "illegal state", whatever that may be. Probably because it was already terminated by some external means. So the exceptions that follow the first one just get added to the traceback, as they are a direct consequence. (Not a hundred percent sure this explanation is correct, just to explain how I read the error message.)

@wacht02
Copy link
Author

wacht02 commented Mar 16, 2023

Hello, I've heard back from the cluster support and tried some things. Your guess was correct. After the solving of the simulation is done COMSOL writes a large amount of data to the disc. If there is no more space available then the simulation crashes. The directory for this is the SSD that is connected to the node. I changed this directory to somewhere with way more space but COMSOL seems to just ignore this and still write on the SSD of the node.

I also took a look at the command line for running simulations. When starting a job with comsol batch -inputfile PATH/TO/FILE -study NAME_OF_STUDY I get the same error as before. The progress is indicated to be 100% though so it's probably only writing the solved study to the disk?

With the command I can also set the directory for temporary files using the -tmpdir PATH/TO/TEMP/ option. When I do this the simulation works fine. Is there an option to do this using MPH or could it be added? I still would like to use MPH to evaluate the results from solved studies but can't open these since COMSOL still writes to the wrong directory.

@john-hen
Copy link
Collaborator

john-hen commented Mar 16, 2023

Great, that explains that. You must be running a pretty big model if it maxes out the local scratch space. Comsol does produce a lot of temporary files. I've often had that problem, where the local hard drive would fill up and eventually everything came to a crashing halt. But then I'd usually solved the same model many times over, with different parameters. And Comsol only seems to remove the temp files at the end of the session. So solving just the one model, one time, should work, provided the SSD has enough free disk space to accommodate the RAM.

But okay, it is what it is. MPh currently provides no option to set the directory for Comsol's temporary files. In principle, this option could be added. I am hesitant to do that though. I want the distinction between "stand-alone client" and "client-server mode" to be as seamless as possible. Ideally, users should just call mph.start() when running simulations locally, and not worry about what happens behind the scenes. But the setting for the temporary directory would have to be implemented differently in those two cases. So this is not as trivial as I would hope. Would require quite a bit of testing across platforms. Which is always a problem, because then I need access to these platforms.

In client-server mode, yes, we could pass that argument to the server process as it starts. Much like you do with comsol batch. In the end, when we start the server, MPh just runs the command comsol server as a subprocess. So there's not much to it. We could just let users pass whatever arguments they want there. That's something I'm considering. But then they'd have to instantiate mph.Server() directly, not via mph.start(). So this is where it gets more complicated, if we want the same thing to work for stand-alone clients too. Which we should: It makes for a consistent user experience.

Anyway, there seems to be a simpler solution for you, and it doesn't require any code changes to MPh. You can set the temp dir via an environment variable: COMSOL_TMPDIR. For some reason, the Comsol documentation only mentions that possibility for the Windows platform. But I've just tested it on the Linux cluster I have access to, which also runs CentOS 7, and it works there as well.

You could set the variable directly in the shell (export COMSOL_TMPDIR=...), or in your .bashrc configuration file, or even in the Python code itself. Like so:

import mph
import os

os.environ['COMSOL_TMPDIR'] = os.environ['HOME'] + '/temp'

client = mph.start()
model = client.load('capacitor.mph')
model.solve()

@wacht02
Copy link
Author

wacht02 commented Mar 17, 2023

Thanks for the reply. Yes my model is not small. But I am not realy sure, what exactly COMSOL is doing. In the monitoring of the cluster you can see the RAM usage of the node. Here you see, that COMSOL uses a large amount of RAM in the beginning of the simulation (~70Gb). This decreases after some minutes and stayes like this for the rest of the simulation (~15Gb). And then at the end it writes even more data to the drives.

I just realised that I forgot to mention something in my previous reply. I also pass the argument -autosave off when starting the comsol batch command. This turns of the writing of recovery files. Do you know if this is also possible with an environmental variable?

I've tested your suggestion by setting export COMSOL_TMPDIR=.... I think it works but now I get an error about not beeing able to write the recovery file I think. Here is the error message, if you are interested:

Traceback (most recent call last):
  File "java.lang.Thread.java", line -1, in java.lang.Thread.run
Exception: Java Exception

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/home/scripts/python/Run_COMSOL_from_Python.py", line 290, in <module>
    model.solve(f'Study')
  File "/home/miniconda3/envs/COMSOL_Env/lib/python3.9/site-packages/mph/model.py", line 349, in solve
    node.run()
  File "/home//miniconda3/envs/COMSOL_Env/lib/python3.9/site-packages/mph/node.py", line 546, in run
    java.run()
com.comsol.util.exceptions.com.comsol.util.exceptions.FlException: Exception:
	com.comsol.util.exceptions.FlException: The following feature has encountered a problem
Messages:
	The following feature has encountered a problem:
	- Feature: Time-Dependent Solver 1 (sol3/t1)

	Cannot create native file /home/.comsol/v61/recoveries/MPHRecovery3252date Mar 17 2023 12-29 PM.mph/solution17754061549040876287.mphbin6221711134361404387.

Error while disconnecting client at session clean-up.
Traceback (most recent call last):
  File "SourceFile", line 290, in com.comsol.model.util.ModelUtil.disconnect
Exception: Java Exception

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/home/miniconda3/envs/COMSOL_Env/lib/python3.9/site-packages/mph/session.py", line 154, in cleanup
    client.disconnect()
  File "/home/miniconda3/envs/COMSOL_Env/lib/python3.9/site-packages/mph/client.py", line 438, in disconnect
    self.java.disconnect()
com.comsol.util.exceptions.com.comsol.util.exceptions.FlException: Exception:
	com.comsol.util.exceptions.FlException: Disk quota exceeded
Messages:
	Disk quota exceeded

Also just out of curiosity: When I want to add an option to MPH, would I just add the agruments to line 81 in the MPH/server.py file? Like agruments = ['-login', 'auto', '-graphics', '-tmpdir', 'PATH/TO/TMP/', '-autosave', 'off']?

@john-hen
Copy link
Collaborator

john-hen commented Mar 17, 2023

Also just out of curiosity: When I want to add an option to MPH, would I just add the agruments to line 81 in the MPH/server.py file? Like agruments = ['-login', 'auto', '-graphics', '-tmpdir', 'PATH/TO/TMP/', '-autosave', 'off']?

Yes. That's exactly what you'd do if you wanted to "hot-patch" the library code. Just make sure the file you're changing is really the one that Python eventually runs. So then you'd want to install from source in "editable" mode, with pip install -e . after git clone-ing the source code. (There are other ways of doing this, of course.)

This is also where I'd consider having users just add whatever options they want. Though then they'd have to call mph.Server(...) directly. Whereas you, when you do the hot patch, can just call mph.start() as usual.

The -autosave off option I have no problem adding in the library. Maybe it's actually just missing, and should have been there all along. When we start the client (not the server), we already try to turn off the creation of recovery files:

MPh/mph/client.py

Lines 209 to 216 in 3a95842

# Override certain settings not useful in headless operation.
preferences = (
('updates.update.check', 'off'),
('tempfiles.saving.warnifoverwriteolder', 'off'), # issue #50
('tempfiles.recovery.autosave', 'off'),
('tempfiles.recovery.checkforrecoveries', 'off'), # issue #39
('tempfiles.saving.optimize', 'filesize'),
)

That's because recovery files are pretty useless when you're scripting things. They are something GUI users want. As far as MPh is concerned, recovery files should not be created, ever. If they are, then that's something we should fix. But maybe the above only prevents them for stand-alone clients, and has no effect in client-server mode. (I have a blind spot there.) Then we should just hard-code -autosave off.

Do you know if this is also possible with an environmental variable?

Not that I'm aware of. It should however be configurable via the *.prefs files inside the ~/.comsol folder, if I'm not mistaken.

@john-hen john-hen changed the title Running COMSOL with MPH on cluster crashes frequently during solving Running Comsol via MPh on the cluster crashes during solving Mar 17, 2023
john-hen added a commit that referenced this issue Mar 18, 2023
See discussion in issue #128. To what extent this command-line option
is effective is however a different matter.
john-hen added a commit that referenced this issue Mar 18, 2023
See discussion in issue #128. This should only be used as a last resort
if there is an important argument that MPh is not providing already.
Before, users would have to hot-patch the library in such a scenario.
@john-hen
Copy link
Collaborator

The changes discussed are in MPh 1.2.3, released today.

@john-hen john-hen added the general General ideas and feedback. label Mar 19, 2023
TermeHansen pushed a commit to resolventdk/MPh that referenced this issue Aug 3, 2023
See discussion in issue MPh-py#128. To what extent this command-line option
is effective is however a different matter.
TermeHansen pushed a commit to resolventdk/MPh that referenced this issue Aug 3, 2023
See discussion in issue MPh-py#128. This should only be used as a last resort
if there is an important argument that MPh is not providing already.
Before, users would have to hot-patch the library in such a scenario.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
general General ideas and feedback.
Projects
None yet
Development

No branches or pull requests

2 participants