New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Running Comsol via MPh on the cluster crashes during solving #128
Comments
What's the If it's not that, then I don't know. The exceptions there are super generic: "Java Exception",
In theory, no. Because it's Linux. But that theory has rarely been tested, by me at least. So there could always be issues with external shared libraries, which Comsol depends on for linear algebra as well as image processing. And not all Linux distros are doing things the same way. Though that doesn't seem to be the problem here. And to be clear: these would be issues with the Comsol setup, not with MPh per se. |
I don't know what the
But the error also ends with the exeption
I will ask the support of the cluster if they know what's going on. Thank you for the quick response. |
Okay, so there is a difference between aborted and killed. If not the run time, it could also be some other resource that might have been exhausted. Like memory or disk space (i.e., "scratch" space, the one for temporary files). But your cluster support will know more. That the last exception is the same in both cases is not surprising. Each time, the process is terminated abruptly while it's solving. The other exceptions are a consequence of that, as the traceback states. MPh does run a clean-up routine whenever Python exits. It has to, because Comsol's way of ending its own Java process is, well... peculiar. Usually "we" tell the Java VM to shut down, in the Python exit handler. But before that, we disconnect the client from the server. Any exceptions that might occur while doing that are caught and silently ignored. But that assumes that the Java VM (that we're shutting down) is still running at that point. Though in your case, it seems to be in an "illegal state", whatever that may be. Probably because it was already terminated by some external means. So the exceptions that follow the first one just get added to the traceback, as they are a direct consequence. (Not a hundred percent sure this explanation is correct, just to explain how I read the error message.) |
Hello, I've heard back from the cluster support and tried some things. Your guess was correct. After the solving of the simulation is done COMSOL writes a large amount of data to the disc. If there is no more space available then the simulation crashes. The directory for this is the SSD that is connected to the node. I changed this directory to somewhere with way more space but COMSOL seems to just ignore this and still write on the SSD of the node. I also took a look at the command line for running simulations. When starting a job with With the command I can also set the directory for temporary files using the |
Great, that explains that. You must be running a pretty big model if it maxes out the local scratch space. Comsol does produce a lot of temporary files. I've often had that problem, where the local hard drive would fill up and eventually everything came to a crashing halt. But then I'd usually solved the same model many times over, with different parameters. And Comsol only seems to remove the temp files at the end of the session. So solving just the one model, one time, should work, provided the SSD has enough free disk space to accommodate the RAM. But okay, it is what it is. MPh currently provides no option to set the directory for Comsol's temporary files. In principle, this option could be added. I am hesitant to do that though. I want the distinction between "stand-alone client" and "client-server mode" to be as seamless as possible. Ideally, users should just call In client-server mode, yes, we could pass that argument to the server process as it starts. Much like you do with Anyway, there seems to be a simpler solution for you, and it doesn't require any code changes to MPh. You can set the temp dir via an environment variable: You could set the variable directly in the shell ( import mph
import os
os.environ['COMSOL_TMPDIR'] = os.environ['HOME'] + '/temp'
client = mph.start()
model = client.load('capacitor.mph')
model.solve() |
Thanks for the reply. Yes my model is not small. But I am not realy sure, what exactly COMSOL is doing. In the monitoring of the cluster you can see the RAM usage of the node. Here you see, that COMSOL uses a large amount of RAM in the beginning of the simulation (~70Gb). This decreases after some minutes and stayes like this for the rest of the simulation (~15Gb). And then at the end it writes even more data to the drives. I just realised that I forgot to mention something in my previous reply. I also pass the argument I've tested your suggestion by setting
Also just out of curiosity: When I want to add an option to MPH, would I just add the agruments to line 81 in the MPH/server.py file? Like |
Yes. That's exactly what you'd do if you wanted to "hot-patch" the library code. Just make sure the file you're changing is really the one that Python eventually runs. So then you'd want to install from source in "editable" mode, with This is also where I'd consider having users just add whatever options they want. Though then they'd have to call The Lines 209 to 216 in 3a95842
That's because recovery files are pretty useless when you're scripting things. They are something GUI users want. As far as MPh is concerned, recovery files should not be created, ever. If they are, then that's something we should fix. But maybe the above only prevents them for stand-alone clients, and has no effect in client-server mode. (I have a blind spot there.) Then we should just hard-code
Not that I'm aware of. It should however be configurable via the |
See discussion in issue #128. To what extent this command-line option is effective is however a different matter.
See discussion in issue #128. This should only be used as a last resort if there is an important argument that MPh is not providing already. Before, users would have to hot-patch the library in such a scenario.
The changes discussed are in MPh 1.2.3, released today. |
See discussion in issue MPh-py#128. To what extent this command-line option is effective is however a different matter.
See discussion in issue MPh-py#128. This should only be used as a last resort if there is an important argument that MPh is not providing already. Before, users would have to hot-patch the library in such a scenario.
Hello, I recently moved my COMSOL effords to a cluster and want to use the MPH package to run and evaluate the simulations.
The cluster is running CentOS 7 and has COMSOL 6.1.0.252 and MPH 1.2.2 installed. When I now run a job on the cluster it sometimes crashes with the following error message in the log file:
I can however run the exact same code on the Windows machiene that has the same installations without any problems. Could this be an OS issue?
Further I think the job only crashes with more complex models as I have not seen this error with simpler models.
Are there any things I need to watch out for when running MPH and COMSOL on the cluster?
If you need further information please let me know and thank you in advance for your effords.
The text was updated successfully, but these errors were encountered: