-
Notifications
You must be signed in to change notification settings - Fork 6
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
remote_job_data.json file missing #26
Comments
Hi @FabiPi3, As for why the calculation fails I can make a few guesses:
Some additional information could be extracted by checking the content of the FW_offline.json file. If an exception was raised inside the code it should be registered inside it. |
Hi @gpetretto, Indeed, I have a problem with my conda env and also with the Here is the slurm info output:
The file
But as you can see, the Looking in there, I can find the Why are the PS: The |
Thanks for the detailed report. I don't know why you get a different workdir. In all the clusters I have tested the Workdir was correctly set, and thus the location of the queue output files. Before submitting the code should change the directory, so it is hard to say why this is not considered in your case without having access to the system. To solve the problem of the location of the files I think that the best solution would be to explicitly set the full path in |
Interestingly not:
I have somehow the feeling it is mysteriously combining the home dir and the run dir 😅 |
So I tested it and if you do not set the full path, the out and err file will be written in the directory, from where you have executed the |
And another comment on this: Is there a specific reason why you are not including the job-id in the out file name? Do you think it is not necessary? Something like: |
This is indeed the standard behavior that I have always encountered. Before submitting the script it should change directory to the folder where the script is copied (see
There are two somewhat related issues. The filename is decided by jobflow-remote and passed to qtoolkit to generate the script. This should work for all the possible schedulers. This means that I cannot directly pass |
I guess each directory will be used only once for execution? So if there is no risk of overwriting these files, it should be fine. Thanks for your answers so far, but I am still struggeling in getting things running. Should I continue in this thread (maybe no the right place?) or do you prefer a more private communication? It would be really great if you could keep helping me and answering my questions. |
I pushed a change to set the absolute paths for the queue files: 0d539fe. I will close this and we can continue the discussion privately. |
I am testing jobflow_remote in a very simple case, I use a add job defined in some module:
and in the main file I am importing this job and using it in a flow:
Using a local worker (simple shell execution), everything works as expected. Now I tried to use a remote worker, which I defined in the config file. Checking the config file with
jf project check
leads to only green ticks. Now I submit the flow, and inspect the job status withjf job list
and the job state goes fromREADY
toONGOING [CHECKED_OUT]
toONGOING [UPLOADED]
toONGOING [RUNNING]
toREMOTE_ERROR [FAILED]
. Looking at the error message withjf job info -err 1
leads to the error message:And indeed, checking the specified run_dir in the error message, no such file is present:
As I said, locally it works fine. Probably it is not a bug, but I don't know what the issue is. Please help me.
For your reference is here the definition of the worker:
The text was updated successfully, but these errors were encountered: