Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

remote_job_data.json file missing #26

Closed
FabiPi3 opened this issue Sep 21, 2023 · 9 comments
Closed

remote_job_data.json file missing #26

FabiPi3 opened this issue Sep 21, 2023 · 9 comments

Comments

@FabiPi3
Copy link
Collaborator

FabiPi3 commented Sep 21, 2023

I am testing jobflow_remote in a very simple case, I use a add job defined in some module:

@job
def add(a: int, b: int):
    print("The sum is", a + b)
    return a + b

and in the main file I am importing this job and using it in a flow:

add_job = add(1, 2)
second_job = add(2, add_job.output)
flow = Flow([keks_job, second_job])
submit_flow(flow, "my_worker")

Using a local worker (simple shell execution), everything works as expected. Now I tried to use a remote worker, which I defined in the config file. Checking the config file with jf project check leads to only green ticks. Now I submit the flow, and inspect the job status with jf job list and the job state goes from READY to ONGOING [CHECKED_OUT] to ONGOING [UPLOADED] to ONGOING [RUNNING] to REMOTE_ERROR [FAILED]. Looking at the error message with jf job info -err 1 leads to the error message:

error_remote = file /mnt/beegfs2018/scratch/peschelf/jobflow-remote/f7/bd/e6/f7bde64b-695c-49d6-b8bd-350e7bd9765d/remote_job_data.json for job f7bde64b-695c-49d6-b8bd-350e7bd9765d does not exist 

And indeed, checking the specified run_dir in the error message, no such file is present:

$ ls
>>> FW.json  FW_offline.json  submit.sh

As I said, locally it works fine. Probably it is not a bug, but I don't know what the issue is. Please help me.

For your reference is here the definition of the worker:

my_remote_worker:
    scheduler_type: slurm
    work_dir: /mnt/beegfs2018/scratch/peschelf/jobflow-remote
    resources:
    pre_run: conda activate exwoflow
    post_run:
    timeout_execute: 60
    type: remote
    host: dune3.physik.hu-berlin.de
    user: peschelf
    port:
    password:
    key_filename:
    passphrase:
    gateway:
    forward_agent:
    connect_timeout:
    connect_kwargs:
    inline_ssh_env:
    keepalive: 60
    shell_cmd: bash
    login_shell: true
@gpetretto
Copy link
Contributor

Hi @FabiPi3,
I think that the file is missing because the calculation was not executed correctly and the output is thus not generated. I will see if I can provide a clearer error in that case.

As for why the calculation fails I can make a few guesses:

  • the calculation did not even start in slurm, as there are no queue.out and queue.err files in your folder, that should be there instead
  • the conda environment on the remote system was not recognized properly
  • the module that contains the add function is not present in the remote environment.

Some additional information could be extracted by checking the content of the FW_offline.json file. If an exception was raised inside the code it should be registered inside it.

@FabiPi3
Copy link
Collaborator Author

FabiPi3 commented Sep 22, 2023

Hi @gpetretto,
thanks for your answer.

Indeed, I have a problem with my conda env and also with the add function. After solving at least the add issue, I see a job running in the queue. But still no output and error files. After checking the queue system, I found out that the error and output files are generated at a completely different place, and I do not understand why.

Here is the slurm info output:

JobId=29132 JobName=add
   UserId=peschelf(7631) GroupId=nobody(501) MCS_label=N/A
   Priority=4294878591 Nice=0 Account=(null) QOS=normal
   JobState=FAILED Reason=NonZeroExitCode Dependency=(null)
   Requeue=1 Restarts=0 BatchFlag=1 Reboot=0 ExitCode=127:0
   RunTime=00:00:35 TimeLimit=12:00:00 TimeMin=N/A
   SubmitTime=2023-09-22T10:15:53 EligibleTime=2023-09-22T10:15:53
   AccrueTime=Unknown
   StartTime=2023-09-22T10:15:53 EndTime=2023-09-22T10:16:28 Deadline=N/A
   PreemptTime=None SuspendTime=None SecsPreSuspend=0
   LastSchedEval=2023-09-22T10:15:53
   Partition=debug AllocNode:Sid=dune3:48782
   ReqNodeList=(null) ExcNodeList=(null)
   NodeList=node109
   BatchHost=node109
   NumNodes=1 NumCPUs=32 NumTasks=0 CPUs/Task=1 ReqB:S:C:T=0:0:*:*
   TRES=cpu=32,node=1,billing=32
   Socks/Node=* NtasksPerN:B:S:C=0:0:*:* CoreSpec=*
   MinCPUsNode=1 MinMemoryNode=0 MinTmpDiskNode=0
   Features=(null) DelayBoot=00:00:00
   OverSubscribe=NO Contiguous=0 Licenses=(null) Network=(null)
   Command=/mnt/beegfs2018/scratch/peschelf/jobflow-remote/4e/6f/ab/4e6fab41-8ab8-4dd6-9cbd-de754a946ff0/submit.sh
   WorkDir=/mnt/beegfs2018/users/stud/peschelf
   StdErr=/mnt/beegfs2018/users/stud/peschelf/queue.err
   StdIn=/dev/null
   StdOut=/mnt/beegfs2018/users/stud/peschelf/queue.out
   Power=

The file submit.sh is in the correct place, the work dir specified by jobflow-remote. It contains:

#!/bin/bash

#SBATCH --job-name=add
#SBATCH --output=queue.out
#SBATCH --error=queue.err
cd /mnt/beegfs2018/scratch/peschelf/jobflow-remote/4e/6f/ab/4e6fab41-8ab8-4dd6-9cbd-de754a946ff0
conda activate exwoflow
export OMP_NUM_THREADS=4
rlaunch singleshot --offline

But as you can see, the WorkDir=/mnt/beegfs2018/users/stud/peschelf is something different. And actually this is no directory I have specified somewhere. I guess it should be not the case.

Looking in there, I can find the queue.err file, which indicates that I have an issue with my conda env. And therefore the rlaunch command could not be found. But that's the next step.

Why are the queue.out and queue.err file at the wrong place?

PS: The FW_offline.json file only contains the launch_id, I guess because fireworks hasn't even started.

@gpetretto
Copy link
Contributor

Thanks for the detailed report. I don't know why you get a different workdir. In all the clusters I have tested the Workdir was correctly set, and thus the location of the queue output files. Before submitting the code should change the directory, so it is hard to say why this is not considered in your case without having access to the system.
Just to better understand, I suppose that /mnt/beegfs2018/users/stud/peschelf is your home folder. Can you confirm that?

To solve the problem of the location of the files I think that the best solution would be to explicitly set the full path in #SBATCH --output=, instead of just the file name. It will be safer in any case. Since a cd is done at the beginning of the submission script it should not be an problem for the rest of the execution. I will update the code to address this.

@FabiPi3
Copy link
Collaborator Author

FabiPi3 commented Sep 22, 2023

Interestingly not:

$ echo $HOME
>>> /users/stud/peschelf

I have somehow the feeling it is mysteriously combining the home dir and the run dir 😅

@FabiPi3
Copy link
Collaborator Author

FabiPi3 commented Sep 22, 2023

So I tested it and if you do not set the full path, the out and err file will be written in the directory, from where you have executed the sbatch command. Maybe this is here the case?

@FabiPi3
Copy link
Collaborator Author

FabiPi3 commented Sep 22, 2023

And another comment on this: Is there a specific reason why you are not including the job-id in the out file name? Do you think it is not necessary?

Something like:
#SBATCH --output queue-%j.out

@gpetretto
Copy link
Contributor

So I tested it and if you do not set the full path, the out and err file will be written in the directory, from where you have executed the sbatch command. Maybe this is here the case?

This is indeed the standard behavior that I have always encountered. Before submitting the script it should change directory to the folder where the script is copied (see

with self.connection.cd(workdir):
). In my case it is working, so I have no good hypothesis about why it does not happen on yours.

And another comment on this: Is there a specific reason why you are not including the job-id in the out file name? Do you think it is not necessary?

Something like: #SBATCH --output queue-%j.out

There are two somewhat related issues. The filename is decided by jobflow-remote and passed to qtoolkit to generate the script. This should work for all the possible schedulers. This means that I cannot directly pass queue-%j.out, as it will work for slurm and not other schedulers (for sure it will not work for the Shell scheduler).
In addition, the CLI offers an option to retreive the information from those files (and we may consider storing the output to the DB as well). In order to retrieve the file it is more convenient to know their names exactly, without the need of checking which kind of scheduler generated them.
All this could be handled by partially delegating the handling of the names to qtoolkit, but it will be probably rather involved and doesn't seem to be worth the effort. Do you think it would be important to have it?

@FabiPi3
Copy link
Collaborator Author

FabiPi3 commented Sep 22, 2023

I guess each directory will be used only once for execution? So if there is no risk of overwriting these files, it should be fine.

Thanks for your answers so far, but I am still struggeling in getting things running. Should I continue in this thread (maybe no the right place?) or do you prefer a more private communication? It would be really great if you could keep helping me and answering my questions.

@gpetretto
Copy link
Contributor

I pushed a change to set the absolute paths for the queue files: 0d539fe.

I will close this and we can continue the discussion privately.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants