Add guard for missing `additional_store` before executing job #76

ml-evs · 2024-02-19T13:23:33Z

As tested in #59, if a job requires an additional store for its output, and that store is not defined in the JFR config, then the job will only fail at the point of trying to access that store.

I think we should add a guard in the runner to prevent this job from being submitted. I'm not sure what JobState it should be left with (perhaps a new one, HELD or some such, so it can be easily rerun once the runner config is updated?) Thoughts @gpetretto?

The text was updated successfully, but these errors were encountered:

gpetretto · 2024-02-20T09:23:31Z

Thanks for raising this point. It would be indeed better to stop it before the execution of the Job, also because I am afraid that one could not just rerun the last "remote" step, but will need to rerun the whole Job.

I would consider a few points:

I am open to discuss this, but I would not introduce a new state, as there are already quite a lot. I think the error Is fine. I would raise a RemoteError with no_retry=True here. This avoids that the runner retries multiple times.
I think the check in Add testing and runtime checks for additional stores #59 is good, maybe we can refine it, if needed. I still need to go through the additional store logic again.
Maybe the same check could be already performed when the jobs are inserted and directly refuse to add them to the DB? This will already inform the user of the issue, instead of waiting for it to be submitted. The other check needs to stay in place as well, since the Job with a specific additional store may be added dynamically or the user could change the config file, and could thus reach the submission any way.

ml-evs · 2024-02-20T09:47:50Z

* I am open to discuss this, but I would not introduce a new state, as there are already quite a lot. I think the error Is fine. I would raise a `RemoteError` with `no_retry=True` here. This avoids that the runner retries multiple times.

I agree it is only worth introducing if we then act on that new state in a specific way; remote error would be the state with the correct behaviour (in that it can be restarted after a config update without issue) but it feels like weird naming for an issue referring to the "local" config (or the runner config, at least). I will implement it with the RemoteError for now (and thanks for the no_retry tip).

* Maybe the same check could be already performed when the jobs are inserted and directly refuse to add them to the DB? This will already inform the user of the issue, instead of waiting for it to be submitted. The other check needs to stay in place as well, since the Job with a specific additional store may be added dynamically or the user could change the config file, and could thus reach the submission any way.

Yes I thought about this; it would need to be run immediately after the call to job.to_db_dict() (as far as I can recall). If I can extract a simple function from the existing check I will try calling it in both places.

gpetretto mentioned this issue Feb 20, 2024

Add testing and runtime checks for additional stores #59

Merged

ml-evs closed this as completed in #59 Feb 22, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add guard for missing `additional_store` before executing job #76

Add guard for missing `additional_store` before executing job #76

ml-evs commented Feb 19, 2024

gpetretto commented Feb 20, 2024

ml-evs commented Feb 20, 2024

Add guard for missing additional_store before executing job #76

Add guard for missing additional_store before executing job #76

Comments

ml-evs commented Feb 19, 2024

gpetretto commented Feb 20, 2024

ml-evs commented Feb 20, 2024

Add guard for missing `additional_store` before executing job #76

Add guard for missing `additional_store` before executing job #76