Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

MultiBackendJobManager.run_jobs() doesn't add new jobs to existing job_tracker #558

Open
VincentVerelst opened this issue Apr 16, 2024 · 2 comments

Comments

@VincentVerelst
Copy link
Contributor

The MultiBackendJobManager.run_jobs() method takes as input a df, which is a DataFrame containing information about all the jobs to run and an output_file, which contains the path to a csv file to track the status of all the jobs.
If the output_file already exists, however, the run_jobs() method will ignore the df input and continue from the existing jobs in the output_file, as seen in the code below:

output_file = Path(output_file)
 if output_file.exists() and output_file.is_file():
      # Resume from existing CSV
      _log.info(f"Resuming `run_jobs` from {output_file.absolute()}")
      df = pd.read_csv(output_file)
      status_histogram = df.groupby("status").size().to_dict()
      _log.info(f"Status histogram: {status_histogram}")

This makes it so that once a MultiBackendJobManager is run a second time, with the same output_file, it's not possible to add new jobs.
Is is possible that when output_file already exists, run_jobs() creates the union of the input df and existing output_file? Or is there a good reason not to?

@soxofaan
Copy link
Member

soxofaan commented Apr 22, 2024

I haven't played a lot with MultiBackendJobManager myself and don't know the practical use details to be honest.

@jdries
Copy link
Collaborator

jdries commented Apr 22, 2024

@VincentVerelst this is certainly a possibility. I suggest that data engineering is free to extend this job manager as needed.
Main reason not to do it would be to avoid unexpected behaviour, you really don't want your job csv to get corrupted and loose all info.
What I sometimes did in the past is using a separate script to make necessary updates to the csv, while job manager script is stopped, and then, after verification of csv, restart job manager with updated job list.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants