You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
The MultiBackendJobManager.run_jobs() method takes as input a df, which is a DataFrame containing information about all the jobs to run and an output_file, which contains the path to a csv file to track the status of all the jobs.
If the output_file already exists, however, the run_jobs() method will ignore the df input and continue from the existing jobs in the output_file, as seen in the code below:
output_file=Path(output_file)
ifoutput_file.exists() andoutput_file.is_file():
# Resume from existing CSV_log.info(f"Resuming `run_jobs` from {output_file.absolute()}")
df=pd.read_csv(output_file)
status_histogram=df.groupby("status").size().to_dict()
_log.info(f"Status histogram: {status_histogram}")
This makes it so that once a MultiBackendJobManager is run a second time, with the same output_file, it's not possible to add new jobs.
Is is possible that when output_file already exists, run_jobs() creates the union of the input df and existing output_file? Or is there a good reason not to?
The text was updated successfully, but these errors were encountered:
@VincentVerelst this is certainly a possibility. I suggest that data engineering is free to extend this job manager as needed.
Main reason not to do it would be to avoid unexpected behaviour, you really don't want your job csv to get corrupted and loose all info.
What I sometimes did in the past is using a separate script to make necessary updates to the csv, while job manager script is stopped, and then, after verification of csv, restart job manager with updated job list.
The
MultiBackendJobManager.run_jobs()
method takes as input a df, which is a DataFrame containing information about all the jobs to run and an output_file, which contains the path to a csv file to track the status of all the jobs.If the output_file already exists, however, the run_jobs() method will ignore the df input and continue from the existing jobs in the output_file, as seen in the code below:
This makes it so that once a MultiBackendJobManager is run a second time, with the same output_file, it's not possible to add new jobs.
Is is possible that when output_file already exists, run_jobs() creates the union of the input df and existing output_file? Or is there a good reason not to?
The text was updated successfully, but these errors were encountered: