info.json is not be saved #830

Sud0x67 · 2021-07-01T12:20:17Z

Hi, I use sacred for my AI experiments and it help me a lot. But recently I found something is wrong with sacred. I use the info dict to save some results of my experiments and uausally it works well. But sometimes the info.json is not saved or only half of it is saved. Is there any solution?
The version of sacred I use is 0.8.2 and on python 3.7.

thequilo · 2021-07-01T12:30:38Z

Hi, I assume you use the FileStorageObserver. The info dict is only saved in the heartbeat events, but not on interruption, failed, or completed events. The heartbeat should usually be stopped correctly so that all data is written, but what you report looks like this is not always the case. Does this happen only for failed/interrupted experiments or also for experiments that finished correctly?

Sud0x67 · 2021-07-01T13:39:26Z

Yes, I use the FileStorageObserver. This happens for experiments that finished correctly, somtimes but not always. Other files including cout.txt, config.json, and run.json are saved correctly except info.json.

thequilo · 2021-07-01T14:01:37Z

Do you have a minimal example that reproduces this issue? It seems to work for me.

The heartbeat events are processed in a background thread. It could be that this thread dies, for some reason, before it can perform the final write.

Sud0x67 · 2021-07-01T14:20:16Z

Thanks for your reply. I am so sorry that I can't provide a minimal example because I am dealing with a complex project about MARL.
My project is based on this repo and command python3 src/main.py --config=qmix --env-config=sc2 with env_args.map_name=3m can reproduces this issue. However, it is not easy to figure out the code and reproduce this issue. I will comment here if I have any idea about this issue. Thanks for your help!

stale · 2022-04-16T16:30:50Z

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

vnmabus · 2022-07-01T08:10:43Z

I have the same issue (Sacred 0.8.2). Is the info dict not saved on completion? That sounds like a bug.

thequilo · 2022-07-01T09:14:23Z

The info dict is not saved on completion. It is not passed to the completed_event of the observer. I don't know exactly why this is the case. I guess the idea was that if the heartbeat event is executed correctly, then there is no need to save the info dict in the completion event because it does not change between the last heartbeat and the completion. But if the heartbeat fails, this assumptions is no longer true.

@vnmabus do you have a minimal example to reproduce the issue? Or does it only appear in larger experiments?

vnmabus · 2022-07-01T09:27:18Z

For now only a few times, and in medium to large experiments in the cluster. I have put a sleep(11) call to patch it for now, but that is not ideal, and I still have to relaunch the failed experiments.

vnmabus · 2023-08-23T10:19:40Z

This should be saved on completion. I have lost countless human and computing time by relaunching half completed experiments because of this.

thequilo · 2023-08-28T07:11:04Z

That's really unfortunate. Do you have extremely large data in your info.json? Maybe it gets killed if the write for the heartbeat takes longer than processing the completed event. In that case, we could make the main thread wait longer for the background heartbeat thread. There currently is a timeout of 2s in

sacred/sacred/run.py

Line 288 in 17c5306

self._heartbeat.join(timeout=2)

. Increasing or removing this could solve the issue.

Saving this information on completed is not as easy as it sounds because it is a breaking change and could create a race condition with the background thread (right?). But it could still be better than half-saved files.

vnmabus · 2023-08-28T07:17:14Z

Yes, I have large data in info (I store all of train and test scores and times).

My proposal was to join the heartbeat thread. I was not aware that this was done using a timeout. What is the reason for that? Can the heartbeat not stop?

thequilo · 2023-08-28T07:24:59Z

I don't know the reason. It was introduced here: 95234cd which seems to be addressing this issue: #273.

I believe that there is no reason for the FileStorageObserver to hang on heartbeat, but the MongoObserver seems to have issues where it sometimes doesn't exit. But I only use the FileStorageObserver, so I can't confirm. But even in that case, I would argue that a hanging experiment script is better than broken files. At least then it is obvious that something went wrong

stale bot added the stale label Apr 16, 2022

stale bot closed this as completed Apr 28, 2022

thequilo reopened this Jul 1, 2022

stale bot removed the stale label Jul 1, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

info.json is not be saved #830

info.json is not be saved #830

Sud0x67 commented Jul 1, 2021

thequilo commented Jul 1, 2021

Sud0x67 commented Jul 1, 2021 •

edited

thequilo commented Jul 1, 2021

Sud0x67 commented Jul 1, 2021

stale bot commented Apr 16, 2022

vnmabus commented Jul 1, 2022

thequilo commented Jul 1, 2022

vnmabus commented Jul 1, 2022

vnmabus commented Aug 23, 2023

thequilo commented Aug 28, 2023

vnmabus commented Aug 28, 2023

thequilo commented Aug 28, 2023

info.json is not be saved #830

info.json is not be saved #830

Comments

Sud0x67 commented Jul 1, 2021

thequilo commented Jul 1, 2021

Sud0x67 commented Jul 1, 2021 • edited

thequilo commented Jul 1, 2021

Sud0x67 commented Jul 1, 2021

stale bot commented Apr 16, 2022

vnmabus commented Jul 1, 2022

thequilo commented Jul 1, 2022

vnmabus commented Jul 1, 2022

vnmabus commented Aug 23, 2023

thequilo commented Aug 28, 2023

vnmabus commented Aug 28, 2023

thequilo commented Aug 28, 2023

Sud0x67 commented Jul 1, 2021 •

edited