Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

info.json is not be saved #830

Open
Sud0x67 opened this issue Jul 1, 2021 · 12 comments
Open

info.json is not be saved #830

Sud0x67 opened this issue Jul 1, 2021 · 12 comments

Comments

@Sud0x67
Copy link

Sud0x67 commented Jul 1, 2021

Hi, I use sacred for my AI experiments and it help me a lot. But recently I found something is wrong with sacred. I use the info dict to save some results of my experiments and uausally it works well. But sometimes the info.json is not saved or only half of it is saved. Is there any solution?
The version of sacred I use is 0.8.2 and on python 3.7.

@thequilo
Copy link
Collaborator

thequilo commented Jul 1, 2021

Hi, I assume you use the FileStorageObserver. The info dict is only saved in the heartbeat events, but not on interruption, failed, or completed events. The heartbeat should usually be stopped correctly so that all data is written, but what you report looks like this is not always the case. Does this happen only for failed/interrupted experiments or also for experiments that finished correctly?

@Sud0x67
Copy link
Author

Sud0x67 commented Jul 1, 2021

Yes, I use the FileStorageObserver. This happens for experiments that finished correctly, somtimes but not always. Other files including cout.txt, config.json, and run.json are saved correctly except info.json.

@thequilo
Copy link
Collaborator

thequilo commented Jul 1, 2021

Do you have a minimal example that reproduces this issue? It seems to work for me.

The heartbeat events are processed in a background thread. It could be that this thread dies, for some reason, before it can perform the final write.

@Sud0x67
Copy link
Author

Sud0x67 commented Jul 1, 2021

Thanks for your reply. I am so sorry that I can't provide a minimal example because I am dealing with a complex project about MARL.
My project is based on this repo and command python3 src/main.py --config=qmix --env-config=sc2 with env_args.map_name=3m can reproduces this issue. However, it is not easy to figure out the code and reproduce this issue. I will comment here if I have any idea about this issue. Thanks for your help!

@stale
Copy link

stale bot commented Apr 16, 2022

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

@stale stale bot added the stale label Apr 16, 2022
@stale stale bot closed this as completed Apr 28, 2022
@vnmabus
Copy link
Contributor

vnmabus commented Jul 1, 2022

I have the same issue (Sacred 0.8.2). Is the info dict not saved on completion? That sounds like a bug.

@thequilo
Copy link
Collaborator

thequilo commented Jul 1, 2022

The info dict is not saved on completion. It is not passed to the completed_event of the observer. I don't know exactly why this is the case. I guess the idea was that if the heartbeat event is executed correctly, then there is no need to save the info dict in the completion event because it does not change between the last heartbeat and the completion. But if the heartbeat fails, this assumptions is no longer true.

@vnmabus do you have a minimal example to reproduce the issue? Or does it only appear in larger experiments?

@thequilo thequilo reopened this Jul 1, 2022
@stale stale bot removed the stale label Jul 1, 2022
@vnmabus
Copy link
Contributor

vnmabus commented Jul 1, 2022

For now only a few times, and in medium to large experiments in the cluster. I have put a sleep(11) call to patch it for now, but that is not ideal, and I still have to relaunch the failed experiments.

@vnmabus
Copy link
Contributor

vnmabus commented Aug 23, 2023

This should be saved on completion. I have lost countless human and computing time by relaunching half completed experiments because of this.

@thequilo
Copy link
Collaborator

That's really unfortunate. Do you have extremely large data in your info.json? Maybe it gets killed if the write for the heartbeat takes longer than processing the completed event. In that case, we could make the main thread wait longer for the background heartbeat thread. There currently is a timeout of 2s in

self._heartbeat.join(timeout=2)
. Increasing or removing this could solve the issue.

Saving this information on completed is not as easy as it sounds because it is a breaking change and could create a race condition with the background thread (right?). But it could still be better than half-saved files.

@vnmabus
Copy link
Contributor

vnmabus commented Aug 28, 2023

Yes, I have large data in info (I store all of train and test scores and times).

My proposal was to join the heartbeat thread. I was not aware that this was done using a timeout. What is the reason for that? Can the heartbeat not stop?

@thequilo
Copy link
Collaborator

I don't know the reason. It was introduced here: 95234cd which seems to be addressing this issue: #273.

I believe that there is no reason for the FileStorageObserver to hang on heartbeat, but the MongoObserver seems to have issues where it sometimes doesn't exit. But I only use the FileStorageObserver, so I can't confirm. But even in that case, I would argue that a hanging experiment script is better than broken files. At least then it is obvious that something went wrong

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants