New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Core] Async aggregator freezes during evaluation #148
Comments
I will take a look at this later. The training loss should be reported at the end of every aggregation in my previous experiments. and will look at the freeze error. |
Thank you, @AmberLJC. I've attached a screenshot of my TB output. |
I found the code freeze is because the checkin_period (simulated clients arrival) is less frequent, so the aggregator hangs as no client check in. I will modify the config |
As for the training loss, i think it depends on the buffer size and other configs. though the original paper suggest |
Thanks for the investigation. From looking at the code, load_global_model() in the async executor expects a model_id, but the Test() function and testing_handler() aren't customized in the async executor. I'm now seeing this error: |
@AmberLJC Can you please take a look when you are free? Thanks! |
Sure |
What happened + What you expected to happen
Hi fedscale team, I tried to run the async aggregator locally, but no test metrics are generated. The training seems to work fine, but the system freezes without any error at round 50.
Here are the last events from the aggregator:
Here's the tail of the executor log:
Versions / Dependencies
Latest
Reproduction script
lHere's my config for the async_aggregator.py example:
Issue Severity
No response
The text was updated successfully, but these errors were encountered: