New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix Async #158
Fix Async #158
Conversation
Committer: fanlai0990 <fanlai0@outlook.com>
Thanks @fanlai0990. Quick questions: Does it address the event mis-ordering issue? (Fine, if not, but we should have a separate issue then) |
Not yet. I need more time to think about it (regarding overhead and its fidelity). Probably push a new fix in early Sept. The current async example makes some sense, but it relies on the clairvoyant information of client completion time, and breaks our client arrival traces. |
Does the current async simulation perform somewhat valid results despite that it's not entirely correct? |
To the best of my knowledge, it provides valid and correct results if no other weird bugs out of blue. The only deficiency is: client arrivals within the buffer_size do not use the system trace, constant arrivals instead, but note that cross-buffer arrivals still follow the system trace. So it can still provide more realistic evaluations than other existing ones. Please feel free to test, and let us know if you find bugs. We plan to implement a much more sophisticated version in the future, which should fix this deficiency and the requirement of clairvoyant client completion time. But understandably, this requires reordering of events on the fly, and many other pieces. Please stay tuned. |
Hi @fanlai0990, I'm still not seeing any model test results using the latest async code. Were you able to see test outputs when you run it? Here are the params I used:
|
Hi @ewenw, Thanks for trying it out! I pulled the latest code, and tested it using your configuration (more details attached below). I can see the test results. Other than the tensorboard, you can try However, I indeed notice some weird test accuracy and am working on it. In the meantime, please let us know if you have any other concerns or features you want.
|
Thank you, @fanlai0990 for the prompt response! I can see the test results now after changing the number of executors to 1. With an increasing number of executors, I see less and less test data points. |
Thanks for confirming it! I am fixing it and will get back to you soon. |
Why are these changes needed?
In async FedScale example, (i) training stalls after a while; (ii) API mismatch in Test;
Related issue number
Closes #148
Checks