-
-
Notifications
You must be signed in to change notification settings - Fork 772
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[distributed training] Program is stuck after the last round of training #33
Comments
Hi, yikang, that's not an error. Your training is successfully finished. The log is somewhat misleading. Let me have a test and modify the logging a little bit. |
INFO:root:#######training########### round_id = 3 |
I checked with 4 rounds, it works. |
@weiyikang Hi, I've finished the issue you reported. It is because that we didn't call MPI_Abort() after finishing the training. Please update our code and try again. You can use round 2 and local epoch 1 to have a test. Thank you for your valuable feedback. |
…d training Former-commit-id: 7dba82c
Running the fedavg on the configure: 20 rounds, 10 epochs, 2 clients, cifar-10 dataset, resnet56, but the program always crashed! The errors as the following:
![image](https://user-images.githubusercontent.com/15953968/95531007-01c04900-0a12-11eb-87d2-d6f57089c5e7.png)
The text was updated successfully, but these errors were encountered: