Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[distributed training] Program is stuck after the last round of training #33

Closed
weiyikang opened this issue Oct 9, 2020 · 5 comments
Closed
Labels
good first issue Good for newcomers

Comments

@weiyikang
Copy link

Running the fedavg on the configure: 20 rounds, 10 epochs, 2 clients, cifar-10 dataset, resnet56, but the program always crashed! The errors as the following:
image

@weiyikang
Copy link
Author

I have running fedavg again use the back end command (nohup), while the same problem also occurs:

nohup sh run_fedavg_distributed_pytorch.sh 4 4 1 4 resnet56 homo 10 5 32 0.001 cifar10 "./../../../data/cifar10" > ./fedavg-resnet-homo-cifar10.txt 2>&1 &

image

Why the errors always occur in the final ROUNT? The GPUs memory is available while the final ROUND aggregation breaks off.
image

@chaoyanghe
Copy link
Member

Hi, yikang, that's not an error. Your training is successfully finished. The log is somewhat misleading. Let me have a test and modify the logging a little bit.

@chaoyanghe
Copy link
Member

INFO:root:#######training########### round_id = 3
INFO:root:(client 642. Local Training Epoch: 0 Loss: 2.123745
INFO:root:#######finished###########
INFO:root:sys.exit(0)
INFO:root:add_model. index = 3
INFO:root:b_all_received = False
INFO:root:add_model. index = 1
INFO:root:b_all_received = False
INFO:root:add_model. index = 0
INFO:root:b_all_received = True
INFO:root:len of self.model_dict[idx] = 4
INFO:root:aggregate time cost: 0
INFO:root:################local_test_on_all_clients : 3
INFO:root:{'training_acc': 0.33828489880643486, 'training_loss': 2.1656083437750917}
INFO:root:{'test_acc': 0.3422873422873423, 'test_loss': 2.165701602372462}
INFO:root:__finish server
INFO:root:sys.exit(0)

@chaoyanghe
Copy link
Member

I checked with 4 rounds, it works.

@chaoyanghe chaoyanghe changed the title Program crashed! [distributed training] Program is stuck after the last round of training Oct 9, 2020
@chaoyanghe chaoyanghe added the good first issue Good for newcomers label Oct 9, 2020
@chaoyanghe
Copy link
Member

@weiyikang Hi, I've finished the issue you reported. It is because that we didn't call MPI_Abort() after finishing the training. Please update our code and try again. You can use round 2 and local epoch 1 to have a test. Thank you for your valuable feedback.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
good first issue Good for newcomers
Projects
None yet
Development

No branches or pull requests

2 participants