[distributed training] Program is stuck after the last round of training #33

weiyikang · 2020-10-09T01:30:59Z

Running the fedavg on the configure: 20 rounds, 10 epochs, 2 clients, cifar-10 dataset, resnet56, but the program always crashed! The errors as the following:

weiyikang · 2020-10-09T08:53:05Z

I have running fedavg again use the back end command (nohup), while the same problem also occurs:

nohup sh run_fedavg_distributed_pytorch.sh 4 4 1 4 resnet56 homo 10 5 32 0.001 cifar10 "./../../../data/cifar10" > ./fedavg-resnet-homo-cifar10.txt 2>&1 &

Why the errors always occur in the final ROUNT? The GPUs memory is available while the final ROUND aggregation breaks off.

chaoyanghe · 2020-10-09T17:53:35Z

Hi, yikang, that's not an error. Your training is successfully finished. The log is somewhat misleading. Let me have a test and modify the logging a little bit.

chaoyanghe · 2020-10-09T18:00:22Z

INFO:root:#######training########### round_id = 3
INFO:root:(client 642. Local Training Epoch: 0 Loss: 2.123745
INFO:root:#######finished###########
INFO:root:sys.exit(0)
INFO:root:add_model. index = 3
INFO:root:b_all_received = False
INFO:root:add_model. index = 1
INFO:root:b_all_received = False
INFO:root:add_model. index = 0
INFO:root:b_all_received = True
INFO:root:len of self.model_dict[idx] = 4
INFO:root:aggregate time cost: 0
INFO:root:################local_test_on_all_clients : 3
INFO:root:{'training_acc': 0.33828489880643486, 'training_loss': 2.1656083437750917}
INFO:root:{'test_acc': 0.3422873422873423, 'test_loss': 2.165701602372462}
INFO:root:__finish server
INFO:root:sys.exit(0)

chaoyanghe · 2020-10-09T18:00:50Z

I checked with 4 rounds, it works.

chaoyanghe · 2020-10-09T19:15:39Z

@weiyikang Hi, I've finished the issue you reported. It is because that we didn't call MPI_Abort() after finishing the training. Please update our code and try again. You can use round 2 and local epoch 1 to have a test. Thank you for your valuable feedback.

…d training Former-commit-id: 7dba82c

chaoyanghe pushed a commit that referenced this issue Oct 9, 2020

[issue #33] force the MPI exit after finishing the distributed training

47db6e6

chaoyanghe changed the title ~~Program crashed!~~ [distributed training] Program is stuck after the last round of training Oct 9, 2020

chaoyanghe added the good first issue Good for newcomers label Oct 9, 2020

chaoyanghe closed this as completed Oct 9, 2020

ZSL98 pushed a commit to ZSL98/FedML that referenced this issue Feb 21, 2021

[issue FedML-AI#33] force the MPI exit after finishing the distribute…

3f885fd

…d training Former-commit-id: 7dba82c

bangawayoo mentioned this issue Dec 29, 2021

Hanging after last round of training FedML-AI/FedNLP#23

Closed

fedml-alex pushed a commit that referenced this issue Jun 1, 2022

[issue #33] force the MPI exit after finishing the distributed training

7dba82c

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[distributed training] Program is stuck after the last round of training #33

[distributed training] Program is stuck after the last round of training #33

weiyikang commented Oct 9, 2020

weiyikang commented Oct 9, 2020

chaoyanghe commented Oct 9, 2020

chaoyanghe commented Oct 9, 2020

chaoyanghe commented Oct 9, 2020

chaoyanghe commented Oct 9, 2020

[distributed training] Program is stuck after the last round of training #33

[distributed training] Program is stuck after the last round of training #33

Comments

weiyikang commented Oct 9, 2020

weiyikang commented Oct 9, 2020

chaoyanghe commented Oct 9, 2020

chaoyanghe commented Oct 9, 2020

chaoyanghe commented Oct 9, 2020

chaoyanghe commented Oct 9, 2020