Skip to content

How to gather all validation_step_outputs at validation_epoch_end and run in rank_zero properly without deadlock? #13041

Discussion options

You must be logged in to vote

The solution is as follows:
Instead of using

if self.trainer.is_global_zero:
    all_val_outs = self.all_gather(...)

The function above will hang as card 0 is trying to communicate other cards

You should write code in this way

all_val_out = self.all_gather(...)

if self.trainer.is_global_zero:
    # merge output and process

self.trainer.strategy.barrier() #to let other cards to wait

Replies: 2 comments 22 replies

Comment options

You must be logged in to vote
20 replies
@ecolss
Comment options

@furkanbiten
Comment options

@vgthengane
Comment options

@Vishwas-Venkatachalapathy
@akihironitta
Comment options

Comment options

You must be logged in to vote
2 replies
@kenanseyidov
Comment options

@akihironitta
Comment options

Answer selected by allanchan339
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment