How to gather all validation_step_outputs at validation_epoch_end and run in rank_zero properly without deadlock? #13041
-
Hi, I am trying to gather all the output and label pairs in the validation epoch end to run a simple validation process.
To calculate an accurate metrics, I need to gather all outputs to a single device and log on rank 0 only as follows:
The program works fine. However, the metric result is always different from DP. Therefore, I need to gather all data first. |
Beta Was this translation helpful? Give feedback.
Replies: 2 comments 22 replies
-
with DP, you need to configure the metric computation differently with DDP to gather all outputs you can use
|
Beta Was this translation helpful? Give feedback.
-
The solution is as follows:
The function above will hang as card 0 is trying to communicate other cards You should write code in this way
|
Beta Was this translation helpful? Give feedback.
The solution is as follows:
Instead of using
The function above will hang as card 0 is trying to communicate other cards
You should write code in this way