How to gather all validation_step_outputs at validation_epoch_end and run in rank_zero properly without deadlock? #13041

allanchan339 · 2022-05-12T03:50:18Z

allanchan339
May 12, 2022

Hi,

I am trying to gather all the output and label pairs in the validation epoch end to run a simple validation process.
First, all validation data will be separated into different devices (controlled by DDP).
The validation step is simple as follows:

def validation_step(self, batch, batch_idx):
        # use valid_metric
        feat, label = batch
        output = self.model(feat)
        del batch, feat
        torch.cuda.empty_cache()


        for i, metric in enumerate(self.metrics):
            metric.update(output[:,i], label[:,i])

        return {"label": label, "output": output}

To calculate an accurate metrics, I need to gather all outputs to a single device and log on rank 0 only as follows:

 def validation_epoch_end(self, validation_step_outputs):
        def _valid_epoch_end(validation_step_outputs):
            enable_Flag = False
            # AP refers to average_precision_score in torchmetrics
            all_labels = list(map(itemgetter('label'), validation_step_outputs))
            all_labels = torch.cat(all_labels).cpu().detach().numpy()
            all_outputs = list(map(itemgetter('output'), validation_step_outputs))
            all_outputs = torch.cat(all_outputs).cpu().detach().numpy()

            AP = []
            for i in range(1, 17+1):
                AP.append(np.nan_to_num(average_precision_score(all_labels
                                                                [:, i], all_outputs[:,  i])))
            mAP = np.mean(AP)
            tmp = EVENT_DICTIONARY_V2
            tmp = tmp.copy()
            for k, v in tmp.items():
                #dict["kick-off"] = AP[0]
                tmp[k] = AP[v]
            
            self.log('Valid/mAP', mAP, logger=True, prog_bar=True,
                     rank_zero_only=True if self.args.strategy != 'dp' and enable_Flag else False)

            label_cls = (list(EVENT_DICTIONARY_V2.keys()))
            zip_iterator = zip(label_cls, AP)
            AP_dictionary = dict(zip_iterator)
            self.log('Valid/AP', AP_dictionary, logger=True, prog_bar=False,
                     rank_zero_only=True if self.args.strategy != 'dp' and enable_Flag else False)

        _valid_epoch_end(validation_step_outputs)

The program works fine. However, the metric result is always different from DP. Therefore, I need to gather all data first.
However, the program doesn't even pass the validation step as the deadlock. Any good idea to modify this program so that I can gather

Answered by allanchan339

Mar 26, 2023

The solution is as follows:
Instead of using

if self.trainer.is_global_zero:
    all_val_outs = self.all_gather(...)

The function above will hang as card 0 is trying to communicate other cards

You should write code in this way

all_val_out = self.all_gather(...)

if self.trainer.is_global_zero:
    # merge output and process

self.trainer.strategy.barrier() #to let other cards to wait

View full answer

rohitgr7 · 2022-05-15T05:55:01Z

rohitgr7
May 15, 2022

with DP, you need to configure the metric computation differently
https://pytorch-lightning.readthedocs.io/en/stable/common/lightning_module.html#validating-with-dataparallel

with DDP to gather all outputs you can use self.all_gather under the condition of rank_zero

if self.trainer.is_global_zero:
    all_val_outs = self.all_gather(...)

20 replies

ecolss Jan 29, 2023

Same issue here, is_global_zero and all_gather make the code hang forever.

furkanbiten Feb 18, 2023

Same issue here, have a metric that needs to be computed across entire dataset. Branching with if self.trainer.is_global_zero leads to deadlock in the other ranks. In my case EarlyStopping on that process complains that it's tracking metric is not available. This also begs the questions why early stopping is triggered at all in each rank - shouldn't it be a global check?

Same happened for me in ModelCheckpoint.

vgthengane Mar 23, 2023

anybody has solved this yet?

Vishwas-Venkatachalapathy Mar 26, 2023

I am running into the same issue. Let me know if there is a fix for it

akihironitta Mar 26, 2023

Just to clarify, the collective all_gather needs to be called in all processes but not only in the main process, otherwise it leads to hanging.

allanchan339 · 2023-03-26T11:29:17Z

allanchan339
Mar 26, 2023
Author

The solution is as follows:
Instead of using

if self.trainer.is_global_zero:
    all_val_outs = self.all_gather(...)

The function above will hang as card 0 is trying to communicate other cards

You should write code in this way

all_val_out = self.all_gather(...)

if self.trainer.is_global_zero:
    # merge output and process

self.trainer.strategy.barrier() #to let other cards to wait

2 replies

kenanseyidov Apr 9, 2023

Can you please provide documentation which explains how data moves from card to card? Right now I struggle to understand what all_gather, is_global_zero and barrier() achieve.

akihironitta Apr 10, 2023

For collective operations, such as all gather/reduce, these blog posts helped me understand how it works.

is_global_zero is just a bool value that represents whether the process is running on the 0th device of all devices or not:

https://lightning.ai/docs/pytorch/2.0.1/common/trainer.html#is-global-zero

barrier() makes the running process wait until other processes get to the same code line. Some references I find helpful are:

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

How to gather all validation_step_outputs at validation_epoch_end and run in rank_zero properly without deadlock? #13041

{{title}}

Replies: 2 comments 22 replies

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

How to gather all validation_step_outputs at validation_epoch_end and run in rank_zero properly without deadlock? #13041

Replies: 2 comments · 22 replies

allanchan339 Mar 26, 2023 Author

Replies: 2 comments 22 replies

allanchan339
Mar 26, 2023
Author