MetaTensor and DistributedDataParallel. bug (SyncBatchNormBackward is a view and is being modified inplace) #5283

myron · 2022-10-07T03:02:20Z

When upgrading from MONAI 0.9.0 to 1.0.0, my 3D segmentation code fails due to (most likely) new MetaTensor in transforms, when using DistributedDataParallel (multi-gpu)

the error is
RuntimeError: Output 0 of SyncBatchNormBackward is a view and is being modified inplace. This view was created inside a custom Function (or because an input was returned as-is) and the autograd logic to handle view+inplace would override the custom backward associated with the custom Function, leading to incorrect gradients. This behavior is forbidden. You can fix this by cloning the output of the custom Function.

same issue was reported here (but for 2D MIL classification)
#5081
and #5198

I've traced it down to this commit
63e36b6
(prior to it, the code is working fine)

It seems the issue is that dataloader returns data as MetaTensor (and not torch.Tensor as before)
e.g. here https://github.com/Project-MONAI/tutorials/blob/main/pathology/multiple_instance_learning/panda_mil_train_evaluate_pytorch_gpu.py#L51
both data and target are MetaTensor types

if converting explicitly (on gpu or cpu):

data = torch.Tensor(data)
target = torch.Tensor(target)

then the code runs fine, but a bit slower. It seems there is something wrong with MetaTensor

The text was updated successfully, but these errors were encountered:

wyli · 2022-10-07T08:09:44Z

thanks for reporting, I'm able to reproduce with torchrun --nnodes=1 --nproc_per_node=2 test.py using this test.py:

import torch.distributed as dist

import torch
from torchvision import models
from monai.data import MetaTensor

torch.autograd.set_detect_anomaly(True)

def run():
    dist.init_process_group("nccl")
    rank = dist.get_rank()
    print(f"rank {rank}")
    device = rank

    mod = models.resnet50(pretrained=True).to(device)
    optim = torch.optim.Adam(mod.parameters(), lr=1e-3)
    z1 = MetaTensor(torch.zeros(1, 3, 128, 128)).to(device)

    mod = torch.nn.SyncBatchNorm.convert_sync_batchnorm(mod)
    mod = torch.nn.parallel.DistributedDataParallel(mod, device_ids=[rank], output_device=rank)

    out = mod(z1)
    print(out.shape)
    loss = (out**2).mean()

    optim.zero_grad()
    loss.backward()
    optim.step()

    print("Stepped.")

if __name__ == "__main__":
    run()

I'll submit a PR to fix this.

wyli · 2022-10-07T10:54:04Z

looks like a pytorch issue, I created a bug report (pytorch/pytorch#86456).

KumoLiu · 2023-12-20T06:08:11Z

Because the bug in the upstream has not yet been fixed, this ticket should be kept.

myron added the bug Something isn't working label Oct 7, 2022

myron added this to the Auto3D Seg framework [P0 v1.0] milestone Oct 7, 2022

myron assigned wyli and Nic-Ma Oct 7, 2022

myron mentioned this issue Oct 7, 2022

Issue with distributed SyncBatchNorm in MIL pipeline #5198

Open

wyli removed this from the Auto3D Seg framework [P0 v1.0] milestone Oct 7, 2022

wyli mentioned this issue Oct 7, 2022

SyncBatchNorm doesn't work with subclass of torch.Tensor pytorch/pytorch#86456

Open

myron mentioned this issue Oct 7, 2022

import cv2 + nvidia/pytorch:22.09-py3 + DistributedDataParallel. (FIND was unable to find an engine) #5291

Closed

wyli changed the title ~~MetaTensor and DistributedDataParallel. bug~~ MetaTensor and DistributedDataParallel. bug (SyncBatchNormBackward is a view and is being modified inplace) Oct 7, 2022

myron mentioned this issue Oct 26, 2022

[Prototype] 4855 compose with lazy resampling #4911

Closed

12 tasks

myron added this to the Auto3D Seg framework [internal ongoing milestone] milestone Oct 31, 2022

vikashg closed this as completed Dec 19, 2023

KumoLiu reopened this Dec 20, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

MetaTensor and DistributedDataParallel. bug (SyncBatchNormBackward is a view and is being modified inplace) #5283

MetaTensor and DistributedDataParallel. bug (SyncBatchNormBackward is a view and is being modified inplace) #5283

myron commented Oct 7, 2022

wyli commented Oct 7, 2022

wyli commented Oct 7, 2022

KumoLiu commented Dec 20, 2023

MetaTensor and DistributedDataParallel. bug (SyncBatchNormBackward is a view and is being modified inplace) #5283

MetaTensor and DistributedDataParallel. bug (SyncBatchNormBackward is a view and is being modified inplace) #5283

Comments

myron commented Oct 7, 2022

wyli commented Oct 7, 2022

wyli commented Oct 7, 2022

KumoLiu commented Dec 20, 2023