Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DataLoader with num_workers > 1, and a Rand[Zoom/Rotate/Flip)d transforms #398

Closed
hjmjohnson opened this issue May 18, 2020 · 9 comments · Fixed by #423
Closed

DataLoader with num_workers > 1, and a Rand[Zoom/Rotate/Flip)d transforms #398

hjmjohnson opened this issue May 18, 2020 · 9 comments · Fixed by #423

Comments

@hjmjohnson
Copy link
Contributor

Describe the bug
When using a DataLoader with num_workers > 1, and a Rand[Zoom/Rotate/Flip)d transform, all the data in the multiple workers have the same random state.

To Reproduce

With train_ds having some random parameterized transforms.

    train_loader: DataLoader = DataLoader(
        train_ds,  # <-- This is a dataset of both the input raw data filenames + definition of transforms
        batch_size=1,
        shuffle=True,
        num_workers=88,
        collate_fn=list_data_collate,
    )

This is particularly disturbing when running on a machine with 40+ CPUs and huge numbers of images have the same parameter augmentation.

Expected behavior
Each transform should have it's own random parameters chosen, regardless of the number of workers chosen.

Screenshots
NOTE: The number of replicated rotation values is always equal to the num_workers specified.

Rotating by 19.367042973517755
Rotating by 19.367042973517755
Rotating by 19.367042973517755
Rotating by 19.367042973517755
Rotating by 4.039486469720721
Rotating by 4.039486469720721
Rotating by 4.039486469720721
Rotating by 4.039486469720721
Rotating by 13.13047017599905
Rotating by 13.13047017599905
Rotating by 13.13047017599905
Rotating by 13.13047017599905
@Nic-Ma
Copy link
Contributor

Nic-Ma commented May 18, 2020

Hi @hjmjohnson ,

Thanks for your bug report.
This is an known issue of "numpy + PyTorch multi-processing".
And you can easily fix it by adding below logic to your DataLoader initialization:

def worker_init_fn(worker_id):
    worker_info = torch.utils.data.get_worker_info()
    worker_info.dataset.transform.set_random_state(worker_info.seed % (2 ** 32 - 1))

dataloader = torch.utils.data.DataLoader(... worker_init_fn=worker_init_fn)

Thanks.

@hjmjohnson
Copy link
Contributor Author

FYI: I also came across documentation that indicates the real problem is out of the scope of MONAI.

https://pytorch.org/docs/stable/data.html

@hjmjohnson
Copy link
Contributor Author

@Nic-Ma THANK YOU! Sorry for the invalid MONAI bug report. Your solution worked wonderfully!

@Nic-Ma
Copy link
Contributor

Nic-Ma commented May 19, 2020

You are welcome.
Thanks.

@tvercaut
Copy link
Member

tvercaut commented May 19, 2020

Should a note about this be put in our wiki or somewhere similar to start collating an FAQ?
Linking to other resources such https://pytorch.org/docs/stable/notes/faq.html#my-data-loader-workers-return-identical-random-numbers makes sense in such a FAQ of course.

@Nic-Ma
Copy link
Contributor

Nic-Ma commented May 19, 2020

Hi @tvercaut ,

Good idea, maybe @atbenmurray can help add this to our wiki page?
Ben spent much time to initialize our detailed wiki pages before.
Thanks.

@hjmjohnson
Copy link
Contributor Author

@Nic-Ma It would be nice if the FAQ were indexed in a way that the Sphinx documentation could reference the FAQ content.

@Nic-Ma
Copy link
Contributor

Nic-Ma commented May 19, 2020

I will try to raise the topic to set up FAQ to the core team guys.
Maybe we can have a discussion this Friday.
Thanks.

@atbenmurray
Copy link
Contributor

@Nic-Ma @tvercaut I can certainly help with the wiki stuff

@wyli wyli mentioned this issue May 25, 2020
1 task
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants