Auto-set `DataLoader.worker_init_fn` with `seed_everything` #6960

awaelchli · 2021-04-11T14:48:37Z

What does this PR do?

Proposal:
Make Lightning add a default worker init function for seeding numpy when the user also calls seed_everyting.
This prevents nasty bugs where data augmentations are the same across workers (when using e.g. numpy and not torch as random number generator).

As suggested by Ethan #6957 (reply in thread) , this feature can be turned on or off with seed_everything(workers=True|False)
Default is currently on (False). It doesn't affect parity, and is therefore a safe feature to introduce.
However, the user has to remember to turn this on, something that likely gets forgotten even if we offer this convenient feature. We may want to consider turning it on in the future.

Vote:

🚀 Default seed_everything(workers=True)
🎉 Default seed_everything(workers=False)

Addresses: #6957

Explanation

To understand the seeds and randomness in the workers, consider the following example:

import os
import torch
from torch.utils.data import Dataset

from pytorch_lightning import LightningModule, Trainer, seed_everything

import numpy as np
from torch.utils.data import Dataset


class RandomDataset(Dataset):
    def __getitem__(self, index):
        # numpy random state is the same in all workers!
        return np.random.randint(0, 10, 3)

    def __len__(self):
        return 16


class BoringModel(LightningModule):

    def __init__(self):
        super().__init__()
        self.layer = torch.nn.Linear(32, 2)

    def training_step(self, batch, batch_idx):
        print("rank:", self.global_rank, "batch:", batch.view(-1).tolist())
        return None

    def configure_optimizers(self):
        return torch.optim.SGD(self.layer.parameters(), lr=0.1)


def run():
    train_data = torch.utils.data.DataLoader(RandomDataset(), batch_size=2, num_workers=2)
    model = BoringModel()
    trainer = Trainer(
        default_root_dir=os.getcwd(),
        gpus=2,
        accelerator="ddp",
        limit_train_batches=4,
        max_epochs=2,
        weights_summary=None,
        progress_bar_refresh_rate=0,
    )
    trainer.fit(model, train_data)


if __name__ == '__main__':
    # ---------------------------------------------------
    # comment the line below to get duplicated batches!
    # ---------------------------------------------------
    seed_everything(1)
    run()

On the master branch, running the above code we get an output like this:

rank: 0 batch: [5, 4, 5, 4, 9, 9]
rank: 0 batch: [5, 4, 5, 4, 9, 9] <---- repeated 
rank: 0 batch: [7, 5, 2, 2, 6, 5]
rank: 0 batch: [7, 5, 2, 2, 6, 5] <---- repeated 

rank: 1 batch: [8, 9, 4, 6, 8, 8]
rank: 1 batch: [8, 9, 4, 6, 8, 8] <---- repeated 
rank: 1 batch: [0, 0, 8, 6, 9, 7]
rank: 1 batch: [0, 0, 8, 6, 9, 7] <---- repeated

As you can see, the batches within the rank are repeated, meaning both workers produce the same batch/augmentation!
As suggested in the PyTorch docs and this blog post, we can fix this by providing custom worker function:

def worker_init_fn(worker_id):
    worker_seed = (torch.initial_seed() + worker_id) % (2**32)
    torch.manual_seed(worker_seed)
    numpy.random.seed(worker_seed)
    random.seed(worker_seed)

dataloader.worker_init_fn = worker_init_fn

This fixes the problem for training on a single gpu. See below the output, in each rank we get different samples. Good!

rank: 0 batch: [2, 2, 3, 0, 8, 9]
rank: 0 batch: [7, 2, 0, 0, 2, 2]
rank: 0 batch: [9, 6, 5, 0, 3, 1]

rank: 1 batch: [2, 2, 3, 0, 8, 9] <---- repeated from rank 0
rank: 1 batch: [7, 2, 0, 0, 2, 2] <---- repeated from rank 0
rank: 1 batch: [9, 6, 5, 0, 3, 1] <---- repeated from rank 0

However we still have problem with multi-gpu!! We need to add the global rank to the seed as well! The (hopefully) correct init is:

def worker_init_fn(worker_id):
    global_rank = rank_zero_only.rank
    worker_seed = (torch.initial_seed() + worker_id + global_rank) % (2**32)
    torch.manual_seed(worker_seed)
    numpy.random.seed(worker_seed)
    random.seed(worker_seed)

dataloader.worker_init_fn = worker_init_fn

And our script produces random samples without duplicates across all ranks and workers:

rank: 0 batch: [2, 2, 3, 0, 8, 9]
rank: 0 batch: [7, 2, 0, 0, 2, 2]
rank: 0 batch: [9, 6, 5, 0, 3, 1]
rank: 0 batch: [0, 8, 2, 6, 3, 2]

rank: 1 batch: [1, 6, 2, 5, 5, 7]
rank: 1 batch: [9, 0, 4, 8, 1, 4]
rank: 1 batch: [7, 2, 2, 1, 6, 6]
rank: 1 batch: [5, 3, 6, 5, 6, 2]

Note: To reproduce this you need to run on linux, where you will see the duplicated samples. On the other platfors (darwin, win) you won't get duplicates, but the seeds in the worker processes are not deterministically chosen.

References:
[1] https://tanelp.github.io/posts/a-bug-that-plagues-thousands-of-open-source-ml-projects/

Before submitting

Was this discussed/approved via a GitHub issue? (not for typos and docs)
Did you read the contributor guideline, Pull Request section?
Did you make sure your PR does only one thing, instead of bundling different changes together?
Did you make sure to update the documentation with your changes? (if necessary)
Did you write any new necessary tests? (not for typos and docs)
Did you verify new and existing tests pass locally with your changes?
Did you update the CHANGELOG? (not for typos, docs, test updates, or internal minor changes/refactorings)

PR review

Anyone in the community is free to review the PR once the tests have passed.
Before you start reviewing make sure you have read Review guidelines. In short, see the following bullet-list:

Is this pull request ready for review? (if not, please submit in draft mode)
Check that all items from Before submitting are resolved
Make sure the title is self-explanatory and the description concisely explains the PR
Add labels and milestones (and optionally projects) to the PR so it can be classified

Did you have fun?

Make sure you had fun coding 🙃

This reverts commit 18e61ee.

pep8speaks · 2021-04-11T14:48:43Z

Hello @awaelchli! Thanks for updating this PR.

There are currently no PEP 8 issues detected in this Pull Request. Cheers! 🍻

Comment last updated at 2021-04-16 18:14:00 UTC

codecov · 2021-04-11T14:50:03Z

Codecov Report

Merging #6960 (8514821) into master (f645df5) will decrease coverage by 3%.
The diff coverage is 55%.

@@           Coverage Diff           @@
##           master   #6960    +/-   ##
=======================================
- Coverage      92%     89%    -3%     
=======================================
  Files         194     196     +2     
  Lines       12386   13219   +833     
=======================================
+ Hits        11414   11824   +410     
- Misses        972    1395   +423

pytorch_lightning/utilities/seed.py

carmocca

Love the description!

Can you try and give this functionality more visibility throughout docs? Maybe also setting it where seed_everything is used

tests/trainer/test_dataloaders.py

pytorch_lightning/utilities/seed.py

pytorch_lightning/trainer/data_loading.py

requirements.txt

awaelchli · 2021-04-16T12:00:26Z

@rkern thanks again for your feedback. For torch < 1.6 manual_seed only accepts 32-bit integers so I had to do a slight modification. I also noticed that on macos and win our custom worker init function for numpy is not necessary. There seems to be no duplication of random state there. I couldn't find a clear answer why but I think the reason is that on these platforms the worker processes get created with a different method.

requirements.txt

rkern · 2021-04-16T17:14:01Z

multiprocessing uses spawn instead of fork on Windows (because there is no fork()) and on macOS in recent Pythons (I think because fork() is a little wonky when interacting with other macOS facilities). This means that instead of copying the state of the process, a new Python process is started up from scratch. That means that the global PRNG underneath np.random gets initialized with fresh OS entropy in each child process. Thus, you will get distinct states, but not deterministic states. You still need the worker_init_fn to give them appropriate seeds if you want determinism.

This reverts commit 708581a.

awaelchli · 2021-04-16T18:51:05Z

@rkern One light bulb after the other turns on. You are of course right, I should have seen this in my testing. I was too much focused on avoiding duplicate states that I was completely blind to see the non-deterministic state. Thanks a lot for your comment. Appreciate your contribution!

tchaton

LGTM !

pytorch_lightning/trainer/data_loading.py

awaelchli · 2021-04-19T14:29:13Z

Thanks everyone for your constructive feedback. It was fun to implement this :)

@rkern

After adding default seeding strategy for NumPy random module within each worker of DataLoader #56488, two concerns are raised: - We dropped the support for NumPy < 1.17 due to `SeedSequence` - In order to support seeding for NumPy < 1.17, how can we provide seed for `numpy.random`? - First option is set the same seed as `random`. But, the problem is a same algorithm is shared between `numpy.random` and `random`. With the same seed, they will have exact same state sequence. Thanks to @rkern, we noticed this so-called [bad things](Lightning-AI/pytorch-lightning#6960 (comment)). - Considering most of users do not aware this problem, we can provide a better seed by default for `numpy.random` using same `SeedSequence` algorithm as numpy. This is just a workaround with hard-coded function to generate an array of four int32 as the seed. To better coping with this problem since there are amount of 3rd party libraries not just `NumPy` having random module. We may at the end need to implement a `SeedSequence` within `torch.random` module, then users can `spawn` a new `SeedSequence` for each library. [ghstack-poisoned]

@rkern

After adding default seeding strategy for NumPy random module within each worker of DataLoader #56488, two concerns are raised: - We dropped the support for NumPy < 1.17 due to `SeedSequence` - In order to support seeding for NumPy < 1.17, how can we provide seed for `numpy.random`? - First option is set the same seed as `random`. But, the problem is a same algorithm is shared between `numpy.random` and `random`. With the same seed, they will have exact same state sequence. Thanks to @rkern, we noticed this so-called [bad things](Lightning-AI/pytorch-lightning#6960 (comment)). - Considering most of users do not aware this problem, we can provide a better seed by default for `numpy.random` using same `SeedSequence` algorithm as numpy. This is just a workaround with hard-coded function to generate an array of four int32 as the seed. To better coping with this problem since there are amount of 3rd party libraries not just `NumPy` having random module. We may at the end need to implement a `SeedSequence` within `torch.random` module, then users can `spawn` a new `SeedSequence` for each library. Differential Revision: [D28000619](https://our.internmc.facebook.com/intern/diff/D28000619) [ghstack-poisoned]

@rkern

After adding default seeding strategy for NumPy random module within each worker of DataLoader #56488, two concerns are raised: - We dropped the support for NumPy < 1.17 due to `SeedSequence` - In order to support seeding for NumPy < 1.17, how can we provide seed for `numpy.random`? - First option is set the same seed as `random`. But, the problem is a same algorithm is shared between `numpy.random` and `random`. With the same seed, they will have exact same state sequence. Thanks to @rkern, we noticed this so-called [bad things](Lightning-AI/pytorch-lightning#6960 (comment)). - Considering most of users do not aware this problem, we can provide a better seed by default for `numpy.random` using same `SeedSequence` algorithm as numpy. This is just a workaround with hard-coded function to generate an array of four int32 as the seed. To better coping with this problem since there are amount of 3rd party libraries not just `NumPy` having random module. We may at the end need to implement a `SeedSequence` within `torch.random` module, then users can `spawn` a new `SeedSequence` for each library. Differential Revision: [D28000619](https://our.internmc.facebook.com/intern/diff/D28000619) [ghstack-poisoned]

Summary: Pull Request resolved: #56797 After adding default seeding strategy for NumPy random module within each worker of DataLoader #56488, two concerns are raised: - We dropped the support for NumPy < 1.17 due to `SeedSequence` - In order to support seeding for NumPy < 1.17, how can we provide seed for `numpy.random`? - First option is set the same seed as `random`. But, the problem is a same algorithm is shared between `numpy.random` and `random`. With the same seed, they will have exact same state sequence. Thanks to rkern, we noticed this so-called [bad things](Lightning-AI/pytorch-lightning#6960 (comment)). - Considering most of users do not aware this problem, we can provide a better seed by default for `numpy.random` using same `SeedSequence` algorithm as numpy. This is just a workaround with hard-coded function to generate an array of four int32 as the seed. To better coping with this problem since there are amount of 3rd party libraries not just `NumPy` having random module. We may at the end need to implement a `SeedSequence` within `torch.random` module, then users can `spawn` a new `SeedSequence` for each library. Test Plan: Imported from OSS Reviewed By: H-Huang Differential Revision: D28000619 Pulled By: ejguan fbshipit-source-id: 5701c8124a38ea5ded69eb8eee70f9680877ffa6

Summary: Pull Request resolved: pytorch#56797 After adding default seeding strategy for NumPy random module within each worker of DataLoader pytorch#56488, two concerns are raised: - We dropped the support for NumPy < 1.17 due to `SeedSequence` - In order to support seeding for NumPy < 1.17, how can we provide seed for `numpy.random`? - First option is set the same seed as `random`. But, the problem is a same algorithm is shared between `numpy.random` and `random`. With the same seed, they will have exact same state sequence. Thanks to rkern, we noticed this so-called [bad things](Lightning-AI/pytorch-lightning#6960 (comment)). - Considering most of users do not aware this problem, we can provide a better seed by default for `numpy.random` using same `SeedSequence` algorithm as numpy. This is just a workaround with hard-coded function to generate an array of four int32 as the seed. To better coping with this problem since there are amount of 3rd party libraries not just `NumPy` having random module. We may at the end need to implement a `SeedSequence` within `torch.random` module, then users can `spawn` a new `SeedSequence` for each library. Test Plan: Imported from OSS Reviewed By: H-Huang Differential Revision: D28000619 Pulled By: ejguan fbshipit-source-id: 5701c8124a38ea5ded69eb8eee70f9680877ffa6

robogast · 2021-05-11T10:14:41Z

Why was the default not actually changed?
The votes indicates that the default for workers should be True, and the 1.3 release notes mention this PR.

However, if I look in master::seed_everything, the default is workers=False.

awaelchli · 2021-05-12T09:13:36Z

It was decided that we have to keep it False to keep parity between versions (it could influence reproducibility from one version to the next).

Summary: Pull Request resolved: pytorch#56797 After adding default seeding strategy for NumPy random module within each worker of DataLoader pytorch#56488, two concerns are raised: - We dropped the support for NumPy < 1.17 due to `SeedSequence` - In order to support seeding for NumPy < 1.17, how can we provide seed for `numpy.random`? - First option is set the same seed as `random`. But, the problem is a same algorithm is shared between `numpy.random` and `random`. With the same seed, they will have exact same state sequence. Thanks to rkern, we noticed this so-called [bad things](Lightning-AI/pytorch-lightning#6960 (comment)). - Considering most of users do not aware this problem, we can provide a better seed by default for `numpy.random` using same `SeedSequence` algorithm as numpy. This is just a workaround with hard-coded function to generate an array of four int32 as the seed. To better coping with this problem since there are amount of 3rd party libraries not just `NumPy` having random module. We may at the end need to implement a `SeedSequence` within `torch.random` module, then users can `spawn` a new `SeedSequence` for each library. Test Plan: Imported from OSS Reviewed By: H-Huang Differential Revision: D28000619 Pulled By: ejguan fbshipit-source-id: 5701c8124a38ea5ded69eb8eee70f9680877ffa6

awaelchli added 3 commits April 11, 2021 14:52

example

18e61ee

auto add worker fn

588f581

Revert "example"

bf92793

This reverts commit 18e61ee.

awaelchli added the feature Is an improvement or enhancement label Apr 11, 2021

revert

49f3200

typo

74893bd

awaelchli added the data handling Generic data-related topic label Apr 11, 2021

flake

49c1a64

awaelchli commented Apr 11, 2021

View reviewed changes

pytorch_lightning/utilities/seed.py Outdated Show resolved Hide resolved

awaelchli added 5 commits April 11, 2021 18:36

add worker_id

50d35ad

typo

873584e

workers argument in seed_everyting

b71d16c

add global rank for worker_init_fn

23f28f3

include torch.manual_seed

59416c9

awaelchli marked this pull request as ready for review April 12, 2021 08:12

awaelchli requested review from Borda, carmocca, justusschock, kaushikb11, SeanNaren, tchaton and williamFalcon as code owners April 12, 2021 08:12

carmocca reviewed Apr 12, 2021

View reviewed changes

tests/trainer/test_dataloaders.py Outdated Show resolved Hide resolved

pytorch_lightning/utilities/seed.py Outdated Show resolved Hide resolved

pytorch_lightning/trainer/data_loading.py Outdated Show resolved Hide resolved

Borda added the Important label Apr 12, 2021

Borda assigned awaelchli Apr 12, 2021

Borda added this to the 1.3 milestone Apr 12, 2021

awaelchli added 2 commits April 12, 2021 13:11

fix env var access

3b247b8

ddp test

5ed6791

carmocca reviewed Apr 16, 2021

View reviewed changes

requirements.txt Outdated Show resolved Hide resolved

Update requirements.txt

1fa020f

carmocca reviewed Apr 16, 2021

View reviewed changes

requirements.txt Outdated Show resolved Hide resolved

carmocca and others added 4 commits April 16, 2021 14:16

Update requirements.txt

2124ef5

change default

4456bf8

remove sanity test

708581a

unused import

fd59239

ananthsub approved these changes Apr 16, 2021

View reviewed changes

awaelchli added 2 commits April 16, 2021 19:55

Revert "remove sanity test"

99799a0

This reverts commit 708581a.

better sanity check

8514821

carmocca added the ready PRs ready to be merged label Apr 19, 2021

tchaton approved these changes Apr 19, 2021

View reviewed changes

pytorch_lightning/trainer/data_loading.py Show resolved Hide resolved

awaelchli merged commit 60c1c8f into master Apr 19, 2021

awaelchli deleted the feature/worker_init_fn branch April 19, 2021 14:28

rkern mentioned this pull request Apr 22, 2021

[DataLoader] Add Numpy seeding to worker of DataLoader pytorch/pytorch#56488

Closed

ejguan mentioned this pull request Apr 23, 2021

[DataLoader] Add generate_state for NumPy seeding pytorch/pytorch#56797

Closed

BestSonny mentioned this pull request Apr 28, 2021

The current implementation containing bugs on dataloader. facebookresearch/DepthContrast#14

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Auto-set `DataLoader.worker_init_fn` with `seed_everything` #6960

Auto-set `DataLoader.worker_init_fn` with `seed_everything` #6960

awaelchli commented Apr 11, 2021 •

edited

pep8speaks commented Apr 11, 2021 •

edited

codecov bot commented Apr 11, 2021 •

edited

carmocca left a comment

awaelchli commented Apr 16, 2021

rkern commented Apr 16, 2021

awaelchli commented Apr 16, 2021

tchaton left a comment

awaelchli commented Apr 19, 2021

robogast commented May 11, 2021 •

edited

awaelchli commented May 12, 2021

Auto-set DataLoader.worker_init_fn with seed_everything #6960

Auto-set DataLoader.worker_init_fn with seed_everything #6960

Conversation

awaelchli commented Apr 11, 2021 • edited

What does this PR do?

Explanation

Before submitting

PR review

Did you have fun?

pep8speaks commented Apr 11, 2021 • edited

Comment last updated at 2021-04-16 18:14:00 UTC

codecov bot commented Apr 11, 2021 • edited

Codecov Report

carmocca left a comment

Choose a reason for hiding this comment

awaelchli commented Apr 16, 2021

rkern commented Apr 16, 2021

awaelchli commented Apr 16, 2021

tchaton left a comment

Choose a reason for hiding this comment

awaelchli commented Apr 19, 2021

robogast commented May 11, 2021 • edited

awaelchli commented May 12, 2021

Auto-set `DataLoader.worker_init_fn` with `seed_everything` #6960

Auto-set `DataLoader.worker_init_fn` with `seed_everything` #6960

awaelchli commented Apr 11, 2021 •

edited

pep8speaks commented Apr 11, 2021 •

edited

codecov bot commented Apr 11, 2021 •

edited

robogast commented May 11, 2021 •

edited