Skip to content

Different DataLoader worker share the same seed and lost randomness #37932

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
4 tasks
gathierry opened this issue May 2, 2025 · 6 comments · Fixed by #37980
Closed
4 tasks

Different DataLoader worker share the same seed and lost randomness #37932

gathierry opened this issue May 2, 2025 · 6 comments · Fixed by #37980
Labels

Comments

@gathierry
Copy link
Contributor

System Info

train_dataloader use seed_worker to set the seed on different workers. But when working with distributed training, e.g. DDP, deep speed, etc, different ranks will shared the same seed. If there are random data augmentation, they will be the same across different ranks.

dataloader_params["worker_init_fn"] = seed_worker

Who can help?

@zach-huggingface @SunMarc

Information

  • The official example scripts
  • My own modified scripts

Tasks

  • An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
  • My own task or dataset (give details below)

Reproduction

print(os.environ["LOCAL_RANK"], torch.randint(low=0, high=len(self), size=()).item())

in the __getitem__ function of dataset, and find same random value across different rank

Expected behavior

expect different value across different ranks

@gathierry gathierry added the bug label May 2, 2025
@Rocketknight1
Copy link
Member

This seems significant, yeah - cc @SunMarc

@SunMarc
Copy link
Member

SunMarc commented May 6, 2025

Hi @gathierry, thanks for the report. Can you share a reproducer ? The seed should not be the same for each worker. torch.initial_seed() shouldn't return the same seed.

@gathierry
Copy link
Contributor Author

gathierry commented May 6, 2025

Hi @SunMarc , torch.initial_seed() return different seed for different worker. However, for different rank, the same worker will give the same seed.
For example, if we have 2 GPUs, dataloader numworker = 2, the seed could be
Rank 0 worker 0: seed = 26593574
Rank 0 worker 1: seed = 26593575
Rank 1 worker 0: seed = 26593574
Rank 1 worker 1: seed = 26593575

@SunMarc
Copy link
Member

SunMarc commented May 6, 2025

Then we can potentially modify a bit the seed depending on the rank of the machine. This looks like this past issue, if you have time, can you check if the reproducer shared in the description works with / without transformers ? : huggingface/accelerate#789 huggingface/accelerate#786

@gathierry
Copy link
Contributor Author

gathierry commented May 6, 2025

Here's a reproducer

import os
import random

import torch
from torch.utils.data import Dataset
from transformers import (
    BertTokenizer,
    BertForSequenceClassification,
    Trainer,
    TrainingArguments,
)

# Dummy dataset
class DummyDataset(Dataset):
    def __init__(self, tokenizer):
        self.tokenizer = tokenizer

    def __len__(self):
        return 100

    def __getitem__(self, idx):
        print(f"Rank: {os.environ.get('LOCAL_RANK')}", random.random())
        inputs = self.tokenizer("This is a dummy input", padding="max_length", truncation=True, max_length=32, return_tensors="pt")
        inputs = {k: v.squeeze(0) for k, v in inputs.items()}
        inputs["labels"] = torch.tensor(1)
        return inputs

def main():
    model = BertForSequenceClassification.from_pretrained("bert-base-uncased")
    tokenizer = BertTokenizer.from_pretrained("bert-base-uncased")
    dataset = DummyDataset(tokenizer)

    training_args = TrainingArguments(
        output_dir="./results",
        per_device_train_batch_size=8,
        num_train_epochs=2,
        logging_steps=10,
        logging_dir="./logs",
        save_steps=1000,
        report_to="none",  # disable WandB, etc.
        ddp_find_unused_parameters=False,  # Required in many cases
    )

    trainer = Trainer(
        model=model,
        args=training_args,
        train_dataset=dataset,
        tokenizer=tokenizer,
    )

    trainer.train()

if __name__ == "__main__":
    main()

And part of the logs. We can find same numbers show up twice

Rank: 1 0.6394267984578837

  0%|          | 0/14 [00:00<?, ?it/s]Rank: 1 0.025010755222666936
Rank: 0 0.6394267984578837
Rank: 1 0.27502931836911926
Rank: 1 0.22321073814882275
Rank: 0 0.025010755222666936
Rank: 1 0.7364712141640124
Rank: 0 0.27502931836911926
Rank: 1 0.6766994874229113
Rank: 0 0.22321073814882275
Rank: 1 0.8921795677048454
Rank: 0 0.7364712141640124
Rank: 1 0.08693883262941615
Rank: 0 0.6766994874229113
Rank: 0 0.8921795677048454
Rank: 0 0.08693883262941615
Rank: 1 0.4219218196852704
Rank: 1 0.029797219438070344

@gathierry
Copy link
Contributor Author

also raised a PR for this

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants