Different DataLoader worker share the same seed and lost randomness #37932

gathierry · 2025-05-02T16:09:48Z

System Info

train_dataloader use seed_worker to set the seed on different workers. But when working with distributed training, e.g. DDP, deep speed, etc, different ranks will shared the same seed. If there are random data augmentation, they will be the same across different ranks.

transformers/src/transformers/trainer.py

Line 1031 in 2932f31

dataloader_params["worker_init_fn"] = seed_worker

Who can help?

@zach-huggingface @SunMarc

Information

The official example scripts
My own modified scripts

Tasks

An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
My own task or dataset (give details below)

Reproduction

print(os.environ["LOCAL_RANK"], torch.randint(low=0, high=len(self), size=()).item())

in the __getitem__ function of dataset, and find same random value across different rank

Expected behavior

expect different value across different ranks

The text was updated successfully, but these errors were encountered:

Rocketknight1 · 2025-05-06T13:13:15Z

This seems significant, yeah - cc @SunMarc

SunMarc · 2025-05-06T14:37:11Z

Hi @gathierry, thanks for the report. Can you share a reproducer ? The seed should not be the same for each worker. torch.initial_seed() shouldn't return the same seed.

gathierry · 2025-05-06T14:44:20Z

Hi @SunMarc , torch.initial_seed() return different seed for different worker. However, for different rank, the same worker will give the same seed.
For example, if we have 2 GPUs, dataloader numworker = 2, the seed could be
Rank 0 worker 0: seed = 26593574
Rank 0 worker 1: seed = 26593575
Rank 1 worker 0: seed = 26593574
Rank 1 worker 1: seed = 26593575

SunMarc · 2025-05-06T15:04:02Z

Then we can potentially modify a bit the seed depending on the rank of the machine. This looks like this past issue, if you have time, can you check if the reproducer shared in the description works with / without transformers ? : huggingface/accelerate#789 huggingface/accelerate#786

gathierry · 2025-05-06T15:04:46Z

Here's a reproducer

import os
import random

import torch
from torch.utils.data import Dataset
from transformers import (
    BertTokenizer,
    BertForSequenceClassification,
    Trainer,
    TrainingArguments,
)

# Dummy dataset
class DummyDataset(Dataset):
    def __init__(self, tokenizer):
        self.tokenizer = tokenizer

    def __len__(self):
        return 100

    def __getitem__(self, idx):
        print(f"Rank: {os.environ.get('LOCAL_RANK')}", random.random())
        inputs = self.tokenizer("This is a dummy input", padding="max_length", truncation=True, max_length=32, return_tensors="pt")
        inputs = {k: v.squeeze(0) for k, v in inputs.items()}
        inputs["labels"] = torch.tensor(1)
        return inputs

def main():
    model = BertForSequenceClassification.from_pretrained("bert-base-uncased")
    tokenizer = BertTokenizer.from_pretrained("bert-base-uncased")
    dataset = DummyDataset(tokenizer)

    training_args = TrainingArguments(
        output_dir="./results",
        per_device_train_batch_size=8,
        num_train_epochs=2,
        logging_steps=10,
        logging_dir="./logs",
        save_steps=1000,
        report_to="none",  # disable WandB, etc.
        ddp_find_unused_parameters=False,  # Required in many cases
    )

    trainer = Trainer(
        model=model,
        args=training_args,
        train_dataset=dataset,
        tokenizer=tokenizer,
    )

    trainer.train()

if __name__ == "__main__":
    main()

And part of the logs. We can find same numbers show up twice

Rank: 1 0.6394267984578837

  0%|          | 0/14 [00:00<?, ?it/s]Rank: 1 0.025010755222666936
Rank: 0 0.6394267984578837
Rank: 1 0.27502931836911926
Rank: 1 0.22321073814882275
Rank: 0 0.025010755222666936
Rank: 1 0.7364712141640124
Rank: 0 0.27502931836911926
Rank: 1 0.6766994874229113
Rank: 0 0.22321073814882275
Rank: 1 0.8921795677048454
Rank: 0 0.7364712141640124
Rank: 1 0.08693883262941615
Rank: 0 0.6766994874229113
Rank: 0 0.8921795677048454
Rank: 0 0.08693883262941615
Rank: 1 0.4219218196852704
Rank: 1 0.029797219438070344

gathierry · 2025-05-06T16:08:40Z

also raised a PR for this

gathierry added the bug label May 2, 2025

gathierry mentioned this issue May 6, 2025

update seed_worker to set seed based on worker_id and rank #37980

Merged

5 tasks

SunMarc closed this as completed in #37980 May 12, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Different DataLoader worker share the same seed and lost randomness #37932

Different DataLoader worker share the same seed and lost randomness #37932

gathierry commented May 2, 2025

Rocketknight1 commented May 6, 2025

SunMarc commented May 6, 2025

gathierry commented May 6, 2025 •

edited

Loading

SunMarc commented May 6, 2025

gathierry commented May 6, 2025 •

edited

Loading

gathierry commented May 6, 2025

Different DataLoader worker share the same seed and lost randomness #37932

Different DataLoader worker share the same seed and lost randomness #37932

Comments

gathierry commented May 2, 2025

System Info

Who can help?

Information

Tasks

Reproduction

Expected behavior

Rocketknight1 commented May 6, 2025

SunMarc commented May 6, 2025

gathierry commented May 6, 2025 • edited Loading

SunMarc commented May 6, 2025

gathierry commented May 6, 2025 • edited Loading

gathierry commented May 6, 2025

gathierry commented May 6, 2025 •

edited

Loading

gathierry commented May 6, 2025 •

edited

Loading