-
Notifications
You must be signed in to change notification settings - Fork 29k
Different DataLoader worker share the same seed and lost randomness #37932
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
This seems significant, yeah - cc @SunMarc |
Hi @gathierry, thanks for the report. Can you share a reproducer ? The seed should not be the same for each worker. |
Hi @SunMarc , |
Then we can potentially modify a bit the seed depending on the rank of the machine. This looks like this past issue, if you have time, can you check if the reproducer shared in the description works with / without transformers ? : huggingface/accelerate#789 huggingface/accelerate#786 |
Here's a reproducer import os
import random
import torch
from torch.utils.data import Dataset
from transformers import (
BertTokenizer,
BertForSequenceClassification,
Trainer,
TrainingArguments,
)
# Dummy dataset
class DummyDataset(Dataset):
def __init__(self, tokenizer):
self.tokenizer = tokenizer
def __len__(self):
return 100
def __getitem__(self, idx):
print(f"Rank: {os.environ.get('LOCAL_RANK')}", random.random())
inputs = self.tokenizer("This is a dummy input", padding="max_length", truncation=True, max_length=32, return_tensors="pt")
inputs = {k: v.squeeze(0) for k, v in inputs.items()}
inputs["labels"] = torch.tensor(1)
return inputs
def main():
model = BertForSequenceClassification.from_pretrained("bert-base-uncased")
tokenizer = BertTokenizer.from_pretrained("bert-base-uncased")
dataset = DummyDataset(tokenizer)
training_args = TrainingArguments(
output_dir="./results",
per_device_train_batch_size=8,
num_train_epochs=2,
logging_steps=10,
logging_dir="./logs",
save_steps=1000,
report_to="none", # disable WandB, etc.
ddp_find_unused_parameters=False, # Required in many cases
)
trainer = Trainer(
model=model,
args=training_args,
train_dataset=dataset,
tokenizer=tokenizer,
)
trainer.train()
if __name__ == "__main__":
main() And part of the logs. We can find same numbers show up twice Rank: 1 0.6394267984578837
0%| | 0/14 [00:00<?, ?it/s]Rank: 1 0.025010755222666936
Rank: 0 0.6394267984578837
Rank: 1 0.27502931836911926
Rank: 1 0.22321073814882275
Rank: 0 0.025010755222666936
Rank: 1 0.7364712141640124
Rank: 0 0.27502931836911926
Rank: 1 0.6766994874229113
Rank: 0 0.22321073814882275
Rank: 1 0.8921795677048454
Rank: 0 0.7364712141640124
Rank: 1 0.08693883262941615
Rank: 0 0.6766994874229113
Rank: 0 0.8921795677048454
Rank: 0 0.08693883262941615
Rank: 1 0.4219218196852704
Rank: 1 0.029797219438070344 |
also raised a PR for this |
System Info
train_dataloader use
seed_worker
to set the seed on different workers. But when working with distributed training, e.g. DDP, deep speed, etc, different ranks will shared the same seed. If there are random data augmentation, they will be the same across different ranks.transformers/src/transformers/trainer.py
Line 1031 in 2932f31
Who can help?
@zach-huggingface @SunMarc
Information
Tasks
examples
folder (such as GLUE/SQuAD, ...)Reproduction
print(os.environ["LOCAL_RANK"], torch.randint(low=0, high=len(self), size=()).item())
in the
__getitem__
function of dataset, and find same random value across different rankExpected behavior
expect different value across different ranks
The text was updated successfully, but these errors were encountered: