Logging to wandb breaks FSDP in 4.41.0 #30895

mgerstgrasser · 2024-05-19T03:33:42Z

System Info

transformers==4.41.0

Who can help?

@pacman100 @muellerzr

Information

The official example scripts
My own modified scripts

Tasks

An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
My own task or dataset (give details below)

Reproduction

In an extraordinary case of "things that can't possibly be interacting yet somehow they are", it seems that in 4.41.0, logging to wandb breaks distributed training with FSDP!

Taking the trl SFT example as a basis:

accelerate launch --use_fsdp --fsdp_auto_wrap_policy TRANSFORMER_BASED_WRAP --num_processes 8 --num_machines 1 trl/examples/scripts/sft.py \
    --model_name_or_path facebook/opt-350m \
    --dataset_text_field "text" \
    --output_dir /somewhere \
    --report_to wandb

This works for transformers==4.40.2, but crashes with transformers==4.41.0.
Setting instead --report_to none, it still works in 4.41.0

The error is pretty non-descript:

Traceback (most recent call last):
  File "sft.py", line 159, in <module>
    trainer.train()
  File "/lib/python3.11/site-packages/trl/trainer/sft_trainer.py", line 361, in train
    output = super().train(*args, **kwargs)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/lib/python3.11/site-packages/transformers/trainer.py", line 1885, in train
    return inner_training_loop(
           ^^^^^^^^^^^^^^^^^^^^
  File "/lib/python3.11/site-packages/transformers/trainer.py", line 2216, in _inner_training_loop
    tr_loss_step = self.training_step(model, inputs)
                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/lib/python3.11/site-packages/transformers/trainer.py", line 3238, in training_step
    loss = self.compute_loss(model, inputs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/lib/python3.11/site-packages/transformers/trainer.py", line 3264, in compute_loss
    outputs = model(**inputs)
              ^^^^^^^^^^^^^^^
  File "/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl
    return forward_call(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/lib/python3.11/site-packages/torch/distributed/fsdp/fully_sharded_data_parallel.py", line 841, in forward
    args, kwargs = _root_pre_forward(self, self, args, kwargs)
                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/lib/python3.11/site-packages/torch/distributed/fsdp/_runtime_utils.py", line 510, in _root_pre_forward
    _lazy_init(state, module)
  File "/lib/python3.11/site-packages/torch/distributed/fsdp/_runtime_utils.py", line 138, in _lazy_init
    _share_state_and_init_handle_attrs(state, root_module)
  File "/lib/python3.11/site-packages/torch/distributed/fsdp/_runtime_utils.py", line 207, in _share_state_and_init_handle_attrs
    _p_assert(
  File "/lib/python3.11/site-packages/torch/distributed/utils.py", line 146, in _p_assert
    raise AssertionError(s)
AssertionError: Non-root FSDP instance's `_is_root` should not have been set yet or should have been set to `False`

Expected behavior

I'd expect this to not crash.

The text was updated successfully, but these errors were encountered:

mgerstgrasser mentioned this issue May 19, 2024

Bugfix: WandbCallback uploads initial model checkpoint #30897

Merged

5 tasks

amyeroberts added the PyTorch FSDP label May 20, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Logging to wandb breaks FSDP in 4.41.0 #30895

Logging to wandb breaks FSDP in 4.41.0 #30895

mgerstgrasser commented May 19, 2024 •

edited

Logging to wandb breaks FSDP in 4.41.0 #30895

Logging to wandb breaks FSDP in 4.41.0 #30895

Comments

mgerstgrasser commented May 19, 2024 • edited

System Info

Who can help?

Information

Tasks

Reproduction

Expected behavior

mgerstgrasser commented May 19, 2024 •

edited