You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
My own task or dataset (give details below)
Reproduction
In an extraordinary case of "things that can't possibly be interacting yet somehow they are", it seems that in 4.41.0, logging to wandb breaks distributed training with FSDP!
This works for transformers==4.40.2, but crashes with transformers==4.41.0.
Setting instead --report_to none, it still works in 4.41.0
The error is pretty non-descript:
Traceback (most recent call last):
File "sft.py", line 159, in <module>
trainer.train()
File "/lib/python3.11/site-packages/trl/trainer/sft_trainer.py", line 361, in train
output = super().train(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/lib/python3.11/site-packages/transformers/trainer.py", line 1885, in train
return inner_training_loop(
^^^^^^^^^^^^^^^^^^^^
File "/lib/python3.11/site-packages/transformers/trainer.py", line 2216, in _inner_training_loop
tr_loss_step = self.training_step(model, inputs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/lib/python3.11/site-packages/transformers/trainer.py", line 3238, in training_step
loss = self.compute_loss(model, inputs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/lib/python3.11/site-packages/transformers/trainer.py", line 3264, in compute_loss
outputs = model(**inputs)
^^^^^^^^^^^^^^^
File "/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl
return forward_call(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/lib/python3.11/site-packages/torch/distributed/fsdp/fully_sharded_data_parallel.py", line 841, in forward
args, kwargs = _root_pre_forward(self, self, args, kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/lib/python3.11/site-packages/torch/distributed/fsdp/_runtime_utils.py", line 510, in _root_pre_forward
_lazy_init(state, module)
File "/lib/python3.11/site-packages/torch/distributed/fsdp/_runtime_utils.py", line 138, in _lazy_init
_share_state_and_init_handle_attrs(state, root_module)
File "/lib/python3.11/site-packages/torch/distributed/fsdp/_runtime_utils.py", line 207, in _share_state_and_init_handle_attrs
_p_assert(
File "/lib/python3.11/site-packages/torch/distributed/utils.py", line 146, in _p_assert
raise AssertionError(s)
AssertionError: Non-root FSDP instance's `_is_root` should not have been set yet or should have been set to `False`
Expected behavior
I'd expect this to not crash.
The text was updated successfully, but these errors were encountered:
System Info
transformers==4.41.0
Who can help?
@pacman100 @muellerzr
Information
Tasks
examples
folder (such as GLUE/SQuAD, ...)Reproduction
In an extraordinary case of "things that can't possibly be interacting yet somehow they are", it seems that in 4.41.0, logging to wandb breaks distributed training with FSDP!
Taking the trl SFT example as a basis:
This works for
transformers==4.40.2
, but crashes withtransformers==4.41.0
.Setting instead
--report_to none
, it still works in 4.41.0The error is pretty non-descript:
Expected behavior
I'd expect this to not crash.
The text was updated successfully, but these errors were encountered: