You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
My own task or dataset (give details below)
Reproduction
Use Seq2SeqTrainingArguments, with save_total_limit=4, batch_szie=12, epochs=8, push_to_hub=True. The task was summarization, the model used was google/mt5-small and the dataset was xlsum-en-es
Pass the training args to Seq2SeqTrainer with custom compute_metrics.
Expected behavior
I followed the summarization topic in Chapter 7 of LLM Course with the dataset of my own choice. The training continued overnight for about 14 hours. After the evaluation was completed for 8th epoch the trainer continued running for another hours. After I force stopped the execution, I got the following error trace:
File ~/.pyenv/versions/llm/lib/python3.12/site-packages/transformers/trainer.py:2728, in Trainer._inner_training_loop(self, batch_size, args, resume_from_checkpoint, trial, ignore_keys_for_eval)
2725 self.control = self.callback_handler.on_train_end(args, self.state, self.control)
2727 # Wait for the checkpoint to be uploaded.
-> 2728 self._finish_current_push()
2730 # After training we make sure to retrieve back the original forward pass method
2731 # for the embedding layer by removing the forward post hook.
2732 if self.neftune_noise_alpha is not None:
File ~/.pyenv/versions/llm/lib/python3.12/site-packages/transformers/trainer.py:4773, in Trainer._finish_current_push(self)
4771 if self.push_in_progress is not None and not self.push_in_progress.is_done():
4772 logger.info("Waiting for the current checkpoint push to be finished, this might take a couple of minutes.")
-> 4773 self.push_in_progress.wait_until_done()
File ~/.pyenv/versions/llm/lib/python3.12/site-packages/transformers/utils/hub.py:1185, in PushInProgress.wait_until_done(self)
1184 def wait_until_done(self):
-> 1185 futures.wait(self.jobs)
File /usr/lib/python3.12/concurrent/futures/_base.py:305, in wait(fs, timeout, return_when)
301 return DoneAndNotDoneFutures(done, not_done)
303 waiter = _create_and_install_waiters(fs, return_when)
--> 305 waiter.event.wait(timeout)
306 for f in fs:
307 with f._condition:
File /usr/lib/python3.12/threading.py:655, in Event.wait(self, timeout)
653 signaled = self._flag
654 if not signaled:
--> 655 signaled = self._cond.wait(timeout)
656 return signaled
File /usr/lib/python3.12/threading.py:355, in Condition.wait(self, timeout)
353 try: # restore state no matter what (e.g., KeyboardInterrupt)
354 if timeout is None:
--> 355 waiter.acquire()
356 gotit = True
357 else:
KeyboardInterrupt:
The last trace suggests that the thread kept waiting and the lock was not acquired. Smaller training runs were completed without issue.
The text was updated successfully, but these errors were encountered:
Update: I tried manually calling the trainer.push_to_hub() function and it still took time as it did earlier, uploading didnt start even after 40 minutes. I have a stable internet connection. Upon KeyboardInterrupt, the stack trace was same the underlying error was in waiter.acquire().
EDIT 1:
Happens the same with trainer.save_model()
Hi, thanks for reaching back. I was following the LLM course, chapter 7 and topic was summarization as listed here. The only change I made was using the xlsum dataset for 'english' and 'spanish'. I've listed above the training arguments I used.
System Info
transformers
version: 4.51.3Who can help?
@SunMarc @zach-huggingface
Information
Tasks
examples
folder (such as GLUE/SQuAD, ...)Reproduction
Seq2SeqTrainingArguments
, withsave_total_limit=4
,batch_szie=12
,epochs=8
,push_to_hub=True
. The task was summarization, the model used was google/mt5-small and the dataset was xlsum-en-esSeq2SeqTrainer
with customcompute_metrics
.Expected behavior
I followed the summarization topic in Chapter 7 of LLM Course with the dataset of my own choice. The training continued overnight for about 14 hours. After the evaluation was completed for 8th epoch the trainer continued running for another hours. After I force stopped the execution, I got the following error trace:
KeyboardInterrupt Traceback (most recent call last)
Cell In[27], line 1
----> 1 trainer.train()
File ~/.pyenv/versions/llm/lib/python3.12/site-packages/transformers/trainer.py:2236, in Trainer.train(self, resume_from_checkpoint, trial, ignore_keys_for_eval, **kwargs)
2233 try:
2234 # Disable progress bars when uploading models during checkpoints to avoid polluting stdout
2235 hf_hub_utils.disable_progress_bars()
-> 2236 return inner_training_loop(
2237 args=args,
2238 resume_from_checkpoint=resume_from_checkpoint,
2239 trial=trial,
2240 ignore_keys_for_eval=ignore_keys_for_eval,
2241 )
2242 finally:
2243 hf_hub_utils.enable_progress_bars()
File ~/.pyenv/versions/llm/lib/python3.12/site-packages/transformers/trainer.py:2728, in Trainer._inner_training_loop(self, batch_size, args, resume_from_checkpoint, trial, ignore_keys_for_eval)
2725 self.control = self.callback_handler.on_train_end(args, self.state, self.control)
2727 # Wait for the checkpoint to be uploaded.
-> 2728 self._finish_current_push()
2730 # After training we make sure to retrieve back the original forward pass method
2731 # for the embedding layer by removing the forward post hook.
2732 if self.neftune_noise_alpha is not None:
File ~/.pyenv/versions/llm/lib/python3.12/site-packages/transformers/trainer.py:4773, in Trainer._finish_current_push(self)
4771 if self.push_in_progress is not None and not self.push_in_progress.is_done():
4772 logger.info("Waiting for the current checkpoint push to be finished, this might take a couple of minutes.")
-> 4773 self.push_in_progress.wait_until_done()
File ~/.pyenv/versions/llm/lib/python3.12/site-packages/transformers/utils/hub.py:1185, in PushInProgress.wait_until_done(self)
1184 def wait_until_done(self):
-> 1185 futures.wait(self.jobs)
File /usr/lib/python3.12/concurrent/futures/_base.py:305, in wait(fs, timeout, return_when)
301 return DoneAndNotDoneFutures(done, not_done)
303 waiter = _create_and_install_waiters(fs, return_when)
--> 305 waiter.event.wait(timeout)
306 for f in fs:
307 with f._condition:
File /usr/lib/python3.12/threading.py:655, in Event.wait(self, timeout)
653 signaled = self._flag
654 if not signaled:
--> 655 signaled = self._cond.wait(timeout)
656 return signaled
File /usr/lib/python3.12/threading.py:355, in Condition.wait(self, timeout)
353 try: # restore state no matter what (e.g., KeyboardInterrupt)
354 if timeout is None:
--> 355 waiter.acquire()
356 gotit = True
357 else:
KeyboardInterrupt:
The last trace suggests that the thread kept waiting and the lock was not acquired. Smaller training runs were completed without issue.
The text was updated successfully, but these errors were encountered: