Description
System Info
transformers
version: 4.51.3- Platform: Linux-6.11.0-25-generic-x86_64-with-glibc2.39
- Python version: 3.12.3
- Huggingface_hub version: 0.30.2
- Safetensors version: 0.5.3
- Accelerate version: 1.6.0
- Accelerate config: not found
- DeepSpeed version: not installed
- PyTorch version (GPU?): 2.6.0+cu126 (True)
- Tensorflow version (GPU?): not installed (NA)
- Flax version (CPU?/GPU?/TPU?): not installed (NA)
- Jax version: not installed
- JaxLib version: not installed
- Using distributed or parallel set-up in script?: Trainer API's parallelism
- Using GPU in script?: Yes
- GPU type: Quadro RTX 5000 with Max-Q Design
Who can help?
Information
- The official example scripts
- My own modified scripts
Tasks
- An officially supported task in the
examples
folder (such as GLUE/SQuAD, ...) - My own task or dataset (give details below)
Reproduction
- Use
Seq2SeqTrainingArguments
, withsave_total_limit=4
,batch_szie=12
,epochs=8
,push_to_hub=True
. The task was summarization, the model used was google/mt5-small and the dataset was xlsum-en-es - Pass the training args to
Seq2SeqTrainer
with customcompute_metrics
.
Expected behavior
I followed the summarization topic in Chapter 7 of LLM Course with the dataset of my own choice. The training continued overnight for about 14 hours. After the evaluation was completed for 8th epoch the trainer continued running for another hours. After I force stopped the execution, I got the following error trace:
KeyboardInterrupt Traceback (most recent call last)
Cell In[27], line 1
----> 1 trainer.train()
File ~/.pyenv/versions/llm/lib/python3.12/site-packages/transformers/trainer.py:2236, in Trainer.train(self, resume_from_checkpoint, trial, ignore_keys_for_eval, **kwargs)
2233 try:
2234 # Disable progress bars when uploading models during checkpoints to avoid polluting stdout
2235 hf_hub_utils.disable_progress_bars()
-> 2236 return inner_training_loop(
2237 args=args,
2238 resume_from_checkpoint=resume_from_checkpoint,
2239 trial=trial,
2240 ignore_keys_for_eval=ignore_keys_for_eval,
2241 )
2242 finally:
2243 hf_hub_utils.enable_progress_bars()
File ~/.pyenv/versions/llm/lib/python3.12/site-packages/transformers/trainer.py:2728, in Trainer._inner_training_loop(self, batch_size, args, resume_from_checkpoint, trial, ignore_keys_for_eval)
2725 self.control = self.callback_handler.on_train_end(args, self.state, self.control)
2727 # Wait for the checkpoint to be uploaded.
-> 2728 self._finish_current_push()
2730 # After training we make sure to retrieve back the original forward pass method
2731 # for the embedding layer by removing the forward post hook.
2732 if self.neftune_noise_alpha is not None:
File ~/.pyenv/versions/llm/lib/python3.12/site-packages/transformers/trainer.py:4773, in Trainer._finish_current_push(self)
4771 if self.push_in_progress is not None and not self.push_in_progress.is_done():
4772 logger.info("Waiting for the current checkpoint push to be finished, this might take a couple of minutes.")
-> 4773 self.push_in_progress.wait_until_done()
File ~/.pyenv/versions/llm/lib/python3.12/site-packages/transformers/utils/hub.py:1185, in PushInProgress.wait_until_done(self)
1184 def wait_until_done(self):
-> 1185 futures.wait(self.jobs)
File /usr/lib/python3.12/concurrent/futures/_base.py:305, in wait(fs, timeout, return_when)
301 return DoneAndNotDoneFutures(done, not_done)
303 waiter = _create_and_install_waiters(fs, return_when)
--> 305 waiter.event.wait(timeout)
306 for f in fs:
307 with f._condition:
File /usr/lib/python3.12/threading.py:655, in Event.wait(self, timeout)
653 signaled = self._flag
654 if not signaled:
--> 655 signaled = self._cond.wait(timeout)
656 return signaled
File /usr/lib/python3.12/threading.py:355, in Condition.wait(self, timeout)
353 try: # restore state no matter what (e.g., KeyboardInterrupt)
354 if timeout is None:
--> 355 waiter.acquire()
356 gotit = True
357 else:
KeyboardInterrupt:
The last trace suggests that the thread kept waiting and the lock was not acquired. Smaller training runs were completed without issue.