Trainer API doesnt stop after the training has been completed

### System Info

- `transformers` version: 4.51.3
- Platform: Linux-6.11.0-25-generic-x86_64-with-glibc2.39
- Python version: 3.12.3
- Huggingface_hub version: 0.30.2
- Safetensors version: 0.5.3
- Accelerate version: 1.6.0
- Accelerate config:    not found
- DeepSpeed version: not installed
- PyTorch version (GPU?): 2.6.0+cu126 (True)
- Tensorflow version (GPU?): not installed (NA)
- Flax version (CPU?/GPU?/TPU?): not installed (NA)
- Jax version: not installed
- JaxLib version: not installed
- Using distributed or parallel set-up in script?: Trainer API's parallelism
- Using GPU in script?: Yes
- GPU type: Quadro RTX 5000 with Max-Q Design

### Who can help?

@SunMarc @zach-huggingface

### Information

- [x] The official example scripts
- [ ] My own modified scripts

### Tasks

- [ ] An officially supported task in the `examples` folder (such as GLUE/SQuAD, ...)
- [x] My own task or dataset (give details below)

### Reproduction

1. Use `Seq2SeqTrainingArguments`,  with `save_total_limit=4`, `batch_szie=12`, `epochs=8`, `push_to_hub=True`. The task was summarization, the model used was google/mt5-small and the dataset was xlsum-en-es
2. Pass the training args to `Seq2SeqTrainer` with custom `compute_metrics`.

### Expected behavior

I followed the summarization topic in Chapter 7 of LLM Course with the dataset of my own choice. The training continued overnight for about 14 hours. After the evaluation was completed for 8th epoch the trainer continued running for another hours. After I force stopped the execution, I got the following error trace:

---------------------------------------------------------------------------
KeyboardInterrupt                         Traceback (most recent call last)
Cell In[27], line 1
----> 1 trainer.train()

File ~/.pyenv/versions/llm/lib/python3.12/site-packages/transformers/trainer.py:2236, in Trainer.train(self, resume_from_checkpoint, trial, ignore_keys_for_eval, **kwargs)
   2233 try:
   2234     # Disable progress bars when uploading models during checkpoints to avoid polluting stdout
   2235     hf_hub_utils.disable_progress_bars()
-> 2236     return inner_training_loop(
   2237         args=args,
   2238         resume_from_checkpoint=resume_from_checkpoint,
   2239         trial=trial,
   2240         ignore_keys_for_eval=ignore_keys_for_eval,
   2241     )
   2242 finally:
   2243     hf_hub_utils.enable_progress_bars()

File ~/.pyenv/versions/llm/lib/python3.12/site-packages/transformers/trainer.py:2728, in Trainer._inner_training_loop(self, batch_size, args, resume_from_checkpoint, trial, ignore_keys_for_eval)
   2725 self.control = self.callback_handler.on_train_end(args, self.state, self.control)
   2727 # Wait for the checkpoint to be uploaded.
-> 2728 self._finish_current_push()
   2730 # After training we make sure to retrieve back the original forward pass method
   2731 # for the embedding layer by removing the forward post hook.
   2732 if self.neftune_noise_alpha is not None:

File ~/.pyenv/versions/llm/lib/python3.12/site-packages/transformers/trainer.py:4773, in Trainer._finish_current_push(self)
   4771 if self.push_in_progress is not None and not self.push_in_progress.is_done():
   4772     logger.info("Waiting for the current checkpoint push to be finished, this might take a couple of minutes.")
-> 4773     self.push_in_progress.wait_until_done()

File ~/.pyenv/versions/llm/lib/python3.12/site-packages/transformers/utils/hub.py:1185, in PushInProgress.wait_until_done(self)
   1184 def wait_until_done(self):
-> 1185     futures.wait(self.jobs)

File /usr/lib/python3.12/concurrent/futures/_base.py:305, in wait(fs, timeout, return_when)
    301         return DoneAndNotDoneFutures(done, not_done)
    303     waiter = _create_and_install_waiters(fs, return_when)
--> 305 waiter.event.wait(timeout)
    306 for f in fs:
    307     with f._condition:

File /usr/lib/python3.12/threading.py:655, in Event.wait(self, timeout)
    653 signaled = self._flag
    654 if not signaled:
--> 655     signaled = self._cond.wait(timeout)
    656 return signaled

File /usr/lib/python3.12/threading.py:355, in Condition.wait(self, timeout)
    353 try:    # restore state no matter what (e.g., KeyboardInterrupt)
    354     if timeout is None:
--> 355         waiter.acquire()
    356         gotit = True
    357     else:

KeyboardInterrupt:
----------------------------------------------------

The last trace suggests that the thread kept waiting and the lock was not acquired. Smaller training runs were completed without issue. 

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Trainer API doesnt stop after the training has been completed #38039

System Info

Who can help?

Information

Tasks

Reproduction

Expected behavior

KeyboardInterrupt:

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Trainer API doesnt stop after the training has been completed #38039

Description

System Info

Who can help?

Information

Tasks

Reproduction

Expected behavior

KeyboardInterrupt:

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions