Skip to content

Trainer API doesnt stop after the training has been completed #38039

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
2 of 4 tasks
Awaisn25 opened this issue May 9, 2025 · 3 comments
Open
2 of 4 tasks

Trainer API doesnt stop after the training has been completed #38039

Awaisn25 opened this issue May 9, 2025 · 3 comments
Labels

Comments

@Awaisn25
Copy link

Awaisn25 commented May 9, 2025

System Info

  • transformers version: 4.51.3
  • Platform: Linux-6.11.0-25-generic-x86_64-with-glibc2.39
  • Python version: 3.12.3
  • Huggingface_hub version: 0.30.2
  • Safetensors version: 0.5.3
  • Accelerate version: 1.6.0
  • Accelerate config: not found
  • DeepSpeed version: not installed
  • PyTorch version (GPU?): 2.6.0+cu126 (True)
  • Tensorflow version (GPU?): not installed (NA)
  • Flax version (CPU?/GPU?/TPU?): not installed (NA)
  • Jax version: not installed
  • JaxLib version: not installed
  • Using distributed or parallel set-up in script?: Trainer API's parallelism
  • Using GPU in script?: Yes
  • GPU type: Quadro RTX 5000 with Max-Q Design

Who can help?

@SunMarc @zach-huggingface

Information

  • The official example scripts
  • My own modified scripts

Tasks

  • An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
  • My own task or dataset (give details below)

Reproduction

  1. Use Seq2SeqTrainingArguments, with save_total_limit=4, batch_szie=12, epochs=8, push_to_hub=True. The task was summarization, the model used was google/mt5-small and the dataset was xlsum-en-es
  2. Pass the training args to Seq2SeqTrainer with custom compute_metrics.

Expected behavior

I followed the summarization topic in Chapter 7 of LLM Course with the dataset of my own choice. The training continued overnight for about 14 hours. After the evaluation was completed for 8th epoch the trainer continued running for another hours. After I force stopped the execution, I got the following error trace:


KeyboardInterrupt Traceback (most recent call last)
Cell In[27], line 1
----> 1 trainer.train()

File ~/.pyenv/versions/llm/lib/python3.12/site-packages/transformers/trainer.py:2236, in Trainer.train(self, resume_from_checkpoint, trial, ignore_keys_for_eval, **kwargs)
2233 try:
2234 # Disable progress bars when uploading models during checkpoints to avoid polluting stdout
2235 hf_hub_utils.disable_progress_bars()
-> 2236 return inner_training_loop(
2237 args=args,
2238 resume_from_checkpoint=resume_from_checkpoint,
2239 trial=trial,
2240 ignore_keys_for_eval=ignore_keys_for_eval,
2241 )
2242 finally:
2243 hf_hub_utils.enable_progress_bars()

File ~/.pyenv/versions/llm/lib/python3.12/site-packages/transformers/trainer.py:2728, in Trainer._inner_training_loop(self, batch_size, args, resume_from_checkpoint, trial, ignore_keys_for_eval)
2725 self.control = self.callback_handler.on_train_end(args, self.state, self.control)
2727 # Wait for the checkpoint to be uploaded.
-> 2728 self._finish_current_push()
2730 # After training we make sure to retrieve back the original forward pass method
2731 # for the embedding layer by removing the forward post hook.
2732 if self.neftune_noise_alpha is not None:

File ~/.pyenv/versions/llm/lib/python3.12/site-packages/transformers/trainer.py:4773, in Trainer._finish_current_push(self)
4771 if self.push_in_progress is not None and not self.push_in_progress.is_done():
4772 logger.info("Waiting for the current checkpoint push to be finished, this might take a couple of minutes.")
-> 4773 self.push_in_progress.wait_until_done()

File ~/.pyenv/versions/llm/lib/python3.12/site-packages/transformers/utils/hub.py:1185, in PushInProgress.wait_until_done(self)
1184 def wait_until_done(self):
-> 1185 futures.wait(self.jobs)

File /usr/lib/python3.12/concurrent/futures/_base.py:305, in wait(fs, timeout, return_when)
301 return DoneAndNotDoneFutures(done, not_done)
303 waiter = _create_and_install_waiters(fs, return_when)
--> 305 waiter.event.wait(timeout)
306 for f in fs:
307 with f._condition:

File /usr/lib/python3.12/threading.py:655, in Event.wait(self, timeout)
653 signaled = self._flag
654 if not signaled:
--> 655 signaled = self._cond.wait(timeout)
656 return signaled

File /usr/lib/python3.12/threading.py:355, in Condition.wait(self, timeout)
353 try: # restore state no matter what (e.g., KeyboardInterrupt)
354 if timeout is None:
--> 355 waiter.acquire()
356 gotit = True
357 else:

KeyboardInterrupt:

The last trace suggests that the thread kept waiting and the lock was not acquired. Smaller training runs were completed without issue.

@Awaisn25 Awaisn25 added the bug label May 9, 2025
@Awaisn25
Copy link
Author

Awaisn25 commented May 9, 2025

Update: I tried manually calling the trainer.push_to_hub() function and it still took time as it did earlier, uploading didnt start even after 40 minutes. I have a stable internet connection. Upon KeyboardInterrupt, the stack trace was same the underlying error was in waiter.acquire().

EDIT 1:
Happens the same with trainer.save_model()

@SunMarc
Copy link
Member

SunMarc commented May 12, 2025

Hey @Awaisn25 , can you share a simple minimal reproducer since the issue seems to come from push_to_hub ?

@Awaisn25
Copy link
Author

Hi, thanks for reaching back. I was following the LLM course, chapter 7 and topic was summarization as listed here. The only change I made was using the xlsum dataset for 'english' and 'spanish'. I've listed above the training arguments I used.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants