Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

During both stage 1 and stage 2 training, the training stops unexpectedly for unknown reasons. #118

Open
wintercat1994 opened this issue Apr 19, 2024 · 0 comments

Comments

@wintercat1994
Copy link

Thank you for your work!
However, I have encountered an issue. During both stage 1 and stage 2 training, the training stops unexpectedly for unknown reasons. This typically occurs after training for several thousand steps.

Attached is the log where my training stage abruptly stopped.
I really hope you can provide an answer.

Steps: 58%|█████▊ | 5831/10000 [3:27:42<2:04:26, 1.79s/it, lr=1e-5, step_loss=0.0255, td=0.06s]
Steps: 58%|█████▊ | 5832/10000 [3:27:44<2:04:41, 1.80s/it, lr=1e-5, step_loss=0.0255, td=0.06s]
Steps: 58%|█████▊ | 5832/10000 [3:27:44<2:04:41, 1.80s/it, lr=1e-5, step_loss=0.0366, td=0.05s]WARNING:torch.distributed.elastic.agent.server.api:Received 1 death signal, shutting down workers
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 68694 closing signal SIGHUP
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 68695 closing signal SIGHUP
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 68696 closing signal SIGHUP
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 68697 closing signal SIGHUP
Traceback (most recent call last):
File "/data/Moore-AnimateAnyone/.venv/bin/accelerate", line 8, in
sys.exit(main())
File "/data/Moore-AnimateAnyone/.venv/lib/python3.10/site-packages/accelerate/commands/accelerate_cli.py", line 45, in main
args.func(args)
File "/data/Moore-AnimateAnyone/.venv/lib/python3.10/site-packages/accelerate/commands/launch.py", line 970, in launch_command
multi_gpu_launcher(args)
File "/data/Moore-AnimateAnyone/.venv/lib/python3.10/site-packages/accelerate/commands/launch.py", line 646, in multi_gpu_launcher
distrib_run.run(args)
File "/data/Moore-AnimateAnyone/.venv/lib/python3.10/site-packages/torch/distributed/run.py", line 785, in run
elastic_launch(
File "/data/Moore-AnimateAnyone/.venv/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 134, in call
return launch_agent(self._config, self._entrypoint, list(args))
File "/data/Moore-AnimateAnyone/.venv/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 241, in launch_agent
result = agent.run()
File "/data/Moore-AnimateAnyone/.venv/lib/python3.10/site-packages/torch/distributed/elastic/metrics/api.py", line 129, in wrapper
result = f(*args, **kwargs)
File "/data/Moore-AnimateAnyone/.venv/lib/python3.10/site-packages/torch/distributed/elastic/agent/server/api.py", line 723, in run
result = self._invoke_run(role)
File "/data/Moore-AnimateAnyone/.venv/lib/python3.10/site-packages/torch/distributed/elastic/agent/server/api.py", line 864, in _invoke_run
time.sleep(monitor_interval)
File "/data/Moore-AnimateAnyone/.venv/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/api.py", line 62, in _terminate_process_handler
raise SignalException(f"Process {os.getpid()} got signal: {sigval}", sigval=sigval)
torch.distributed.elastic.multiprocessing.api.SignalException: Process 68556 got signal: 1

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant