You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Thank you for your work!
However, I have encountered an issue. During both stage 1 and stage 2 training, the training stops unexpectedly for unknown reasons. This typically occurs after training for several thousand steps.
Attached is the log where my training stage abruptly stopped.
I really hope you can provide an answer.
Steps: 58%|█████▊ | 5831/10000 [3:27:42<2:04:26, 1.79s/it, lr=1e-5, step_loss=0.0255, td=0.06s]
Steps: 58%|█████▊ | 5832/10000 [3:27:44<2:04:41, 1.80s/it, lr=1e-5, step_loss=0.0255, td=0.06s]
Steps: 58%|█████▊ | 5832/10000 [3:27:44<2:04:41, 1.80s/it, lr=1e-5, step_loss=0.0366, td=0.05s]WARNING:torch.distributed.elastic.agent.server.api:Received 1 death signal, shutting down workers
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 68694 closing signal SIGHUP
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 68695 closing signal SIGHUP
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 68696 closing signal SIGHUP
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 68697 closing signal SIGHUP
Traceback (most recent call last):
File "/data/Moore-AnimateAnyone/.venv/bin/accelerate", line 8, in
sys.exit(main())
File "/data/Moore-AnimateAnyone/.venv/lib/python3.10/site-packages/accelerate/commands/accelerate_cli.py", line 45, in main
args.func(args)
File "/data/Moore-AnimateAnyone/.venv/lib/python3.10/site-packages/accelerate/commands/launch.py", line 970, in launch_command
multi_gpu_launcher(args)
File "/data/Moore-AnimateAnyone/.venv/lib/python3.10/site-packages/accelerate/commands/launch.py", line 646, in multi_gpu_launcher
distrib_run.run(args)
File "/data/Moore-AnimateAnyone/.venv/lib/python3.10/site-packages/torch/distributed/run.py", line 785, in run
elastic_launch(
File "/data/Moore-AnimateAnyone/.venv/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 134, in call
return launch_agent(self._config, self._entrypoint, list(args))
File "/data/Moore-AnimateAnyone/.venv/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 241, in launch_agent
result = agent.run()
File "/data/Moore-AnimateAnyone/.venv/lib/python3.10/site-packages/torch/distributed/elastic/metrics/api.py", line 129, in wrapper
result = f(*args, **kwargs)
File "/data/Moore-AnimateAnyone/.venv/lib/python3.10/site-packages/torch/distributed/elastic/agent/server/api.py", line 723, in run
result = self._invoke_run(role)
File "/data/Moore-AnimateAnyone/.venv/lib/python3.10/site-packages/torch/distributed/elastic/agent/server/api.py", line 864, in _invoke_run
time.sleep(monitor_interval)
File "/data/Moore-AnimateAnyone/.venv/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/api.py", line 62, in _terminate_process_handler
raise SignalException(f"Process {os.getpid()} got signal: {sigval}", sigval=sigval)
torch.distributed.elastic.multiprocessing.api.SignalException: Process 68556 got signal: 1
The text was updated successfully, but these errors were encountered:
Thank you for your work!
However, I have encountered an issue. During both stage 1 and stage 2 training, the training stops unexpectedly for unknown reasons. This typically occurs after training for several thousand steps.
Attached is the log where my training stage abruptly stopped.
I really hope you can provide an answer.
Steps: 58%|█████▊ | 5831/10000 [3:27:42<2:04:26, 1.79s/it, lr=1e-5, step_loss=0.0255, td=0.06s]
Steps: 58%|█████▊ | 5832/10000 [3:27:44<2:04:41, 1.80s/it, lr=1e-5, step_loss=0.0255, td=0.06s]
Steps: 58%|█████▊ | 5832/10000 [3:27:44<2:04:41, 1.80s/it, lr=1e-5, step_loss=0.0366, td=0.05s]WARNING:torch.distributed.elastic.agent.server.api:Received 1 death signal, shutting down workers
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 68694 closing signal SIGHUP
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 68695 closing signal SIGHUP
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 68696 closing signal SIGHUP
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 68697 closing signal SIGHUP
Traceback (most recent call last):
File "/data/Moore-AnimateAnyone/.venv/bin/accelerate", line 8, in
sys.exit(main())
File "/data/Moore-AnimateAnyone/.venv/lib/python3.10/site-packages/accelerate/commands/accelerate_cli.py", line 45, in main
args.func(args)
File "/data/Moore-AnimateAnyone/.venv/lib/python3.10/site-packages/accelerate/commands/launch.py", line 970, in launch_command
multi_gpu_launcher(args)
File "/data/Moore-AnimateAnyone/.venv/lib/python3.10/site-packages/accelerate/commands/launch.py", line 646, in multi_gpu_launcher
distrib_run.run(args)
File "/data/Moore-AnimateAnyone/.venv/lib/python3.10/site-packages/torch/distributed/run.py", line 785, in run
elastic_launch(
File "/data/Moore-AnimateAnyone/.venv/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 134, in call
return launch_agent(self._config, self._entrypoint, list(args))
File "/data/Moore-AnimateAnyone/.venv/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 241, in launch_agent
result = agent.run()
File "/data/Moore-AnimateAnyone/.venv/lib/python3.10/site-packages/torch/distributed/elastic/metrics/api.py", line 129, in wrapper
result = f(*args, **kwargs)
File "/data/Moore-AnimateAnyone/.venv/lib/python3.10/site-packages/torch/distributed/elastic/agent/server/api.py", line 723, in run
result = self._invoke_run(role)
File "/data/Moore-AnimateAnyone/.venv/lib/python3.10/site-packages/torch/distributed/elastic/agent/server/api.py", line 864, in _invoke_run
time.sleep(monitor_interval)
File "/data/Moore-AnimateAnyone/.venv/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/api.py", line 62, in _terminate_process_handler
raise SignalException(f"Process {os.getpid()} got signal: {sigval}", sigval=sigval)
torch.distributed.elastic.multiprocessing.api.SignalException: Process 68556 got signal: 1
The text was updated successfully, but these errors were encountered: