You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
2024-01-18 11:37:34,621 ERROR tune_controller.py:1374 -- Trial task failed for trial TorchTrainer_10734_00000
Traceback (most recent call last):
File "d:\MLOps\mlenv\lib\site-packages\ray\air\execution\_internal\event_manager.py", line 110, in resolve_future
result = ray.get(future)
File "d:\MLOps\mlenv\lib\site-packages\ray\_private\auto_init_hook.py", line 22, in auto_init_wrapper
return fn(*args, **kwargs)
File "d:\MLOps\mlenv\lib\site-packages\ray\_private\client_mode_hook.py", line 103, in wrapper
return func(*args, **kwargs)
File "d:\MLOps\mlenv\lib\site-packages\ray\_private\worker.py", line 2624, in get
raise value.as_instanceof_cause()
ray.exceptions.RayTaskError(PermissionError): ray::_Inner.train() (pid=16812, ip=127.0.0.1, actor_id=9fe2423e0cd57b52c144a08a01000000, repr=TorchTrainer)
File "python\ray\_raylet.pyx", line 1813, in ray._raylet.execute_task
File "python\ray\_raylet.pyx", line 1754, in ray._raylet.execute_task.function_executor
File "d:\MLOps\mlenv\lib\site-packages\ray\_private\function_manager.py", line 726, in actor_method_executor
return method(__ray_actor, *args, **kwargs)
File "d:\MLOps\mlenv\lib\site-packages\ray\util\tracing\tracing_helper.py", line 467, in _resume_span
return method(self, *_args, **_kwargs)
File "d:\MLOps\mlenv\lib\site-packages\ray\tune\trainable\trainable.py", line 342, in train
raise skipped from exception_cause(skipped)
File "d:\MLOps\mlenv\lib\site-packages\ray\train\_internal\utils.py", line 43, in check_for_failure
ray.get(object_ref)
File "d:\MLOps\mlenv\lib\site-packages\ray\_private\auto_init_hook.py", line 22, in auto_init_wrapper
return fn(*args, **kwargs)
File "d:\MLOps\mlenv\lib\site-packages\ray\_private\client_mode_hook.py", line 103, in wrapper
return func(*args, **kwargs)
File "d:\MLOps\mlenv\lib\site-packages\ray\_private\worker.py", line 2624, in get
raise value.as_instanceof_cause()
ray.exceptions.RayTaskError(PermissionError): ray::_RayTrainWorker__execute.get_next() (pid=9492, ip=127.0.0.1, actor_id=bc7d4a1c3843a673dd0e184f01000000, repr=<ray.train._internal.worker_group.RayTrainWorker object at 0x0000023D4BB79090>)
File "python\ray\_raylet.pyx", line 1813, in ray._raylet.execute_task
File "python\ray\_raylet.pyx", line 1754, in ray._raylet.execute_task.function_executor
File "d:\MLOps\mlenv\lib\site-packages\ray\_private\function_manager.py", line 726, in actor_method_executor
return method(__ray_actor, *args, **kwargs)
File "d:\MLOps\mlenv\lib\site-packages\ray\util\tracing\tracing_helper.py", line 467, in _resume_span
return method(self, *_args, **_kwargs)
File "d:\MLOps\mlenv\lib\site-packages\ray\train\_internal\worker_group.py", line 33, in __execute
raise skipped from exception_cause(skipped)
File "d:\MLOps\mlenv\lib\site-packages\ray\train\_internal\utils.py", line 118, in discard_return_wrapper
train_func(*args, **kwargs)
File "C:\Users\YASH\AppData\Local\Temp\ipykernel_10888\3384315726.py", line 44, in train_loop_per_worker
File "d:\MLOps\mlenv\lib\site-packages\ray\train\_internal\session.py", line 644, in wrapper
return fn(*args, **kwargs)
File "d:\MLOps\mlenv\lib\site-packages\ray\train\_internal\session.py", line 706, in report
_get_session().report(metrics, checkpoint=checkpoint)
File "d:\MLOps\mlenv\lib\site-packages\ray\train\_internal\session.py", line 417, in report
persisted_checkpoint = self.storage.persist_current_checkpoint(checkpoint)
File "d:\MLOps\mlenv\lib\site-packages\ray\train\_internal\storage.py", line 558, in persist_current_checkpoint
_pyarrow_fs_copy_files(
File "d:\MLOps\mlenv\lib\site-packages\ray\train\_internal\storage.py", line 110, in _pyarrow_fs_copy_files
return pyarrow.fs.copy_files(
File "d:\MLOps\mlenv\lib\site-packages\pyarrow\fs.py", line 244, in copy_files
_copy_files_selector(source_fs, source_sel,
File "pyarrow\_fs.pyx", line 1229, in pyarrow._fs._copy_files_selector
File "pyarrow\error.pxi", line 110, in pyarrow.lib.check_status
PermissionError: [WinError 32] Failed copying 'C:/Users/YASH/AppData/Local/Temp/tmpsm8_7sn1/model.pt' to 'D:/MLOps/Made-With-ML/notebooks/llm/TorchTrainer_10734_00000_0_2024-01-18_11-31-47/checkpoint_000000/model.pt'. Detail: [Windows error 32] The process cannot access the file because it is being used by another process.
2024-01-18 11:37:38,459 ERROR tune.py:1038 -- Trials did not complete: [TorchTrainer_10734_00000]
2024-01-18 11:37:38,466 INFO tune.py:1042 -- Total run time: 351.06 seconds (347.16 seconds for the tuning loop).
Directory:
The text was updated successfully, but these errors were encountered:
I think this is related to Windows file locking, which is more restrictive than for Mac/Linux. I'm wondering if it has to do with multiple Ray workers. If at least one of them is still accesssing the checkpoint file then it might be blocked for copying by Windows.
Hello,
I am getting a Permission Error while running the
trainer.fit()
. I am running the code a personal laptop.Code:
Error Log:
Directory:
The text was updated successfully, but these errors were encountered: