Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

terminate called after throwing an instance of 'c10::Error' #44

Open
Fritzyuan opened this issue Nov 16, 2023 · 1 comment
Open

terminate called after throwing an instance of 'c10::Error' #44

Fritzyuan opened this issue Nov 16, 2023 · 1 comment

Comments

@Fritzyuan
Copy link

请问训练过程中出现如标题所写的错误该如何解决?
Traceback (most recent call last): [0/1808]
File "/workspace/AnomalyGPT/code/train_mvtec.py", line 149, in
main(args)
File "/workspace/AnomalyGPT/code/train_mvtec.py", line 124, in main
agent.train_model(
File "/workspace/AnomalyGPT/code/model/agent.py", line 84, in train_model
self.ds_engine.step()
File "/opt/conda/lib/python3.10/site-packages/deepspeed/runtime/engine.py", line 2041, in step
self.tput_timer.stop(global_step=self.is_gradient_accumulation_boundary(), report_speed=report_progress)
File "/opt/conda/lib/python3.10/site-packages/deepspeed/utils/timer.py", line 191, in stop
get_accelerator().synchronize()
File "/opt/conda/lib/python3.10/site-packages/deepspeed/accelerator/cuda_accelerator.py", line 63, in synchronize
return torch.cuda.synchronize(device_index)
File "/opt/conda/lib/python3.10/site-packages/torch/cuda/init.py", line 566, in synchronize
return torch._C._cuda_synchronize()
RuntimeError: CUDA error: unknown error
CUDA kernel errors might be asynchronously reported at some other API call,so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
[!] loss: 1.1797; token_acc: 68.0: 23%|███████████████████████████▎ | 20490/90725 [3:23:40<11:38:10, 1.68it/s]
terminate called after throwing an instance of 'c10::Error'
what(): CUDA error: unknown error
CUDA kernel errors might be asynchronously reported at some other API call,so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
Exception raised from c10_cuda_check_implementation at ../c10/cuda/CUDAException.cpp:31 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7f9246211457 in /opt/conda/lib/python3.10/site-packages/torch/lib/libc10.so)
frame #1: c10::detail::torchCheckFail(char const
, char const
, unsigned int, std::string const&) + 0x64 (0x7f92461db3ec in /opt/conda/lib/python3.10/site-packages/torch/lib/libc10.so)
frame #2: c10::cuda::c10_cuda_check_implementation(std::string const&, std::string const&, int, bool) + 0xb4 (0x7f927127dc64 in /opt/conda/lib/python3.10/site-packages/torch/lib/libc10_cuda.so)
frame #3: + 0x1e0dc (0x7f92712550dc in /opt/conda/lib/python3.10/site-packages/torch/lib/libc10_cuda.so)
frame #4: c10::cuda::CUDACachingAllocator::raw_delete(void*) + 0x244 (0x7f9271258054 in /opt/conda/lib/python3.10/site-packages/torch/lib/libc10_cuda.so)
frame #5: + 0x4d7d63 (0x7f929c148d63 in /opt/conda/lib/python3.10/site-packages/torch/lib/libtorch_python.so)
frame #6: c10::TensorImpl::~TensorImpl() + 0x1a0 (0x7f92461f19e0 in /opt/conda/lib/python3.10/site-packages/torch/lib/libc10.so)
frame #7: c10::TensorImpl::~TensorImpl() + 0x9 (0x7f92461f1af9 in /opt/conda/lib/python3.10/site-packages/torch/lib/libc10.so)
frame #8: + 0x735788 (0x7f929c3a6788 in /opt/conda/lib/python3.10/site-packages/torch/lib/libtorch_python.so)
frame #9: THPVariable_subclass_dealloc(_object*) + 0x2a5 (0x7f929c3a6a75 in /opt/conda/lib/python3.10/site-packages/torch/lib/libtorch_python.so)
frame #10: /opt/conda/bin/python() [0x4e9dc8]
frame #11: /opt/conda/bin/python() [0x4df132]
frame #12: _PyModule_ClearDict + 0x14d (0x55a86d in /opt/conda/bin/python)
frame #13: /opt/conda/bin/python() [0x5c49a3]
frame #14: Py_FinalizeEx + 0x143 (0x5c3433 in /opt/conda/bin/python)
frame #15: Py_RunMain + 0x109 (0x5b5229 in /opt/conda/bin/python)
frame #16: Py_BytesMain + 0x39 (0x585639 in /opt/conda/bin/python)
frame #17: __libc_start_main + 0xf3 (0x7f92bdf1a083 in /lib/x86_64-linux-gnu/libc.so.6)
frame #18: /opt/conda/bin/python() [0x5854ee]

@tenderzada
Copy link

您好,能给我转一下Vicuna的权重吗。

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants