Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add OF_CUDA_CHECK/OF_CUDNN_CHECK/OF_CUBLAS_CHECK/OF_CURAND_CHECK #3446

Merged
merged 12 commits into from Aug 8, 2020

Conversation

liujuncheng
Copy link
Collaborator

CudaCheck:

F0808 14:02:36.887184 29334 cuda_util.cpp:88] Check failed: error == cudaSuccess (2 vs. 0) out of memory
*** Check failure stack trace: ***
    @     0x7feeceb683ad  google::LogMessage::Fail()
    @     0x7feeceb6c56c  google::LogMessage::SendToLog()
    @     0x7feeceb67ed3  google::LogMessage::Flush()
    @     0x7feeceb6cfbe  google::LogMessageFatal::~LogMessageFatal()
    @     0x7feecd9fde50  oneflow::CudaCheck<>()
    @     0x7feecdf98436  oneflow::(anonymous namespace)::ConvGpuKernel<>::Compute()
    @     0x7feecdd3ba09  oneflow::UserKernel::ForwardDataContent()
    @     0x7feecdccb50c  oneflow::Kernel::Forward()
    @     0x7feecdcca9e7  oneflow::Kernel::Launch()
    @     0x7feecd99ab18  oneflow::Actor::AsyncLaunchKernel()
    @     0x7feecd9ae93c  oneflow::NormalForwardCompActor::Act()
    @     0x7feecd99c07e  oneflow::Actor::TryLogActEvent()
    @     0x7feecd99ec36  oneflow::Actor::ActUntilFail()
    @     0x7feecd99ed25  oneflow::Actor::HandlerNormal()
    @     0x7feecde9f051  oneflow::Thread::PollMsgChannel()
    @     0x7feecde9cee8  _ZNSt6thread5_ImplISt12_Bind_simpleIFZN7oneflow9GpuThreadC1EllEUlvE_vEEE6_M_runEv
    @     0x7feef2b84421  execute_native_thread_routine_compat
    @     0x7feefac36304  start_thread
    @     0x7feefa975d1d  __clone
    @              (nil)  (unknown)
Fatal Python error: Aborted

OF_CUDA_CHECK:

F0808 14:04:04.819490 32547 conv_cudnn_kernels.cpp:164] Check failed: cudaMalloc(&ptr, 1024LL * 1024LL * 1024LL * 32LL ) : out of memory (2) 
*** Check failure stack trace: ***
    @     0x7f66eb96887d  google::LogMessage::Fail()
    @     0x7f66eb96ca3c  google::LogMessage::SendToLog()
    @     0x7f66eb9683a3  google::LogMessage::Flush()
    @     0x7f66eb96d48e  google::LogMessageFatal::~LogMessageFatal()
    @     0x7f66ead98c22  oneflow::(anonymous namespace)::ConvGpuKernel<>::Compute()
    @     0x7f66eab3ba09  oneflow::UserKernel::ForwardDataContent()
    @     0x7f66eaacb50c  oneflow::Kernel::Forward()
    @     0x7f66eaaca9e7  oneflow::Kernel::Launch()
    @     0x7f66ea79ab18  oneflow::Actor::AsyncLaunchKernel()
    @     0x7f66ea7ae93c  oneflow::NormalForwardCompActor::Act()
    @     0x7f66ea79c07e  oneflow::Actor::TryLogActEvent()
    @     0x7f66ea79ec36  oneflow::Actor::ActUntilFail()
    @     0x7f66ea79ed25  oneflow::Actor::HandlerNormal()
    @     0x7f66eac9f051  oneflow::Thread::PollMsgChannel()
    @     0x7f66eac9cee8  _ZNSt6thread5_ImplISt12_Bind_simpleIFZN7oneflow9GpuThreadC1EllEUlvE_vEEE6_M_runEv
    @     0x7f670f984421  execute_native_thread_routine_compat
    @     0x7f6717a36304  start_thread
    @     0x7f6717775d1d  __clone
    @              (nil)  (unknown)
Fatal Python error: Aborted

CudaCheck:

F0808 13:49:52.191977 13059 cuda_util.cpp:93] Check failed: error : CUDNN_STATUS_BAD_PARAM 
*** Check failure stack trace: ***
    @     0x7ff06e3d3a8d  google::LogMessage::Fail()
    @     0x7ff06e3d7c4c  google::LogMessage::SendToLog()
    @     0x7ff06e3d35b3  google::LogMessage::Flush()
    @     0x7ff06e3d869e  google::LogMessageFatal::~LogMessageFatal()
    @     0x7ff06d269c84  oneflow::CudaCheck<>()
    @     0x7ff06d803e98  oneflow::(anonymous namespace)::ConvGpuKernel<>::Compute()
    @     0x7ff06d5a7689  oneflow::UserKernel::ForwardDataContent()
    @     0x7ff06d53718c  oneflow::Kernel::Forward()
    @     0x7ff06d536667  oneflow::Kernel::Launch()
    @     0x7ff06d206998  oneflow::Actor::AsyncLaunchKernel()
    @     0x7ff06d21a7bc  oneflow::NormalForwardCompActor::Act()
    @     0x7ff06d207efe  oneflow::Actor::TryLogActEvent()
    @     0x7ff06d20aab6  oneflow::Actor::ActUntilFail()
    @     0x7ff06d20aba5  oneflow::Actor::HandlerNormal()
    @     0x7ff06d70acd1  oneflow::Thread::PollMsgChannel()
    @     0x7ff06d708b68  _ZNSt6thread5_ImplISt12_Bind_simpleIFZN7oneflow9GpuThreadC1EllEUlvE_vEEE6_M_runEv
    @     0x7ff0923ef421  execute_native_thread_routine_compat
    @     0x7ff09a4a1304  start_thread
    @     0x7ff09a1e0d1d  __clone
    @              (nil)  (unknown)
Fatal Python error: Aborted

OF_CUDNN_CHECK:

F0808 14:00:59.760102 26053 conv_cudnn_kernels.cpp:181] Check failed: cudnnConvolutionForward( ctx->device_ctx()->cudnn_handle(), CudnnSPOnePtr<T>(), args.xdesc.Get(), in->dptr(), nullptr, weight->dptr(), args.cdesc.Get(), algo_perf.algo, buf->mut_dptr(), args.params.max_ws_size, CudnnSPZeroPtr<T>(), args.ydesc.Get(), out->mut_dptr()) : CUDNN_STATUS_BAD_PARAM (3) 
*** Check failure stack trace: ***
    @     0x7f68efa5824d  google::LogMessage::Fail()
    @     0x7f68efa5c40c  google::LogMessage::SendToLog()
    @     0x7f68efa57d73  google::LogMessage::Flush()
    @     0x7f68efa5ce5e  google::LogMessageFatal::~LogMessageFatal()
    @     0x7f68eee88727  oneflow::(anonymous namespace)::ConvGpuKernel<>::Compute()
    @     0x7f68eec2ba09  oneflow::UserKernel::ForwardDataContent()
    @     0x7f68eebbb50c  oneflow::Kernel::Forward()
    @     0x7f68eebba9e7  oneflow::Kernel::Launch()
    @     0x7f68ee88ab18  oneflow::Actor::AsyncLaunchKernel()
    @     0x7f68ee89e93c  oneflow::NormalForwardCompActor::Act()
    @     0x7f68ee88c07e  oneflow::Actor::TryLogActEvent()
    @     0x7f68ee88ec36  oneflow::Actor::ActUntilFail()
    @     0x7f68ee88ed25  oneflow::Actor::HandlerNormal()
    @     0x7f68eed8f051  oneflow::Thread::PollMsgChannel()
    @     0x7f68eed8cee8  _ZNSt6thread5_ImplISt12_Bind_simpleIFZN7oneflow9GpuThreadC1EllEUlvE_vEEE6_M_runEv
    @     0x7f6913a74421  execute_native_thread_routine_compat
    @     0x7f691bb26304  start_thread
    @     0x7f691b865d1d  __clone
    @              (nil)  (unknown)
Fatal Python error: Aborted

const char* CurandGetErrorString(curandStatus_t error);

#define OF_CUDA_CHECK(condition) \
for (cudaError_t _of_cuda_check_status = (condition); _of_cuda_check_status != cudaSuccess;) \
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

用这个for做临时作用域来定义临时变量的技巧非常有意思

@lixinqi lixinqi marked this pull request as ready for review August 8, 2020 07:02
@liujuncheng liujuncheng merged commit 142b862 into master Aug 8, 2020
@liujuncheng liujuncheng deleted the dev_of_cuda_check branch August 8, 2020 12:29
@jackalcooper jackalcooper added this to the 0.1.9 milestone Aug 13, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants