-
Notifications
You must be signed in to change notification settings - Fork 622
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add Operator origin information to most errors #2065
Conversation
Signed-off-by: Krzysztof Lecki <klecki@nvidia.com>
dali/pipeline/executor/executor.h
Outdated
HandleError(make_string("Error when executing GPU Operator ", op_node.op->name(), | ||
", instance name: \"", op_node.instance_name, "\" encountered:\n", | ||
e.what())); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this can be generic. As only CPU/Mixed/GPU changes...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
dali/pipeline/graph/graph_descr.cc
Outdated
", instance name: \"", op_nodes_[op_id].instance_name, "\" encountered:\n", e.what(), | ||
"\nCurrent pipeline object is no longer valid.")); | ||
} catch (...) { | ||
throw std::runtime_error("Unknown Critical error in pipeline"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
throw std::runtime_error("Unknown Critical error in pipeline"); | |
throw std::runtime_error("Unknown Critical error building pipeline"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
Signed-off-by: Krzysztof Lecki <klecki@nvidia.com>
!build |
CI MESSAGE: [1430786]: BUILD STARTED |
dali/pipeline/executor/executor.h
Outdated
void HandleError(const char *message = "Unknown exception") { | ||
void HandleError(const std::string &stage, const std::string &op_name, | ||
const std::string &instance_name, const std::string &message) { | ||
HandleError(make_string("Error when executing ", stage, " Operator ", op_name, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
HandleError(make_string("Error when executing ", stage, " Operator ", op_name, | |
HandleError(make_string("Error when executing ", stage, " operator ", op_name, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- Trim leading underscores from op_name, if there are any (ExternalSource and TFRecordReader use them).
- Some operators already add their name to error messages - sometimes going to great lengths to get it correct (Reshape, Reinterpret, Warp operator family). Sometimes there's kernel name added, which may coincide with operator name - in either of these cases (especially kernels) it would be nice to also trim
"<op_name>: "
if (op_name[0] == '_') op_name = op_name.substr(1); // trim leading underscore
if (message.rfind(op_name + ": ", 0) == 0) // trim "<op_name>: "
message = message.substr(op_name.length() + 2);
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- Done
- Not done -> I tried it, but it isn't that straightforward, as per example from PR description. The name is in the middle of error message (and it's not that bad).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
And done Operator -> operator.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think that you need to account for operator name already being there in the message and for operator names starting with underscore.
Also, I'd consider adding instance name only if there are multiple instances of given operator in the pipeline - otherwise it's just clutter.
CI MESSAGE: [1430786]: BUILD PASSED |
if (ws.has_stream() && ws.has_event()) { | ||
CUDA_CALL(cudaEventRecord(ws.event(), ws.stream())); | ||
} | ||
CUDA_CALL(cudaGetLastError()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am just curious, wouldn't CUDA_CALL(cudaEventRecord(ws.event(), ws.stream()));
trigger the error as well?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- Note that this function may also return error codes from previous, asynchronous launches.
- Note that this function may also return cudaErrorInitializationError, cudaErrorInsufficientDriver or cudaErrorNoDevice if this call tries to initialize internal CUDA RT state.
It could
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It would, but it's in if
statement, I guess workspace not having a stream would be odd, but I want to have not-conditional check here.
Done, I check if there are more than two instances of the same op and only in such case add the instance name. |
Signed-off-by: Krzysztof Lecki <klecki@nvidia.com>
Signed-off-by: Krzysztof Lecki <klecki@nvidia.com>
Signed-off-by: Krzysztof Lecki <klecki@nvidia.com>
!build |
CI MESSAGE: [1460288]: BUILD STARTED |
CI MESSAGE: [1460288]: BUILD FAILED |
CI MESSAGE: [1463848]: BUILD STARTED |
CI MESSAGE: [1463848]: BUILD FAILED |
Signed-off-by: Krzysztof Lecki <klecki@nvidia.com>
!build |
CI MESSAGE: [1464067]: BUILD STARTED |
@@ -100,7 +100,7 @@ class DLL_PUBLIC AsyncPipelinedExecutor : public PipelinedExecutor { | |||
SignalStop(); | |||
mixed_work_cv_.notify_all(); | |||
gpu_work_cv_.notify_all(); | |||
throw std::runtime_error("Unknown critical error in pipeline"); | |||
throw std::runtime_error("Unknown critical error in Pipeline."); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why? It's not an error in the Pipeline class (at least we hope so), but in some user-defined pipeline.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done everywhere.
@@ -89,7 +89,7 @@ class DLL_PUBLIC AsyncSeparatedPipelinedExecutor : public SeparatedPipelinedExec | |||
} catch (...) { | |||
exec_error_ = true; | |||
SignalStop(); | |||
throw std::runtime_error("Unknown critical error in pipeline"); | |||
throw std::runtime_error("Unknown critical error in Pipeline."); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
likewise
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
dali/pipeline/pipeline.cc
Outdated
} catch (...) { | ||
throw std::runtime_error("Unknown Critical error in pipeline"); | ||
throw std::runtime_error("Unknown critical error when building Pipeline."); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
again...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
dali/pipeline/pipeline.cc
Outdated
@@ -536,11 +553,11 @@ void Pipeline::Outputs(DeviceWorkspace *ws) { | |||
try { | |||
executor_->Outputs(ws); | |||
} catch (std::exception &e) { | |||
throw std::runtime_error("Critical error in pipeline: " | |||
throw std::runtime_error("Critical error in Pipeline:\n" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
and again...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
CI MESSAGE: [1464067]: BUILD PASSED |
Signed-off-by: Krzysztof Lecki <klecki@nvidia.com>
!build |
CI MESSAGE: [1464217]: BUILD STARTED |
CI MESSAGE: [1464217]: BUILD FAILED |
CI MESSAGE: [1464217]: BUILD PASSED |
Signed-off-by: Krzysztof Lecki klecki@nvidia.com
Why we need this PR?
For better error messages.
What happened in this PR?
try/catch around every Operator invocation, so we can pinpoint all the errors to the particular op (coming from the DALI_ENFORCE).
Add similar try/catch around instantiation and graph building for all Ops.
Add CUDA error checking in Mixed and GPU.
Executor, OpGraph
Any downsides?
👀, force errors in tests and check
Examples of error messages:
New:
Operators with their name in the error - it's still there as we would need to look after the "Assert on .... failed: "
Now only with multiple instances of the same operator errors include instance name:
Previously (only for the first two examples):
This partially overlaps with docs.
JIRA TASK: [Use DALI-1490 or NA]