Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add Operator origin information to most errors #2065

Merged
merged 7 commits into from
Jul 13, 2020
Merged

Conversation

klecki
Copy link
Contributor

@klecki klecki commented Jun 26, 2020

Signed-off-by: Krzysztof Lecki klecki@nvidia.com

Why we need this PR?

For better error messages.

What happened in this PR?

  • What solution was applied:
    try/catch around every Operator invocation, so we can pinpoint all the errors to the particular op (coming from the DALI_ENFORCE).
    Add similar try/catch around instantiation and graph building for all Ops.
    Add CUDA error checking in Mixed and GPU.
  • Affected modules and functionalities:
    Executor, OpGraph
  • Key points relevant for the review:
    Any downsides?
  • Validation and testing:
    👀, force errors in tests and check

Examples of error messages:
New:

RuntimeError: Critical error in Pipeline:
Error when executing GPU operator SequenceRearrange encountered:
[/home/klecki/work/salvador/dali/dali/pipeline/data/views.h:53] Assert on "shape.sample_dim() == ndim" failed: Input with dimension (2) cannot be converted to dimension (1).
Stacktrace (11 entries):
....
RuntimeError: Critical error when building Pipeline:                                                                                                                   
Error when constructing operator: SequenceRearrange encountered:                                                                                                       
[/home/klecki/work/salvador/dali/dali/operators/sequence/sequence_rearrange.h:56] Assert on "!new_order_.empty()" failed: Empty result sequences are not allowed.      
Stacktrace (100 entries):                                                                                                                                                                   
...

Operators with their name in the error - it's still there as we would need to look after the "Assert on .... failed: "

RuntimeError: Critical error in Pipeline:
Error when executing GPU operator Reshape encountered:
[/home/klecki/work/salvador/dali/dali/operators/generic/reshape.cc:302] Assert on "actual_volume * input_element_size == requested_volume * output_element_size" failed: Reshape: Input and output samples m
ust occupy the same size in bytes.
Sample index:     0
Actual volume:    151200
     in bytes:    151200
Requested volume: 158760
     in bytes:    158760
Input shape:    200 x 252 x 3
Requested shape:        420 x 126 x 3

Now only with multiple instances of the same operator errors include instance name:

Error when executing GPU operator SequenceRearrange, instance name: "aa", encountered:
[/home/klecki/work/salvador/dali/dali/pipeline/data/views.h:53] Assert on "shape.sample_dim() == ndim" failed: Input with dimension (2) cannot be converted to dimension (1).

Previously (only for the first two examples):

RuntimeError: Critical error in pipeline: [/home/klecki/work/salvador/dali/dali/pipeline/data/views.h:53] Assert on "shape.sample_dim() == ndim" failed: Input with dimension (2) cannot be converted to dimension (1).
Stacktrace (11 entries):
...
RuntimeError: [/home/klecki/work/salvador/dali/dali/operators/sequence/sequence_rearrange.h:56] Assert on "!new_order_.empty()" faile
d: Empty result sequences are not allowed.                                                                                           
Stacktrace (100 entries):                
...
  • Documentation (including examples):
    This partially overlaps with docs.

JIRA TASK: [Use DALI-1490 or NA]

Signed-off-by: Krzysztof Lecki <klecki@nvidia.com>
Comment on lines 559 to 561
HandleError(make_string("Error when executing GPU Operator ", op_node.op->name(),
", instance name: \"", op_node.instance_name, "\" encountered:\n",
e.what()));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this can be generic. As only CPU/Mixed/GPU changes...

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

", instance name: \"", op_nodes_[op_id].instance_name, "\" encountered:\n", e.what(),
"\nCurrent pipeline object is no longer valid."));
} catch (...) {
throw std::runtime_error("Unknown Critical error in pipeline");
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
throw std::runtime_error("Unknown Critical error in pipeline");
throw std::runtime_error("Unknown Critical error building pipeline");

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

Signed-off-by: Krzysztof Lecki <klecki@nvidia.com>
@klecki
Copy link
Contributor Author

klecki commented Jun 29, 2020

!build

@klecki klecki marked this pull request as ready for review June 29, 2020 11:54
@klecki klecki changed the title [WIP] add operator information to most common errors Add Operator origin information to most errors Jun 29, 2020
@dali-automaton
Copy link
Collaborator

CI MESSAGE: [1430786]: BUILD STARTED

void HandleError(const char *message = "Unknown exception") {
void HandleError(const std::string &stage, const std::string &op_name,
const std::string &instance_name, const std::string &message) {
HandleError(make_string("Error when executing ", stage, " Operator ", op_name,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
HandleError(make_string("Error when executing ", stage, " Operator ", op_name,
HandleError(make_string("Error when executing ", stage, " operator ", op_name,

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. Trim leading underscores from op_name, if there are any (ExternalSource and TFRecordReader use them).
  2. Some operators already add their name to error messages - sometimes going to great lengths to get it correct (Reshape, Reinterpret, Warp operator family). Sometimes there's kernel name added, which may coincide with operator name - in either of these cases (especially kernels) it would be nice to also trim "<op_name>: "
if (op_name[0] == '_') op_name = op_name.substr(1);  // trim leading underscore
if (message.rfind(op_name + ": ", 0) == 0)  // trim "<op_name>: "
  message = message.substr(op_name.length() + 2);

Copy link
Contributor Author

@klecki klecki Jul 10, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. Done
  2. Not done -> I tried it, but it isn't that straightforward, as per example from PR description. The name is in the middle of error message (and it's not that bad).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And done Operator -> operator.

Copy link
Contributor

@mzient mzient left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that you need to account for operator name already being there in the message and for operator names starting with underscore.
Also, I'd consider adding instance name only if there are multiple instances of given operator in the pipeline - otherwise it's just clutter.

@dali-automaton
Copy link
Collaborator

CI MESSAGE: [1430786]: BUILD PASSED

if (ws.has_stream() && ws.has_event()) {
CUDA_CALL(cudaEventRecord(ws.event(), ws.stream()));
}
CUDA_CALL(cudaGetLastError());
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am just curious, wouldn't CUDA_CALL(cudaEventRecord(ws.event(), ws.stream())); trigger the error as well?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

- Note that this function may also return error codes from previous, asynchronous launches.
- Note that this function may also return cudaErrorInitializationError, cudaErrorInsufficientDriver or cudaErrorNoDevice if this call tries to initialize internal CUDA RT state.

It could

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would, but it's in if statement, I guess workspace not having a stream would be odd, but I want to have not-conditional check here.

@klecki
Copy link
Contributor Author

klecki commented Jul 10, 2020

Also, I'd consider adding instance name only if there are multiple instances of given operator in the pipeline - otherwise it's just clutter.

Done, I check if there are more than two instances of the same op and only in such case add the instance name.

Signed-off-by: Krzysztof Lecki <klecki@nvidia.com>
Signed-off-by: Krzysztof Lecki <klecki@nvidia.com>
Signed-off-by: Krzysztof Lecki <klecki@nvidia.com>
@klecki
Copy link
Contributor Author

klecki commented Jul 10, 2020

!build

@dali-automaton
Copy link
Collaborator

CI MESSAGE: [1460288]: BUILD STARTED

@dali-automaton
Copy link
Collaborator

CI MESSAGE: [1460288]: BUILD FAILED

@dali-automaton
Copy link
Collaborator

CI MESSAGE: [1463848]: BUILD STARTED

@dali-automaton
Copy link
Collaborator

CI MESSAGE: [1463848]: BUILD FAILED

Signed-off-by: Krzysztof Lecki <klecki@nvidia.com>
@klecki
Copy link
Contributor Author

klecki commented Jul 13, 2020

!build

@dali-automaton
Copy link
Collaborator

CI MESSAGE: [1464067]: BUILD STARTED

@@ -100,7 +100,7 @@ class DLL_PUBLIC AsyncPipelinedExecutor : public PipelinedExecutor {
SignalStop();
mixed_work_cv_.notify_all();
gpu_work_cv_.notify_all();
throw std::runtime_error("Unknown critical error in pipeline");
throw std::runtime_error("Unknown critical error in Pipeline.");
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why? It's not an error in the Pipeline class (at least we hope so), but in some user-defined pipeline.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done everywhere.

@@ -89,7 +89,7 @@ class DLL_PUBLIC AsyncSeparatedPipelinedExecutor : public SeparatedPipelinedExec
} catch (...) {
exec_error_ = true;
SignalStop();
throw std::runtime_error("Unknown critical error in pipeline");
throw std::runtime_error("Unknown critical error in Pipeline.");
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

likewise

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

} catch (...) {
throw std::runtime_error("Unknown Critical error in pipeline");
throw std::runtime_error("Unknown critical error when building Pipeline.");
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

again...

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

@@ -536,11 +553,11 @@ void Pipeline::Outputs(DeviceWorkspace *ws) {
try {
executor_->Outputs(ws);
} catch (std::exception &e) {
throw std::runtime_error("Critical error in pipeline: "
throw std::runtime_error("Critical error in Pipeline:\n"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

and again...

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

@dali-automaton
Copy link
Collaborator

CI MESSAGE: [1464067]: BUILD PASSED

Signed-off-by: Krzysztof Lecki <klecki@nvidia.com>
@klecki
Copy link
Contributor Author

klecki commented Jul 13, 2020

!build

@dali-automaton
Copy link
Collaborator

CI MESSAGE: [1464217]: BUILD STARTED

@dali-automaton
Copy link
Collaborator

CI MESSAGE: [1464217]: BUILD FAILED

@dali-automaton
Copy link
Collaborator

CI MESSAGE: [1464217]: BUILD PASSED

@klecki klecki merged commit e47eff6 into NVIDIA:master Jul 13, 2020
@klecki klecki deleted the error-msg branch July 13, 2020 15:39
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants