Add Operator origin information to most errors #2065

klecki · 2020-06-26T17:47:56Z

Signed-off-by: Krzysztof Lecki klecki@nvidia.com

Why we need this PR?

For better error messages.

What happened in this PR?

What solution was applied:
try/catch around every Operator invocation, so we can pinpoint all the errors to the particular op (coming from the DALI_ENFORCE).
Add similar try/catch around instantiation and graph building for all Ops.
Add CUDA error checking in Mixed and GPU.
Affected modules and functionalities:
Executor, OpGraph
Key points relevant for the review:
Any downsides?
Validation and testing:
👀, force errors in tests and check

Examples of error messages:
New:

RuntimeError: Critical error in Pipeline:
Error when executing GPU operator SequenceRearrange encountered:
[/home/klecki/work/salvador/dali/dali/pipeline/data/views.h:53] Assert on "shape.sample_dim() == ndim" failed: Input with dimension (2) cannot be converted to dimension (1).
Stacktrace (11 entries):
....

RuntimeError: Critical error when building Pipeline:                                                                                                                   
Error when constructing operator: SequenceRearrange encountered:                                                                                                       
[/home/klecki/work/salvador/dali/dali/operators/sequence/sequence_rearrange.h:56] Assert on "!new_order_.empty()" failed: Empty result sequences are not allowed.      
Stacktrace (100 entries):                                                                                                                                                                   
...

Operators with their name in the error - it's still there as we would need to look after the "Assert on .... failed: "

RuntimeError: Critical error in Pipeline:
Error when executing GPU operator Reshape encountered:
[/home/klecki/work/salvador/dali/dali/operators/generic/reshape.cc:302] Assert on "actual_volume * input_element_size == requested_volume * output_element_size" failed: Reshape: Input and output samples m
ust occupy the same size in bytes.
Sample index:     0
Actual volume:    151200
     in bytes:    151200
Requested volume: 158760
     in bytes:    158760
Input shape:    200 x 252 x 3
Requested shape:        420 x 126 x 3

Now only with multiple instances of the same operator errors include instance name:

Error when executing GPU operator SequenceRearrange, instance name: "aa", encountered:
[/home/klecki/work/salvador/dali/dali/pipeline/data/views.h:53] Assert on "shape.sample_dim() == ndim" failed: Input with dimension (2) cannot be converted to dimension (1).

Previously (only for the first two examples):

RuntimeError: Critical error in pipeline: [/home/klecki/work/salvador/dali/dali/pipeline/data/views.h:53] Assert on "shape.sample_dim() == ndim" failed: Input with dimension (2) cannot be converted to dimension (1).
Stacktrace (11 entries):
...

RuntimeError: [/home/klecki/work/salvador/dali/dali/operators/sequence/sequence_rearrange.h:56] Assert on "!new_order_.empty()" faile
d: Empty result sequences are not allowed.                                                                                           
Stacktrace (100 entries):                
...

Documentation (including examples):
This partially overlaps with docs.

JIRA TASK: [Use DALI-1490 or NA]

Signed-off-by: Krzysztof Lecki <klecki@nvidia.com>

JanuszL · 2020-06-26T20:47:12Z

dali/pipeline/executor/executor.h

+        HandleError(make_string("Error when executing GPU Operator ", op_node.op->name(),
+                                ", instance name: \"", op_node.instance_name, "\" encountered:\n",
+                                e.what()));


I think this can be generic. As only CPU/Mixed/GPU changes...

JanuszL · 2020-06-26T20:47:49Z

dali/pipeline/graph/graph_descr.cc

+                        ", instance name: \"",  op_nodes_[op_id].instance_name, "\" encountered:\n", e.what(),
+                        "\nCurrent pipeline object is no longer valid."));
+      } catch (...) {
+        throw std::runtime_error("Unknown Critical error in pipeline");


Suggested change

throw std::runtime_error("Unknown Critical error in pipeline");

throw std::runtime_error("Unknown Critical error building pipeline");

Signed-off-by: Krzysztof Lecki <klecki@nvidia.com>

klecki · 2020-06-29T11:54:28Z

!build

dali-automaton · 2020-06-29T12:00:41Z

CI MESSAGE: [1430786]: BUILD STARTED

mzient · 2020-06-29T12:01:48Z

dali/pipeline/executor/executor.h

-  void HandleError(const char *message = "Unknown exception") {
+  void HandleError(const std::string &stage, const std::string &op_name,
+                   const std::string &instance_name, const std::string &message) {
+    HandleError(make_string("Error when executing ", stage, " Operator ", op_name,


Suggested change

HandleError(make_string("Error when executing ", stage, " Operator ", op_name,

HandleError(make_string("Error when executing ", stage, " operator ", op_name,

Trim leading underscores from op_name, if there are any (ExternalSource and TFRecordReader use them).

Some operators already add their name to error messages - sometimes going to great lengths to get it correct (Reshape, Reinterpret, Warp operator family). Sometimes there's kernel name added, which may coincide with operator name - in either of these cases (especially kernels) it would be nice to also trim "<op_name>: "

if (op_name[0] == '_') op_name = op_name.substr(1); // trim leading underscore if (message.rfind(op_name + ": ", 0) == 0) // trim "<op_name>: " message = message.substr(op_name.length() + 2);

Done

Not done -> I tried it, but it isn't that straightforward, as per example from PR description. The name is in the middle of error message (and it's not that bad).

And done Operator -> operator.

mzient

I think that you need to account for operator name already being there in the message and for operator names starting with underscore.
Also, I'd consider adding instance name only if there are multiple instances of given operator in the pipeline - otherwise it's just clutter.

dali-automaton · 2020-06-29T13:55:06Z

CI MESSAGE: [1430786]: BUILD PASSED

jantonguirao · 2020-07-06T07:20:46Z

dali/pipeline/executor/executor.h

+        if (ws.has_stream() && ws.has_event()) {
+          CUDA_CALL(cudaEventRecord(ws.event(), ws.stream()));
+        }
+        CUDA_CALL(cudaGetLastError());


I am just curious, wouldn't CUDA_CALL(cudaEventRecord(ws.event(), ws.stream())); trigger the error as well?

- Note that this function may also return error codes from previous, asynchronous launches. - Note that this function may also return cudaErrorInitializationError, cudaErrorInsufficientDriver or cudaErrorNoDevice if this call tries to initialize internal CUDA RT state.

It could

It would, but it's in if statement, I guess workspace not having a stream would be odd, but I want to have not-conditional check here.

klecki · 2020-07-10T17:15:01Z

Also, I'd consider adding instance name only if there are multiple instances of given operator in the pipeline - otherwise it's just clutter.

Done, I check if there are more than two instances of the same op and only in such case add the instance name.

Signed-off-by: Krzysztof Lecki <klecki@nvidia.com>

klecki · 2020-07-10T19:31:24Z

!build

dali-automaton · 2020-07-10T19:35:31Z

CI MESSAGE: [1460288]: BUILD STARTED

dali-automaton · 2020-07-10T20:10:23Z

CI MESSAGE: [1460288]: BUILD FAILED

dali-automaton · 2020-07-13T08:16:04Z

CI MESSAGE: [1463848]: BUILD STARTED

dali-automaton · 2020-07-13T08:39:51Z

CI MESSAGE: [1463848]: BUILD FAILED

Signed-off-by: Krzysztof Lecki <klecki@nvidia.com>

klecki · 2020-07-13T08:42:05Z

!build

dali-automaton · 2020-07-13T10:28:47Z

CI MESSAGE: [1464067]: BUILD STARTED

mzient · 2020-07-13T11:58:57Z

dali/pipeline/executor/async_pipelined_executor.h

@@ -100,7 +100,7 @@ class DLL_PUBLIC AsyncPipelinedExecutor : public PipelinedExecutor {
      SignalStop();
      mixed_work_cv_.notify_all();
      gpu_work_cv_.notify_all();
-      throw std::runtime_error("Unknown critical error in pipeline");
+      throw std::runtime_error("Unknown critical error in Pipeline.");


Why? It's not an error in the Pipeline class (at least we hope so), but in some user-defined pipeline.

Done everywhere.

mzient · 2020-07-13T11:59:06Z

dali/pipeline/executor/async_separated_pipelined_executor.h

@@ -89,7 +89,7 @@ class DLL_PUBLIC AsyncSeparatedPipelinedExecutor : public SeparatedPipelinedExec
    } catch (...) {
      exec_error_ = true;
      SignalStop();
-      throw std::runtime_error("Unknown critical error in pipeline");
+      throw std::runtime_error("Unknown critical error in Pipeline.");


mzient · 2020-07-13T12:00:11Z

dali/pipeline/pipeline.cc

    } catch (...) {
-      throw std::runtime_error("Unknown Critical error in pipeline");
+      throw std::runtime_error("Unknown critical error when building Pipeline.");


mzient · 2020-07-13T12:00:18Z

dali/pipeline/pipeline.cc

@@ -536,11 +553,11 @@ void Pipeline::Outputs(DeviceWorkspace *ws) {
    try {
      executor_->Outputs(ws);
    } catch (std::exception &e) {
-      throw std::runtime_error("Critical error in pipeline: "
+      throw std::runtime_error("Critical error in Pipeline:\n"


and again...

dali-automaton · 2020-07-13T12:30:49Z

CI MESSAGE: [1464067]: BUILD PASSED

Signed-off-by: Krzysztof Lecki <klecki@nvidia.com>

klecki · 2020-07-13T12:38:17Z

!build

dali-automaton · 2020-07-13T12:40:59Z

CI MESSAGE: [1464217]: BUILD STARTED

dali-automaton · 2020-07-13T15:03:36Z

CI MESSAGE: [1464217]: BUILD FAILED

dali-automaton · 2020-07-13T15:15:43Z

CI MESSAGE: [1464217]: BUILD PASSED

[WIP] add operator information to most common errors

a41d033

Signed-off-by: Krzysztof Lecki <klecki@nvidia.com>

JanuszL reviewed Jun 26, 2020

View reviewed changes

Rework messages, cleanup, add cudaGetLastError

09bc45c

Signed-off-by: Krzysztof Lecki <klecki@nvidia.com>

klecki force-pushed the error-msg branch from a169fc3 to 09bc45c Compare June 29, 2020 11:50

klecki marked this pull request as ready for review June 29, 2020 11:54

klecki changed the title ~~[WIP] add operator information to most common errors~~ Add Operator origin information to most errors Jun 29, 2020

mzient reviewed Jun 29, 2020

View reviewed changes

jantonguirao approved these changes Jul 6, 2020

View reviewed changes

klecki added 3 commits July 10, 2020 20:28

Review adjustments

9e90e56

Signed-off-by: Krzysztof Lecki <klecki@nvidia.com>

Remove removing which doesn't remove

15a3e14

Signed-off-by: Krzysztof Lecki <klecki@nvidia.com>

Review cleanup

d9a6891

Signed-off-by: Krzysztof Lecki <klecki@nvidia.com>

Sign compare error fix

f3e2ad3

Signed-off-by: Krzysztof Lecki <klecki@nvidia.com>

klecki force-pushed the error-msg branch from 70123b5 to f3e2ad3 Compare July 13, 2020 08:41

mzient reviewed Jul 13, 2020

View reviewed changes

mzient approved these changes Jul 13, 2020

View reviewed changes

Pipeline -> pipeline

b437072

Signed-off-by: Krzysztof Lecki <klecki@nvidia.com>

klecki force-pushed the error-msg branch from 8c95dcb to b437072 Compare July 13, 2020 12:36

JanuszL approved these changes Jul 13, 2020

View reviewed changes

klecki merged commit e47eff6 into NVIDIA:master Jul 13, 2020

klecki deleted the error-msg branch July 13, 2020 15:39

	throw std::runtime_error("Unknown Critical error in pipeline");
	throw std::runtime_error("Unknown Critical error building pipeline");

	HandleError(make_string("Error when executing ", stage, " Operator ", op_name,
	HandleError(make_string("Error when executing ", stage, " operator ", op_name,

Add Operator origin information to most errors #2065

Add Operator origin information to most errors #2065

Conversation

klecki commented Jun 26, 2020 • edited Loading

Why we need this PR?

What happened in this PR?

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

klecki commented Jun 29, 2020

dali-automaton commented Jun 29, 2020

Choose a reason for hiding this comment

Choose a reason for hiding this comment

klecki Jul 10, 2020 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mzient left a comment

Choose a reason for hiding this comment

dali-automaton commented Jun 29, 2020

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

klecki commented Jul 10, 2020

klecki commented Jul 10, 2020

dali-automaton commented Jul 10, 2020

dali-automaton commented Jul 10, 2020

dali-automaton commented Jul 13, 2020

dali-automaton commented Jul 13, 2020

klecki commented Jul 13, 2020

dali-automaton commented Jul 13, 2020

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

dali-automaton commented Jul 13, 2020

klecki commented Jul 13, 2020

dali-automaton commented Jul 13, 2020

dali-automaton commented Jul 13, 2020

dali-automaton commented Jul 13, 2020

klecki commented Jun 26, 2020 •

edited

Loading

klecki Jul 10, 2020 •

edited

Loading