Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Extend context and name propagation in errors #5396

Merged
merged 7 commits into from
Apr 11, 2024

Conversation

klecki
Copy link
Contributor

@klecki klecki commented Mar 26, 2024

Category: Refactoring

Description:

Add error context related to particular operator during graph construction and pipeline build:

  • error propagation in Pipeline::Build() is adjusted
  • Python calls to backend: Pipeline::AddOperator now augment errors with context.
  • OpSpec building errors propagate the proper operator name.

Replace most of the SchemaName() occurrences that were used to indicate the name operator
with the fully formatted operator name in the correct API.

Make sure that the naming information is added as soon as possible, so it can be accessed in partially
constructed schema with the correct values present.

Note: This PR mostly adds the context like:

Error in <device> operator `nvidia.dali.fn.operator_name`,
which was used in the pipeline definition with the following traceback:
<traceback>
encountered:
<Original error message>

to places where it was not previously used, but we are processing a single operator.

Error messages are adjusted to show the user-facing input/output/argument name (in uniform way) rather than the internal one.
Otherwise the checks and messages are preserved.
The types of error messages are not adjusted.

Additional information:

Affected modules and functionalities:

Pipeline and spec building

Key points relevant for the review:

Typos, broken formatting or conditions.

Tests:

  • Existing tests apply
    Some of those error conditions are behind double or triple layer of error checks and are not observable by the Python user.
    Will use CI to determine how much adjusting the wording affects the tests.
  • New tests added
    The context added in AddOperator is verified to work, not all conditions are easily verified from Python tests. I won't build loads of broken GTest pipelines just to match the exact error messages.
    • Python tests
    • GTests
    • Benchmark
    • Other
  • N/A

Checklist

Documentation

  • Existing documentation applies
  • Documentation updated
    • Docstring
    • Doxygen
    • RST
    • Jupyter
    • Other
  • N/A

DALI team only

Requirements

  • Implements new requirements
  • Affects existing requirements
  • N/A

REQ IDs: N/A

JIRA TASK: N/A

@klecki klecki force-pushed the less-schema-name-more-display-name branch 2 times, most recently from 786f7e4 to 2584a08 Compare March 26, 2024 18:18
@klecki
Copy link
Contributor Author

klecki commented Mar 26, 2024

!build

@dali-automaton
Copy link
Collaborator

CI MESSAGE: [13808683]: BUILD STARTED

@dali-automaton
Copy link
Collaborator

CI MESSAGE: [13808683]: BUILD FAILED

@klecki
Copy link
Contributor Author

klecki commented Mar 26, 2024

!build

@dali-automaton
Copy link
Collaborator

CI MESSAGE: [13812812]: BUILD STARTED

@dali-automaton
Copy link
Collaborator

CI MESSAGE: [13812812]: BUILD PASSED

Comment on lines 59 to 69
"Operator `" + GetOpDisplayName(spec, true) + "` does not support in-place execution.");
DALI_ENFORCE(spec.NumRegularInput() <= schema.MaxNumInput(),
"Operator '" + spec.SchemaName() +
"' supports a maximum of " + std::to_string(schema.MaxNumInput()) + " inputs, "
"Operator `" + GetOpDisplayName(spec, true) +
"` supports a maximum of " + std::to_string(schema.MaxNumInput()) + " inputs, "
"but was passed " + std::to_string(spec.NumRegularInput()) + ".");
DALI_ENFORCE(spec.NumRegularInput() >= schema.MinNumInput(),
"Operator '" + spec.SchemaName() +
"' supports a minimum of " + std::to_string(schema.MinNumInput()) + " inputs, "
"Operator `" + GetOpDisplayName(spec, true) +
"` supports a minimum of " + std::to_string(schema.MinNumInput()) + " inputs, "
"but was passed " + std::to_string(spec.NumRegularInput()) + ".");
DALI_ENFORCE(spec.NumOutput() == schema.CalculateOutputs(spec) + additional_outputs,
"Operator '" + spec.SchemaName() + "' supports "
"Operator `" + GetOpDisplayName(spec, true) + "` supports "
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Upgrade to make_string? You'll be able to drop all those std::to_string...

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

@@ -96,7 +96,7 @@ void PropagateError(ErrorInfo error) {
}
}

std::string GetErrorContextMessage(const OpSpec &spec) {
std::string GetErrorContextMessage(const OpSpec &spec, const std::string &message_name) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about using a string_view in such functions?

Suggested change
std::string GetErrorContextMessage(const OpSpec &spec, const std::string &message_name) {
std::string GetErrorContextMessage(const OpSpec &spec, std::string_view message_name) {

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

Comment on lines 374 to 369
DALI_FAIL(make_string("Error for ", FormatArgument(spec, arg_name),
". Named arguments inputs to operators must be CPU data nodes. "
"However, a GPU data node was provided."));
Copy link
Contributor

@mzient mzient Mar 27, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't like this message. It says that the input is GPU, but we're not really sure, are we? We know that it's not CPU. Perhaps there's some internal error and it's just empty? I think we should stick to the known facts and report those.

Suggested change
DALI_FAIL(make_string("Error for ", FormatArgument(spec, arg_name),
". Named arguments inputs to operators must be CPU data nodes. "
"However, a GPU data node was provided."));
DALI_FAIL(make_string("Error for ", FormatArgument(spec, arg_name),
". Named arguments inputs to operators must be CPU data nodes. ""));

or otherwise make sure that we're saying the truth

Suggested change
DALI_FAIL(make_string("Error for ", FormatArgument(spec, arg_name),
". Named arguments inputs to operators must be CPU data nodes. "
"However, a GPU data node was provided."));
assert(it->second.has_gpu);
DALI_FAIL(make_string("Error for ", FormatArgument(spec, arg_name),
". Named arguments inputs to operators must be CPU data nodes. "
"However, a GPU data node was provided."));

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

Comment on lines 390 to 391
make_string("Error for ", FormatOutput(spec, i), ". Output name \"", output_name,
"\" conflicts with existing intermediate result name."));
Copy link
Contributor

@mzient mzient Mar 27, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
make_string("Error for ", FormatOutput(spec, i), ". Output name \"", output_name,
"\" conflicts with existing intermediate result name."));
make_string("Error while specifying ", FormatOutput(spec, i), ". Output name \"", output_name,
"\" conflicts with an existing intermediate result name."));

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

Comment on lines 525 to 519
eptr = std::current_exception();
}
if (eptr) {
PropagateError({eptr,
"Critical error when building pipeline:\n" + GetErrorContextMessage(op_spec),
"\nCurrent pipeline object is no longer valid."});
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's wrong with the following?

Suggested change
eptr = std::current_exception();
}
if (eptr) {
PropagateError({eptr,
"Critical error when building pipeline:\n" + GetErrorContextMessage(op_spec),
"\nCurrent pipeline object is no longer valid."});
PropagateError({std::current_exception(),
"Critical error when building pipeline:\n" + GetErrorContextMessage(op_spec),
"\nCurrent pipeline object is no longer valid."});
}

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right, much simpler, adjusted the other occurrences as well.

@mzient mzient assigned mzient and unassigned szalpal Mar 27, 2024
@klecki
Copy link
Contributor Author

klecki commented Mar 27, 2024

!build

@dali-automaton
Copy link
Collaborator

CI MESSAGE: [13836164]: BUILD STARTED

@dali-automaton
Copy link
Collaborator

CI MESSAGE: [13836164]: BUILD FAILED

Signed-off-by: Krzysztof Lecki <klecki@nvidia.com>
Signed-off-by: Krzysztof Lecki <klecki@nvidia.com>
Signed-off-by: Krzysztof Lecki <klecki@nvidia.com>
Signed-off-by: Krzysztof Lecki <klecki@nvidia.com>
Signed-off-by: Krzysztof Lecki <klecki@nvidia.com>
@klecki klecki force-pushed the less-schema-name-more-display-name branch from 71096a0 to e8d2079 Compare April 10, 2024 12:17
Signed-off-by: Krzysztof Lecki <klecki@nvidia.com>
Signed-off-by: Krzysztof Lecki <klecki@nvidia.com>
@klecki
Copy link
Contributor Author

klecki commented Apr 10, 2024

!build

@dali-automaton
Copy link
Collaborator

CI MESSAGE: [14134128]: BUILD STARTED

@dali-automaton
Copy link
Collaborator

CI MESSAGE: [14134128]: BUILD PASSED

@klecki klecki merged commit 3b1b5f8 into NVIDIA:main Apr 11, 2024
7 checks passed
@klecki klecki deleted the less-schema-name-more-display-name branch April 11, 2024 10:51
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants