[EPIC] ErrorManager #2940

MelReyCG · 2024-01-17T17:00:31Z

Main Goal

The goal of this EPIC is to add a component in GEOS that centralizes and manage error (and exceptions), provides structured error data, produces clear & comprehensive error outputs that are suitable for everyone (user / devs), and define a policy regarding errors and exceptions

Issues in this EPIC

1. Complete the errors unit test (which must tests every types of errors GEOS can encounter)
- Numeric errors, memory overflow, IO errors,
- All exception types in use in GEOS,
- std errors ("map::at" user reported error, std::vector resizing...),
- Unknown exceptions,
- MPI errors
- exception / error while catching an exception,
- unexpected program exits (we should at least have the stacktrace),
- ...

2. Create the ErrorManager class, which :
- Provides a centralized point to throw and manage the GEOS errors / exceptions,
- Is based on structured error data rather than only texts,
- Must be reliable,
- Produces clear console outputs (not comprehensive, depending on the user type),
- Produces a generated error data file that contain all error data (JSON format? One per ranks, grouped in a sub folder?),
- Has only GEOS_HOST methods, to ensure that only CPUs can throw / manage errors.

The error data structure can contain:

Error message,
timestamp,
Location in the code, stack-trace,
Group / Wrapper that sent the message, if applicable (name + xml location / path in hierarchy),
GEOS loading / simulating phase,
TimeStep, convergence step and converged attribute,
MPI rank,
Parent exception data,
… (don't hesitate to suggest more data)

3. Factorize errors that come from multiple ranks, either synchronously or by postprocessing the generated error data file.

The goal here is to solve this classic problem : Let's consider GEOS ran on 2048 ranks, and the rank 407 thrown an error because of a local issue. Then the ranks 203, 358, 1017 and 1502 thrown another error because of ghosting cells, and all the other ranks sent MPI_ABORT errors. In this situation, we can only hope that every everything outputs in that order in the log, but it is not guaranteed.

The solution I would like to propose is to process the error data files either :
a) If possible, when a crash occurs, the rank 0 will then collect & factorize any error data files from other ranks and output it in the stdout,
b) After the complete GEOS shutdown, by launching geos or a dedicated executable / script on the generated error data files folder.
Because of HPC considerations, the a) method could be enabled by adding a command line parameter.

4. Properly manage TPL errors (by: 1. adding human explanation on what GEOS was trying to do, and 2. if possible, mentioning why calls are failing),
- GEOS_LAI_CHECK_ERROR() macro failures,
- GEOS_PARMETIS_CHECK() / GEOS_SCOTCH_CHECK() macro failures,
- CUDA errors (GEOS_HYPRE_CHECK_DEVICE_ERRORS(), cudaGetLastError())

5. All errors from the unit test must be properly interfaced with python / pygeos

6.1. Add a section in the documentation to describe "How to generate an error / an exception". What is acceptable and what is not in the GEOS code.

The following practices are banned :

Recovering from an exception. Exception can only be catched by higher function in the call-stack to add more information to them (and potentially stack exceptions).
Throwing any error / exception or writing any log from a GEOS_HOST_DEVICE context.
- If any code can run on GPU, the error /warning state should be reported to the CPU. For instance, if a variable should throw an error if negative, the good practice is to collect its minimal value with RAJA::ReduceMin and read it from the host context to write a proper contextualized message.
- because of the memory impact, any call to CUDA printf() is banned.
... (don't hesitate to suggest more)
6.2. Ensure that the errors / exceptions practices are in place in GEOS.
- Search where warning could be used rather than logs in the code,
- Remove any possibility to add an error / a log from the GPU (too cache heavy),
- Control the flow of every exception in GEOS.

The text was updated successfully, but these errors were encountered:

jeannepellerin · 2024-02-21T19:06:15Z

@rrsettgast

MelReyCG added the EPIC The agile epic concept label Jan 17, 2024

MelReyCG self-assigned this Jan 18, 2024

paveltomin mentioned this issue Feb 1, 2024

XSD Schema versioning #2971

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[EPIC] ErrorManager #2940

[EPIC] ErrorManager #2940

MelReyCG commented Jan 17, 2024 •

edited

Loading

jeannepellerin commented Feb 21, 2024

[EPIC] ErrorManager #2940

[EPIC] ErrorManager #2940

Comments

MelReyCG commented Jan 17, 2024 • edited Loading

Main Goal

Issues in this EPIC

jeannepellerin commented Feb 21, 2024

MelReyCG commented Jan 17, 2024 •

edited

Loading