Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Crash detection #69

Merged
merged 5 commits into from
Jan 25, 2023
Merged

Crash detection #69

merged 5 commits into from
Jan 25, 2023

Conversation

AliceHarang
Copy link
Collaborator

@AliceHarang AliceHarang commented Aug 8, 2022

Creation of a Crash detection to stop the run if the time-step collapses.

  • Detection of the collapsing time-steps
  • Output field variables
  • Test on CI tests
  • Test on real case

@AliceHarang
Copy link
Collaborator Author

Need to improve the crash detection:
Calculate the critical time-step function of the previous ones.

src/Mainloop.cu Outdated
// Stop calculation if crashed
if (XLoop.dt < XParam.dtmin)
{
XLoop.totaltime = XParam.endtime + 1;

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

would make more sense to say :
XParam.endtime = XLoop.totaltime
with the same outcome.
i.e. when looking at the model status after the crash it will look like the model reached the expected endtime when in reality it stopped earlier.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good comments, I will make these changes

src/Mainloop.cu Outdated
if (XLoop.dt < XParam.dtmin)
{
XLoop.totaltime = XParam.endtime + 1;
log(" \n ");

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add :
XLoop.nextoutputtime = XLoop.totaltime

this will enforce the request for output without changing the if statement below. same effect but cleaner I think.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will add this too.

@CyprienBosserelle
Copy link
Owner

CyprienBosserelle commented Jan 23, 2023

Few things with this solution ATM:

  1. This method doesn't see +INF crashes. Because dt will not collapse but instead grow and stabilize
  2. I don't think I agree with the weighting. if anything I would simplify and not weighting (dtmin=dt)
  3. We need a mechanism for selecting which dt criteria we use:
    the dt<XParam.dtmin or dt<0.1*Xloop.dtmin

On one end the dt<XParam.dtmin is simple and easy for user to control

On the other end dt<0.1*Xloop.dtmin may be more sensitive and detect a wider range of crash. but without a test crash model it is difficult to know .

Possible solutions

Some crash do not affect the model timestep/minimum timestep (e.g. +INF) but we can probably look at the largest time step as well as the smallest to find crazy values.
We have a user defined dtmin and dtmax that detects the crashes using both the minimum timestep and maximum timestep.
Timestep is a good value to look at because locally dt approx to (sqrt(g*h) + hypot(u,v))/dx

Implementing that technique would move the crash detection inside the timestep calculation but that is a more complex operation and more code changes. Should that detection be moved to post v1.0?

Do you agree?

I suggest we do:

  • Simplify the min time step detection to be as defined by the user or default as is
  • Add task to improve crash detection for version 1.XX

@CyprienBosserelle CyprienBosserelle added this to the v1.0.0 RC milestone Jan 25, 2023
@AliceHarang
Copy link
Collaborator Author

This development is useful to detect the code crashes, locate the source and understand the origin of the crash.
It detects the crash based on the time steps and create a map of the variables at the last steps.
It has been test on a time steps getting too small during the simulation, with different zones output and multi-resolution grid.

@AliceHarang AliceHarang merged commit ba62f5b into development Jan 25, 2023
@AliceHarang AliceHarang deleted the Crash_identification branch January 25, 2023 01:59
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants