Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Look into BackTracking effect on new test model #1705

Closed
visr opened this issue Aug 8, 2024 · 20 comments
Closed

Look into BackTracking effect on new test model #1705

visr opened this issue Aug 8, 2024 · 20 comments
Labels
core Issues related to the computational core in Julia performance Relates to runtime performance or convergence

Comments

@visr
Copy link
Member

visr commented Aug 8, 2024

#1697 increased the performance considerably, and passed the tests with only slightly higher tolerances.
Surprisingly @Fati-Mon observed this behavior on a Basin running dry:

image

image

We need to look into what is going on here.

@visr visr added performance Relates to runtime performance or convergence core Issues related to the computational core in Julia bug Indicates an unexpected problem or unintended behavior v1.0 Release! labels Aug 8, 2024
@Huite
Copy link
Contributor

Huite commented Aug 8, 2024

I guess it's possible that storage still ends up just barely negative somehow and one of the du terms blows up when it sees a negative storage?

@visr
Copy link
Member Author

visr commented Aug 8, 2024

We reject any timesteps that lead to negative storage #1406. Or perhaps that is too late for this, which is triggered during nonlinear iterations?

@Huite
Copy link
Contributor

Huite commented Aug 8, 2024

Hmm good point. It shouldn't be too late: if it converges with a negative storage, it should still be rejected.
Both the callback and isoutdomain should be triggered. So in this case I guess there's really no negative storage involved.

Maybe some term that isn't smooth enough around S = 0?

Does any other water balance term jump?

@visr
Copy link
Member Author

visr commented Aug 8, 2024

Yeah perhaps we are stepping completely over a reduction factor here. The inflow is 29 m3/s and the outflow 62 m3/s. The outflow should go down to the inflow due to the reduction factors, but all I see is a brief minor dip in the outflow over the Outlet (50 m3/s), and no reduced flow to the UserDemand (12 m3/s).

The complete water balance at the start of the anomaly:

time node_id storage level inflow_rate outflow_rate storage_rate balance_error
2023-11-22 3 5.3E+06 6.1E+01 2.9E+01 6.2E+01 -3.3E+01 8.2E-14
2023-11-23 3 2.4E+06 4.1E+01 2.9E+01 6.2E+01 -2.7E+01 -5.9E+00
2023-11-24 3 9.5E+04 7.9E+00 2.9E+01 6.2E+01 1.1E+04 -1.1E+04
2023-11-25 3 9.4E+08 8.2E+02 2.9E+01 6.2E+01 3.8E+04 -3.8E+04
2023-11-26 3 4.2E+09 1.7E+03 2.9E+01 6.2E+01 6.5E+04 -6.5E+04

The model looks like this:

image

And can be downloaded here:

crystal.zip

The results with v2024.10.0 look much better. When in > out, the storage is going up from 0, when out > in, the storage goes down to 0. When storage is 0, out will never go above in, so they are the same.

image
image

@visr
Copy link
Member Author

visr commented Aug 8, 2024

A few more observations. QNDF, TRBDF2 and KenCarp4 all have similar issues. If you reduce the tolerances you get an unstable error. ImplicitEuler does seem to produce the correct results with backtracking. On the HWS model I couldn't directly eyeball large convergence issues with QNDF, though there are no empty Basins there. HWS with backtracking runs in 47s for QNDF and in 2m19 for ImplicitEuler.

Perhaps for now we should make backtracking opt-in for now while we explore this issue. Perhaps we can look at what is happening with backtracking more in detail, based on SciML/OrdinaryDiffEq.jl#1792.

This made me realize #1413 needs a v1.0 label, because that helps to just error out on these issues to protect users.

@SouthEndMusic
Copy link
Collaborator

@visr you can also try different linesearch algorithms for relaxation (https://github.com/JuliaNLSolvers/LineSearches.jl) but some more insight would be very helpful.

Also, we've probably discussed this before, but what about incorporating the water balance error in isoutofdomain?

@visr
Copy link
Member Author

visr commented Aug 9, 2024

Yeah I can do that next week. I think the water balance error is important to address, #1413., Not sure if isoutofdomain should be used for this (at least the name sounds wrong), though perhaps one of https://docs.sciml.ai/DiffEqCallbacks/stable/step_control/ makes sense.

@Huite
Copy link
Contributor

Huite commented Aug 9, 2024

Just to make sure I'm reading properly: the upper plots show timeseries that are totally wrong? It flies off course immediately, right?

And secondly: the same solver doesn't run into this problem when there's no relaxation?

In my understanding, the relaxation only limits the update of the Newton-Rhapson step. The check for convergence should be the same -- so I don't understand how it finds such a radically different solution.

@SouthEndMusic
Copy link
Collaborator

SouthEndMusic commented Aug 9, 2024

@Huite you say limits the step and I thought so too, but I haven't seen anything in the linesearch code that indicates that the step cannot be bigger than the original Newton step

@Huite
Copy link
Contributor

Huite commented Aug 9, 2024

Doing some homework on the solvers:

  • Newton-Rhapson with ImplicitEuler is straightforward: for each Newton iteration, compute the Jacobian at $u_i$, then solve for $u_{i + 1}$
  • QNDF uses a Quasi Newton (Difference Formula) method. The quasi part comes from the fact that it uses an approximate Jacobian, trying to avoid a full Jacobian update. When it degrades too much (not sure how this is detected), it should do a full Jacobian reformulation.
  • TRBDF2 and KenCarp4 are so called ESDIRK methods: Explicit first stage, Single diagonal coefficient, Diagonally Implicit Runge-Kutta.

The DiffEq docs mention:

For asymptotically large systems of ODEs (N>1000?) where f is very costly, and the complex eigenvalues are minimal (low oscillations), in that case QNDF or FBDF will be the most efficient. QNDF and FBDF will also do surprisingly well if the solution is smooth. However, this method can handle less stiffness than other methods and its Newton iterations may fail at low accuracy situations. Other choices to consider in this regime are CVODE_BDF and lsoda.

Curious whether CVODE_BDF or lsoda fail in a similar way.

@Huite
Copy link
Contributor

Huite commented Aug 9, 2024

@Huite you say limits the step and I thought so too, but I haven't seen anything in the linesearch code that indicates that the step cannot be bigger than the original Newton step

I think it's this in:
https://github.com/JuliaNLSolvers/LineSearches.jl/blob/3259cd240144b96a5a3a309ea96dfb19181058b2/src/backtracking.jl#L22

α is the size of the update, default size is 1:

α_0::Tα = real(T)(1)

α = 1 is a standard Newton step.

Then it's updated:

α_0 = min(α_0, min(alphamax, ls.maxstep / norm(s, Inf)))

So α_0 shouldn't exceed 1.0 (unless you fiddle with it?).

Then in the actual function, it's doing interpolation via either a quadratic or a cubic and finding the zero for it.
I think these lines mean it's limited:

α_1, α_2 = αinitial, αinitial

# interpolation
# ...

α_tmp = NaNMath.min(α_tmp, α_2*ρ_hi) # avoid too small reductions
α_2 = NaNMath.max(α_tmp, α_2*ρ_lo) # avoid too big reductions

Defined above:

    ρ_hi::TF = 0.5
    ρ_lo::TF = 0.1

So after the first iteration 0.1 <= α_2 <= 0.5.

Afterwards, it should only get smaller. (Unless something's changing defaults.)

@Huite
Copy link
Contributor

Huite commented Aug 9, 2024

To get a slightly better feeling for these methods: https://www.stochasticlifestyle.com/differences-between-methods-for-solving-stiff-odes/

@SouthEndMusic
Copy link
Collaborator

SouthEndMusic commented Aug 11, 2024

TL;DR: I still think we should give this a try.

The outflow should go down to the inflow due to the reduction factors

@visr that is not how it is implemented now, but that is what is suggested here. Currently outflows of a basin don't depend on inflows (only on the storage), but to me it makes sense that that would be the case. Note that this adds new non-zeros to the Jacobian, which is much easier to play with with #1606 in place (at least we can test with dense Jacobians).

Additionally, with the current implementation it is much more finnicky to find a steady state solution for depleted basins with in- and outflow (the solver has to find exactly that storage where the reduction factor makes the outflow match the inflow, in hindsight that's probably what you meant Martijn) than with the aforementioned suggestion.

I played around a bit with writing a custom line search algorithm, from which I learned that:

  • The solver is happy to work with negative storages as intermediate results in the non-linear solver. That makes sense now that I observed it, and hints at when isoutofdomain is triggered.
  • I tried enforcing non-negative storages with a custom line search but that does not work, because line search is not always triggered, I guess because sometimes the initial guess for the non linear solver is already good enough
  • The fraction α of the step can get extremely small, in the order of 1e-19.

I also learned (with the help of claude.ai) that QNDF does a predictor step based on previous results, which I can imagine can be very off with an abrupt change in dynamics due to the reduction factor

@Huite
Copy link
Contributor

Huite commented Aug 12, 2024

Currently outflows of a basin don't depend on inflows (only on the storage), but to me it makes sense that that would be the case. Note that this adds new non-zeros to the Jacobian, which is much easier to play with with #1606 in place (at least we can test with dense Jacobians).

But the storage does depend on the inflows, and outflows depends on the storage. So in the (linearized) system of equations outflows clearly do depend on the inflows.

The solver is happy to work with negative storages as intermediate results in the non-linear solver

So the result of a Newton iteration might contain negative storages?

Does this mean it's calling the rhs function (water_balance!) with negative storages?
How many of our du/dt terms are robust for negative storage? Now that I think of it, it might be smart to have unit tests in place for each du/dt term so they we know they accidentally explode. I reckon every potential outflow should return 0 for negative storage (?).

it is much more finnicky to find a steady state solution for depleted basins with in- and outflow (the solver has to find exactly that storage where the reduction factor makes the outflow match the inflow, in hindsight that's probably what you meant Martijn)

I don't follow why is is more finnicky? Isn't a basin always above depletion if there's still outflow? The outflow is distributed across different terms, each very small. The linearized problem provides enough information that the linear solver can find such a point, although it may result in negative storage. If the next rhs evaluation results in setting all outflows to 0, the next iteration should result in a small positive storage again.

Apparently the solver will dip into negative storage, re-formulate, and then converge with (wrong) positive storages -- otherwise the solution would be rejected. Shouldn't that go away it outflow becomes zero with negative storage then?

@evetion
Copy link
Member

evetion commented Aug 12, 2024

Does this mean it's calling the rhs function (water_balance!) with negative storages? How many of our du/dt terms are robust for negative storage? Now that I think of it, it might be smart to have unit tests in place for each du/dt term so they we know they accidentally explode. I reckon every potential outflow should return 0 for negative storage (?).

Might be relevant, docstring of the PositiveDomain callback (sorry for the garbled math statements):

Please note, that the system should be defined also outside the positive domain, since even with these approaches, negative variables might occur during the calculations. Moreover, one should follow Shampine's et al. advice and set the derivative xi′xi′​ of a negative component xixi​ to max⁡{0,fi(x,t)}max{0,fi​(x,t)}, where tt denotes the current time point with state vector xx and fifi​ is the ii-th component of function ff in an ODE system x′=f(x,t)x′=f(x,t).

@Huite
Copy link
Contributor

Huite commented Aug 12, 2024

Good point, I remember reading this. The Shampine paper is this one:

https://www.sciencedirect.com/science/article/abs/pii/S0096300304009683

From the PDF, with non-garbled math:

image

My first suggestion now would be to add some unit tests to enforce du=0 for S <= 0, and potentially add a sigmoid to sidestep resulting discontinuities.

@visr
Copy link
Member Author

visr commented Aug 12, 2024

Ha, we had implemented that advice but removed it when we started using isoutofdomain, see https://github.com/Deltares/Ribasim/pull/639/files. I'll try adding that back and see the effect.

EDIT: For the test model I see negative storage, but not with a negative du, so it had no effect.

@Huite
Copy link
Contributor

Huite commented Aug 12, 2024

Uh I think setting the total du to zero might not be the best solution. I'm not privy enough to the fancier solvers which track several previous iterations, but if you're using ImplicitEuler, it will base du only on the last iteration. In that case, if u < 0, then du for those states will be set to zero as well, and every iteration afterwards will also set du=0. Only in the next time step will it then reformulate and become dynamic again.

Instead, I think you should allow positive individual terms (inflow e.g. precipitation), but truncate individual negative terms. That way, if you end up below zero it can still go back in the positive.

EDIT: pretty sure this is what MODFLOW 6 does for the NPF (internodal flows) Newton formulation too.

@Huite
Copy link
Contributor

Huite commented Aug 12, 2024

Other suggestions:

  1. just try a scalar relax value.

  2. Evaluate residual and repeatedly apply a scalar value (both SWAP and MODFLOW 6 use this approach):

SWAP:
image

MODFLOW:
image

visr added a commit that referenced this issue Aug 13, 2024
Fixes a part of #1705.

This no longer has the huge speedup of #1697, but still a good 10-20%
performance improvement, and no breakdowns. If we go over to
NonLinearSolve.jl it probably makes sense to try out other algorithms
again.
@visr visr removed bug Indicates an unexpected problem or unintended behavior v1.0 Release! labels Aug 22, 2024
@SouthEndMusic SouthEndMusic changed the title Look into BackTracking affect on new test model Look into BackTracking effect on new test model Aug 26, 2024
@visr
Copy link
Member Author

visr commented Sep 3, 2024

Fixed by #1761.

@visr visr closed this as completed Sep 3, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
core Issues related to the computational core in Julia performance Relates to runtime performance or convergence
Projects
Archived in project
Development

No branches or pull requests

4 participants