Look into BackTracking effect on new test model #1705

visr · 2024-08-08T12:14:42Z

#1697 increased the performance considerably, and passed the tests with only slightly higher tolerances.
Surprisingly @Fati-Mon observed this behavior on a Basin running dry:

We need to look into what is going on here.

Huite · 2024-08-08T12:43:29Z

I guess it's possible that storage still ends up just barely negative somehow and one of the du terms blows up when it sees a negative storage?

visr · 2024-08-08T12:50:08Z

We reject any timesteps that lead to negative storage #1406. Or perhaps that is too late for this, which is triggered during nonlinear iterations?

Huite · 2024-08-08T13:08:19Z

Hmm good point. It shouldn't be too late: if it converges with a negative storage, it should still be rejected.
Both the callback and isoutdomain should be triggered. So in this case I guess there's really no negative storage involved.

Maybe some term that isn't smooth enough around S = 0?

Does any other water balance term jump?

visr · 2024-08-08T13:34:15Z

Yeah perhaps we are stepping completely over a reduction factor here. The inflow is 29 m3/s and the outflow 62 m3/s. The outflow should go down to the inflow due to the reduction factors, but all I see is a brief minor dip in the outflow over the Outlet (50 m3/s), and no reduced flow to the UserDemand (12 m3/s).

The complete water balance at the start of the anomaly:

time	node_id	storage	level	inflow_rate	outflow_rate	storage_rate	balance_error
2023-11-22	3	5.3E+06	6.1E+01	2.9E+01	6.2E+01	-3.3E+01	8.2E-14
2023-11-23	3	2.4E+06	4.1E+01	2.9E+01	6.2E+01	-2.7E+01	-5.9E+00
2023-11-24	3	9.5E+04	7.9E+00	2.9E+01	6.2E+01	1.1E+04	-1.1E+04
2023-11-25	3	9.4E+08	8.2E+02	2.9E+01	6.2E+01	3.8E+04	-3.8E+04
2023-11-26	3	4.2E+09	1.7E+03	2.9E+01	6.2E+01	6.5E+04	-6.5E+04

The model looks like this:

And can be downloaded here:

crystal.zip

The results with v2024.10.0 look much better. When in > out, the storage is going up from 0, when out > in, the storage goes down to 0. When storage is 0, out will never go above in, so they are the same.

visr · 2024-08-08T14:51:40Z

A few more observations. QNDF, TRBDF2 and KenCarp4 all have similar issues. If you reduce the tolerances you get an unstable error. ImplicitEuler does seem to produce the correct results with backtracking. On the HWS model I couldn't directly eyeball large convergence issues with QNDF, though there are no empty Basins there. HWS with backtracking runs in 47s for QNDF and in 2m19 for ImplicitEuler.

Perhaps for now we should make backtracking opt-in for now while we explore this issue. Perhaps we can look at what is happening with backtracking more in detail, based on SciML/OrdinaryDiffEq.jl#1792.

This made me realize #1413 needs a v1.0 label, because that helps to just error out on these issues to protect users.

SouthEndMusic · 2024-08-09T08:44:46Z

@visr you can also try different linesearch algorithms for relaxation (https://github.com/JuliaNLSolvers/LineSearches.jl) but some more insight would be very helpful.

Also, we've probably discussed this before, but what about incorporating the water balance error in isoutofdomain?

visr · 2024-08-09T09:23:43Z

Yeah I can do that next week. I think the water balance error is important to address, #1413., Not sure if isoutofdomain should be used for this (at least the name sounds wrong), though perhaps one of https://docs.sciml.ai/DiffEqCallbacks/stable/step_control/ makes sense.

Huite · 2024-08-09T09:26:55Z

Just to make sure I'm reading properly: the upper plots show timeseries that are totally wrong? It flies off course immediately, right?

And secondly: the same solver doesn't run into this problem when there's no relaxation?

In my understanding, the relaxation only limits the update of the Newton-Rhapson step. The check for convergence should be the same -- so I don't understand how it finds such a radically different solution.

SouthEndMusic · 2024-08-09T09:48:37Z

@Huite you say limits the step and I thought so too, but I haven't seen anything in the linesearch code that indicates that the step cannot be bigger than the original Newton step

Huite · 2024-08-09T09:58:43Z

Doing some homework on the solvers:

Newton-Rhapson with ImplicitEuler is straightforward: for each Newton iteration, compute the Jacobian at $u_i$, then solve for $u_{i + 1}$
QNDF uses a Quasi Newton (Difference Formula) method. The quasi part comes from the fact that it uses an approximate Jacobian, trying to avoid a full Jacobian update. When it degrades too much (not sure how this is detected), it should do a full Jacobian reformulation.
TRBDF2 and KenCarp4 are so called ESDIRK methods: Explicit first stage, Single diagonal coefficient, Diagonally Implicit Runge-Kutta.

The DiffEq docs mention:

For asymptotically large systems of ODEs (N>1000?) where f is very costly, and the complex eigenvalues are minimal (low oscillations), in that case QNDF or FBDF will be the most efficient. QNDF and FBDF will also do surprisingly well if the solution is smooth. However, this method can handle less stiffness than other methods and its Newton iterations may fail at low accuracy situations. Other choices to consider in this regime are CVODE_BDF and lsoda.

Curious whether CVODE_BDF or lsoda fail in a similar way.

Huite · 2024-08-09T10:28:11Z

@Huite you say limits the step and I thought so too, but I haven't seen anything in the linesearch code that indicates that the step cannot be bigger than the original Newton step

I think it's this in:
https://github.com/JuliaNLSolvers/LineSearches.jl/blob/3259cd240144b96a5a3a309ea96dfb19181058b2/src/backtracking.jl#L22

α is the size of the update, default size is 1:

α_0::Tα = real(T)(1)

α = 1 is a standard Newton step.

Then it's updated:

α_0 = min(α_0, min(alphamax, ls.maxstep / norm(s, Inf)))

So α_0 shouldn't exceed 1.0 (unless you fiddle with it?).

Then in the actual function, it's doing interpolation via either a quadratic or a cubic and finding the zero for it.
I think these lines mean it's limited:

α_1, α_2 = αinitial, αinitial

# interpolation
# ...

α_tmp = NaNMath.min(α_tmp, α_2*ρ_hi) # avoid too small reductions
α_2 = NaNMath.max(α_tmp, α_2*ρ_lo) # avoid too big reductions

Defined above:

    ρ_hi::TF = 0.5
    ρ_lo::TF = 0.1

So after the first iteration 0.1 <= α_2 <= 0.5.

Afterwards, it should only get smaller. (Unless something's changing defaults.)

Huite · 2024-08-09T11:19:02Z

To get a slightly better feeling for these methods: https://www.stochasticlifestyle.com/differences-between-methods-for-solving-stiff-odes/

SouthEndMusic · 2024-08-11T10:00:56Z

TL;DR: I still think we should give this a try.

The outflow should go down to the inflow due to the reduction factors

@visr that is not how it is implemented now, but that is what is suggested here. Currently outflows of a basin don't depend on inflows (only on the storage), but to me it makes sense that that would be the case. Note that this adds new non-zeros to the Jacobian, which is much easier to play with with #1606 in place (at least we can test with dense Jacobians).

Additionally, with the current implementation it is much more finnicky to find a steady state solution for depleted basins with in- and outflow (the solver has to find exactly that storage where the reduction factor makes the outflow match the inflow, in hindsight that's probably what you meant Martijn) than with the aforementioned suggestion.

I played around a bit with writing a custom line search algorithm, from which I learned that:

The solver is happy to work with negative storages as intermediate results in the non-linear solver. That makes sense now that I observed it, and hints at when isoutofdomain is triggered.
I tried enforcing non-negative storages with a custom line search but that does not work, because line search is not always triggered, I guess because sometimes the initial guess for the non linear solver is already good enough
The fraction α of the step can get extremely small, in the order of 1e-19.

I also learned (with the help of claude.ai) that QNDF does a predictor step based on previous results, which I can imagine can be very off with an abrupt change in dynamics due to the reduction factor

Huite · 2024-08-12T07:23:31Z

Currently outflows of a basin don't depend on inflows (only on the storage), but to me it makes sense that that would be the case. Note that this adds new non-zeros to the Jacobian, which is much easier to play with with #1606 in place (at least we can test with dense Jacobians).

But the storage does depend on the inflows, and outflows depends on the storage. So in the (linearized) system of equations outflows clearly do depend on the inflows.

The solver is happy to work with negative storages as intermediate results in the non-linear solver

So the result of a Newton iteration might contain negative storages?

Does this mean it's calling the rhs function (water_balance!) with negative storages?
How many of our du/dt terms are robust for negative storage? Now that I think of it, it might be smart to have unit tests in place for each du/dt term so they we know they accidentally explode. I reckon every potential outflow should return 0 for negative storage (?).

it is much more finnicky to find a steady state solution for depleted basins with in- and outflow (the solver has to find exactly that storage where the reduction factor makes the outflow match the inflow, in hindsight that's probably what you meant Martijn)

I don't follow why is is more finnicky? Isn't a basin always above depletion if there's still outflow? The outflow is distributed across different terms, each very small. The linearized problem provides enough information that the linear solver can find such a point, although it may result in negative storage. If the next rhs evaluation results in setting all outflows to 0, the next iteration should result in a small positive storage again.

Apparently the solver will dip into negative storage, re-formulate, and then converge with (wrong) positive storages -- otherwise the solution would be rejected. Shouldn't that go away it outflow becomes zero with negative storage then?

evetion · 2024-08-12T07:27:04Z

Does this mean it's calling the rhs function (water_balance!) with negative storages? How many of our du/dt terms are robust for negative storage? Now that I think of it, it might be smart to have unit tests in place for each du/dt term so they we know they accidentally explode. I reckon every potential outflow should return 0 for negative storage (?).

Might be relevant, docstring of the PositiveDomain callback (sorry for the garbled math statements):

Please note, that the system should be defined also outside the positive domain, since even with these approaches, negative variables might occur during the calculations. Moreover, one should follow Shampine's et al. advice and set the derivative xi′xi′ of a negative component xixi to max⁡{0,fi(x,t)}max{0,fi(x,t)}, where tt denotes the current time point with state vector xx and fifi is the ii-th component of function ff in an ODE system x′=f(x,t)x′=f(x,t).

Huite · 2024-08-12T07:39:25Z

Good point, I remember reading this. The Shampine paper is this one:

https://www.sciencedirect.com/science/article/abs/pii/S0096300304009683

From the PDF, with non-garbled math:

My first suggestion now would be to add some unit tests to enforce du=0 for S <= 0, and potentially add a sigmoid to sidestep resulting discontinuities.

visr · 2024-08-12T07:51:03Z

Ha, we had implemented that advice but removed it when we started using isoutofdomain, see https://github.com/Deltares/Ribasim/pull/639/files. I'll try adding that back and see the effect.

EDIT: For the test model I see negative storage, but not with a negative du, so it had no effect.

Huite · 2024-08-12T07:59:27Z

Uh I think setting the total du to zero might not be the best solution. I'm not privy enough to the fancier solvers which track several previous iterations, but if you're using ImplicitEuler, it will base du only on the last iteration. In that case, if u < 0, then du for those states will be set to zero as well, and every iteration afterwards will also set du=0. Only in the next time step will it then reformulate and become dynamic again.

Instead, I think you should allow positive individual terms (inflow e.g. precipitation), but truncate individual negative terms. That way, if you end up below zero it can still go back in the positive.

EDIT: pretty sure this is what MODFLOW 6 does for the NPF (internodal flows) Newton formulation too.

Huite · 2024-08-12T09:49:41Z

Other suggestions:

just try a scalar relax value.
Evaluate residual and repeatedly apply a scalar value (both SWAP and MODFLOW 6 use this approach):

SWAP:

MODFLOW:

Fixes a part of #1705. This no longer has the huge speedup of #1697, but still a good 10-20% performance improvement, and no breakdowns. If we go over to NonLinearSolve.jl it probably makes sense to try out other algorithms again.

visr · 2024-09-03T14:49:32Z

Fixed by #1761.

visr added performance Relates to runtime performance or convergence core Issues related to the computational core in Julia bug Indicates an unexpected problem or unintended behavior v1.0 Release! labels Aug 8, 2024

visr mentioned this issue Aug 12, 2024

Use a fixed relaxation factor of 0.1 #1715

Merged

visr removed bug Indicates an unexpected problem or unintended behavior v1.0 Release! labels Aug 22, 2024

SouthEndMusic mentioned this issue Aug 26, 2024

Re-enable BackTracking #1761

Merged

SouthEndMusic changed the title ~~Look into BackTracking affect on new test model~~ Look into BackTracking effect on new test model Aug 26, 2024

visr closed this as completed Sep 3, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Look into BackTracking effect on new test model #1705

Look into BackTracking effect on new test model #1705

visr commented Aug 8, 2024

Huite commented Aug 8, 2024

visr commented Aug 8, 2024

Huite commented Aug 8, 2024

visr commented Aug 8, 2024 •

edited

Loading

visr commented Aug 8, 2024

SouthEndMusic commented Aug 9, 2024

visr commented Aug 9, 2024

Huite commented Aug 9, 2024 •

edited

Loading

SouthEndMusic commented Aug 9, 2024 •

edited

Loading

Huite commented Aug 9, 2024

Huite commented Aug 9, 2024

Huite commented Aug 9, 2024

SouthEndMusic commented Aug 11, 2024 •

edited

Loading

Huite commented Aug 12, 2024

evetion commented Aug 12, 2024 •

edited

Loading

Huite commented Aug 12, 2024 •

edited

Loading

visr commented Aug 12, 2024 •

edited

Loading

Huite commented Aug 12, 2024 •

edited

Loading

Huite commented Aug 12, 2024 •

edited

Loading

visr commented Sep 3, 2024

Look into BackTracking effect on new test model #1705

Look into BackTracking effect on new test model #1705

Comments

visr commented Aug 8, 2024

Huite commented Aug 8, 2024

visr commented Aug 8, 2024

Huite commented Aug 8, 2024

visr commented Aug 8, 2024 • edited Loading

visr commented Aug 8, 2024

SouthEndMusic commented Aug 9, 2024

visr commented Aug 9, 2024

Huite commented Aug 9, 2024 • edited Loading

SouthEndMusic commented Aug 9, 2024 • edited Loading

Huite commented Aug 9, 2024

Huite commented Aug 9, 2024

Huite commented Aug 9, 2024

SouthEndMusic commented Aug 11, 2024 • edited Loading

Huite commented Aug 12, 2024

evetion commented Aug 12, 2024 • edited Loading

Huite commented Aug 12, 2024 • edited Loading

visr commented Aug 12, 2024 • edited Loading

Huite commented Aug 12, 2024 • edited Loading

Huite commented Aug 12, 2024 • edited Loading

visr commented Sep 3, 2024

visr commented Aug 8, 2024 •

edited

Loading

Huite commented Aug 9, 2024 •

edited

Loading

SouthEndMusic commented Aug 9, 2024 •

edited

Loading

SouthEndMusic commented Aug 11, 2024 •

edited

Loading

evetion commented Aug 12, 2024 •

edited

Loading

Huite commented Aug 12, 2024 •

edited

Loading

visr commented Aug 12, 2024 •

edited

Loading

Huite commented Aug 12, 2024 •

edited

Loading

Huite commented Aug 12, 2024 •

edited

Loading