Show convergence bottleneck nodes to users #452

visr · 2023-07-24T20:47:37Z

Related to #419, if a run doesn't converge or is slower than expected, we need good tools to investigate the issue.

One thing that will help is to look at the solver stats:

julia> using Ribasim
[ Info: Precompiling Ribasim [aac5e3d9-0b8f-4d4f-8241-b1a7a9632635]

julia> model = Ribasim.run("nl.toml")
┌ Warning: Newton steps could not converge and algorithm is not adaptive. Use a lower dt.
└ @ SciMLBase C:\Users\visser_mn\.julia\packages\SciMLBase\wvDeR\src\integrator_interface.jl:612
Model(ts: 2, t: 2020-01-01T00:00:00)

julia> model.integrator.sol.stats
DiffEqBase.Stats
Number of function 1 evaluations:                  8803
Number of function 2 evaluations:                  0
Number of W matrix evaluations:                    1
Number of linear solves:                           10
Number of Jacobians created:                       1
Number of nonlinear solver iterations:             10
Number of nonlinear solver convergence failures:   1
Number of rootfind condition calls:                0
Number of accepted steps:                          0
Number of rejected steps:                          0

Additionally it would be nice to locate the nodes in the network causing the issues easily. One thing @Huite suggested was to calculate some stability criterium, like the Basin retention time or courant number.

The goal of this issue is to collect some tools, some of which could be useful for the documentation.

The text was updated successfully, but these errors were encountered:

visr · 2023-08-04T13:12:58Z

We already depend on TimerOutputs but currently don't really use it, but if we hook it up properly the results can be really helpful, like for https://github.com/trixi-framework/Trixi.jl.

 ────────────────────────────────────────────────────────────────────────────────────
              Trixi.jl                      Time                    Allocations      
                                   ───────────────────────   ────────────────────────
         Tot / % measured:              595ms /  84.3%           29.3MiB /  50.9%

 Section                   ncalls     time    %tot     avg     alloc    %tot      avg
 ────────────────────────────────────────────────────────────────────────────────────
 I/O                            3    290ms   57.8%  96.7ms   6.91MiB   46.4%  2.30MiB
   ~I/O~                        3    189ms   37.6%  63.0ms    948KiB    6.2%   316KiB
   save solution                2    101ms   20.1%  50.4ms   5.84MiB   39.2%  2.92MiB
   get element variables        2    380μs    0.1%   190μs    151KiB    1.0%  75.7KiB
   save mesh                    2    400ns    0.0%   200ns     0.00B    0.0%    0.00B
 analyze solution               2    188ms   37.4%  93.9ms   7.86MiB   52.7%  3.93MiB
 rhs!                          91   24.3ms    4.8%   267μs    137KiB    0.9%  1.51KiB
   reset ∂u/∂t                 91   21.7ms    4.3%   239μs     0.00B    0.0%    0.00B
   volume integral             91    776μs    0.2%  8.53μs   25.6KiB    0.2%     288B
   ~rhs!~                      91    459μs    0.1%  5.04μs   9.33KiB    0.1%     105B
   interface flux              91    437μs    0.1%  4.80μs   29.9KiB    0.2%     336B
   prolong2interfaces          91    366μs    0.1%  4.03μs   27.0KiB    0.2%     304B
   surface integral            91    264μs    0.1%  2.90μs   24.2KiB    0.2%     272B
   Jacobian                    91    223μs    0.0%  2.45μs   21.3KiB    0.1%     240B
   prolong2boundaries          91   10.9μs    0.0%   120ns     0.00B    0.0%    0.00B
   prolong2mortars             91   10.3μs    0.0%   113ns     0.00B    0.0%    0.00B
   mortar flux                 91   9.50μs    0.0%   104ns     0.00B    0.0%    0.00B
   source terms                91   6.50μs    0.0%  71.4ns     0.00B    0.0%    0.00B
   boundary flux               91   6.20μs    0.0%  68.1ns     0.00B    0.0%    0.00B
 calculate dt                  19   6.60μs    0.0%   347ns     0.00B    0.0%    0.00B
 ────────────────────────────────────────────────────────────────────────────────────

Huite · 2024-03-14T12:51:30Z

We have somewhat extensive experience with slowly running models in variable density SEAWAT simulations, since the solute transport component MT3D does adaptive timestepping. Without exception, large flows in small cells are a problem. The way we analyze those is by analyzing flow output and computing flow velocities.

The same is probably true for Ribasim: if you have output, you can color edges by computed flows. You can also compute derived data, such as average residence times (current storage / current outflow). These can be easily plotted on a map, and will likely help identifying which nodes are problematic.

gijsber · 2024-03-14T12:56:17Z

Flow velocities are also appreciated output from an ecological perspective (as indicated by Ellis Penning)

Huite · 2024-03-22T09:49:43Z

Another thought: we're currently relying on DifferentialEquations.jl machinery to do the timestepping for us.

As far as I know, it compares a higher and lower order to estimate whether time step sizes have to be adjusted. Essentially, I expected it to compare du_lower and du_higher. If we can investigate these a bit and get access to them, we should be able to output them and interpret them spatially.

This should help us and users to identify what the problematic nodes and connections are.

SouthEndMusic · 2024-05-02T13:51:38Z

I asked about this on the Julia discourse:

https://discourse.julialang.org/t/find-out-which-state-s-cause-convergence-problems/113738

SouthEndMusic · 2024-05-02T20:08:03Z

Chris (the man, the myth, the legend) has responded. Something like this works (for certain algorithms):

(; cache, p) = model.integrator
(; basin) = p
if hasproperty(cache, :atmp)
    storage_error = abs.(cache.atmp.storage)
    perm = sortperm(storage_error, rev=true)
    println("Basins in descending order of being a convergence bottleneck:")
    for i in perm
        node_id = basin.node_id.values[i]
        error = storage_error[i]
        println("$node_id, (error = $error)")
    end
end

though this should probably be logged, and we should have a sensible threshold value for when the error is large enough to mention. This also relates to how these error values should be interpreted, which I don't know exactly how to do yet.

Edit: they come from calculate_residuals. Maybe the aforementioned threshold can simply be reltol.

Fixes #452. To do: - [x] Implement #452 (comment), - [x] Add some more basins to the model and test that the quickly emptying one is the bottleneck.

visr added the performance Relates to runtime performance or convergence label Jul 24, 2023

visr mentioned this issue Aug 25, 2023

Speed up large model #419

Closed

visr added enhancement New feature or request epic labels Sep 18, 2023

visr mentioned this issue Sep 18, 2023

Diagnostics assistance tool: detection of unbalanced schematisation elements #129

Closed

visr changed the title ~~debugging slow runs~~ Debugging convergence issues Jan 18, 2024

visr mentioned this issue Jan 29, 2024

Bug in RWS-HWS 2024.1.0 model Deltares/Ribasim-NL#54

Closed

visr mentioned this issue Mar 15, 2024

Tooling for model diagnostic checks #1271

Open

visr mentioned this issue Apr 22, 2024

Error on large water balance error #1413

Open

visr changed the title ~~Debugging convergence issues~~ Show convergence bottleneck nodes to users Apr 23, 2024

Jingru923 removed the epic label Apr 25, 2024

SouthEndMusic mentioned this issue Apr 29, 2024

Show convergence bottleneck nodes to users #1440

Merged

2 tasks

Jingru923 assigned SouthEndMusic May 2, 2024

SouthEndMusic closed this as completed in #1440 May 8, 2024

SouthEndMusic added a commit that referenced this issue May 8, 2024

Show convergence bottleneck nodes to users (#1440)

943c04c

Fixes #452. To do: - [x] Implement #452 (comment), - [x] Add some more basins to the model and test that the quickly emptying one is the bottleneck.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Show convergence bottleneck nodes to users #452

Show convergence bottleneck nodes to users #452

visr commented Jul 24, 2023 •

edited

Loading

visr commented Aug 4, 2023

Huite commented Mar 14, 2024 •

edited

Loading

gijsber commented Mar 14, 2024

Huite commented Mar 22, 2024 •

edited

Loading

SouthEndMusic commented May 2, 2024

SouthEndMusic commented May 2, 2024 •

edited

Loading

Show convergence bottleneck nodes to users #452

Show convergence bottleneck nodes to users #452

Comments

visr commented Jul 24, 2023 • edited Loading

visr commented Aug 4, 2023

Huite commented Mar 14, 2024 • edited Loading

gijsber commented Mar 14, 2024

Huite commented Mar 22, 2024 • edited Loading

SouthEndMusic commented May 2, 2024

SouthEndMusic commented May 2, 2024 • edited Loading

visr commented Jul 24, 2023 •

edited

Loading

Huite commented Mar 14, 2024 •

edited

Loading

Huite commented Mar 22, 2024 •

edited

Loading

SouthEndMusic commented May 2, 2024 •

edited

Loading