Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Show convergence bottleneck nodes to users #452

Closed
visr opened this issue Jul 24, 2023 · 6 comments · Fixed by #1440
Closed

Show convergence bottleneck nodes to users #452

visr opened this issue Jul 24, 2023 · 6 comments · Fixed by #1440
Assignees
Labels
enhancement New feature or request performance Relates to runtime performance or convergence

Comments

@visr
Copy link
Member

visr commented Jul 24, 2023

Related to #419, if a run doesn't converge or is slower than expected, we need good tools to investigate the issue.

One thing that will help is to look at the solver stats:

julia> using Ribasim
[ Info: Precompiling Ribasim [aac5e3d9-0b8f-4d4f-8241-b1a7a9632635]

julia> model = Ribasim.run("nl.toml")
┌ Warning: Newton steps could not converge and algorithm is not adaptive. Use a lower dt.
└ @ SciMLBase C:\Users\visser_mn\.julia\packages\SciMLBase\wvDeR\src\integrator_interface.jl:612
Model(ts: 2, t: 2020-01-01T00:00:00)

julia> model.integrator.sol.stats
DiffEqBase.Stats
Number of function 1 evaluations:                  8803
Number of function 2 evaluations:                  0
Number of W matrix evaluations:                    1
Number of linear solves:                           10
Number of Jacobians created:                       1
Number of nonlinear solver iterations:             10
Number of nonlinear solver convergence failures:   1
Number of rootfind condition calls:                0
Number of accepted steps:                          0
Number of rejected steps:                          0

Additionally it would be nice to locate the nodes in the network causing the issues easily. One thing @Huite suggested was to calculate some stability criterium, like the Basin retention time or courant number.

The goal of this issue is to collect some tools, some of which could be useful for the documentation.

@visr visr added the performance Relates to runtime performance or convergence label Jul 24, 2023
@visr
Copy link
Member Author

visr commented Aug 4, 2023

We already depend on TimerOutputs but currently don't really use it, but if we hook it up properly the results can be really helpful, like for https://github.com/trixi-framework/Trixi.jl.

 ────────────────────────────────────────────────────────────────────────────────────
              Trixi.jl                      Time                    Allocations      
                                   ───────────────────────   ────────────────────────
         Tot / % measured:              595ms /  84.3%           29.3MiB /  50.9%

 Section                   ncalls     time    %tot     avg     alloc    %tot      avg
 ────────────────────────────────────────────────────────────────────────────────────
 I/O                            3    290ms   57.8%  96.7ms   6.91MiB   46.4%  2.30MiB
   ~I/O~                        3    189ms   37.6%  63.0ms    948KiB    6.2%   316KiB
   save solution                2    101ms   20.1%  50.4ms   5.84MiB   39.2%  2.92MiB
   get element variables        2    380μs    0.1%   190μs    151KiB    1.0%  75.7KiB
   save mesh                    2    400ns    0.0%   200ns     0.00B    0.0%    0.00B
 analyze solution               2    188ms   37.4%  93.9ms   7.86MiB   52.7%  3.93MiB
 rhs!                          91   24.3ms    4.8%   267μs    137KiB    0.9%  1.51KiB
   reset ∂u/∂t                 91   21.7ms    4.3%   239μs     0.00B    0.0%    0.00B
   volume integral             91    776μs    0.2%  8.53μs   25.6KiB    0.2%     288B
   ~rhs!~                      91    459μs    0.1%  5.04μs   9.33KiB    0.1%     105B
   interface flux              91    437μs    0.1%  4.80μs   29.9KiB    0.2%     336B
   prolong2interfaces          91    366μs    0.1%  4.03μs   27.0KiB    0.2%     304B
   surface integral            91    264μs    0.1%  2.90μs   24.2KiB    0.2%     272B
   Jacobian                    91    223μs    0.0%  2.45μs   21.3KiB    0.1%     240B
   prolong2boundaries          91   10.9μs    0.0%   120ns     0.00B    0.0%    0.00B
   prolong2mortars             91   10.3μs    0.0%   113ns     0.00B    0.0%    0.00B
   mortar flux                 91   9.50μs    0.0%   104ns     0.00B    0.0%    0.00B
   source terms                91   6.50μs    0.0%  71.4ns     0.00B    0.0%    0.00B
   boundary flux               91   6.20μs    0.0%  68.1ns     0.00B    0.0%    0.00B
 calculate dt                  19   6.60μs    0.0%   347ns     0.00B    0.0%    0.00B
 ────────────────────────────────────────────────────────────────────────────────────

@visr visr added enhancement New feature or request epic labels Sep 18, 2023
@visr visr changed the title debugging slow runs Debugging convergence issues Jan 18, 2024
@Huite
Copy link
Contributor

Huite commented Mar 14, 2024

We have somewhat extensive experience with slowly running models in variable density SEAWAT simulations, since the solute transport component MT3D does adaptive timestepping. Without exception, large flows in small cells are a problem. The way we analyze those is by analyzing flow output and computing flow velocities.

The same is probably true for Ribasim: if you have output, you can color edges by computed flows. You can also compute derived data, such as average residence times (current storage / current outflow). These can be easily plotted on a map, and will likely help identifying which nodes are problematic.

@gijsber
Copy link
Contributor

gijsber commented Mar 14, 2024

Flow velocities are also appreciated output from an ecological perspective (as indicated by Ellis Penning)

@Huite
Copy link
Contributor

Huite commented Mar 22, 2024

Another thought: we're currently relying on DifferentialEquations.jl machinery to do the timestepping for us.

As far as I know, it compares a higher and lower order to estimate whether time step sizes have to be adjusted. Essentially, I expected it to compare du_lower and du_higher. If we can investigate these a bit and get access to them, we should be able to output them and interpret them spatially.

This should help us and users to identify what the problematic nodes and connections are.

@visr visr changed the title Debugging convergence issues Show convergence bottleneck nodes to users Apr 23, 2024
@Jingru923 Jingru923 removed the epic label Apr 25, 2024
@SouthEndMusic
Copy link
Collaborator

I asked about this on the Julia discourse:

https://discourse.julialang.org/t/find-out-which-state-s-cause-convergence-problems/113738

@SouthEndMusic
Copy link
Collaborator

SouthEndMusic commented May 2, 2024

Chris (the man, the myth, the legend) has responded. Something like this works (for certain algorithms):

(; cache, p) = model.integrator
(; basin) = p
if hasproperty(cache, :atmp)
    storage_error = abs.(cache.atmp.storage)
    perm = sortperm(storage_error, rev=true)
    println("Basins in descending order of being a convergence bottleneck:")
    for i in perm
        node_id = basin.node_id.values[i]
        error = storage_error[i]
        println("$node_id, (error = $error)")
    end
end

though this should probably be logged, and we should have a sensible threshold value for when the error is large enough to mention. This also relates to how these error values should be interpreted, which I don't know exactly how to do yet.

Edit: they come from calculate_residuals. Maybe the aforementioned threshold can simply be reltol.

SouthEndMusic added a commit that referenced this issue May 8, 2024
Fixes #452.

To do:
- [x] Implement
#452 (comment),
- [x] Add some more basins to the model and test that the quickly
emptying one is the bottleneck.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request performance Relates to runtime performance or convergence
Projects
Archived in project
Development

Successfully merging a pull request may close this issue.

5 participants