Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

"Interrupted. Larger maxiters is needed." warning when garbage collection is added #1001

Closed
novalex98 opened this issue Nov 21, 2023 · 7 comments · Fixed by SciML/OrdinaryDiffEq.jl#2071
Labels

Comments

@novalex98
Copy link

I am solving a boundary value problem using a shooting method, by repeatedly calling an ODE solver from within an outer ODE solver. This code runs on my PC, but when I moved it to an HPC environment that uses slurm, it crashed.

I discovered (https://discourse.julialang.org/t/is-there-a-way-to-limit-memory-as-reported-by-sys-total-memory/93909) that activating Garbage Collection periodically prevents the crash by stopping the code from exceeding the amount of memory slurm allocates to it.

However, when I add Garbage Collection I see a type of warning I haven't seen before: similar to this issue #739 I see messages saying "Interrupted. Larger maxiters is needed. If you are using an integrator for non-stiff ODEs or an automatic switching algorithm (the default), you may want to consider using a method for stiff equations. See the solver pages for more details". I'm using RadauIIA5() so stiffness isn't the issue and I don't get this error message on my PC (or on the login node of the HPC system).

This doesn't seem to fix the problem: #680

@novalex98 novalex98 added the bug label Nov 21, 2023
@novalex98 novalex98 changed the title "Interrupted. Larger maxiters is needed." warning during garbage collection "Interrupted. Larger maxiters is needed." warning when garbage collection is added Nov 21, 2023
@ChrisRackauckas
Copy link
Member

Please share the MWE.

Are you appropriately setting the heap size when moving it to the HPC setting?

And BTW, did you try other BVP solvers like the MIRK ones? Generally those will scale better.

@novalex98
Copy link
Author

novalex98 commented Nov 22, 2023

Dear Dr. Rackauckas,

I'll develop an MWE. I can't share the code publicly, but I strongly suspect I can replicate it with a generic stiff problem.

The heap size is appropriate (I've tried 1G, 10G, and 50G. The total memory available to the program is 100G. They all behave the same) and I've shown that the problem is with garbage collection, not with the high performance computing environment: I can induce the same issue on my laptop if I add GC.gc() calls to the code. It only showed up on the HPC because garbage collection had to be called because of the limited heap size.

Regarding BVP solvers, this system is highly unstable, and will often have a lot of +infinity/-infinity values around the correct local minimum, so, for example, the Regula Falsi method doesn't find the local minimum. I'll look into the BVP solvers more, but I'm skeptical because its so unstable.

@novalex98
Copy link
Author

I haven't quite replicated the issue, but I'm seeing a situation where garbage collection causes a change that seems related. The two attached codes are the same except that one runs garbage collection and the other doesn't.

The one without garbage collection gives a lot of

"Warning: dt() <= dtmin() at t =0.0"

The one with garbage collection gives far fewer warnings like this, and none after garbage collection runs the first time. in this case that doesn't stop it from pushing through (if anything it improves its performance) but its not clear to me why the appearance of these warnings depends on a garbage collection call later in the code.
Stiff_GC_MWE_3.txt
Stiff_GC_MWE.txt

@ChrisRackauckas
Copy link
Member

I'm not sure I understand the MWE:

  • tol_solve_ivp = 10^(-30) that's impossible to achieve with Float64 if relative values are around 10^4 for some things.
  • The jacobian function is not correct

My immediate guess just by looking at it is that there must be some uninitialized memory that is being used in the RadauIIA5 Jacobain, because it's an odd one. So two questions:

  1. one does this only happen with RadauIIA5 and no other solver?
  2. What happens if you add jac_mat .= 0 to the Jacobian function?

@novalex98
Copy link
Author

"tol_solve_ivp = 10^(-30) that's impossible to achieve with Float64 if relative values are around 10^4 for some things."

That's odd to me because I've had it successfully converge with Radau.

I used Matlab's symbolic math toolbox to get my Jacobian. Where's the error?

@novalex98
Copy link
Author

novalex98 commented Nov 25, 2023

Adding jac_mat .= 0 to the start of the Jacobians in my original code fixed the problem, at least on my PC, alongside fixing a typo

(in one of my Jacobians I had a mistake of the form
"function(jac_u, u, p, t)
jac_v = [stuff]
return jac_v")
where the variable in the definition of the function and the variable returned were different.

@ChrisRackauckas
Copy link
Member

ChrisRackauckas commented Nov 26, 2023

Okay awesome, this narrowed it down. Basically, for one of the Jacobians in RadauIIA5 there was uninitialized memory being used. In almost any normal case, someone would fill the J with memory, but for safety we normally default Jacobian memory to zero. Because it's uninitialized memory, if you GC vs not GC you'd get the random values in there. This is fixed by SciML/OrdinaryDiffEq.jl#2071 which adds the zeroing to W2, effectively doing the jac_mat .= 0 at construction time so you don't have to. That means any value not specified in the Jacobian defaults to zero, which is true is all other methods but RadauIIA5 is weird and had to define its own Jacobian and we missed adding the safety thing on that spot.

That's odd to me because I've had it successfully converge with Radau.

In theory it's possible, though Float64 can only guarantee a relative accuracy of around 1e-16, i.e. 1 + 1e-16 == 1 in floating point arithmetic, and thus if you have values near 1 (which you seem to have) then trying to make something converge to 1e-30 isn't likely to happen. It's possible, but not always.

I used Matlab's symbolic math toolbox to get my Jacobian. Where's the error?

The file that you sent had:

function Jacobian_func(jac_mat, u, p, t)
	p1, p2, p3, p4 = p
	
end

Not sure if that was intentional but it did expose the fact that the jac_mat was not edited and its memory was unutilized, so the example worked for the cause

where the variable in the definition of the function and the variable returned were different.

Yup so it was using the uninitialized memory, awesome this should have at least deterministic behavior now of treating the Jacobian as zero (and we should find a better way to error in these cases).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants