Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

NetCDFOutputWriter sometimes outputs twice at approximately the same time step #3056

Open
tomchor opened this issue Apr 9, 2023 · 18 comments

Comments

@tomchor
Copy link
Collaborator

tomchor commented Apr 9, 2023

I noticed that occasionally the NetCDFOutputWriter will output twice at what is virtually the the same time.

As an example, I'm showing below the time interval between time steps of one of my netcdf files created with a nominal TimeInterval of approximately 1236. However, you see that (while most of the times are separate by the correct amount) there are some instances with outputs separated by something much smaller:

julia> diff(ds[:time])
255-element Vector{Float64}:
 1236.3673926631677
 1236.3673926631677
 1236.3673926631677
 1236.3673926631677
 1236.3673926631673
 1236.3673926631673
 1236.3673926631673
 1236.3673926631673
 1236.3673926631673
 1236.3673926631673
 1236.3673926631673
 1236.3673926631673
 1236.3673926631673
 1236.367392663169
 1236.3673926631673
 1236.3673926631673
 1236.3673926631673
 1236.3673926631673
 1236.3673926631673
 1236.3673926631673
 1236.3673926631673
    
 1236.3673926631454
 1236.3673926631454
 1236.3673926632036
    5.820766091346741e-11
 1236.3673926631454
 1236.3673926631454
 1236.3673926631454
 1236.3673926631454
 1236.3673926631454
 1236.3673926631454
 1236.3673926632036
    5.820766091346741e-11
 1236.3673926631454
 1236.3673926632036
   73.29669670568546
 1163.07069595746
 1236.3673926631454
 1236.3673926631454
 1236.3673926631454
 1236.3673926631454
 1236.3673926631454

Furthermore, In this specific case, if the time steps were done correctly we would have 251 time steps, but we have 255.

Unfortunately I haven't been able to reproduce this in a MWE. My guess is that it needs a TimeStepWizard (which I am using) and complex dynamics which make the wizard increase and decrease the time step relatively often (which also happens in my case).

Two things to note:

  • At no point while making this NetCDF file I restarted the simulation. It has always been one go.
  • This seems to happen more often with a Float32 type than with a Float64.
@glwagner
Copy link
Member

glwagner commented Apr 9, 2023

Sounds like a roundoff error. I've noticed similar problems.

@glwagner
Copy link
Member

glwagner commented Apr 9, 2023

Maybe related to #2321

@tomchor
Copy link
Collaborator Author

tomchor commented Apr 9, 2023

You're right, it does sound the same problem. My "extra time steps" also tend to happen at the end of long runs.

@mncrowe
Copy link

mncrowe commented Jun 5, 2024

MWE attached. This works for me on v0.91.0 and the issue seemingly doesn't require any TimeStepWizard.

(It seems to take around 8 mins to run which feels slow given the problem size?)

MWE.zip

@mncrowe
Copy link

mncrowe commented Jun 5, 2024

@tomchor, same results on the tc/timestepping branch

@tomchor
Copy link
Collaborator Author

tomchor commented Jun 5, 2024

@tomchor, same results on the tc/timestepping branch

Ah, bummer. I guess we need to investigate a little more.

@glwagner
Copy link
Member

glwagner commented Jun 6, 2024

Can you copy paste the minimal code here? Hopefully it's sort (otherwise its' not minimal)

@mncrowe
Copy link

mncrowe commented Jun 6, 2024

Can you copy paste the minimal code here? Hopefully it's sort (otherwise its' not minimal)

using Oceananigans, Printf

Nt = 200		# number of time saves
T = 8e5*π/7		# simulation stop time (s)
Δt = 16/15		# timestep (s)
filename = "MWE_data"	# save name of data file
architecture = CPU()	# GPU() or CPU()

grid = RectilinearGrid(architecture, size = (), topology=(Flat, Flat, Flat))

model = NonhydrostaticModel(; grid, advection = CenteredFourthOrder(), timestepper = :RungeKutta3)

simulation = Simulation(model, Δt=Δt, stop_time=T)

progress_message(sim) = @printf("Iteration: %03d, time: %s, Δt: %s, wall time: %s\n",
	iteration(sim), prettytime(sim), prettytime(sim.Δt), prettytime(sim.run_wall_time))

add_callback!(simulation, progress_message, IterationInterval(1000))

fields = Dict("u" => model.velocities.u);

simulation.output_writers[:field_writer] = NetCDFOutputWriter(model, fields, 
	filename = filename * ".nc", schedule = TimeInterval(T/Nt), overwrite_existing = true)

run!(simulation)

@glwagner
Copy link
Member

glwagner commented Jun 6, 2024

Nice, thank you! I don't think it's necessary to use an output writer because schedules work the same for output and callbacks. Here's another possible MWE:

using Oceananigans

Ns = 200    # number of time saves
T = 8e5*π/7 # simulation stop time (s)
Δt = 16/15  # timestep (s)

grid = RectilinearGrid(size = (), topology=(Flat, Flat, Flat))
model = NonhydrostaticModel(; grid)
simulation = Simulation(model; Δt, stop_time=T)

captured_times = []
capture_time(sim) = push!(captured_times, time(sim))
add_callback!(simulation, capture_time, TimeInterval(T/Ns))

run!(simulation)

@show captured_times

@glwagner
Copy link
Member

glwagner commented Jun 6, 2024

Hmm, this actually doesn't reproduce the problem for me, although it reveals another problem wherein captured_times sometimes has 200 elements and sometimes has 201. I think the expected number is 201 here, when it's 200, it's because the simulation has ended before the 201st output occurs.

@glwagner
Copy link
Member

glwagner commented Jun 6, 2024

Here's an idea: rather than tracking previous_actuation_time, perhaps TimeInterval should track the number of times it's been actuated. Then we compute the next time as just (N + 1) * T where N is the number of actuations and T is the time interval. I think this will fix the current MWE but not sure about other situations.

@glwagner
Copy link
Member

glwagner commented Jun 6, 2024

How about #3616 @mncrowe ?

@mncrowe
Copy link

mncrowe commented Jun 6, 2024

How about #3616 @mncrowe ?

As in I should test on glw/time-interval-fix branch?

@mncrowe
Copy link

mncrowe commented Jun 6, 2024

How about #3616 @mncrowe ?

As in I should test on glw/time-interval-fix branch?

Done, to conclude:

My MWE replicates the issue on 'main' but your fix on 'glw/time-interval-fix' works.

Your MWE does not replicate the issue for me on either branch.

@glwagner
Copy link
Member

glwagner commented Jun 6, 2024

Oh! maybe the time-stepper matters...

@glwagner
Copy link
Member

glwagner commented Jun 6, 2024

Hmm, @tomchor had a notion for how to minimize the time-stepper effect too.

@glwagner
Copy link
Member

glwagner commented Jun 6, 2024

Ok, updated MWE:

using Oceananigans

FT = Float64
Ns = 200    # number of time saves
T = 8e5*π/7 # simulation stop time (s)
Δt = 16/15  # timestep (s)

grid = RectilinearGrid(FT, size = (), topology=(Flat, Flat, Flat))
model = NonhydrostaticModel(; grid, timestepper=:RungeKutta3)
simulation = Simulation(model; Δt, stop_time=T)

captured_times = []
capture_time(sim) = push!(captured_times, time(sim))
callback = Callback(capture_time, TimeInterval(T/Ns))
add_callback!(simulation, callback)

run!(simulation)

@show time(simulation) iteration(simulation)
@show length(captured_times)
@show time(simulation) == T

I added an FT parameter since @tomchor remarked that this issue happens more often with Float32 (which makes sense if we attribute this to roudn off error)

@glwagner
Copy link
Member

glwagner commented Jun 6, 2024

Ok yes with this change, I get 212 captured_times on main, but the expected 201 on glw/time-interval-fix.

PS @mncrowe single backticks "`" format in-line code (not single quotes) and for code blocks you want to surround the block with triple backticks "```" (not double backticks)

PPS compilation seems to stall for FT=Float32, not sure why.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants