Restart from checkpoint loose particles (diagnostic file size decreases significantly) #3203

sheroytata · 2022-06-27T14:05:59Z

WarpX@21.12

22.01 had some library issues while running

Running on GPU and I seem to use the restart from checkpoint but I end up loosing particles upon restart.

It looks like the simulation time also falls::

Just running normally start from step 0 without restart (killed since time limit has been reached)

STEP 30122 starts ...
STEP 30122 ends. TIME = 2.559001466e-12 DT = 8.495456697e-17
Evolve time = 77712.85459 s; This step = 2.491132788 s; Avg. per step = 2.579936743 s

Simulation time after restart at step 30000

STEP 30122 starts ...
STEP 30122 ends. TIME = 2.559001466e-12 DT = 8.495456697e-17
Evolve time = 1115.337932 s; This step = 2.220336455 s; Avg. per step = 9.142114198 s

It seems that the simulation time per step also falls from 2.5 s to 2.2 s!

I am also dumping out the fields with openPMD and the file size drops form 410 GB to 280 GB! (with just 100 steps after the restart)

Without the restart a smaller version of the simulation seems to work well.

Also upon restart I required to invert " amrex.abort_on_out_of_gpu_memory " from default else I was not able to run the simulation since it errors out as GPU out of memory!! I had to change this flag from the default to get warpx to run after restart from checkpoint.

I also seem to have exclusive access to the GPUS/node where others are not currently running.

I have tried the restart example in the code and have not yet been able to recreate the issue.

At the end it seems like the simulation works well but write particles after restart does not seem to be doing well since the fields exist!

At the start of the simulation I also get the warning below:
Multiple GPUs are visible to each MPI rank, but the number of GPUs per socket or node has not been provided.
This may lead to incorrect or suboptimal rank-to-GPU mapping.!

Attached is an example of the electron density (and seems to affect all the slices) (notice the particles at the front of the moving window )

But the field seems to be good!

Some of the parameters are:

amr.n_cell = 384 384 13440   

amr.max_grid_size_x = 128   
amr.max_grid_size_y = 128
amr.max_grid_size_z = 128

amr.blocking_factor_x = 128  
amr.blocking_factor_y = 128
amr.blocking_factor_z = 128

boundary.field_lo = pml pml pml
boundary.field_hi = pml pml pml
warpx.pml_ncell = 8


electrons.do_continuous_injection = 1
ions.do_continuous_injection = 1

For restart:

amr.restart = "/.<pathToCheckpoints>/checkpoints/30000/"
amrex.abort_on_out_of_gpu_memory = 1 # (which works sometimes with the default value)

With the full input script below

For saving the checkpoints I use:

checkpoint.file_prefix = "/<PathToCheckpoints>/checkpoints/"
checkpoint.format = "checkpoint"
checkpoint.intervals = 20000:200000:10000
checkpoint.diag_type = Full

WarpX Version info:
CUDA initialized with 1 GPU per MPI rank; 8 GPU(s) used in total
MPI initialized with 8 MPI processes
MPI initialized with thread support level 3
AMReX (21.12) initialized
WarpX (21.12-nogit)
PICSAR (7b5449f92a4b)

The text was updated successfully, but these errors were encountered:

RemiLehe · 2022-06-28T13:38:35Z

Thanks for reporting this issue!
Would you be able to share the full input script for the first simulation and the restarted simulation? (or a modified version thereof that would still allow us to reproduce this issue)

sheroytata · 2022-06-29T07:38:55Z

I tried a much smaller simulation with less cells and the problem repeated.
Attached is the input files with different grids. The output dump at step 31000 is much smaller.

inputs_3d_LongTime.txt

inputs_3d_LongTime_GridChange.txt

For the first line just comment out the restart line and change max step to 30000 for the next run uncomment the restart line and change max step to 60k

sheroytata · 2022-07-04T07:01:38Z

I finally recompiled 22.05 with the new fixes to LibAblastr. With this version and the input files above restart seems fine.

Thanks

atmyers self-assigned this Jun 28, 2022

RemiLehe closed this as completed Jul 6, 2022

sheroytata mentioned this issue Jun 20, 2023

Spectral solver with PML Boundaries and moving window #3966

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Restart from checkpoint loose particles (diagnostic file size decreases significantly) #3203

Restart from checkpoint loose particles (diagnostic file size decreases significantly) #3203

sheroytata commented Jun 27, 2022

RemiLehe commented Jun 28, 2022

sheroytata commented Jun 29, 2022

sheroytata commented Jul 4, 2022

Restart from checkpoint loose particles (diagnostic file size decreases significantly) #3203

Restart from checkpoint loose particles (diagnostic file size decreases significantly) #3203

Comments

sheroytata commented Jun 27, 2022

RemiLehe commented Jun 28, 2022

sheroytata commented Jun 29, 2022

sheroytata commented Jul 4, 2022