Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Restart from checkpoint loose particles (diagnostic file size decreases significantly) #3203

Closed
sheroytata opened this issue Jun 27, 2022 · 3 comments
Assignees

Comments

@sheroytata
Copy link

WarpX@21.12

22.01 had some library issues while running

Running on GPU and I seem to use the restart from checkpoint but I end up loosing particles upon restart.

It looks like the simulation time also falls::

Just running normally start from step 0 without restart (killed since time limit has been reached)

STEP 30122 starts ...
STEP 30122 ends. TIME = 2.559001466e-12 DT = 8.495456697e-17
Evolve time = 77712.85459 s; This step = 2.491132788 s; Avg. per step = 2.579936743 s

Simulation time after restart at step 30000

STEP 30122 starts ...
STEP 30122 ends. TIME = 2.559001466e-12 DT = 8.495456697e-17
Evolve time = 1115.337932 s; This step = 2.220336455 s; Avg. per step = 9.142114198 s

It seems that the simulation time per step also falls from 2.5 s to 2.2 s!

I am also dumping out the fields with openPMD and the file size drops form 410 GB to 280 GB! (with just 100 steps after the restart)

Without the restart a smaller version of the simulation seems to work well.

Also upon restart I required to invert " amrex.abort_on_out_of_gpu_memory " from default else I was not able to run the simulation since it errors out as GPU out of memory!! I had to change this flag from the default to get warpx to run after restart from checkpoint.

I also seem to have exclusive access to the GPUS/node where others are not currently running.

I have tried the restart example in the code and have not yet been able to recreate the issue.

At the end it seems like the simulation works well but write particles after restart does not seem to be doing well since the fields exist!

At the start of the simulation I also get the warning below:
Multiple GPUs are visible to each MPI rank, but the number of GPUs per socket or node has not been provided.
This may lead to incorrect or suboptimal rank-to-GPU mapping.!

Attached is an example of the electron density (and seems to affect all the slices) (notice the particles at the front of the moving window )
image
image

But the field seems to be good!
image
image

Some of the parameters are:

amr.n_cell = 384 384 13440   

amr.max_grid_size_x = 128   
amr.max_grid_size_y = 128
amr.max_grid_size_z = 128

amr.blocking_factor_x = 128  
amr.blocking_factor_y = 128
amr.blocking_factor_z = 128

boundary.field_lo = pml pml pml
boundary.field_hi = pml pml pml
warpx.pml_ncell = 8


electrons.do_continuous_injection = 1
ions.do_continuous_injection = 1

For restart:

amr.restart = "/.<pathToCheckpoints>/checkpoints/30000/"
amrex.abort_on_out_of_gpu_memory = 1 # (which works sometimes with the default value)

With the full input script below

For saving the checkpoints I use:

checkpoint.file_prefix = "/<PathToCheckpoints>/checkpoints/"
checkpoint.format = "checkpoint"
checkpoint.intervals = 20000:200000:10000
checkpoint.diag_type = Full

WarpX Version info:
CUDA initialized with 1 GPU per MPI rank; 8 GPU(s) used in total
MPI initialized with 8 MPI processes
MPI initialized with thread support level 3
AMReX (21.12) initialized
WarpX (21.12-nogit)
PICSAR (7b5449f92a4b)

@RemiLehe
Copy link
Member

Thanks for reporting this issue!
Would you be able to share the full input script for the first simulation and the restarted simulation? (or a modified version thereof that would still allow us to reproduce this issue)

@atmyers atmyers self-assigned this Jun 28, 2022
@sheroytata
Copy link
Author

I tried a much smaller simulation with less cells and the problem repeated.
Attached is the input files with different grids. The output dump at step 31000 is much smaller.

inputs_3d_LongTime.txt

inputs_3d_LongTime_GridChange.txt

For the first line just comment out the restart line and change max step to 30000 for the next run uncomment the restart line and change max step to 60k

@sheroytata
Copy link
Author

I finally recompiled 22.05 with the new fixes to LibAblastr. With this version and the input files above restart seems fine.

Thanks

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants