Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ice_dyn_vp: use global_sum_* for bit-for-bit reproducibility #763

Closed
wants to merge 11 commits into from

Conversation

phil-blain
Copy link
Member

I'm opening this PR as draft mainly so we can discuss the performance regression I'm hitting, hence I did not fill the whole PR template. I'll do that when we are ready to merge the code (if we do).

Summary:
I refactored the VP solver code so that global sums are computed using the existing CICE global_sum_* subroutines, so that bit-for-bit reproducibility can be achieved if desired using the existing bfbflag namelist settting.

My adventures in doing so are detailed at:

As you can see from the changes in the PR, each call to global_sum is replaced by two calls to global_sum_prod, one for the array holding X components and a second for the array holding Y components.

I used the Intel VTune profiler to profile both the old and new code, compared them, and got these results: phil-blain#40 (comment).

Seeing that the time spent in the MPI_ALLREDUCE in the new code was twice that of the old code, I tried to refactor the code again to do the global sums for both dimensions at the same time:

but that did not really help, which was disappointing...

So I guess I'd like to have some input from others :)

When the implicit VP solver was added in f7fd063 (dynamics: add implicit
VP solver (CICE-Consortium#491), 2020-09-22), it had not yet been tested with OpenMP
enabled.

The OpenMP implementation was carefully reviewed and then fixed in
d1e972a (Update OMP (CICE-Consortium#680), 2022-02-18), which lead to all runs of the
'decomp' suite completing and all restart tests passing. The 'bfbcomp'
tests are still failing, but this is due to the code not using the CICE
global sum implementation correctly, which will be fixed in the next
commits.

Update the documentation accordingly.
The system installation of OpenMPI at /usr/mpi/gcc/openmpi-4.1.2a1/ is
not compiled with support for PBS. This leads to failures as the MPI
runtime does not have the same view of the number of avaiable processors
as the job scheduler.

Use our own build of OpenMPI, compiled with PBS support, for the
'ppp6_gnu'  environment, which uses OpenMPI.
Intel MPI 2021.5.1, which comes with oneAPI 2022.1.2, seems to have an
intermittent bug where a call to 'MPI_Waitall' fails with:

    Abort(17) on node 0 (rank 0 in comm 0): Fatal error in PMPI_Waitall: See the MPI_ERROR field in MPI_Status for the error code

and no core dump is produced. This affects at least these cases of the
'decomp' suite:

- *_*_restart_gx3_16x2x1x1x800_droundrobin
- *_*_restart_gx3_16x2x2x2x200_droundrobin

This was reported to Intel and they suggested setting the variable
'I_MPI_FABRICS' to 'ofi' (the default being 'shm:ofi' [1]). This
disables shared memory transport and indeeds fixes the failures.

Set this variable for all ECCC machine files using Intel MPI.

[1] https://www.intel.com/content/www/us/en/develop/documentation/mpi-developer-reference-linux/top/environment-variable-reference/environment-variables-for-fabrics-control/communication-fabrics-control.html
Intel MPI, in contrast to OpenMPI (as far as I was able to test, and see
[1], [2]), does not (by default) guarantee that repeated runs of the same
code on the same machine with the same number of MPI ranks yield the
same results when collective operations (e.g. 'MPI_ALLREDUCE') are used.

Since the VP solver uses MPI_ALLREDUCE in its algorithm, this leads to
repeated runs of the code giving different answers, and baseline
comparing runs with code built from the same commit failing.

When generating a baseline or comparing against an existing baseline,
set the environment variable 'I_MPI_CBWR' to 1 for ECCC machine files
using Intel MPI [3], so that (processor) topology-aware collective
algorithms are not used and results are reproducible.

Note that we do not need to set this variable on robert or underhill, on
which jobs have exclusive node access and thus job placement (on
processors) is guaranteed to be reproducible.

[1] https://stackoverflow.com/a/45916859/
[2] https://scicomp.stackexchange.com/a/2386/
[3] https://www.intel.com/content/www/us/en/develop/documentation/mpi-developer-reference-linux/top/environment-variable-reference/i-mpi-adjust-family-environment-variables.html#i-mpi-adjust-family-environment-variables_GUID-A5119508-5588-4CF5-9979-8D60831D1411
If starting a run with with "ice_ic='none'" (no ice), the linearized
problem for the ice velocity A x = b will have b = 0, since all terms in
the right hand side vector will be zero:

- strint[xy] is zero because the velocity is zero
- tau[xy] is zero because the ocean velocity is also zero
- [uv]vel_init is zero
- strair[xy] is zero because the concentration is zero
- strtlt[xy] is zero because the ocean velocity is zero

We thus have a linear system A x = b with b=0, so we
must have x=0.

In the FGMRES linear solver, this special case is not taken into
account, and so we end up with an all-zero initial residual since
workspace_[xy] is also zero because of the all-zero initial guess
'sol[xy]', which corresponds to the initial ice velocity. This then
leads to a division by zero when normalizing the first Arnoldi vector.

Fix this special case by computing the norm of the right-hand-side
vector before starting the iterations, and exiting early if it is zero.
This is in line with the GMRES implementation in SciPy [1].

[1] https://github.com/scipy/scipy/blob/651a9b717deb68adde9416072c1e1d5aa14a58a1/scipy/sparse/linalg/_isolve/iterative.py#L620-L628
The VP solver uses a linear solver, FGMRES, as part of the non-linear
iteration. The FGMRES algorithm involves computing the norm of a
distributed vector field, thus performing global sums.

These norms are computed by first summing the squared X and Y components
of a vector field in subroutine 'calc_L2norm_squared', summing these
over the local blocks, and then doing a global (MPI) sum using
'global_sum'.

This approach does not lead to reproducible results when the MPI
distribution, or the number of local blocks, is changed, for reasons
explained in the "Reproducible sums" section of the Developer Guide
(mostly, floating point addition is not associative). This was partly
pointed out in [1] but I failed to realize it at the time.

Make the results of the VP solver more reproducible by using two calls
to 'global_sum_prod' to individually compute the squares of the X and Y
components when computing norms, and then summing these two reproducible
scalars.

The same pattern appears in the FGMRES solver (subroutine 'fgmres'), the
preconditioner 'pgmres' which uses the same algorithm, and the
Classical and Modified Gram-Schmidt algorithms in 'orthogonalize'.

These changes result in twice the number of global sums for fgmres,
pgmres and the MGS algorithm. For the CGS algorithm, the performance
impact is higher as 'global_sum_prod' is called inside the loop, whereas
previously we called 'global_allreduce_sum' after the loop to compute
all 'initer' sums at the same time.

To keep that optimization, we would have to implement a new interface
'global_allreduce_sum_prod' which would take two arrays of shape
(nx_block,ny_block,max_blocks,k) and sum these over their first three
dimensions before performing the global reduction over the k dimension.

We choose to not go that route for now mostly because anyway the CGS
algorithm is (by default) only used for the PGMRES preconditioner, and
so the cost should be relatively low as 'initer' corresponds to
'dim_pgmres' in the namelist, which should be kept low for efficiency
(default 5).

These changes lead to bit-for-bit reproducibility (the decomp_suite
passes) when using 'precond=ident' and 'precond=diag' along with
'bfbflag=reprosum'. 'precond=pgmres' is still not bit-for-bit because
some halo updates are skipped for efficiency. This will be addressed in
a following commit.

Note that calc_bvec loops only over ice points to compute b[xy], so
zero-initialize b[xy] since global_sum_prod loops over the whole array.
The arnoldi_basis_[xy] arrays are already zero-initialized in fgmres and
pgmres.

[1] CICE-Consortium#491 (comment)
The 'pgmres' subroutine implements a separate GMRES solver and is used
as a preconditioner for the FGMRES linear solver. Since it is only a
preconditioner, it was decided to skip the halo updates after computing
the matrix-vector product (in 'matvec'), for efficiency.

This leads to non-reproducibility since the content of the non-updated
halos depend on the block / MPI distribution.

Add the required halo updates, but only perform them when we are
explicitely asking for bit-for-bit global sums, i.e. when 'bfbflag' is
set to something else than 'not'.

Adjust the interfaces of 'pgmres' and 'precondition' (from which
'pgmres' is called) to accept 'halo_info_mask', since it is needed for
masked updates.

Closes CICE-Consortium#518
In the previous commits we ensured bit-for-bit reproducibility of the
outputs when using the VP solver.

Some global norms computed during the nonlinear iteration still use the
same non-reproducible pattern of summing over blocks locally before
performing the reduction. However, these norms are used only to monitor
the convergence in the log file, as well as to exit the iteration when
the required convergence level is reached ('nlres_norm'). Only
'nlres_norm' could (in theory) influence the output, but it is unlikely
that a difference due to floating point errors would influence the 'if
(nlres_norm < tol_nl)' condition used to exist the nonlinear iteration.

Change these remaining cases to also use 'global_sum_prod' to compute
the global norm, leading to bit-for-bit log reproducibility.

Make sure to zero-initialize the arrays F[xy] corresponding to the
non-linear residual vector (Ax - b), once again because global_sum_prod
loops over the whole array, and only ice points are currently
initialized (in 'residual_vec') for these arrays.

In debug mode, this avoids NaNs being involved in some computations
during the 'global_sum_prod' calls used to compute the norm of the
residual just after it's computed. This was only triggered in certain
tests of the 'decomp_suite'.
The previous commit removed the last caller of 'calc_L2norm_squared'.
Remove the subroutine.

Also, do not compute 'sum_squared' in 'residual_vec', since the variable
'L2norm' which receives this value is also unused in 'anderson_solver'
since the previous commit. Remove that variable, and adjust the
interface of 'residual_vec' accordingly.
The previous commits made sure that the model outputs as well as the log
file output are bit-for-bit reproducible when using the VP solver by
refactoring the code to use the existing 'global_sum_prod' subrooutine.

Add a note in the documentation mentioning that 'bfbflag' is required to
get bit-for-bit reproducible results under different decompositions /
MPI count when using the VP solver.

While at it, correct a typo in cice_index.rst.
@apcraig
Copy link
Contributor

apcraig commented Sep 29, 2022

I had a quick look at your notes and implementation. First,

I think you can reduce the new implementation from two MPI sums to one. You should be able to compute "FxFx+FyFy" locally and put it in a temporary array (nx,ny,iblk). Then call global_sum on that array. I hope I understand that correctly. And you can do that over icells saving some mult-adds. That should save the chaos (masks, tripoles, etc) in global_sum_prod as well as reduce the MPI_Allreduce calls in half. And that should still work on tripole because all the tripole logic for the global sums is in global_sum.

Separately, it might be worth recoding global_sum and global_sum_prod so the masking is outside the loops. I suspect those methods could perform better with some refactoring, but we'd have to explore that. That may be a second step to come later.

I think an important difference in the original and new implementation is that the old implementation was using global_sum_scalar for the reduction after computing a local sum. Circling back, part of the difference is probably the cost of the local sum which does not seem particularly efficient for arrays in the global_sum method because of how the loops and if tests are structured.

@apcraig
Copy link
Contributor

apcraig commented Sep 30, 2022

One other comment, is it possible the time accounted in MPI is partly due to load imbalance elsewhere such that the AllReduce is waiting for all pes to arrive?

@apcraig
Copy link
Contributor

apcraig commented Sep 30, 2022

Just for fun, I ran a few cases with this branch and

./cice.setup -m cheyenne -e intel -g gx1 -p 72x1 -s dynpicard --test smoke --testid rv01

Here are the dynamics timings I got,

Timer   3:  Dynamics       5.45 seconds    ! old version of the code (current CICE main)
Timer   3:  Dynamics      10.85 seconds    ! out of the box (bfbflag='off')
Timer   3:  Dynamics       9.72 seconds    ! out of the box + simpler global_sum_prod_dble
Timer   3:  Dynamics      23.07 seconds    ! bfbflag='reprosum'
Timer   3:  Dynamics      22.79 seconds    ! bfbflag='reprosum'  + simpler global_sum_prod_dble

So, getting the if statements out of global_sum_prod_dble (I just commented them out for testing purposes), reduces the dynamics time by 10% for bfbflag='off'. It has less impact with bfbflag='reprosum'. We should recode the global_sum and global_sum_prod at some point. Also 'reprosum' is quite a bit more expensive overall, but it doesn't have to be on in production, just for some test cases. The next thing I might try to do is combine the two global_sum_prod calls into a single global_sum call in ice_dyn_vp and see the effect.

I think a main point is that global reductions have never been evaluated for performance because they were just used for diagnostics and never were a bottleneck. It's good we're looking at them now.

@apcraig
Copy link
Contributor

apcraig commented Sep 30, 2022

A few more timing data points for this version of the code,

Timer   3:  Dynamics       5.45 seconds    ! old version of the code (current CICE main)
Timer   3:  Dynamics      10.85 seconds    ! out of the box (bfbflag='off')
Timer   3:  Dynamics       9.72 seconds    ! bfbflag='off' + simpler global_sum_prod_dbl
Timer   3:  Dynamics      23.07 seconds    ! bfbflag='reprosum'
Timer   3:  Dynamics      22.79 seconds    ! bfbflag='reprosum'  + simpler global_sum_prod_dbl
Timer   3:  Dynamics       8.38 seconds    ! bfbflag='off' + 2->1 VP AllReduce
Timer   3:  Dynamics       7.24 seconds    ! bfbflag='off' + 2->1 VP AllReduce + simpler global_sum_dbl

So we were at 5.45s with old VP implementation, with 2->1 VP Allreduce calls and simpler global_sum (removed if checks in loops), it's now 7.24s. I don't know if those differences could be explained by other changes in the VP implementation, but it suggests if we can refactor the two global_sum_prod calls to a single global_sum and improve the performance of the global_sum, we're starting to get close. Also, the temporary arrays passed into global_sum are computed at all points, they could be masked saving still more time. We still have to pay for some overhead in global_reductions, copying the array and then the call to compute_sums_dbl. And maybe there are other things that could be done.

Separately, those intel performance tools looks cool. Maybe we should use them to help us focus on other issues like vector performance. Historically, those tools have required a bunch of spinup time and they didn't work well on real MPI applications (even though they work great on sample codes), but maybe CICE is a small enough application and the tools are getting better that it'd be very useful to go thru that exercise.

@phil-blain
Copy link
Member Author

Thanks a lot for taking a look.

I had a quick look at your notes and implementation. First,

I think you can reduce the new implementation from two MPI sums to one. You should be able to compute "Fx_Fx+Fy_Fy" locally and put it in a temporary array (nx,ny,iblk). Then call global_sum on that array. I hope I understand that correctly. And you can do that over icells saving some mult-adds. That should save the chaos (masks, tripoles, etc) in global_sum_prod as well as reduce the MPI_Allreduce calls in half. And that should still work on tripole because all the tripole logic for the global sums is in global_sum.

Yes, I think that's right.

Separately, it might be worth recoding global_sum and global_sum_prod so the masking is outside the loops. I suspect those methods could perform better with some refactoring, but we'd have to explore that. That may be a second step to come later.

Yes, I was suspecting that also. I haven't yet confirmed it.

I think an important difference in the original and new implementation is that the old implementation was using global_sum_scalar for the reduction after computing a local sum. Circling back, part of the difference is probably the cost of the local sum which does not seem particularly efficient for arrays in the global_sum method because of how the loops and if tests are structured.

Yes, indeed. Another thing is that the computation of the local sum that was1 OpenMP-accelerated, but this is not the case for global_sum, which makes sense as we want to be able to get the same answer with different thread counts.

@phil-blain
Copy link
Member Author

I'll make some more tweaks and test on my side with a single AllReduce instead of two, and see if I can tweak global_sum to make use of the existing icellu array to sum only on ice points...

Thanks for your input!

@apcraig
Copy link
Contributor

apcraig commented Oct 3, 2022

Hi @phil-blain, thanks for continuing to push on this.

I don't know if it'll make sense to push the icellu mask down into the global sum. That might require some pretty significant changes in those methods all the way down to ice_repro_sum. My suggestion is to use the icellu mask first just to reduce the local FxFx+FyFy computation but to leave the array passed into global_sum the regular size without passing icellu. Using icellu as a "mask" or "lmask" in the global_sum is also just going to slow it down. Those masks just zero out the cells, they don't reduce the size of the array to sum over.

It seems to me the first thing is to convert the 2 global_sum_prods into 1 global_sum by computing FxFx+FyFy efficiently outside the global_sum. The second step is to refactor the global_sum methods to pull the if checks outside the loops. That second step could be done as a separate PR over the next few weeks if you prefer, I think we've identified the problem there and know we need to fix it for VP at least. Otherwise, it might be reasonable to merge this PR soon? I think it's unlikely we'll be able to duplicate the baseline performance.

However, let me propose another option. That would be to use the original approach when we don't need bit-for-bit but use the standard global_sum (of the array) when we need bit-for-bit reproducible. That will ensure the production runs can go as fast as possible while still retaining ability to do all the checks we want in a bit-for-bit mode. That would probably require a "vp_global_sum" subroutine in VP to leverage code reuse.

@phil-blain
Copy link
Member Author

I don't know if it'll make sense to push the icellu mask down into the global sum. That might require some pretty significant changes in those methods all the way down to ice_repro_sum. My suggestion is to use the icellu mask first just to reduce the local Fx_Fx+Fy_Fy computation but to leave the array passed into global_sum the regular size without passing icellu. Using icellu as a "mask" or "lmask" in the global_sum is also just going to slow it down. Those masks just zero out the cells, they don't reduce the size of the array to sum over.

I was not thinking I would have to pass icellu down to ice_reprosum. My idea was that since a lot of blocks have only ocean and no ice, we are summing a lot of zeros in global_sum_dbl before handling the work to compute_sums_dbl. If the loop is done on icellu, then we do really skip the ocean cells, we are not only masking them. So I have a feeling it could be faster. I'd like to try it just to measure the difference. The array that would be passed to compute_sums_dbl would stay the same shape, so no further changes down the call stack would be necessary.

It seems to me the first thing is to convert the 2 global_sum_prods into 1 global_sum by computing Fx_Fx+Fy_Fy efficiently outside the global_sum. The second step is to refactor the global_sum methods to pull the if checks outside the loops. That second step could be done as a separate PR over the next few weeks if you prefer, I think we've identified the problem there and know we need to fix it for VP at least. Otherwise, it might be reasonable to merge this PR soon? I think it's unlikely we'll be able to duplicate the baseline performance.

Yes, if we do refactor the global_sum, I agree it could be in a subsequent PR. And then we can merge this one soon-ish, once we are confident we do not degrade the timings too much.

However, let me propose another option. That would be to use the original approach when we don't need bit-for-bit but use the standard global_sum (of the array) when we need bit-for-bit reproducible. That will ensure the production runs can go as fast as possible while still retaining ability to do all the checks we want in a bit-for-bit mode. That would probably require a "vp_global_sum" subroutine in VP to leverage code reuse.

This is exactly what @JFLemieux73 suggested when we talked last week, and then he suggested reaching out to you to get a second opinion. So I think this is an option worth exploring, too. I think if I refactor things a bit, we can keep the code clean and still retain the same performance, or very close (we would have an additional if somewhere to dispatch appropriately to the original or new code depending on bfbflag, but apart from that the performance should be pretty similar).

@apcraig
Copy link
Contributor

apcraig commented Oct 3, 2022

That all sounds great. I think global_sum is just doing a copy into a 1d array so I'm not sure icellu would help much. It might help if we were still using global_sum_prod. I think we're also on the same page with respect to supporting two global sum approaches within vp. As I noted before, the global sum has not been on the critical path for performance nor has it played a role in results beyond just diagnostics. Having a fast custom global sum where it is on the critical path is a totally reasonable approach.

@phil-blain
Copy link
Member Author

That all sounds great. I think global_sum is just doing a copy into a 1d array so I'm not sure icellu would help much. It might help if we were still using global_sum_prod.

Oh, you are totally right, this is what I had in mind indeed. If we simply use global_sum and not global_sum_prod. then it does not make sense to use icellu.

One thing that I thought of to speed up global_sum and friends is to just add more subroutines to the module procedure. We could have one version without any mask, one with the multiplicative mask, one with the logical mask, etc. Then the dispatching would be done at compile time, which is even better. And we could gather the common code (dealing with the tripole blocks) in a separate subroutine and simply call it from all global_sum_* subroutines.

@apcraig
Copy link
Contributor

apcraig commented Oct 3, 2022

Very good ideas for refactoring global_sum routines. My thought was just pull out the if blocks and to do the masking separately only when needed, but having more interfaces might work even better, just have to see how much duplication we end up with and how much code reuse we can have.

@phil-blain
Copy link
Member Author

use the original approach when we don't need bit-for-bit but use the standard global_sum (of the array) when we need bit-for-bit reproducible.

This is what I've implemented. I'm finalizing the QC testing right now. The timings also look good. I'll open a new PR later today or tomorrow. Stay tuned!

Thanks again,

@phil-blain phil-blain closed this Oct 6, 2022
@phil-blain
Copy link
Member Author

Final PR is here: #774

@phil-blain phil-blain deleted the repro-vp-no-pairs branch February 19, 2024 20:29
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants