Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add Modular fp #119

Draft
wants to merge 16 commits into
base: master
Choose a base branch
from
Draft

Add Modular fp #119

wants to merge 16 commits into from

Conversation

belericant
Copy link
Contributor

Addresses issue #42

@sbryngelson
Copy link
Member

sbryngelson commented Mar 10, 2023

I found a few things:

We are still using dble, which converts to double precision: https://gcc.gnu.org/onlinedocs/gfortran/DBLE.html

you can get rid of the calls to dble:

[I]shb-m1pro: MFC/src $ grep -iR 'dble' ./*
./post_process/m_global_parameters.f90:        Re_trans = dble(trans)
./post_process/m_global_parameters.f90:                      + dble(ir - 1)*DLOG(R0mx/R0mn)/dble(Npt - 1)
./pre_process/m_global_parameters.fpp:        Re_trans = dble(trans)
./pre_process/m_global_parameters.fpp:                      + dble(ir - 1)*DLOG(R0mx/R0mn)/dble(nb - 1)
./simulation/m_global_parameters.fpp:        Re_trans = dble(trans)
./simulation/m_global_parameters.fpp:                      + dble(ir - 1)*DLOG(R0mx/R0mn)/dble(nb - 1)

Please change these to lowercase exp/sqrt:

[I]shb-m1pro: MFC/src $ grep -R 'EXP(' ./*
./post_process/m_global_parameters.f90:        c3 = (CEXP(c2) - CEXP(-c2))/(CEXP(c2) + CEXP(-c2)) ! TANH(c2)
./post_process/m_global_parameters.f90:        R0mn = 0.8_wp*EXP(-2.8_wp*sd)
./post_process/m_global_parameters.f90:        R0mx = 0.2_wp*EXP(9.5_wp*sd) + 1._wp
./post_process/m_global_parameters.f90:            R0(ir) = EXP(phi(ir))
./post_process/m_global_parameters.f90:            tmp = EXP(-0.5_wp*(phi(ir)/sd)**2)/SQRT(2._wp*pi)/sd
./post_process/m_global_parameters.f90:        tmp = EXP(-0.5_wp*(phi(1)/sd)**2)/SQRT(2._wp*pi)/sd
./post_process/m_global_parameters.f90:        tmp = EXP(-0.5_wp*(phi(Npt)/sd)**2)/SQRT(2._wp*pi)/sd
./pre_process/m_global_parameters.fpp:        c3 = (CEXP(c2) - CEXP(-c2))/(CEXP(c2) + CEXP(-c2)) ! TANH(c2)
./pre_process/m_global_parameters.fpp:        R0mn = 0.8_wp*EXP(-2.8_wp*sd)
./pre_process/m_global_parameters.fpp:        R0mx = 0.2_wp*EXP(9.5_wp*sd) + 1._wp
./pre_process/m_global_parameters.fpp:            R0(ir) = EXP(phi(ir))
./pre_process/m_global_parameters.fpp:            tmp = EXP(-0.5_wp*(phi(ir)/sd)**2)/SQRT(2._wp*pi)/sd
./pre_process/m_global_parameters.fpp:        tmp = EXP(-0.5_wp*(phi(1)/sd)**2)/SQRT(2._wp*pi)/sd
./pre_process/m_global_parameters.fpp:        tmp = EXP(-0.5_wp*(phi(nb)/sd)**2)/SQRT(2._wp*pi)/sd
./simulation/m_global_parameters.fpp:        c3 = (CEXP(c2) - CEXP(-c2))/(CEXP(c2) + CEXP(-c2)) ! TANH(c2)
./simulation/m_global_parameters.fpp:        R0mn = 0.8_wp*EXP(-2.8_wp*sd)
./simulation/m_global_parameters.fpp:        R0mx = 0.2_wp*EXP(9.5_wp*sd) + 1._wp
./simulation/m_global_parameters.fpp:            R0(ir) = EXP(phi(ir))
./simulation/m_global_parameters.fpp:            tmp = EXP(-0.5_wp*(phi(ir)/sd)**2)/SQRT(2._wp*pi)/sd
./simulation/m_global_parameters.fpp:        tmp = EXP(-0.5_wp*(phi(1)/sd)**2)/SQRT(2._wp*pi)/sd
./simulation/m_global_parameters.fpp:        tmp = EXP(-0.5_wp*(phi(nb)/sd)**2)/SQRT(2._wp*pi)/sd

And these:

[I]shb-m1pro: MFC/src $ grep -R 'SQRT(' ./*
./common/m_helper.f90:        ntmp = SQRT((4._wp*pi/3._wp)*nR3/vftmp)
./post_process/m_global_parameters.f90:        uu = SQRT(pl0/rhol0)
./post_process/m_global_parameters.f90:        phi_vn = (1._wp + SQRT(mu_v/mu_n)*(M_n/M_v)**(0.25_wp))**2 &
./post_process/m_global_parameters.f90:                 /(SQRT(8._wp)*SQRT(1._wp + M_v/M_n))
./post_process/m_global_parameters.f90:        phi_nv = (1._wp + SQRT(mu_n/mu_v)*(M_v/M_n)**(0.25_wp))**2 &
./post_process/m_global_parameters.f90:                 /(SQRT(8._wp)*SQRT(1._wp + M_n/M_v))
./post_process/m_global_parameters.f90:        omegaN = SQRT(3._wp*k_poly*Ca + 2._wp*(3._wp*k_poly - 1._wp)/(Web*R0))/R0
./post_process/m_global_parameters.f90:        c2 = CSQRT(c1)
./post_process/m_global_parameters.f90:            tmp = EXP(-0.5_wp*(phi(ir)/sd)**2)/SQRT(2._wp*pi)/sd
./post_process/m_global_parameters.f90:        tmp = EXP(-0.5_wp*(phi(1)/sd)**2)/SQRT(2._wp*pi)/sd
./post_process/m_global_parameters.f90:        tmp = EXP(-0.5_wp*(phi(Npt)/sd)**2)/SQRT(2._wp*pi)/sd
./pre_process/m_global_parameters.fpp:        uu = SQRT(pl0/rhol0)
./pre_process/m_global_parameters.fpp:        phi_vn = (1._wp + SQRT(mu_v/mu_n)*(M_n/M_v)**(0.25_wp))**2 &
./pre_process/m_global_parameters.fpp:                 /(SQRT(8._wp)*SQRT(1._wp + M_v/M_n))
./pre_process/m_global_parameters.fpp:        phi_nv = (1._wp + SQRT(mu_n/mu_v)*(M_v/M_n)**(0.25_wp))**2 &
./pre_process/m_global_parameters.fpp:                 /(SQRT(8._wp)*SQRT(1._wp + M_n/M_v))
./pre_process/m_global_parameters.fpp:        omegaN = SQRT(3._wp*k_poly*Ca + 2._wp*(3._wp*k_poly - 1._wp)/(Web*R0))/R0
./pre_process/m_global_parameters.fpp:        c2 = CSQRT(c1)
./pre_process/m_global_parameters.fpp:            tmp = EXP(-0.5_wp*(phi(ir)/sd)**2)/SQRT(2._wp*pi)/sd
./pre_process/m_global_parameters.fpp:        tmp = EXP(-0.5_wp*(phi(1)/sd)**2)/SQRT(2._wp*pi)/sd
./pre_process/m_global_parameters.fpp:        tmp = EXP(-0.5_wp*(phi(nb)/sd)**2)/SQRT(2._wp*pi)/sd
./pre_process/m_patches.f90:                    !    q_prim_vf(E_idx)%sf(i,j,0) = (1._wp * (10._wp ** 5)) + 1._wp/SQRT(x_cc(i)**2+y_cc(j)**2) &
./pre_process/m_patches.f90:                    ! radius_pressure = SQRT(x_cc(i)**2) ! 1D
./pre_process/m_patches.f90:                    ! radius_pressure = SQRT(x_cc(i)**2 + cart_y**2) ! 2D
./pre_process/m_patches.f90:                    ! radius_pressure = SQRT(x_cc(i)**2 + cart_y**2 + cart_z**2) ! 3D
./simulation/m_qbmm.fpp:                            c = SQRT(c)
./simulation/m_qbmm.fpp:        fup(1) = bu - SQRT(c2)
./simulation/m_qbmm.fpp:        fup(2) = bu + SQRT(c2)
./simulation/m_global_parameters.fpp:        uu = SQRT(pl0/rhol0)
./simulation/m_global_parameters.fpp:        phi_vn = (1._wp + SQRT(mu_v/mu_n)*(M_n/M_v)**(0.25_wp))**2 &
./simulation/m_global_parameters.fpp:                 /(SQRT(8._wp)*SQRT(1._wp + M_v/M_n))
./simulation/m_global_parameters.fpp:        phi_nv = (1._wp + SQRT(mu_n/mu_v)*(M_v/M_n)**(0.25_wp))**2 &
./simulation/m_global_parameters.fpp:                 /(SQRT(8._wp)*SQRT(1._wp + M_n/M_v))
./simulation/m_global_parameters.fpp:        omegaN = SQRT(3._wp*k_poly*Ca + 2._wp*(3._wp*k_poly - 1._wp)/(Web*R0))/R0
./simulation/m_global_parameters.fpp:        c2 = CSQRT(c1)
./simulation/m_global_parameters.fpp:            tmp = EXP(-0.5_wp*(phi(ir)/sd)**2)/SQRT(2._wp*pi)/sd
./simulation/m_global_parameters.fpp:        tmp = EXP(-0.5_wp*(phi(1)/sd)**2)/SQRT(2._wp*pi)/sd
./simulation/m_global_parameters.fpp:        tmp = EXP(-0.5_wp*(phi(nb)/sd)**2)/SQRT(2._wp*pi)/sd
./simulation/m_data_output.fpp:                        nbub = SQRT((4._wp*pi/3._wp)*nR3/alf)
./simulation/m_data_output.fpp:                            nbub = SQRT((4._wp*pi/3._wp)*nR3/alf)
./simulation/m_bubbles.fpp:                            c_liquid = SQRT(n_tait*(myP + B_tait)/(myRho*(1._wp - alf)))
./simulation/m_bubbles.fpp:        f_cgas = SQRT(tmp + (fntait - 1._wp)*fH)

@sbryngelson
Copy link
Member

@anshgupta1234 would you mind giving this PR a review as well, just to check for typos that might be around? I looked, but another set of eyes would be useful. Thanks!

Copy link
Contributor

@anshgupta1234 anshgupta1234 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Didn't find anything else. Looking good!

@henryleberre
Copy link
Member

henryleberre commented Mar 11, 2023

@sbryngelson What do you think about moving the contents of m_precision_select into m_constants ?

I wonder how NVFortran will handle precision selection because we specify -Mr8:

-Mr8: (Fortran only) The compiler promotes REAL variables and constants to DOUBLE PRECISION variables and constants, respectively. DOUBLE PRECISION elements are 8 bytes in length.

-Mnor8: (Fortran only) The compiler does not promote REAL variables and constants to DOUBLE PRECISION. REAL variables will be single precision (4 bytes in length).

From https://docs.nvidia.com/hpc-sdk/compilers/hpc-compilers-ref-guide/#cmdln-m-opts-optimize. It's unclear to me when and how this "promotion" happens.

In any case, should this feature be made available to the build system? I feel like another way we could have implemented this would have been to use real everywhere and let CMake set the appropriate compiler flags. We already use the flags that turn real into double everywhere. Setting single precision could (possibly) be as simple as not specifying these flags.

The benefit of the approach @belericant took is that you could elect not to use working precision for kernels that need double precision. It's modularity at the cost of verbosity. Note that we could still achieve this with the other approach by adding/removing compiler flags depending on what file is being compiled.

One last remark, we may lose precision in a few places by being aggressive in the way we enforce working precision. Consider this example

x_cc(i) = x_domain%beg + (5._wp * (10._wp ** -(1)))*dx*real(2*i + 1, wp)

(5._wp * (10._wp ** -(1))) is a constant evaluated at compile-time. All operations that lead to its evaluation are done in working precision. It might be preferable to compute this in double precision and then cast it to single working precision for better fidelity.

@henryleberre henryleberre linked an issue Mar 11, 2023 that may be closed by this pull request
@henryleberre henryleberre added enhancement New feature or request good first issue Good for newcomers labels Mar 11, 2023
@henryleberre henryleberre self-requested a review March 11, 2023 23:32
@sbryngelson
Copy link
Member

sbryngelson commented Mar 12, 2023

@sbryngelson What do you think about moving the contents of m_precision_select into m_constants ?

@henryleberre I don't have particular preference about this, seems fine.

I wonder how NVFortran will handle precision selection because we specify -Mr8:

-Mr8: (Fortran only) The compiler promotes REAL variables and constants to DOUBLE PRECISION variables and constants, respectively. DOUBLE PRECISION elements are 8 bytes in length.

-Mnor8: (Fortran only) The compiler does not promote REAL variables and constants to DOUBLE PRECISION. REAL variables will be single precision (4 bytes in length).

From [docs.nvidia.com/hpc-sdk/compilers/hpc-compilers-ref-guide/#cmdln-m-opts-optimize](https://docs.nvidia.com/hpc-sdk/compilers/hpc-compilers-ref-guide/#cmdln-m-opts-optimize). It's unclear to me when and how this "promotion" happens.

In any case, should this feature be made available to the build system? I feel like another way we could have implemented this would have been to use real everywhere and let CMake set the appropriate compiler flags. We already use the flags that turn real into double everywhere. Setting single precision could (possibly) be as simple as not specifying these flags.

The benefit of the approach @belericant took is that you could elect not to use working precision for kernels that need double precision. It's modularity at the cost of verbosity. Note that we could still achieve this with the other approach by adding/removing compiler flags depending on what file is being compiled.

This current setup might be messing with @belericant 's tests, then, no?

But to your main point: I think we will probably have to abandon any forced precision flags. The reasoning is that we now have fine-grained control, which is what I would like. I'm not sure even compiling each module with different flags would do the trick here because some "setup" routines like s_initialize_module_XXXXX should presumably be done in higher precision (since it's only done once). Then we only use a lower precision in compute kernels.

One last remark, we may lose precision in a few places by being aggressive in the way we enforce working precision. Consider this example

x_cc(i) = x_domain%beg + (5._wp * (10._wp ** -(1)))*dx*real(2*i + 1, wp)
(5._wp * (10._wp ** -(1))) 

is a constant evaluated at compile-time. All operations that lead to its evaluation are done in working precision. It might be preferable to compute this in double precision and then cast it to single working precision for better fidelity.

@henryleberre If I understand correctly, there shouldn't be a meaningful difference as long as the computation is backward stable and not victim to catastrophic cancelation.
But I'm not sure I am fully understanding.

henryleberre pushed a commit to henryleberre/ChemMFC that referenced this pull request Mar 21, 2023
@sbryngelson
Copy link
Member

is this ready for merge? @belericant

@sbryngelson
Copy link
Member

we should also check performance on cpu and gpu, perhaps @wilfonba can pass along his script for that. i would also be curious to see the performance numbers on one of those cases using single precision on a V100 or A100

@belericant
Copy link
Contributor Author

is this ready for merge? @belericant

@wilfonba found issues exporting 3d runs in single precision but that isn't due to the modifications necessary to have modular working precision. I don't know if it makes more sense for that to be fixed in this PR or in a new PR.

@sbryngelson
Copy link
Member

What happens if you use wp=single and precision (output) = 1 or 2, and if you use wp=double and precision = 1 or 2? Do any of these cases break? are they documented? @belericant @wilfonba

@wilfonba
Copy link
Collaborator

@sbryngelson That's the logic I still need to add

@sbryngelson
Copy link
Member

Ok I will merge once that is complete.

@wilfonba
Copy link
Collaborator

Numerical Schlieren doesn't seem to work with single working precision. When I run with double working precision and save single precision silo files it's nearly uniformly 1 after the first 10 time steps, but if I run with single precision and save in single precision it's uniformly zero after the first 10 time steps.

@sbryngelson
Copy link
Member

Numerical Schlieren doesn't seem to work with single working precision. When I run with double working precision and save single precision silo files it's nearly uniformly 1 after the first 10 time steps, but if I run with single precision and save in single precision it's uniformly zero after the first 10 time steps.

https://github.com/belericant/MFC/blob/4134a660e77c20e0639f5565e768768728caa39f/src/post_process/m_derived_variables.fpp#L573-L718 here is the relevant code. It could be sensitive to precision issues, as well as the value of schlieren_alpha(i). What do you use for these?

@wilfonba
Copy link
Collaborator

@sbryngelson
'schlieren_alpha(1)' : 40 'schlieren_alpha(2)' : 400

@sbryngelson
Copy link
Member

I don't know if these are reasonable or not, I haven't used the feature in a while.

@wilfonba
Copy link
Collaborator

It's one of the example cases, so I'd assume they're reasonable, but I can run it longer to see

@wilfonba
Copy link
Collaborator

wilfonba commented May 10, 2023

@sbryngelson Silo output works for all combinations now. For a 3D shock droplet with 2 million cells on 2 V100's with case optimization enabled, I get the following times per time-step (over a 1000 time-step run)

Double Precision: 3.288e-2 s/step (~109,500 steps/hr)
Single Precision: 2.499e-2 s/step (~144,000 steps/hr)

for a 31.5% increase in steps/hr.

I did have to comment out the calls to cufftexecd2z and cufftexecz2d in m_fftw.fpp for it to compile on GPUs with NVHPC in single precision, works fine on CPUs with GNU though. It looked like on GPUs fftw3 was compiled after I got these errors, not sure what's up with that. Also, some of the test cases don't pass on double precision, likely a bug introduced somewhere in all the changes this pr has made or me merging it with upstream. None of the test cases pass with wp = single_precision because they all request double-precision silo files and those options are incompatible. Perhaps there's a way to modify the toolchain such that you pass the working precision as an option and m_precision_select.f90 is automatically modified and changes all the test cases to use the correct precision silo files? @henryleberre or @anshgupta1234 could probably add this far more quickly than I can given they have experience with it.

@sbryngelson
Copy link
Member

sbryngelson commented May 10, 2023

@wilfonba

It looked like on GPUs fftw3 was compiled after I got these errors, not sure what's up with that.

I've seen this recently as well. It doesn't build fftw when using GPUs (because I guess it finds cuFFT?) but it will build fftw for the post-processing code for CPU purposes.

None of the test cases pass with wp = single_precision because they all request double-precision silo files and those options are incompatible.

I don't understand what you mean here. What options? We don't have silo golden files (though we do use h5dump to check for obviously incorrect hdf5/silo files).

Also, some of the test cases don't pass on double precision, likely a bug introduced somewhere in all the changes this pr has made or me merging it with upstream

This will have to be fixed of course.

Perhaps there's a way to modify the toolchain such that you pass the working precision as an option and m_precision_select.f90 is automatically modified and changes all the test cases to use the correct precision silo files?

Presumably, we would like the precision to be selected in the case.py file. In this situation, we would just double the number of test cases, a set for single and a set for double precision.

@wilfonba
Copy link
Collaborator

wilfonba commented May 10, 2023

@sbryngelson

I don't understand what you mean here. What options? We don't have silo golden files (though we do use h5dump to check for obviously incorrect hdf5/silo files).>

I have a check in m_checker that says single working precision and double precision silo files are incompatible because running a simulation in single precision and then requesting a double precision silo file doesn't make much sense. I could make it so that if single working precision is used and double silo precision is requested the single precision data is cast to double precision, but I think it makes more sense to just disallow that combination for practical purposes. I suppose allowing it would fix the testing problem.

Presumably, we would like the precision to be selected in the case.py file. In this situation, we would just double the number of test cases, a set for single and a set for double precision.>

I see how this would be desirable, but I don't think it's as straightforward as other case file options. All of the working precision differences appear to take effect at compile time, not run time, while all current case file options are run time. I'm guessing trying to change precision at run time would cause a bunch of run time errors, but I can check. (edit. since the case file goes through mfc.sh I think we could just have the toolchain catch this option and rebuild everything)

I'll work on getting the test cases to pass given all the changes, and talk with Henry/Ansh about toolchain changes for testing.

@wilfonba
Copy link
Collaborator

Came across a cool tool that takes math expressions and the expected ranges of each variable and outputs the best way to write the expression with the least loss of precision. May be useful if there are any calculations in the code that we might suspect to have problems with single precision.

https://herbie.uwplse.org/

@sbryngelson
Copy link
Member

Seems like it could be useful!

added ci

added ci

things2

things

added single precision to test suite and generated golden files

added working precision to case_dicts, testing, and documentation

fixed code not passing using double working precision
@sbryngelson sbryngelson marked this pull request as draft December 20, 2023 20:53
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request good first issue Good for newcomers
Development

Successfully merging this pull request may close these issues.

Switching to modern and modular precision declaration
5 participants