Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

SIGFPE: Floating-point exception in CH4Mod.F90 #6428

Closed
ndkeen opened this issue May 17, 2024 · 6 comments · Fixed by #6483
Closed

SIGFPE: Floating-point exception in CH4Mod.F90 #6428

ndkeen opened this issue May 17, 2024 · 6 comments · Fixed by #6483
Assignees
Labels
BGC GNU GNU compiler related issues Land

Comments

@ndkeen
Copy link
Contributor

ndkeen commented May 17, 2024

On pm-cpu (as well as gcp12) this DEBUG test
SMS_D_Ld1.ne30pg2_r05_IcoswISC30E3r5.WCYCLSSP370.pm-cpu_gnu.allactive-wcprodssp
fails with GNU.

364: #0  0x14b320651dbf in ???
364: #1  0x36f58e7 in ch4_tran
364:    at /global/cfs/cdirs/e3sm/ndk/repos/ndk_mf_pm-avoid-DVS-warning/components/elm/src/biogeochem/CH4Mod.F90:3121
364: #2  0x37353e5 in __ch4mod_MOD_ch4
364:    at /global/cfs/cdirs/e3sm/ndk/repos/ndk_mf_pm-avoid-DVS-warning/components/elm/src/biogeochem/CH4Mod.F90:1709
364: #3  0x2a16186 in __elm_driver_MOD_elm_drv
364:    at /global/cfs/cdirs/e3sm/ndk/repos/ndk_mf_pm-avoid-DVS-warning/components/elm/src/main/elm_driver.F90:1189
364: #4  0x29dd153 in __lnd_comp_mct_MOD_lnd_run_mct
364:    at /global/cfs/cdirs/e3sm/ndk/repos/ndk_mf_pm-avoid-DVS-warning/components/elm/src/cpl/lnd_comp_mct.F90:617
364: #5  0x49fdab in __component_mod_MOD_component_run
364:    at /global/cfs/cdirs/e3sm/ndk/repos/ndk_mf_pm-avoid-DVS-warning/driver-mct/main/component_mod.F90:734
364: #6  0x483563 in __cime_comp_mod_MOD_cime_run
364:    at /global/cfs/cdirs/e3sm/ndk/repos/ndk_mf_pm-avoid-DVS-warning/driver-mct/main/cime_comp_mod.F90:2968
364: #7  0x49d116 in cime_driver
364:    at /global/cfs/cdirs/e3sm/ndk/repos/ndk_mf_pm-avoid-DVS-warning/driver-mct/main/cime_driver.F90:153
364: #8  0x49d179 in main
364:    at /global/cfs/cdirs/e3sm/ndk/repos/ndk_mf_pm-avoid-DVS-warning/driver-mct/main/cime_driver.F90:23
srun: error: nid004572: task 364: Floating point exception
srun: Terminating StepId=25683685.0

In components/elm/src/biogeochem/CH4Mod.F90

      ! Perform competition for oxygen and methane in each soil layer if demands over the course of the timestep                                                                                                                                                    
      ! exceed that available. Assign to each process in proportion to the quantity demanded in the absense of                                                                                                                                                      
      ! the limitation.                                                                                                                                                                                                                                             
      do j = 1,nlevsoi
         do fc = 1, num_methc
            c = filter_methc (fc)

            o2demand = o2_decomp_depth(c,j) + o2_oxid_depth(c,j) ! o2_decomp_depth includes autotrophic root respiration                                                                                                                                            
            if (o2demand > 0._r8) then
               o2stress(c,j) = min((conc_o2(c,j) / dtime + o2_aere_depth(c,j)) / o2demand, 1._r8)   ! <-- line 3121
            else
               o2stress(c,j) = 1._r8
@rljacob
Copy link
Member

rljacob commented May 29, 2024

This would get more traction with a title that points to the routine. Fixed it. The test name doesn't help for these fully coupled water cycle cases since so many components are running.

@rljacob rljacob changed the title SIGFPE: Floating-point exception with SMS_D_Ld1.ne30pg2_r05_IcoswISC30E3r5.WCYCLSSP370.pm-cpu_gnu.allactive-wcprodssp SIGFPE: Floating-point exception in CH4Mod.F90 May 29, 2024
@ndkeen
Copy link
Contributor Author

ndkeen commented May 29, 2024

Adding prints, I see that o2demand=1.2750394683855957E-312 at crash point. Which is above 0._r8 and seems reasonable to me, but this compiler isn't happy with that.

If I try something like this:

           mach_eps       = epsilon(1.0_r8)
       
       ...
            !if (o2demand > 0._r8) then                                                                                                                                                                                                                             
            if (o2demand > mach_eps) then

the run continues.

I'm not sure if that's good solution or if all compilers support epsilonI(). Or maybe we already have shared variable representing fortrans intrinsic epsilon()? ... Or... we don't expect o2demand to have such a value which may point to some other issue...

@peterdschwartz
Copy link
Contributor

@ndkeen Thanks for further looking into this. There have been a few cases of floating-point underflow/overflow happening like this, and it's good to get rid of them as unless the machine is dividing a number by an exact multiple, the result in undetermined.

I'd have to look further into how epsilon works (there's also a tiny function). It should find the smallest representable number for your machine based on some criterion (so would be architecture dependent / maybe compiler dependent).
We do use the huge to initialize some variables, but for computations, I wonder if it's better to explicitly use a smallparameter value as certain subroutines might be more or less sensitive to the value of that parameter in producing non-bfb results.

@ndkeen
Copy link
Contributor Author

ndkeen commented Jun 10, 2024

I suppose we aren't hitting div-by-zero, but a different FP issue: overflow. It seems safe to use epsilon() here, though it raises question of how many places will we need to do something like this. Let me know if I should make PR -- I would prefer to let decision be the developers as they may know more about reasonable values for these quantities (ie, if some value should really never be below x, could implement that instead).

@peterdschwartz
Copy link
Contributor

@ndkeen Ok, I solved similar issue in a separate module PR 5828 and I will go through a similar process to how dependent the code is on the value of the parameter.

@peterdschwartz
Copy link
Contributor

Made a PR that addresses this Issue. Point of note, the function we want is tiny instead of epsilon . epsilon provides the round-off error for a data type, so for doubles it's on the order of 1E-16. This is much too large for most calculations as the variables may be on the order of 1E-24 and causes DIFFs.

On the other hand tiny provides the smallest representable number for a given data type on the compiler+machine and is on the order of 1.E-308

peterdschwartz added a commit that referenced this issue Jul 10, 2024
…flow' into next (PR #6483)

Replace >0._r8 check in CH4Mod with a small parameter set by tiny intrinsic function.
The exact value of tiny(1._r8) may depend on the compiler and machine but is on the order of 1.E-308

Tested on pm-cpu_intel and pm-cpu_gnu

Fixes #6428
[BFB]
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
BGC GNU GNU compiler related issues Land
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants