Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

SMP Release #30

Open
rgknox opened this issue Mar 6, 2015 · 57 comments
Open

SMP Release #30

rgknox opened this issue Mar 6, 2015 · 57 comments

Comments

@rgknox
Copy link
Contributor

rgknox commented Mar 6, 2015

Hi All,

I put the Shared Memory Parallelism commits on the master. This will allow for the splitting of radiation scattering, photosynthesis and thermodynamics of different patches to different CPU cores.

This has been tested using RK4 and Hybrid integration
This has had limited testing on gridded runs
This has had no testing on coupled runs (but I don't suspect any breakage).

If you don't want to use shared memory, just keep doing what you have done in the past and nothing should change.

If you do want to use it, follow these steps for a single polygon run:

  1. compile code with shared memory directives, if you are using OpenMP, the flag is '-fopenmp'
  2. (optional) increase your stack size. On linux: "ulimit -s unlimited"
  3. set run-time environment variables. If you are using OpenMP, the key variable is OMP_NUM_THREADS. This defines how many shared memory cores will be used. On linux:
    "export OMP_NUM_THREADS=X" where X is the number of cores you wish to use. REMEMBER: These cores must share RAM, so you are limited by the number of cores that are on one node.
  4. Execute the simulation as you would normally.

This release is experimental for the time being. If you have trouble or crashes or poor reproducability of previous work, revert to commit 2a5d68e

ie:

git checkout 2a5d68e

@apourmok
Copy link
Contributor

apourmok commented Mar 9, 2015

Hi Ryan,
I pulled your latest changes from main line. I am able to compile and run
the model on my main branch but in my management branch that I am currently
working, I can't compile the model since the pull. Here is the error I get,
any thoughts?

Error: Unclassifiable statement at (1)
canopy_struct_dynamics.f90:392.53:

     csite%rough(ipa) = snow_rough * snowfac_can
                                                 1

Warning: Nonconforming tab character at (1)
canopy_struct_dynamics.f90:591.53:

     csite%rough(ipa) = snow_rough * snowfac_can
                                                 1

Warning: Nonconforming tab character at (1)
canopy_struct_dynamics.f90:600.85:

urf_rough = soil_rough * (1.0 - snowfac_can)
&
1
Warning: Nonconforming tab character at (1)
canopy_struct_dynamics.f90:779.9:

     lad(:) = 0.0
     1

Error: Unclassifiable statement at (1)
canopy_struct_dynamics.f90:821.27:

              lad(k) = lad(k) + ladcohort
                       1

Error: Statement function at (1) is recursive
canopy_struct_dynamics.f90:832.35:

              lad(kapartial) = lad(kapartial)
                               1

Error: Statement function at (1) is recursive
canopy_struct_dynamics.f90:836.35:

              lad(kapartial) = lad(kapartial)
                               1

Error: Statement function at (1) is recursive
canopy_struct_dynamics.f90:839.35:

              lad(kzpartial) = lad(kzpartial)
                               1

Error: Statement function at (1) is recursive
canopy_struct_dynamics.f90:893.27:

              lad(k) = lad(k) + ladcohort
                       1

Error: Statement function at (1) is recursive
canopy_struct_dynamics.f90:904.35:

              lad(kapartial) = lad(kapartial)
                               1

Error: Statement function at (1) is recursive
canopy_struct_dynamics.f90:908.35:

              lad(kapartial) = lad(kapartial)
                               1

Error: Statement function at (1) is recursive
canopy_struct_dynamics.f90:911.35:

              lad(kzpartial) = lad(kzpartial)
                               1

Error: Statement function at (1) is recursive
canopy_struct_dynamics.f90:933.12:

        cdrag   (:) = cdrag1 + 0.5 * cdrag2
        1

Error: Unclassifiable statement at (1)
canopy_struct_dynamics.f90:949.67:

           cdrag   (k)  = cdrag1 + cdrag2 / (1.0 + exp(c3_lad))
                                                               1

Error: Unexpected STATEMENT FUNCTION statement at (1)
canopy_struct_dynamics.f90:950.32:

           pshelter(k)  = 1.
                            1

Error: Unexpected STATEMENT FUNCTION statement at (1)
canopy_struct_dynamics.f90:952.47:

           cumldrag(k)  = ldga_bk + lyrhalf
                                           1

Error: Unexpected STATEMENT FUNCTION statement at (1)
canopy_struct_dynamics.f90:958.17:

        cdrag   (:) = cdrag0
             1

Error: 'cdrag' at (1) is not a variable
canopy_struct_dynamics.f90:971.53:

           pshelter(k)  = 1. + alpha_m97 * lad(k)
                                                 1

Error: Unexpected STATEMENT FUNCTION statement at (1)
canopy_struct_dynamics.f90:973.47:

           cumldrag(k)  = ldga_bk + lyrhalf
                                           1

Error: Unexpected STATEMENT FUNCTION statement at (1)
canopy_struct_dynamics.f90:1049.61:

        windlyr(k) = max(ugbmin, uh * exp(- nn * nddfun))
                                                         1

Error: Unexpected STATEMENT FUNCTION statement at (1)
canopy_struct_dynamics.f90:1197.91:

                    ,csite%veg_displace(ipa),zzmid(k),csite%rough(ipa))
                                                                       1

Error: Unexpected STATEMENT FUNCTION statement at (1)
canopy_struct_dynamics.f90:1339.7:

end associate
   1

Error: Expecting END SUBROUTINE statement at (1)
canopy_struct_dynamics.f90:1588.6:

  associate(                               &
  1

Error: Unclassifiable statement at (1)
canopy_struct_dynamics.f90:2063.9:

     lad8(:) = 0.d0
     1

Error: Unclassifiable statement at (1)
canopy_struct_dynamics.f90:2106.28:

              lad8(k) = lad8(k) + ladcohort
                        1

Error: Statement function at (1) is recursive
canopy_struct_dynamics.f90:2117.36:

              lad8(kapartial) = lad8(kapartial)
                                1

Error: Statement function at (1) is recursive
canopy_struct_dynamics.f90:2121.36:

              lad8(kapartial) = lad8(kapartial)
                                1

Error: Statement function at (1) is recursive
Fatal Error: Error count reached limit of 25.
make[1]: *** [canopy_struct_dynamics.o] Error 1
make[1]: Leaving directory `/usr2/postdoc/apourmok/ED2-1/ED/build/bin'
make: *** [all] Error 2

On Thu, Mar 5, 2015 at 8:10 PM, Ryan Knox notifications@github.com wrote:

Hi All,

I put the Shared Memory Parallelism commits on the master. This will allow
for the splitting of radiation scattering, photosynthesis and
thermodynamics of different patches to different CPU cores.

This has been tested using RK4 and Hybrid integration
This has had limited testing on gridded runs
This has had no testing on coupled runs (but I don't suspect any breakage).

If you don't want to use shared memory, just keep doing what you have done
in the past and nothing should change.

If you do want to use it, follow these steps for a single polygon run:

  1. compile code with shared memory directives, if you are using OpenMP,
    the flag is '-fopenmp'
  2. (optional) increase your stack size. On linux: "ulimit -s unlimited"
  3. set run-time environment variables. If you are using OpenMP, the key
    variable is OMP_NUM_THREADS. This defines how many shared memory cores will
    be used. On linux:
    "export OMP_NUM_THREADS=X" where X is the number of cores you wish to use.
    REMEMBER: These cores must share RAM, so you are limited by the number of
    cores that are on one node.
  4. Execute the simulation as you would normally.

This release is experimental for the time being. If you have trouble or
crashes or poor reproducability of previous work, revert to commit 2a5d68e
2a5d68e

ie:

git checkout 2a5d68e
2a5d68e


Reply to this email directly or view it on GitHub
#30.

Afshin Pourmokhtarian, Ph.D.
Postdoctoral Research Associate
Dietze Ecological Forecasting Lab
Boston University
Deptartment of Earth & Environment, Rm 130
685 Commonwealth Avenue
Boston, MA 02215

@rgknox
Copy link
Contributor Author

rgknox commented Mar 10, 2015

thanks Afshin, looking into this now

On Mon, Mar 9, 2015 at 3:19 PM, Afshin Pourmokhtarian <
notifications@github.com> wrote:

Hi Ryan,
I pulled your latest changes from main line. I am able to compile and run
the model on my main branch but in my management branch that I am currently
working, I can't compile the model since the pull. Here is the error I get,
any thoughts?

Error: Unclassifiable statement at (1)
canopy_struct_dynamics.f90:392.53:

csite%rough(ipa) = snow_rough * snowfac_can
1
Warning: Nonconforming tab character at (1)
canopy_struct_dynamics.f90:591.53:

csite%rough(ipa) = snow_rough * snowfac_can
1
Warning: Nonconforming tab character at (1)
canopy_struct_dynamics.f90:600.85:

urf_rough = soil_rough * (1.0 - snowfac_can)
&
1
Warning: Nonconforming tab character at (1)
canopy_struct_dynamics.f90:779.9:

lad(:) = 0.0
1
Error: Unclassifiable statement at (1)
canopy_struct_dynamics.f90:821.27:

lad(k) = lad(k) + ladcohort
1
Error: Statement function at (1) is recursive
canopy_struct_dynamics.f90:832.35:

lad(kapartial) = lad(kapartial)
1
Error: Statement function at (1) is recursive
canopy_struct_dynamics.f90:836.35:

lad(kapartial) = lad(kapartial)
1
Error: Statement function at (1) is recursive
canopy_struct_dynamics.f90:839.35:

lad(kzpartial) = lad(kzpartial)
1
Error: Statement function at (1) is recursive
canopy_struct_dynamics.f90:893.27:

lad(k) = lad(k) + ladcohort
1
Error: Statement function at (1) is recursive
canopy_struct_dynamics.f90:904.35:

lad(kapartial) = lad(kapartial)
1
Error: Statement function at (1) is recursive
canopy_struct_dynamics.f90:908.35:

lad(kapartial) = lad(kapartial)
1
Error: Statement function at (1) is recursive
canopy_struct_dynamics.f90:911.35:

lad(kzpartial) = lad(kzpartial)
1
Error: Statement function at (1) is recursive
canopy_struct_dynamics.f90:933.12:

cdrag (:) = cdrag1 + 0.5 * cdrag2
1
Error: Unclassifiable statement at (1)
canopy_struct_dynamics.f90:949.67:

cdrag (k) = cdrag1 + cdrag2 / (1.0 + exp(c3_lad))
1
Error: Unexpected STATEMENT FUNCTION statement at (1)
canopy_struct_dynamics.f90:950.32:

pshelter(k) = 1.
1
Error: Unexpected STATEMENT FUNCTION statement at (1)
canopy_struct_dynamics.f90:952.47:

cumldrag(k) = ldga_bk + lyrhalf
1
Error: Unexpected STATEMENT FUNCTION statement at (1)
canopy_struct_dynamics.f90:958.17:

cdrag (:) = cdrag0
1
Error: 'cdrag' at (1) is not a variable
canopy_struct_dynamics.f90:971.53:

pshelter(k) = 1. + alpha_m97 * lad(k)
1
Error: Unexpected STATEMENT FUNCTION statement at (1)
canopy_struct_dynamics.f90:973.47:

cumldrag(k) = ldga_bk + lyrhalf
1
Error: Unexpected STATEMENT FUNCTION statement at (1)
canopy_struct_dynamics.f90:1049.61:

windlyr(k) = max(ugbmin, uh * exp(- nn * nddfun))
1
Error: Unexpected STATEMENT FUNCTION statement at (1)
canopy_struct_dynamics.f90:1197.91:

,csite%veg_displace(ipa),zzmid(k),csite%rough(ipa))
1
Error: Unexpected STATEMENT FUNCTION statement at (1)
canopy_struct_dynamics.f90:1339.7:

end associate
1
Error: Expecting END SUBROUTINE statement at (1)
canopy_struct_dynamics.f90:1588.6:

associate( &
1
Error: Unclassifiable statement at (1)
canopy_struct_dynamics.f90:2063.9:

lad8(:) = 0.d0
1
Error: Unclassifiable statement at (1)
canopy_struct_dynamics.f90:2106.28:

lad8(k) = lad8(k) + ladcohort
1
Error: Statement function at (1) is recursive
canopy_struct_dynamics.f90:2117.36:

lad8(kapartial) = lad8(kapartial)
1
Error: Statement function at (1) is recursive
canopy_struct_dynamics.f90:2121.36:

lad8(kapartial) = lad8(kapartial)
1
Error: Statement function at (1) is recursive
Fatal Error: Error count reached limit of 25.
make[1]: *** [canopy_struct_dynamics.o] Error 1
make[1]: Leaving directory `/usr2/postdoc/apourmok/ED2-1/ED/build/bin'
make: *** [all] Error 2

On Thu, Mar 5, 2015 at 8:10 PM, Ryan Knox notifications@github.com
wrote:

Hi All,

I put the Shared Memory Parallelism commits on the master. This will
allow
for the splitting of radiation scattering, photosynthesis and
thermodynamics of different patches to different CPU cores.

This has been tested using RK4 and Hybrid integration
This has had limited testing on gridded runs
This has had no testing on coupled runs (but I don't suspect any
breakage).

If you don't want to use shared memory, just keep doing what you have
done
in the past and nothing should change.

If you do want to use it, follow these steps for a single polygon run:

  1. compile code with shared memory directives, if you are using OpenMP,
    the flag is '-fopenmp'
  2. (optional) increase your stack size. On linux: "ulimit -s unlimited"
  3. set run-time environment variables. If you are using OpenMP, the key
    variable is OMP_NUM_THREADS. This defines how many shared memory cores
    will
    be used. On linux:
    "export OMP_NUM_THREADS=X" where X is the number of cores you wish to
    use.
    REMEMBER: These cores must share RAM, so you are limited by the number of
    cores that are on one node.
  4. Execute the simulation as you would normally.

This release is experimental for the time being. If you have trouble or
crashes or poor reproducability of previous work, revert to commit
2a5d68e
<
2a5d68e

ie:

git checkout 2a5d68e
<
2a5d68e


Reply to this email directly or view it on GitHub
#30.

Afshin Pourmokhtarian, Ph.D.
Postdoctoral Research Associate
Dietze Ecological Forecasting Lab
Boston University
Deptartment of Earth & Environment, Rm 130
685 Commonwealth Avenue
Boston, MA 02215


Reply to this email directly or view it on GitHub
#30 (comment).

@apourmok
Copy link
Contributor

Thanks Ryan. I did some research on "Error: Unclassifiable statement at
(1)" and it seems it could be related to compilation options/flags.
On separate note, my model crashes when run it with hybrid integrator, it
crashes after 2 months in to the run. Should I post the error here or under
hybrid integrator issue on Github?

On Mon, Mar 9, 2015 at 9:28 PM, Ryan Knox notifications@github.com wrote:

thanks Afshin, looking into this now

On Mon, Mar 9, 2015 at 3:19 PM, Afshin Pourmokhtarian <
notifications@github.com> wrote:

Hi Ryan,
I pulled your latest changes from main line. I am able to compile and run

the model on my main branch but in my management branch that I am
currently
working, I can't compile the model since the pull. Here is the error I
get,
any thoughts?

Error: Unclassifiable statement at (1)
canopy_struct_dynamics.f90:392.53:

csite%rough(ipa) = snow_rough * snowfac_can
1
Warning: Nonconforming tab character at (1)
canopy_struct_dynamics.f90:591.53:

csite%rough(ipa) = snow_rough * snowfac_can
1
Warning: Nonconforming tab character at (1)
canopy_struct_dynamics.f90:600.85:

urf_rough = soil_rough * (1.0 - snowfac_can)
&
1
Warning: Nonconforming tab character at (1)
canopy_struct_dynamics.f90:779.9:

lad(:) = 0.0
1
Error: Unclassifiable statement at (1)
canopy_struct_dynamics.f90:821.27:

lad(k) = lad(k) + ladcohort
1
Error: Statement function at (1) is recursive
canopy_struct_dynamics.f90:832.35:

lad(kapartial) = lad(kapartial)
1
Error: Statement function at (1) is recursive
canopy_struct_dynamics.f90:836.35:

lad(kapartial) = lad(kapartial)
1
Error: Statement function at (1) is recursive
canopy_struct_dynamics.f90:839.35:

lad(kzpartial) = lad(kzpartial)
1
Error: Statement function at (1) is recursive
canopy_struct_dynamics.f90:893.27:

lad(k) = lad(k) + ladcohort
1
Error: Statement function at (1) is recursive
canopy_struct_dynamics.f90:904.35:

lad(kapartial) = lad(kapartial)
1
Error: Statement function at (1) is recursive
canopy_struct_dynamics.f90:908.35:

lad(kapartial) = lad(kapartial)
1
Error: Statement function at (1) is recursive
canopy_struct_dynamics.f90:911.35:

lad(kzpartial) = lad(kzpartial)
1
Error: Statement function at (1) is recursive
canopy_struct_dynamics.f90:933.12:

cdrag (:) = cdrag1 + 0.5 * cdrag2
1
Error: Unclassifiable statement at (1)
canopy_struct_dynamics.f90:949.67:

cdrag (k) = cdrag1 + cdrag2 / (1.0 + exp(c3_lad))
1
Error: Unexpected STATEMENT FUNCTION statement at (1)
canopy_struct_dynamics.f90:950.32:

pshelter(k) = 1.
1
Error: Unexpected STATEMENT FUNCTION statement at (1)
canopy_struct_dynamics.f90:952.47:

cumldrag(k) = ldga_bk + lyrhalf
1
Error: Unexpected STATEMENT FUNCTION statement at (1)
canopy_struct_dynamics.f90:958.17:

cdrag (:) = cdrag0
1
Error: 'cdrag' at (1) is not a variable
canopy_struct_dynamics.f90:971.53:

pshelter(k) = 1. + alpha_m97 * lad(k)
1
Error: Unexpected STATEMENT FUNCTION statement at (1)
canopy_struct_dynamics.f90:973.47:

cumldrag(k) = ldga_bk + lyrhalf
1
Error: Unexpected STATEMENT FUNCTION statement at (1)
canopy_struct_dynamics.f90:1049.61:

windlyr(k) = max(ugbmin, uh * exp(- nn * nddfun))
1
Error: Unexpected STATEMENT FUNCTION statement at (1)
canopy_struct_dynamics.f90:1197.91:

,csite%veg_displace(ipa),zzmid(k),csite%rough(ipa))
1
Error: Unexpected STATEMENT FUNCTION statement at (1)
canopy_struct_dynamics.f90:1339.7:

end associate
1
Error: Expecting END SUBROUTINE statement at (1)
canopy_struct_dynamics.f90:1588.6:

associate( &
1
Error: Unclassifiable statement at (1)
canopy_struct_dynamics.f90:2063.9:

lad8(:) = 0.d0
1
Error: Unclassifiable statement at (1)
canopy_struct_dynamics.f90:2106.28:

lad8(k) = lad8(k) + ladcohort
1
Error: Statement function at (1) is recursive
canopy_struct_dynamics.f90:2117.36:

lad8(kapartial) = lad8(kapartial)
1
Error: Statement function at (1) is recursive
canopy_struct_dynamics.f90:2121.36:

lad8(kapartial) = lad8(kapartial)
1
Error: Statement function at (1) is recursive
Fatal Error: Error count reached limit of 25.
make[1]: *** [canopy_struct_dynamics.o] Error 1
make[1]: Leaving directory `/usr2/postdoc/apourmok/ED2-1/ED/build/bin'
make: *** [all] Error 2

On Thu, Mar 5, 2015 at 8:10 PM, Ryan Knox notifications@github.com
wrote:

Hi All,

I put the Shared Memory Parallelism commits on the master. This will
allow
for the splitting of radiation scattering, photosynthesis and
thermodynamics of different patches to different CPU cores.

This has been tested using RK4 and Hybrid integration
This has had limited testing on gridded runs
This has had no testing on coupled runs (but I don't suspect any
breakage).

If you don't want to use shared memory, just keep doing what you have
done
in the past and nothing should change.

If you do want to use it, follow these steps for a single polygon run:

  1. compile code with shared memory directives, if you are using OpenMP,
    the flag is '-fopenmp'
  2. (optional) increase your stack size. On linux: "ulimit -s unlimited"
  3. set run-time environment variables. If you are using OpenMP, the key
    variable is OMP_NUM_THREADS. This defines how many shared memory cores
    will
    be used. On linux:
    "export OMP_NUM_THREADS=X" where X is the number of cores you wish to
    use.
    REMEMBER: These cores must share RAM, so you are limited by the number
    of
    cores that are on one node.
  4. Execute the simulation as you would normally.

This release is experimental for the time being. If you have trouble or
crashes or poor reproducability of previous work, revert to commit
2a5d68e
<

2a5d68e

ie:

git checkout 2a5d68e
<

2a5d68e


Reply to this email directly or view it on GitHub
#30.

Afshin Pourmokhtarian, Ph.D.
Postdoctoral Research Associate
Dietze Ecological Forecasting Lab
Boston University
Deptartment of Earth & Environment, Rm 130
685 Commonwealth Avenue
Boston, MA 02215


Reply to this email directly or view it on GitHub
#30 (comment).


Reply to this email directly or view it on GitHub
#30 (comment).

Afshin Pourmokhtarian, Ph.D.
Postdoctoral Research Associate
Dietze Ecological Forecasting Lab
Boston University
Deptartment of Earth & Environment, Rm 130
685 Commonwealth Avenue
Boston, MA 02215

@rgknox
Copy link
Contributor Author

rgknox commented Mar 10, 2015

Is it possible that your version of fortran does not like the associate statment? This is a new type of statement we have not had in the code as of yet.

I think this might part of a more recent fortran standard. I only put it in there because it helps with readability, but it might be problematic when it comes to portability.

@apourmok
Copy link
Contributor

I thought that might be the case but surprisingly when I pull the SMP in to
my mainline, I am able to compile it. Having said that, the only things I
changed in my management branch are adding new western PFTs and few
functions for logging and planting so I am confused how it throws me an
error.

On Mon, Mar 9, 2015 at 10:26 PM, Ryan Knox notifications@github.com wrote:

Is it possible that your version of fortran does not like the associate
statment? This is a new type of statement we have not had in the code as of
yet.

I think this might part of a more recent fortran standard. I only put it
in there because it helps with readability, but it might be problematic
when it comes to portability.


Reply to this email directly or view it on GitHub
#30 (comment).

Afshin Pourmokhtarian, Ph.D.
Postdoctoral Research Associate
Dietze Ecological Forecasting Lab
Boston University
Deptartment of Earth & Environment, Rm 130
685 Commonwealth Avenue
Boston, MA 02215

@rgknox
Copy link
Contributor Author

rgknox commented Mar 10, 2015

Maybe different compile flags in your branch?

Regarding the crash with the hybrid, I think this would be a good new
issue. Keep in mind that while the hybrid integrator is fast, there are a
few underlying issues. 1) The forward stepping in hybrid is achieved via a
simple Euler step which is not ideal, so it is susceptible to both error
and instability. 2) The backward step on leaves uses temperature as the
state variable and not enthalpy. Enthalpy is ideal because it allows for a
smooth transition through phase changes, alternatively phase change is
diagnosed from prognostic temperatures. The result is that you will get
rapid step-like drops or gains in energy when crossing 0 degrees. It is
possible to re-write the state variables to be enthalpy (or internal
energy) instead of temperature, but this was an over-site during my thesis
work, and since my thesis work was tropical.. I never had time to get back
to it.

On Tue, Mar 10, 2015 at 8:07 AM, Afshin Pourmokhtarian <
notifications@github.com> wrote:

I thought that might be the case but surprisingly when I pull the SMP in to
my mainline, I am able to compile it. Having said that, the only things I
changed in my management branch are adding new western PFTs and few
functions for logging and planting so I am confused how it throws me an
error.

On Mon, Mar 9, 2015 at 10:26 PM, Ryan Knox notifications@github.com
wrote:

Is it possible that your version of fortran does not like the associate
statment? This is a new type of statement we have not had in the code as
of
yet.

I think this might part of a more recent fortran standard. I only put it
in there because it helps with readability, but it might be problematic
when it comes to portability.


Reply to this email directly or view it on GitHub
#30 (comment).

Afshin Pourmokhtarian, Ph.D.
Postdoctoral Research Associate
Dietze Ecological Forecasting Lab
Boston University
Deptartment of Earth & Environment, Rm 130
685 Commonwealth Avenue
Boston, MA 02215


Reply to this email directly or view it on GitHub
#30 (comment).

@apourmok
Copy link
Contributor

Hi Ryan,
It seems that the problem with SMP on BU geo cluster is old version of
Fortran as you said. Although when I load *gcc/4.8.1 *module and try to
compile it again, I get a new error (see below). I found on Stackoverflow
that I can get around this problem with adding "-fno-whole-file" to my
compilation flag (
http://stackoverflow.com/questions/21307765/gfortran-attribute-that-requires-an-explicit-interface-for-this-procedure
)

but now I get a new error (see the bottom after -----). Any idea?

Error: Dummy argument 'cgrid' of procedure 'soil_default_fill' at (1) has
an attribute that requires an explicit interface for this procedure

ed_init.f90:465.29:

     call print_soil_info(edgrid_g(igr),igr)
                         1

Error: Dummy argument 'cgrid' of procedure 'print_soil_info' at (1) has an
attribute that requires an explicit interface for this procedure

make[1]: *** [ed_init.o] Error 1
make[1]: Leaving directory `/usr2/postdoc/apourmok/ED2-1/ED/build/bin'
make: *** [all] Error 2


Fatal Error: Cannot read module file 'hdf5.mod' opened at (1), because it
was created by a different version of GNU Fortran

make[1]: *** [hdf5_coms.o] Error 1
make[1]: Leaving directory `/usr2/postdoc/apourmok/ED2-1/ED/build/bin'
make: *** [all] Error 2

Thanks,
Afshin

On Tue, Mar 10, 2015 at 2:29 PM, Ryan Knox notifications@github.com wrote:

Maybe different compile flags in your branch?

Regarding the crash with the hybrid, I think this would be a good new
issue. Keep in mind that while the hybrid integrator is fast, there are a
few underlying issues. 1) The forward stepping in hybrid is achieved via a
simple Euler step which is not ideal, so it is susceptible to both error
and instability. 2) The backward step on leaves uses temperature as the
state variable and not enthalpy. Enthalpy is ideal because it allows for a
smooth transition through phase changes, alternatively phase change is
diagnosed from prognostic temperatures. The result is that you will get
rapid step-like drops or gains in energy when crossing 0 degrees. It is
possible to re-write the state variables to be enthalpy (or internal
energy) instead of temperature, but this was an over-site during my thesis
work, and since my thesis work was tropical.. I never had time to get back
to it.

On Tue, Mar 10, 2015 at 8:07 AM, Afshin Pourmokhtarian <
notifications@github.com> wrote:

I thought that might be the case but surprisingly when I pull the SMP in
to
my mainline, I am able to compile it. Having said that, the only things I
changed in my management branch are adding new western PFTs and few
functions for logging and planting so I am confused how it throws me an
error.

On Mon, Mar 9, 2015 at 10:26 PM, Ryan Knox notifications@github.com
wrote:

Is it possible that your version of fortran does not like the associate
statment? This is a new type of statement we have not had in the code
as
of
yet.

I think this might part of a more recent fortran standard. I only put
it
in there because it helps with readability, but it might be problematic
when it comes to portability.


Reply to this email directly or view it on GitHub
#30 (comment).

Afshin Pourmokhtarian, Ph.D.
Postdoctoral Research Associate
Dietze Ecological Forecasting Lab
Boston University
Deptartment of Earth & Environment, Rm 130
685 Commonwealth Avenue
Boston, MA 02215


Reply to this email directly or view it on GitHub
#30 (comment).


Reply to this email directly or view it on GitHub
#30 (comment).

Afshin Pourmokhtarian, Ph.D.
Postdoctoral Research Associate
Dietze Ecological Forecasting Lab
Boston University
Deptartment of Earth & Environment, Rm 130
685 Commonwealth Avenue
Boston, MA 02215

@rgknox
Copy link
Contributor Author

rgknox commented Mar 12, 2015

I went ahead and removed the associate statements, and thereby replaced the aliases with their original variables. This change should make the code complient with your original compilers.

Perhaps we can discuss as a community during our next get-together whether we want to embrace the more recent fortran standards for future releases.

@apourmok
Copy link
Contributor

Thanks Ryan. It is working now.
We definitely need to talk about this issue in the next ED2 call/meeting.

On Thu, Mar 12, 2015 at 7:25 PM, Ryan Knox notifications@github.com wrote:

I went ahead and removed the associate statements, and thereby replaced
the aliases with their original variables. This change should make the code
complient with your original compilers.

Perhaps we can discuss as a community during our next get-together whether
we want to embrace the more recent fortran standards for future releases.


Reply to this email directly or view it on GitHub
#30 (comment).

Afshin Pourmokhtarian, Ph.D.
Postdoctoral Research Associate
Dietze Ecological Forecasting Lab
Boston University
Deptartment of Earth & Environment, Rm 130
685 Commonwealth Avenue
Boston, MA 02215

@crollinson
Copy link
Contributor

@rgknox

I just tried running the SMP version with the PalEON stuff and 5 out of 6 runs have crashed between 15 and 30 years into the simulations. The error I'm getting is pasted below. Sometimes the top function is [...]mmean_vars instead of dmean, but the rest is the same. I haven't tried digging into it yet and figured I'd ask you to see if you know what's going on first. I'm running things with the hybrid integrator and the new CBR_SCHEME = 0


Program received signal 8 (SIGFPE): Floating-point exception.

Backtrace for this error:

  • /lib64/libc.so.6(+0x326a0) [0x2aad5f24e6a0]
  • function __average_utils_MOD_integrate_ed_dmean_vars (0xD45100)
    at line 2229 of file average_utils.f90
  • function ed_output_ (0xE5C12A)
    at line 87 of file edio.f90
  • function ed_model_ (0x50B5EE)
    at line 444 of file ed_model.f90
  • function ed_driver_ (0x434851)
    at line 274 of file ed_driver.f90
  • in the main program
    at line 157 of file edmain.f90
  • /lib64/libc.so.6(__libc_start_main+0xfd) [0x2aad5f23ad5d]
    /var/spool/sge/scc-dd1/job_scripts/6758981: line 13: 87730 Quit (core dumped) ./ed_2.1-opt

@crollinson
Copy link
Contributor

Quick update @rgknox:

I don't know if this helps at all, but all 6 models have now crashed with the SIGFPE error. 4 out of the six reference a " * frqsum_o_daysec" line in average_utils. The other 2 are " * ndaysi"

Thoughts?

@rgknox
Copy link
Contributor Author

rgknox commented Mar 18, 2015

I can confirm similar problems, although I can get stable results using my
local branch which has a small set of differences with master, I hope to
identify the culprit soon. It's possible it has something to do with the
removal of the associate statement in the last commit, or perhaps related
to how the snowfac changes were applied/merged. Sorry all, we will get this
patched soon
On Mar 18, 2015 3:41 PM, "Christy Rollinson" notifications@github.com
wrote:

Quick update @rgknox https://github.com/rgknox:

I don't know if this helps at all, but all 6 models have now crashed with
the SIGFPE error. 4 out of the six reference a " * frqsum_o_daysec" line in
average_utils. The other 2 are " * ndaysi"

Thoughts?


Reply to this email directly or view it on GitHub
#30 (comment).

@crollinson
Copy link
Contributor

Interesting that you mention snowfac... I just had my non-SMP ED (with CBR changes) crash (SIGFPE error 8) with it tracing back to snowfac in the radiate driver (line 757). That's the first time in about 2,000 years of runs with the normal version, but maybe I'm the one to blame... (sorry!)

@crollinson
Copy link
Contributor

Sorry to flood the comments, but it looks like the error being tied to snowfac is likely. All of the SMP errors were tied to par_level variables.

At least this time, it doesn't seem to be a snow issue as most of my errors are being thrown in non-winter months.

@apourmok
Copy link
Contributor

@crollinson did you change some part the code in radiate_driver as part of you snow fix?

@crollinson
Copy link
Contributor

@apourmok Nope. I steered clear of that one.

@mpaiao
Copy link
Contributor

mpaiao commented Mar 18, 2015

@crollinson I remember seeing problems with frqsum_o_daysec and ndaysi, and I think it was related to -Q- files (or -Q- files turned off) that would cause division by 0. I thought we had fixed it, but maybe we didn't fix everything... Could you share the ED2IN that caused the problem so I check the configurations that created the problem? Did the problem occur right at the beginning, or at the beginning of a new month?

@crollinson
Copy link
Contributor

@mpaiao It didn't crash right at the beginning. On the BU server, there's a lag between what gets written to the out file, so I'm not exactly sure if it crashed at the beginning of a new month. Q files are turned off. My ED2IN files and as well as the crash logs can be found in one of my github repositories: https://github.com/crollinson/ED_Processing/tree/master/spin_finish_smp Keep in mind that the last date in the log is not necessarily the date of the crash.

Restarting from a history file gets me past the crash point, so maybe it's at least partially a problem with an uninitialized variable?

@mpaiao
Copy link
Contributor

mpaiao commented Mar 19, 2015

@crollinson It seems the crash is always happening when it's integrating these par_level_diffu/par_level_diffd variables. The radiation code has some substantial differences from the version I updated last time (there used to be par_level_diff only), but I checked the usual places where variables should be initialised and nothing stood out. Maybe the value is becoming too large and eventually overflows when average_utils accumulates it over the month? It may be worth checking the values of these variables in the -E- files the code generated before it crashed, I think they should be always between 0 and 1, unless their definition has changed.

@rgknox
Copy link
Contributor Author

rgknox commented Mar 19, 2015

I was able to remove the crash by reverting to the previous %snowfac
formulation, but:

the trouble may specifically involve line 489 in rk4_derivs.f90:

  rk4aux(ibuff)%h_flux_g   (mzg+1) = -

avg_th_cond &
* (initp%sfcwater_tempk(1) -
initp%soil_tempk(mzg)) &
/ (5.d-1 * initp%sfcwater_depth(1) -
slzt8(mzg) )

It was when I changed this line back to the original that things started
working again.

On Wed, Mar 18, 2015 at 6:49 PM, Marcos Longo notifications@github.com
wrote:

@crollinson https://github.com/crollinson It seems the crash is always
happening when it's integrating these par_level_diffu/par_level_diffd
variables. The radiation code has some substantial differences from the
version I updated last time (there used to be par_level_diff only), but I
checked the usual places where variables should be initialised and nothing
stood out. Maybe the value is becoming too large and eventually overflows
when average_utils accumulates it over the month? It may be worth checking
the values of these variables in the -E- files the code generated before it
crashed, I think they should be always between 0 and 1, unless their
definition has changed.


Reply to this email directly or view it on GitHub
#30 (comment).

@crollinson
Copy link
Contributor

Thanks @rgknox. That's unfortunate, but not surprising that's where the problem is coming from. I'll take a closer look this afternoon, but the way the way the soil-snow-air were interacting were causing major problems in the northeast. I think I'd tried reverting this spot back to the original and it was one of the key spots that made snow break. I'll admit though, I got turned around as to where the problem was coming from.

I can spend some time on it probably tomorrow morning (maybe this afternoon), but maybe @mpaiao could take a look and double check places I've changed and argue the case for reverting them?

@crollinson
Copy link
Contributor

@rgknox I just tried setting this line back to the old version where h_flux_g(mzg+1) is scaled by snowfac & it made things worse, not better in my branch.

@mpaiao this is a spot where I could follow your logic & it makes sense, but it causes really weird fluxes in my snow layers and getting rid of snowfac in that statement made the hflux in the first snow layer sensible.

@mpaiao
Copy link
Contributor

mpaiao commented Mar 19, 2015

@crollinson this is rarely used in the tropics, just as short-lived puddles, so if removing it improves results in snowy areas, then I'm totally fine with getting rid of snowfac. I don't think it violates any energy conservation either, which would be my only concern.

@crollinson
Copy link
Contributor

I've done a couple more tests with things that have made snow more stable in the past and I really don't think the problem is rooted with snowfac. In the snowfac tweaking, I've gotten a ton of other SIGFPE errors (with no change in frequency), and a couple were not in average_utils. Every time it ties back to a line group with par_level_diffu, par_level_dffd, or par_level_beam. I'm not sure when/why these came into the mainline, but they weren't in the version I was using for the CBR fixes and so I'm having a hard time tracking down what's going on with them. Any thoughts?

@crollinson
Copy link
Contributor

And yet another update: I was going through the all of the stuff that is printed during compiling and came up with 5 flags for potentially uninitialized variables. I haven't tracked each of them down thoroughly and would appreciate it if anybody that knows about these sections could chime in. The flags are are (in order of what I think are potential breaking points):

1_) rk4_misc.f90: In function 'adjust_sfcw_properties':
rk4_misc.f90:1565: warning: 'depth_available' may be used uninitialized in this function
*This might be tied to to the snow problems I've been dealing with. This bug exists in much older versions of ED (c. 2013 at least), but may have been less of an issue until the more recent change in snowfac, depending on the order in which certain things are done. (That's just a hunch at this point, but could fit in with the par stuff as well)_

  1. ed_state_vars.f90: In function 'copy_sitetype_mask':
    ed_state_vars.f90:8795: warning: 'i' may be used uninitialized in this function

  2. ed_read_ed21_history.F90: In function 'read_ed21_history_file':
    ed_read_ed21_history.F90:447: warning: 'si_index' may be used uninitialized in this function

  3. heun_driver.f90: In function 'heun_stepper':
    heun_driver.f90:807: warning: 'combh' may be used uninitialized in this function
    (note: I've been running with the hybrid driver, so this is almost certainly not the issue)

  4. events.f90: In function 'event_irrigate':
    events.f90:649: warning: 'soil_temp' may be used uninitialized in this function
    (note: this shouldn't occur in my runs at all, so this is almost certainly not the issue)

@crollinson
Copy link
Contributor

I've tracked down the source of the depth_available uninitialization (#1 above). It actually dates back to @mpaiao in Jan 2012 (Jan 5). I've tried to adapt things to how things work now based on my best guess of what was going on in the version before that commit. What I have now is:
!---------------------------------------------------------------------------------!
! There is not enough water vapour. Dry down to the minimum, and hope for the !
! best. !
!---------------------------------------------------------------------------------!
energy_available = wmass_available * (alvi8 - fracliq_needed * alli8)
depth_available = wmass_available * ( fracliq_needed * wdnsi8 &
+ (1.d0-fracliq_needed * fdnsi8)
!---------------------------------------------------------------------------------!

I'm currently going to let energy_available be overwritten by what is currently in the code (energy_available = wmass_available * energy_needed / wmass_needed).

@mpaiao, since you're the one that made this change, could you double check the new depth_avail initialization and see if it makes sense? The old version was:
depth_available = wmass_available * ( initp%soil_fracliq(nzg) * wdnsi8 &
+ (1.d0-initp%soil_fracliq(nzg)) * fdnsi8)

@crollinson
Copy link
Contributor

While the uninitialized variables still need to be sorted out, fixing the snow_depth issue alone has not fixed the SMP crashes. Everything continues to point back to whatever was done to introduce par_level_diffu/par_level_diffd

@rgknox
Copy link
Contributor Author

rgknox commented Mar 20, 2015

I will look into that, those are my diagnostics
On Mar 20, 2015 10:19 AM, "Christy Rollinson" notifications@github.com
wrote:

While the uninitialized variables still need to be sorted out, fixing the
snow_depth issue alone has not fixed the SMP crashes. Everything continues
to point back to whatever was done to introduce
par_level_diffu/par_level_diffd


Reply to this email directly or view it on GitHub
#30 (comment).

@rgknox
Copy link
Contributor Author

rgknox commented Mar 20, 2015

Christy, what radiation scheme are you using? it will help me track this ICANRAD = ?

@crollinson
Copy link
Contributor

I'm running icanrad=0.

All of my ED2INs with my settings can be found in one of my github repos: https://github.com/crollinson/ED_Processing/tree/master/spin_finish_smp

On Mar 20, 2015, at 2:14 PM, Ryan Knox notifications@github.com wrote:

Christy, what radiation scheme are you using? it will help me track this ICANRAD = ?


Reply to this email directly or view it on GitHub.

@rgknox
Copy link
Contributor Author

rgknox commented Mar 20, 2015

I am having trouble reproducing errors regarding par_level variables, is there any crash report info you could provide, like tracebacks etc?

@rgknox
Copy link
Contributor Author

rgknox commented Mar 20, 2015

I am thinking we should really deprecate ICANRAD=0 anyway, Marcos and I have both gone through the code and theory for two-stream with a fine-tooth comb, and get more sensible and consistent answers with the updated two-stream (ICANRAD=2).
Christy, could you test with ICANRAD=2 and ICANRAD=1 if possible?

@crollinson
Copy link
Contributor

There are examples of the first couple crashes in that folder I linked to above. Ryan, I'll start an ICANRAD=2 run now.

If you think it might be a problem with my starting conditions, I just uploaded my .pss & .css files that you could try running. https://github.com/crollinson/ED_Processing/tree/master/phase1a_spinup.v2

These were created from an SAS solution after 150 years of non-SMP run with disturb off (those ED2INs are also on my github), which shouldn't affect how things are running from an initial, but I suppose it's possible.

@rgknox
Copy link
Contributor Author

rgknox commented Mar 20, 2015

I found and fixed a potential bug that my be your problem with par_level
variables. The history reads were not including those variables, I have
updated this. I will branch off your master, commit, and give you a push
request.

On Fri, Mar 20, 2015 at 12:17 PM, Christy Rollinson <
notifications@github.com> wrote:

There are examples of the first couple crashes in that folder I linked to
above. Ryan, I'll start an ICANRAD=2 run now.

If you think it might be a problem with my starting conditions, I just
uploaded my .pss & .css files that you could try running.
https://github.com/crollinson/ED_Processing/tree/master/phase1a_spinup.v2

These were created from an SAS solution after 150 years of non-SMP run
with disturb off (those ED2INs are also on my github), which shouldn't
affect how things are running from an initial, but I suppose it's possible.


Reply to this email directly or view it on GitHub
#30 (comment).

@crollinson
Copy link
Contributor

Fantastic! Thanks Ryan!

@rgknox
Copy link
Contributor Author

rgknox commented Mar 20, 2015

For some reason I could not branch off your master. I put the changes into the mainline. Could you try merging the changes into your local, Christy?

@rgknox
Copy link
Contributor Author

rgknox commented Mar 20, 2015

I went back to archives of 2012 and then 2011 when the Heun integrator implementation was in its infancy, never did I find any instance where the local variable "combh" was initialized before it was used.

I personally have no experience using the Heun integrator, and unless someone is invested in this option, I propose we just disable it until that person steps up.

crollinson added a commit to crollinson/ED2 that referenced this issue Mar 20, 2015
par_level fixes per @rgknox, issue EDmodel#30

(git ED2 master pull request EDmodel#38)
@crollinson
Copy link
Contributor

Thanks Ryan! I was able to pull the mainline into my branch and have things running now. I'm about 15 years in at 3 sites and so far so good. I'll let you know how things turn out.

@rgknox
Copy link
Contributor Author

rgknox commented Mar 20, 2015

OK, I will pull a clone and test as well
On Mar 20, 2015 2:52 PM, "Christy Rollinson" notifications@github.com
wrote:

Thanks Ryan! I was able to pull the mainline into my branch and have
things running now. I'm about 15 years in at 3 sites and so far so good.
I'll let you know how things turn out.


Reply to this email directly or view it on GitHub
#30 (comment).

@crollinson
Copy link
Contributor

@rgknox SMP is still a no-go for me. Currently ICANRAD = 1, one only got 4 years with backtrace as follows:

Program received signal 8 (SIGFPE): Floating-point exception.

Backtrace for this error:

  • /lib64/libc.so.6(+0x326a0) [0x2b531e66d6a0]
  • function __average_utils_MOD_integrate_ed_dmean_vars (0xD44DB2)
    at line 2223 of file average_utils.f90
  • function ed_output_ (0xE5C1FE)
    at line 87 of file edio.f90
  • function ed_model_ (0x50B5EE)
    at line 444 of file ed_model.f90
  • function ed_driver_ (0x434851)
    at line 274 of file ed_driver.f90
  • in the main program
    at line 157 of file edmain.f90
  • /lib64/libc.so.6(__libc_start_main+0xfd) [0x2b531e659d5d]
    /var/spool/sge/scc-pg6/job_scripts/6829264: line 13: 113752 Quit (core dumped) ./ed_2.1-opt

another made it 30 years, but still got SIGFPE fails with par_level vars

Program received signal 8 (SIGFPE): Floating-point exception.

Backtrace for this error:

  • /lib64/libc.so.6(+0x326a0) [0x2b63c7d416a0]
  • function __fuse_fiss_utils_MOD_fuse_2_cohorts (0x95561F)
    at line 1608 of file fuse_fiss_utils.f90
  • function __fuse_fiss_utils_MOD_fuse_cohorts (0x9975B8)
    at line 740 of file fuse_fiss_utils.f90
  • function reproduction_ (0xEFDB19)
    at line 397 of file reproduction.f90
  • function vegetation_dynamics_ (0xC3BD5E)
    at line 91 of file vegetation_dynamics.f90
  • function ed_model_ (0x50B441)
    at line 398 of file ed_model.f90
  • function ed_driver_ (0x434851)
    at line 274 of file ed_driver.f90
  • in the main program
    at line 157 of file edmain.f90
  • /lib64/libc.so.6(__libc_start_main+0xfd) [0x2b63c7d2dd5d]
    /var/spool/sge/scc-pg6/job_scripts/6829262: line 13: 112922 Quit (core dumped) ./ed_2.1-opt

Is there maybe some sort of min/max bound to keep the number from getting too small or something of the sort?

@mpaiao
Copy link
Contributor

mpaiao commented Mar 20, 2015

The other items:

  1. ed_state_vars.f90: I checked the code and it looks fine, maybe it is complaining of this line:
    allind = (/ (i,i=1,isize) /)
    Kind of ugly, but I don't think it is wrong

  2. ed_read_ed21_history.F90: this is likely to be a bug. I went back to a version from 2012, and I think the block near line 447 should be:

           csite => cpoly%site(isi)
           !------ Calculate the index of this site's data in the HDF. ----------------!
           si_index = pysi_id(py_index) + isi - 1
           if (sipa_n(si_index) > 0) then

@rgknox
Copy link
Contributor Author

rgknox commented Mar 20, 2015

Cohort fusion is not including the qmean, mmean and dmean averages of the
par_level diagnostics, that is a problem that I am writing a fix for... but
I would not expect that to be the cause for a crash. I will submit another
mini commit to the master and keep looking.

On Fri, Mar 20, 2015 at 3:33 PM, Marcos Longo notifications@github.com
wrote:

The other items:

  1. ed_state_vars.f90: I checked the code and it looks fine, maybe it is
    complaining of this line:
    allind = (/ (i,i=1,isize) /)
    Kind of ugly, but I don't think it is wrong

  2. ed_read_ed21_history.F90: this is likely to be a bug. I went back to a
    version from 2012, and I think the block near line 447 should be:

       csite => cpoly%site(isi)
       !------ Calculate the index of this site's data in the HDF. ----------------!
       si_index = pysi_id(py_index) + isi - 1
       if (sipa_n(si_index) > 0) then


Reply to this email directly or view it on GitHub
#30 (comment).

@crollinson
Copy link
Contributor

@rgknox I have an idea on why it's bonking and why it's a stochastic thing.

I keep coming back to lines like this (889-899 in multiple_scatter):
!------ Integrate the visible light levels. --------------------------------------!
! NEEDS TO BE CHECKED (PARTICULARLY THE UPWARD)
! THIS SHOULD BE THE LEVEL (COHORT) CENTERED FLUX OF PAR
do i=1,ncoh
ip1 = i + 1
im1 = i - 1
par_level_diffd(i) = 5.d-1 * (swd(i) + swd(ip1)) / (par_diff_norm + par_beam_norm)
par_level_diffu(i) = 5.d-1 * (swu(i) + swu(im1)) / (par_diff_norm + par_beam_norm)
par_level_beam (i) = 5.d-1 * (beam_down(i) + beam_down(ip1)) / (par_diff_norm+par_beam_norm)
end do
!---------------------------------------------------------------------------------!

Could it have to do with something being off with i+1 or i-1? It looks like swd & swu are okay, but changing the numbers of those & not initializing the values right would explain the random nature of the crashes I'm seeing.

@crollinson
Copy link
Contributor

FYI, I'm now running a test with the ED2 mainline version (no changes to CBR) to make sure it's something there and not a weird artifact in my branch that's causing all of these issues

@rgknox
Copy link
Contributor Author

rgknox commented Mar 20, 2015

The ip1 and im1 thing looks like it should be ok. I did a double check and
it the code seems logical there. Note that variable swu is allocated to
allow a zero index, so when i=1 and im1=0, this should be fine.
I do wonder if some compilers have issue with this though.

Do you get the same issues with ICANRAD=2?

We are close! I'm sorry the par_level variables are being such a pain.
They are pretty useful though, because this are direclty comparable to par
flux sensors at differnet heights in a canopy.

On Fri, Mar 20, 2015 at 3:52 PM, Christy Rollinson <notifications@github.com

wrote:

@rgknox https://github.com/rgknox I have an idea on why it's bonking
and why it's a stochastic thing.

I keep coming back to lines like this (889-899 in multiple_scatter):
!------ Integrate the visible light levels.
--------------------------------------!
! NEEDS TO BE CHECKED (PARTICULARLY THE UPWARD)
! THIS SHOULD BE THE LEVEL (COHORT) CENTERED FLUX OF PAR
do i=1,ncoh
ip1 = i + 1
im1 = i - 1
par_level_diffd(i) = 5.d-1 * (swd(i) + swd(ip1)) / (par_diff_norm +
par_beam_norm)
par_level_diffu(i) = 5.d-1 * (swu(i) + swu(im1)) / (par_diff_norm +
par_beam_norm)
par_level_beam (i) = 5.d-1 * (beam_down(i) + beam_down(ip1)) /
(par_diff_norm+par_beam_norm)
end do

!---------------------------------------------------------------------------------!

Could it have to do with something being off with i+1 or i-1? It looks
like swd & swu are okay, but changing the numbers of those & not
initializing the values right would explain the random nature of the
crashes I'm seeing.


Reply to this email directly or view it on GitHub
#30 (comment).

@crollinson
Copy link
Contributor

I guess one other thing to note is that my PFT settings are quite dramatically different from the ED defaults, and that could definitely be impacting PAR things. It doesn't explain the randomness of the the errors though. If you want to check out those settings anyway, they're also on github: https://github.com/crollinson/ED_Processing/blob/master/PalEON_Phase1a.v2.xml

Thanks for being so responsive and helping me figure out what's going wrong! Once we get SMP fully working, you'll be the hero of the PalEON team for speeding up our millennial runs so much.

@rgknox
Copy link
Contributor Author

rgknox commented Mar 20, 2015

I took a quick look at one of your ED2IN's Christy.

https://raw.githubusercontent.com/crollinson/ED_Processing/master/spin_finish_ED2IN/ED2IN.PBL

One thing I noticed is that you have a relatively large timestep set for
DTLSM (900) when using the hybrid integration (INTEGRATION_SCHEME=3). This
may be a cause of stability problems, although I don't think it would
generate a seg fault or the things we are currently trouble-shooting, I
would be prepared to reduce this time-step if the model generates any
complaints regarding it's various self checks. For my most recent research
runs for istance, I used a DTLSM of 180 in the tropics. Also try the RK4
integration if you have stability problems, that method (while potentially
slower) forces the integration to meet an error criterion, the hybrid does
not (it can't).

On Fri, Mar 20, 2015 at 4:16 PM, Christy Rollinson <notifications@github.com

wrote:

I guess one other thing to note is that my PFT settings are quite
dramatically different from the ED defaults, and that could definitely be
impacting PAR things. It doesn't explain the randomness of the the errors
though. If you want to check out those settings anyway, they're also on
github:
https://github.com/crollinson/ED_Processing/blob/master/PalEON_Phase1a.v2.xml

Thanks for being so responsive and helping me figure out what's going
wrong! Once we get SMP fully working, you'll be the hero of the PalEON team
for speeding up our millennial runs so much.


Reply to this email directly or view it on GitHub
#30 (comment).

@crollinson
Copy link
Contributor

I just confirmed that I get the same errors with the github mainline branch ICANRAD = 2 and ICANRAD = 1 as well.

I haven't been having stability issues and my gh24 pre-SMP branch is working fine, but I'll keep in mind bumping the timestep down if I start encountering issues.

@crollinson
Copy link
Contributor

Pretty sure I just found the problem!!! The par_level variables were missing from ed_type_init.f90

Made the changes and pulling into my line for testing, but this would explain everything.

@DanielNScott
Copy link
Contributor

Hey Christy, did it turn out that was the issue? Are your SMP runs stable now with the hybrid integrator? If so, can you run w/ DTLSM > 180?

@crollinson
Copy link
Contributor

Hey @DanielNScott . SMP with the Hybrid Integrator has been working fantastically for me. I completed a set of my 12000 yr runs at 6 northeastern sites over the weekend (~48 hrs at DTLSM = 900) and had one stability interruption that then went through just fine when I dropped DTLSM from 900 to 600 (15 min to 10).

I'm redoing them with the version from yesterday's updated pull request just to double check, but the numbers I got from the first run and the spin I just finished look completely reasonable based on quick glances at snapshots (no thorough analysis yet).

@rgknox
Copy link
Contributor Author

rgknox commented Mar 25, 2015

The current EDModel master branch has not shown any signs of instability
during limited testing, so I am ever hopeful that this, along with
Christy's new CBR changes (to be merged in any moment) will be considered a
stable release and recommended for production/research runs. My testing
cluster edison.nersc.gov was down for maintenance this morning, but I'm
getting really excited to test everything out!

On Wed, Mar 25, 2015 at 1:19 PM, Christy Rollinson <notifications@github.com

wrote:

Hey @DanielNScott https://github.com/DanielNScott . SMP with the Hybrid
Integrator has been working fantastically for me. I completed a set of my
12000 yr runs at 6 northeastern sites over the weekend (~48 hrs at DTLSM =
900) and had one stability interruption that then went through just fine
when I dropped DTLSM from 900 to 600 (15 min to 10).

I'm redoing them with the version from yesterday's updated pull request
just to double check, but the numbers I got from the first run and the spin
I just finished look completely reasonable based on quick glances at
snapshots (no thorough analysis yet).


Reply to this email directly or view it on GitHub
#30 (comment).

@apourmok
Copy link
Contributor

apourmok commented Apr 2, 2015

Just want to report an issue with SMP. I can compile it with "-fopenmp" but when I try to run it, I get the error below but if I remove -fopenmp and compile it, the model runs. Any thought on this?

+--- Parallel info: -------------------------------------+

    • Machnum = 0
    • Machsize = 1
      +--------------------------------------------------------+
      Reading namelist information
      Copying namelist
      +------------------------------------------------------------+
      | Ecosystem Demography Model, version 2.2
      +------------------------------------------------------------+
      | Input namelist filename is ED2IN
      |
      | Single process execution on INITIAL run.
      +------------------------------------------------------------+
      => Generating the land/sea mask.
      /projectnb/dietzelab/EDI/oge2OLD/OGE2_HEADER
      -> Getting file: /projectnb/dietzelab/EDI/oge2OLD/OGE2_30N090W.h5...
      Segmentation fault

@crollinson
Copy link
Contributor

I think I just realized what’s going on! Before you trying running the SMP ED, you need to set it for multi-threaded, which means logging on the to cluster with parallelization on.

To do this on the BU server:

  1. log on to geo
  2. type: qlogin
  3. OMP_NUM_THREADS=8 (or 16 or whatever)
  4. run ED

On Apr 2, 2015, at 5:49 PM, Afshin Pourmokhtarian notifications@github.com wrote:

Just want to report an issue with SMP. I can compile it with "-fopenmp" but when I try to run it, I get the error below but if I remove -fopenmp and compile it, the model runs. Any thought on this?

+--- Parallel info: -------------------------------------+

  • Machnum = 0
  • Machsize = 1 +--------------------------------------------------------+ Reading namelist information Copying namelist +------------------------------------------------------------+ | Ecosystem Demography Model, version 2.2 +------------------------------------------------------------+ | Input namelist filename is ED2IN | | Single process execution on INITIAL run. +------------------------------------------------------------+ => Generating the land/sea mask. /projectnb/dietzelab/EDI/oge2OLD/OGE2_HEADER -> Getting file: /projectnb/dietzelab/EDI/oge2OLD/OGE2_30N090W.h5... Segmentation fault

    Reply to this email directly or view it on GitHub SMP Release #30 (comment).

@apourmok
Copy link
Contributor

apourmok commented Apr 2, 2015

These are my compilation flags:

Compile flags ------------------------------------------------

CMACH=PC_LINUX1
F_COMP=mpif90
#F_COMP = gfortran
#F_OPTS= -V -FR -O2 -recursive -static -Vaxlib -check all -g -fpe0 -ftz -debug extended \

-debug inline_debug_info -debug-parameters all -traceback -ftrapuv

#F_Opts= -03
F_OPTS= -g -Wall -W -ffpe-trap=invalid,zero,overflow -Wconversion -fbounds-check -fbacktrace -fdump-core -fopenmp
C_COMP=mpicc
#C_OPTS= -O2 -DLITTLE -g -static -traceback -debug extended
C_OPTS = -03 -DLITTLE
LOADER=mpif90
#LOADER = gfortran
LOADER_OPTS=${F_OPTS}
C_LOADER=mpicc
#C_LOADER_OPTS=-v -g -traceback -static
LIBS=
MOD_EXT=mod

MPI Flags ----------------------------------------------------

MPI_PATH=
PAR_INCS=
PAR_LIBS=
PAR_DEFS=-DRAMS_MPI

@rgknox
Copy link
Contributor Author

rgknox commented Apr 2, 2015

Did you set your stack limit to unlimited?
On the run node:

ulimit -s unlimited
On Apr 2, 2015 3:04 PM, "Afshin Pourmokhtarian" notifications@github.com
wrote:

These are my compilation flags:
Compile flags ------------------------------------------------

CMACH=PC_LINUX1
F_COMP=mpif90
#F_COMP = gfortran
#F_OPTS= -V -FR -O2 -recursive -static -Vaxlib -check all -g -fpe0 -ftz
-debug extended
-debug inline_debug_info -debug-parameters all -traceback -ftrapuv

#F_Opts= -03
F_OPTS= -g -Wall -W -ffpe-trap=invalid,zero,overflow -Wconversion
-fbounds-check -fbacktrace -fdump-core -fopenmp
C_COMP=mpicc
#C_OPTS= -O2 -DLITTLE -g -static -traceback -debug extended
C_OPTS = -03 -DLITTLE
LOADER=mpif90
#LOADER = gfortran
LOADER_OPTS=${F_OPTS}
C_LOADER=mpicc
#C_LOADER_OPTS=-v -g -traceback -static
LIBS=
MOD_EXT=mod
MPI Flags ----------------------------------------------------

MPI_PATH=
PAR_INCS=
PAR_LIBS=
PAR_DEFS=-DRAMS_MPI


Reply to this email directly or view it on GitHub
#30 (comment).

@apourmok
Copy link
Contributor

apourmok commented Apr 3, 2015

Thanks @crollinson that was the issue and now it works.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

5 participants