Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

202204 release - exited on signal 11 (Segmentation fault) #11

Closed
StevePny opened this issue May 9, 2022 · 11 comments
Closed

202204 release - exited on signal 11 (Segmentation fault) #11

StevePny opened this issue May 9, 2022 · 11 comments

Comments

@StevePny
Copy link

StevePny commented May 9, 2022

Running on linux ubuntu gnu (docker container):

I'm still working through getting the 202204 release running successfully (i.e. to at least roughly replicate the pre-202204 version). I'm currently getting this segmentation fault. A previous segmentation fault was corrected by updating the FMS package build to the 'main' branch of the FMS repo. I've symbolic linked the aerosol.txt, solarconstant_noaa_an.txt, co2historicaldata_*.txt and a few other key files to the INPUT/ directory (from their previous location in the main experiment directory):

  Updating solar constant with cycle approx
    Opened solar constant data file: INPUT/solarconstant_noaa_an.txt 
  CHECK: Solar constant data used for year        2020   1361.0400000000000        1361.0400000000000     
0 FORECAST DATE          26 AUG.  2020 AT 12 HRS  0.00 MINS
  JULIAN DAY             2459088  PLUS   0.000000
  RADIUS VECTOR          1.0104738
  RIGHT ASCENSION OF SUN  10.3754267 HRS, OR  10 HRS  22 MINS  31.5 SECS
  DECLINATION OF THE SUN  10.1408708 DEGS, OR   10 DEGS   8 MINS  27.1 SECS
  EQUATION OF TIME        -1.7063098 MINS, OR   -102.38 SECS, OR-0.007466 RADIANS
  SOLAR CONSTANT        1332.9711572 (DISTANCE AJUSTED)


    for cosz calculations: nswr,deltim,deltsw,dtswh =           8   450.00000000000000        3600.0000000000000        1.0000000000000000        anginc,nstp =   3.2724923474893676E-002           9
    Opened aerosol data file: INPUT/aerosol.dat               
   --- Reading  MONTH OF AUGUST    CLIMATOLOGICAL AEROSOL GLOBAL DISTRIBUTION                  
    Request volcanic date out of range, optical depth set to lowest value
  CHECK: Sample Volcanic data used for month, year:           8        2020
           1           1           1           1
    Opened co2 data file: INPUT/co2historicaldata_2020.txt
        2020  MONTHLY CO2 (PPMV)   24  12  LON/LAT (N-S/0-360E) IN 15 DEGREE RESOLUTION,  GLB ANNUAL MEAN =   412.81000000000000        GROWTH RATE =   2.5200000000000000     
    Global annual mean CO2 data for year        2020   4.1281000000000000E-004
  CHECK: Sample of selected months of CO2 data used for year:        2020
         Month =           1
   4.1894999999999996E-004   4.1873000000000002E-004   4.1708999999999995E-004   4.1537999999999997E-004   4.1341000000000001E-004   4.1173000000000002E-004   4.1005000000000002E-004   4.0923000000000001E-004   4.0920999999999997E-004   4.0912999999999995E-004   4.0892000000000001E-004   4.0863000000000000E-004
         Month =           4
   4.2148000000000001E-004   4.1961000000000000E-004   4.1841000000000003E-004   4.1831999999999997E-004   4.1779000000000002E-004   4.1539999999999996E-004   4.1255999999999997E-004   4.1018000000000001E-004   4.1001999999999998E-004   4.0969999999999998E-004   4.0936999999999999E-004   4.0924000000000001E-004
         Month =           7
   4.0852999999999994E-004   4.0848000000000002E-004   4.0861000000000001E-004   4.0970999999999998E-004   4.1144000000000000E-004   4.1177999999999994E-004   4.1160999999999997E-004   4.1099999999999996E-004   4.1077999999999997E-004   4.1047000000000002E-004   4.1013999999999997E-004   4.1000999999999999E-004
         Month =          10
   4.1172000000000002E-004   4.1114999999999994E-004   4.1237999999999995E-004   4.1209999999999999E-004   4.1077999999999997E-004   4.1110000000000002E-004   4.1175999999999995E-004   4.1212999999999997E-004   4.1164999999999995E-004   4.1120999999999996E-004   4.1104999999999999E-004   4.1089999999999996E-004
--------------------------------------------------------------------------
Primary job  terminated normally, but 1 process returned
a non-zero exit code. Per user-direction, the job has been aborted.
--------------------------------------------------------------------------
--------------------------------------------------------------------------
mpirun noticed that process rank 2 with PID 0 on node e90980d4b77e exited on signal 11 (Segmentation fault).
--------------------------------------------------------------------------

For verification, I've also tried running the regional_Laura test case, and get a similar error:

   Updating solar constant with cycle approx
    Opened solar constant data file: INPUT/solarconstant_noaa_an.txt 
  CHECK: Solar constant data used for year        2020   1361.0400000000000        1361.0400000000000     
0 FORECAST DATE          26 AUG.  2020 AT 12 HRS  0.00 MINS
  JULIAN DAY             2459088  PLUS   0.000000
  RADIUS VECTOR          1.0104738
  RIGHT ASCENSION OF SUN  10.3754267 HRS, OR  10 HRS  22 MINS  31.5 SECS
  DECLINATION OF THE SUN  10.1408708 DEGS, OR   10 DEGS   8 MINS  27.1 SECS
  EQUATION OF TIME        -1.7063098 MINS, OR   -102.38 SECS, OR-0.007466 RADIANS
  SOLAR CONSTANT        1332.9711572 (DISTANCE AJUSTED)


    for cosz calculations: nswr,deltim,deltsw,dtswh =           8   450.00000000000000        3600.0000000000000        1.0000000000000000        anginc,nstp =   3.2724923474893676E-002           9
    Opened aerosol data file: INPUT/aerosol.dat               
   --- Reading  MONTH OF AUGUST    CLIMATOLOGICAL AEROSOL GLOBAL DISTRIBUTION                  
    Request volcanic date out of range, optical depth set to lowest value
  CHECK: Sample Volcanic data used for month, year:           8        2020
           1           1           1           1
    Opened co2 data file: INPUT/co2historicaldata_2020.txt
        2020  MONTHLY CO2 (PPMV)   24  12  LON/LAT (N-S/0-360E) IN 15 DEGREE RESOLUTION,  GLB ANNUAL MEAN =   412.81000000000000        GROWTH RATE =   2.5200000000000000     
    Global annual mean CO2 data for year        2020   4.1281000000000000E-004
  CHECK: Sample of selected months of CO2 data used for year:        2020
         Month =           1
   4.1894999999999996E-004   4.1873000000000002E-004   4.1708999999999995E-004   4.1537999999999997E-004   4.1341000000000001E-004   4.1173000000000002E-004   4.1005000000000002E-004   4.0923000000000001E-004   4.0920999999999997E-004   4.0912999999999995E-004   4.0892000000000001E-004   4.0863000000000000E-004
         Month =           4
   4.2148000000000001E-004   4.1961000000000000E-004   4.1841000000000003E-004   4.1831999999999997E-004   4.1779000000000002E-004   4.1539999999999996E-004   4.1255999999999997E-004   4.1018000000000001E-004   4.1001999999999998E-004   4.0969999999999998E-004   4.0936999999999999E-004   4.0924000000000001E-004
         Month =           7
   4.0852999999999994E-004   4.0848000000000002E-004   4.0861000000000001E-004   4.0970999999999998E-004   4.1144000000000000E-004   4.1177999999999994E-004   4.1160999999999997E-004   4.1099999999999996E-004   4.1077999999999997E-004   4.1047000000000002E-004   4.1013999999999997E-004   4.1000999999999999E-004
         Month =          10
   4.1172000000000002E-004   4.1114999999999994E-004   4.1237999999999995E-004   4.1209999999999999E-004   4.1077999999999997E-004   4.1110000000000002E-004   4.1175999999999995E-004   4.1212999999999997E-004   4.1164999999999995E-004   4.1120999999999996E-004   4.1104999999999999E-004   4.1089999999999996E-004
--------------------------------------------------------------------------
Primary job  terminated normally, but 1 process returned
a non-zero exit code. Per user-direction, the job has been aborted.
--------------------------------------------------------------------------
--------------------------------------------------------------------------
mpirun noticed that process rank 1 with PID 0 on node d312f888f66b exited on signal 11 (Segmentation fault).
--------------------------------------------------------------------------
@laurenchilutti
Copy link
Contributor

@StevePny What test are you running that results in this seg fault failure? Is it one of the tests in the CI directory?

@laurenchilutti
Copy link
Contributor

@StevePny To expand upon my previous comment: Is it one of the tests in the SHiELD_build repository CI directory?

@StevePny
Copy link
Author

StevePny commented May 10, 2022

@laurenchilutti
The latter case is the regional_Laura case distributed with the SHiELD-in-a-box:
https://www.gfdl.noaa.gov/shield/shield-in-a-box/
https://zenodo.org/record/5090124/files/regional_Laura.zip

I was able to install SHiELD and run these example cases (regional_Laura and global_nest_Laura) prior to the 202204 release.

@kaiyuan-cheng
Copy link
Contributor

@StevePny
I have tested the latest SHiELD code with the regional Laura case. When I built SHiELD natively on an NOAA HPC, the Laura case works fine. However, it does not work with the containerized SHiELD, which is very strange.

@kaiyuan-cheng
Copy link
Contributor

Looks like it is the NCEP library causing the crash. Segmentation fault occurs at

call getgb(lugb,lugi,kdata,lskip,jpds,jgds,ndata,lskip,

However, I still don't understand why it is the case. Before this line, another NCEP library, getgbh(), works just fine. Also, the same compiler flags and arguments worked previously.

@StevePny
Copy link
Author

@kaiyuan-cheng just checking in - has any progress been made on clearing up this issue, or should we continue with the pre-202204 version?

@kaiyuan-cheng
Copy link
Contributor

kaiyuan-cheng commented Jul 19, 2022 via email

@kaiyuan-cheng
Copy link
Contributor

@StevePny It turns out that the default stack size, 8 MB, is insufficient to hold the large one-dimension variable, lbms. The solution is to set an unlimited stack size.

@lharris4
Copy link
Contributor

lharris4 commented Sep 8, 2022 via email

@StevePny
Copy link
Author

StevePny commented Dec 1, 2022

To provide a clarifying detail -
The docker container does not inherit the system stack limit by default. The ulimit can be set on the command line when running the docker container, but 'unlimited' is not a permitted option. In order to specify an unlimited stack size in the docker container, one can add this option:

--ulimit stack=-1

With this setting I can run the regional_Laura_test case on an AWS c6g.8xlarge ec2 instance.

Note - to be safe, I also set the stack size in the ec2 instance with:
ulimit -s unlimited

@lharris4
Copy link
Contributor

lharris4 commented Dec 1, 2022 via email

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants