Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix met-8.1 to compile using GNU 6.3.0 and later. #1139

Closed
JohnHalleyGotway opened this issue Jun 7, 2019 · 9 comments
Closed

Fix met-8.1 to compile using GNU 6.3.0 and later. #1139

JohnHalleyGotway opened this issue Jun 7, 2019 · 9 comments
Assignees
Labels
MET: Library Code priority: blocker Blocker type: bug Fix something that is not working

Comments

@JohnHalleyGotway
Copy link
Collaborator

JohnHalleyGotway commented Jun 7, 2019

SNAT attempted to compile met-8.1 in /usr/local/met-8.1 but was unable to.
See: https://rt.rap.ucar.edu/rt/Ticket/Display.html?id=90507

The Debian stretch release includes GNU 6.3.0 and Debian Buster release has GNU 8.x.

Howard Soh found that even when met-8.1 can be made to compile on 6.3.0 there are still runtime errors! This task is to get MET compiling on the Debian stretch and buster releases and make sure the unit test output matches the output generated by GNU 4.9.2 (i.e. dakota).

Debian buster is available for testing on k15.rap.ucar.edu.

@JohnHalleyGotway
Copy link
Collaborator Author

I made a corresponding branch from the master_v8.1 bugfix branch:
https://github.com/NCAR/MET/tree/bugfix_1139_GNU

And I'm testing in these locations:
GNU 6.3.0 on mohawk: /d2/projects/MET/MET_releases
GNU 8.3.0 on k15: /d2/MET/MET_releases

I've committed 3 changes to that bugfix branch so far:
(1) Update data_line.cc for GNU 8.3.0 compilation error by changing sstream to sstream.str().
(2) Update enum_to_string.cc for GNU 6.3.0 to avoid seg fault. Need to allocate (n+1) bytes instead of (n).
(3) Update structure of Makefile.am for plot_mode_field. This is the only example of an application which builds a library and compiles applications. In 6.3.0 and earlier, the Makefile produced by autoconf compiled the libraries followed by the applications:
mohawk's Makefile: all-am: Makefile $(LIBRARIES) $(PROGRAMS)
In 8.3.0, the order is swapped in that Makefile:
k15:s Makefile: all-am: Makefile $(PROGRAMS) $(LIBRARIES)

To fix this, I got rid of the vx_cgraph library from Makefile.am and instead just linked to the object files directly.

These 3 changes enable MET to compile on mohawk and k15 but there are still runtime errors.

@JohnHalleyGotway
Copy link
Collaborator Author

make test errors out immediately:
*** Running ASCII2NC to reformat ASCII point observations into NetCDF ***
../src/tools/other/ascii2nc/ascii2nc
../data/sample_obs/ascii/sample_ascii_obs.txt
../out/ascii2nc/sample_ascii.nc
-v 2
ERROR :
ERROR : CommandLine::is_switch(const char *) const -> empty string!
ERROR :

@JohnHalleyGotway
Copy link
Collaborator Author

JohnHalleyGotway commented Jun 11, 2019

I updated the CommandLine class to use strings instead of character arrays, fixed the downstream ripple effects, and got MET to compile on both mohawk and k15. Running the full set of unit tests on mohawk required...

  • copying over the input "unit_test" directory (41GB)
  • installing the PERL XML/Parser module.
  • manually settings MET_TEST_RSCRIPT
  • updating the PERL code in VxUtil.pm and unit.pl to escape regular expressions with ${

I ran into 2 random failures during the unit tests. Both went away after rerunning (which is worrisome). Both had the same error message when reading MET-formatted NetCDF files:

TEST: gen_vx_mask_POLY_SYMDIFF - FAIL - 0.170 sec
terminate called after throwing an instance of 'netCDF::exceptions::NcEdge'
what(): NetCDF: Start+count exceeds dimension bound
file: ncVar.cpp line:1614
Aborted

TEST: plot_data_plane_NC_MET - FAIL - 0.124 sec
terminate called after throwing an instance of 'netCDF::exceptions::NcEdge'
what(): NetCDF: Start+count exceeds dimension bound
file: ncVar.cpp line:1614
Aborted

@JohnHalleyGotway
Copy link
Collaborator Author

Another runtime error, but this time repeatable. Problem when parsing tc_stat job command options. Updated code to use a string instead of a character pointer.

@JohnHalleyGotway
Copy link
Collaborator Author

Updated met_file.cc to search for lat/lon dimensions using strings instead of const char * in the hopes of fixing the intermittent netCDF::exceptions::NcEdge runtime error.

@TaraJensen
Copy link
Contributor

Charge 2702691

@JohnHalleyGotway
Copy link
Collaborator Author

JohnHalleyGotway commented Jun 21, 2019

Making somewhat slow but steady progress. Working on mohawk (gnu 6.3.0) and k15 (gnu 8.3.0).
All unit tests now run on mohawk and k15 (except no python instance on k15 to test unit_python.xml).
Remaining diffs are all now acceptable:

(1) All NetCDF files have a header difference. NetCDF4 output files include:
:_NCProperties = "version=1|netcdflibversion=4.4.1.1|hdf5libversion=1.8.18" ;
(2) The output of stat_analysis filter job is in a different order.
(3) 3 png output files from plot_mode_field all differ but look the same.
(4) [k15 only] grid_stat_GFS_FOURIER_240000L_20120410_000000V.stat contains many very small diffs... none larger than 10e-17. Presumably the fourier decomposition is very slightly different.

@JohnHalleyGotway
Copy link
Collaborator Author

After merging the changes for this bugfix into develop and master_v8.1, the regression tests successfully ran. However, there is a sporadic, non-repeatable failure occurring on mohawk which occurs when writing NetCDF output files. The most recent instance was from gen_vx_mask and is listed below. This might be a memory problem, perhaps caused by an uninitialized variables somewhere. Will try compiling with Intel's -ftrapuv option to look for uninitialized variables.

/d2/projects/MET/MET_development/MET/met/share/met/../../bin/gen_vx_mask
/d2/projects/MET/MET_development/MET/test_output/gen_vx_mask/DATA_APCP_24_mask.nc
/d2/projects/MET/MET_development/MET/met/share/met/poly/LMV.poly
/d2/projects/MET/MET_development/MET/test_output/gen_vx_mask/POLY_PASS_THRU_APCP_24_LMV_mask.nc
-type poly -value 10 -v 2
DEBUG 1: Input File: /d2/projects/MET/MET_development/MET/test_output/gen_vx_mask/DATA_APCP_24_mask.nc
DEBUG 1: Mask File: /d2/projects/MET/MET_development/MET/met/share/met/poly/LMV.poly
terminate called after throwing an instance of 'netCDF::exceptions::NcEdge'
what(): NetCDF: Start+count exceeds dimension bound
file: ncVar.cpp line:1614
Aborted

@JohnHalleyGotway
Copy link
Collaborator Author

Merged changes from this branch into both master_v8.1 and develop. Tested master_v8.1 on both mohawk (GNU 6.3.0) and k15 (GNU 8.3.0). Am also testing develop on them but that's independent of this issue. Will create a separate issue for sporadic runtime error.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
MET: Library Code priority: blocker Blocker type: bug Fix something that is not working
Projects
None yet
Development

No branches or pull requests

3 participants