Rmpi doesn't shut down properly #43

dschlaep · 2016-11-11T07:26:10Z

SWSF works well when run with Rmpi on my local computer -- except that at the very end, it hangs and doesn't close down the workers. This happens, e.g., with test project 5.

- snow::clusterCall returns a list and not a vector as assumed - passed test project 4 > [1] "Exporting 275 objects from master process to workers" > [1] "Export of 275 objects took 8.72 secs" > > has_run has_problems made_new_refs deleted_output > 1 TRUE FALSE FALSE FALSE > referenceDB > 1 dbTables_Test4_AllOverallAggregations_snow_v1.8.2.sqlite3 - passed test project 5 (but no report because it hangs as described in issue #43 > [1] "Exporting 275 objects from master process to workers" > [1] "Export of 275 objects took 2.26 secs" > > [1] "SWSF: ended after 21.55 s" > [1] "SWSF: ended with actions = create, execute, aggregate, concatenate at 2016-11-11 12:34:50"

alexreeder · 2016-11-15T15:23:28Z

Something like this happens on Yellowstone too, here an example of the trace output. But because it only happens after trace output says "run completed". It is not nice, but no issue for execution, only for clean up.

[1] "SWSF simulation runs: completed with 342 runs: ended after 4882.32 s"
[1] "SWSF: ended after 4907.64 s"
[1] "SWSF: ended with actions = create, execute, aggregate at 2016-11-03 09:18:09"
[1] "Detaching Rmpi. Rmpi cannot be used unless relaunching R."

BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES
PID 4832 RUNNING AT ys0356-ib
EXIT CODE: 139
CLEANING UP REMAINING PROCESSES
YOU CAN IGNORE THE BELOW CLEANUP MESSAGES

BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES
PID 4832 RUNNING AT ys0356-ib
EXIT CODE: 11
CLEANING UP REMAINING PROCESSES
YOU CAN IGNORE THE BELOW CLEANUP MESSAGES

dschlaep · 2016-11-16T07:00:26Z

This is not an issue for the execution of one simulation run, agreed.

However, it is an issue when several runs are linked because they cannot finish when one hangs. For instance, test projects 5 (parallelization with mpi) and 4 cannot be run together (e.g., with ./Test_projects/Run_all_test_projects.R -t=5,4) because 5 hangs and 4 never gets started.

It is also a problem because resources are not properly freed — at least on my local computer. When I run test project 5 and force quit it when it hangs, MPI workers are not released. This requires an additional step.

Thus, we still need to figure out how to properly clean up after running a simulation with MPI.

dschlaep · 2016-12-12T16:15:24Z

When the function Rmpi::mpi.close.Rslaves() is called, then it hangs because of the call to Rmpi::mpi.comm.disconnect(comm). This is basically .Call("mpi_comm_disconnect", as.integer(comm), PACKAGE = "Rmpi").
Calling Rmpi::mpi.exit() removes all workers and works; for now, I comment all calls to Rmpi::mpi.close.Rslaves().

longer used for Rsoilwat (but not new function ‘sw_out_flags’ which should eventually be moved to Rsoilwat) - replaced .GlobalEnv with function globalenv(); in the hope that this would work better on parallel workers (probably no change) - moved ensemble code to file SWSF_Ensembles.R (all ensemble functionality to be deprecated and removed?) - code that checks whether all requested sites/runs (Include_YN column in master input) have associated data (include_YN_* from external data sources), moved to function check_requested_sites() - function mpi_work(): explicit call to do_OneSite() replaced with ‘do.call’: changing function parameters will require less code maintenance - moved code to clean parallel simulation setup to function clean_parallel_workers(), but this seems not yet to work for test project 5 - moved code for ‘prior calculations’ to file SWSF_PriorCalculations.R: new functions ‘calc_ExtendSoilDatafileToRequestedSoilLayers’, ‘calc_CalculateBareSoilEvaporationCoefficientsFromSoilTexture’, and ‘do_prior_TableLookups’ - main function ‘do_OneSite()’: - moved to file SWSF_Simulation.R - gained new argument ‘SimParams’ which are all the arguments that do not change between runs within a simulation experiment, i.e., all variables that were previously exported to workers from ‘list.export’ - because the Rcpp functions like ‘get_KilledBySoilLayers’ are not any more defined in the same enclosure, ‘do_OneSite’ doesn’t find them any more (and Rcpp doesn’t currently work any more), hence they need to be re-defined from the global environment inside do_OneSite (see lines 18-20): this will be fixed and functional again, once SWSF is a R package - new helper functions for do_OneSite(): ‘gather_args_do_OneSite’ to prepare the ‘SimParams’ argument of ‘do_OneSite’; and ‘run_simulation_experiment’ to execute the entire simulation experiment, i.e., preparing the parallel workers and executing do_OneSite with them - moved timing functions to file SWSF_Timing.R - moved all code and functions to determine sources of daily weather to file SWSF_WeatherDB.R: ‘dw_LookupWeatherFolder’, ‘dw_Maurer2002_NorthAmerica’, ‘dw_DayMet_NorthAmerica’, ‘dw_NRCan_10km_Canada’, ‘dw_NCEPCFSR_Global’, ‘dw_determine_sources’, - contributes to #49 - passes test projects 1-4 > elapsed_s has_run has_problems made_new_refs deleted_output > Test1 67.168 TRUE TRUE FALSE FALSE > Test2 60.695 TRUE TRUE FALSE FALSE > Test3 6.151 TRUE FALSE FALSE FALSE > Test4 0.622 TRUE FALSE FALSE FALSE > referenceDB > Test1 dbTables_Test1_downscaling_overhaul_v1.10.1.sqlite3 > Test2 dbTables_Test2_LookupWeatherFolders_v1.10.1.sqlite3 > Test3 dbTables_Test3_OnlyMeanDailyOutput_v1.10.1.sqlite3 > Test4 dbTables_Test4_AllOverallAggregations_snow_v1.10.1.sqlite3 - test project 5 runs through as well but hangs again at the end when code attempts to clean up workers (see #43)

dschlaep assigned alexreeder Nov 11, 2016

dschlaep closed this as completed in 51c71f0 Dec 13, 2016

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Rmpi doesn't shut down properly #43

Rmpi doesn't shut down properly #43

dschlaep commented Nov 11, 2016

alexreeder commented Nov 15, 2016 •

edited

Loading

dschlaep commented Nov 16, 2016

dschlaep commented Dec 12, 2016

Rmpi doesn't shut down properly #43

Rmpi doesn't shut down properly #43

Comments

dschlaep commented Nov 11, 2016

alexreeder commented Nov 15, 2016 • edited Loading

dschlaep commented Nov 16, 2016

dschlaep commented Dec 12, 2016

alexreeder commented Nov 15, 2016 •

edited

Loading