-
Notifications
You must be signed in to change notification settings - Fork 7
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Rmpi doesn't shut down properly #43
Comments
- snow::clusterCall returns a list and not a vector as assumed - passed test project 4 > [1] "Exporting 275 objects from master process to workers" > [1] "Export of 275 objects took 8.72 secs" > > has_run has_problems made_new_refs deleted_output > 1 TRUE FALSE FALSE FALSE > referenceDB > 1 dbTables_Test4_AllOverallAggregations_snow_v1.8.2.sqlite3 - passed test project 5 (but no report because it hangs as described in issue #43 > [1] "Exporting 275 objects from master process to workers" > [1] "Export of 275 objects took 2.26 secs" > > [1] "SWSF: ended after 21.55 s" > [1] "SWSF: ended with actions = create, execute, aggregate, concatenate at 2016-11-11 12:34:50"
Something like this happens on Yellowstone too, here an example of the trace output. But because it only happens after trace output says "run completed". It is not nice, but no issue for execution, only for clean up. [1] "SWSF simulation runs: completed with 342 runs: ended after 4882.32 s" BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES |
This is not an issue for the execution of one simulation run, agreed. However, it is an issue when several runs are linked because they cannot finish when one hangs. For instance, test projects 5 (parallelization with mpi) and 4 cannot be run together (e.g., with It is also a problem because resources are not properly freed — at least on my local computer. When I run test project 5 and force quit it when it hangs, MPI workers are not released. This requires an additional step. Thus, we still need to figure out how to properly clean up after running a simulation with MPI. |
When the function Rmpi::mpi.close.Rslaves() is called, then it hangs because of the call to Rmpi::mpi.comm.disconnect(comm). This is basically |
longer used for Rsoilwat (but not new function ‘sw_out_flags’ which should eventually be moved to Rsoilwat) - replaced .GlobalEnv with function globalenv(); in the hope that this would work better on parallel workers (probably no change) - moved ensemble code to file SWSF_Ensembles.R (all ensemble functionality to be deprecated and removed?) - code that checks whether all requested sites/runs (Include_YN column in master input) have associated data (include_YN_* from external data sources), moved to function check_requested_sites() - function mpi_work(): explicit call to do_OneSite() replaced with ‘do.call’: changing function parameters will require less code maintenance - moved code to clean parallel simulation setup to function clean_parallel_workers(), but this seems not yet to work for test project 5 - moved code for ‘prior calculations’ to file SWSF_PriorCalculations.R: new functions ‘calc_ExtendSoilDatafileToRequestedSoilLayers’, ‘calc_CalculateBareSoilEvaporationCoefficientsFromSoilTexture’, and ‘do_prior_TableLookups’ - main function ‘do_OneSite()’: - moved to file SWSF_Simulation.R - gained new argument ‘SimParams’ which are all the arguments that do not change between runs within a simulation experiment, i.e., all variables that were previously exported to workers from ‘list.export’ - because the Rcpp functions like ‘get_KilledBySoilLayers’ are not any more defined in the same enclosure, ‘do_OneSite’ doesn’t find them any more (and Rcpp doesn’t currently work any more), hence they need to be re-defined from the global environment inside do_OneSite (see lines 18-20): this will be fixed and functional again, once SWSF is a R package - new helper functions for do_OneSite(): ‘gather_args_do_OneSite’ to prepare the ‘SimParams’ argument of ‘do_OneSite’; and ‘run_simulation_experiment’ to execute the entire simulation experiment, i.e., preparing the parallel workers and executing do_OneSite with them - moved timing functions to file SWSF_Timing.R - moved all code and functions to determine sources of daily weather to file SWSF_WeatherDB.R: ‘dw_LookupWeatherFolder’, ‘dw_Maurer2002_NorthAmerica’, ‘dw_DayMet_NorthAmerica’, ‘dw_NRCan_10km_Canada’, ‘dw_NCEPCFSR_Global’, ‘dw_determine_sources’, - contributes to #49 - passes test projects 1-4 > elapsed_s has_run has_problems made_new_refs deleted_output > Test1 67.168 TRUE TRUE FALSE FALSE > Test2 60.695 TRUE TRUE FALSE FALSE > Test3 6.151 TRUE FALSE FALSE FALSE > Test4 0.622 TRUE FALSE FALSE FALSE > referenceDB > Test1 dbTables_Test1_downscaling_overhaul_v1.10.1.sqlite3 > Test2 dbTables_Test2_LookupWeatherFolders_v1.10.1.sqlite3 > Test3 dbTables_Test3_OnlyMeanDailyOutput_v1.10.1.sqlite3 > Test4 dbTables_Test4_AllOverallAggregations_snow_v1.10.1.sqlite3 - test project 5 runs through as well but hangs again at the end when code attempts to clean up workers (see #43)
SWSF works well when run with Rmpi on my local computer -- except that at the very end, it hangs and doesn't close down the workers. This happens, e.g., with test project 5.
The text was updated successfully, but these errors were encountered: