Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Rmpi doesn't shut down properly #43

Closed
dschlaep opened this issue Nov 11, 2016 · 3 comments
Closed

Rmpi doesn't shut down properly #43

dschlaep opened this issue Nov 11, 2016 · 3 comments
Assignees

Comments

@dschlaep
Copy link
Member

SWSF works well when run with Rmpi on my local computer -- except that at the very end, it hangs and doesn't close down the workers. This happens, e.g., with test project 5.

dschlaep pushed a commit that referenced this issue Nov 11, 2016
- snow::clusterCall returns a list and not a vector as assumed
- passed test project 4
> [1] "Exporting 275 objects from master process to workers"
> [1] "Export of 275 objects took 8.72 secs"
>
>   has_run has_problems made_new_refs deleted_output
> 1    TRUE        FALSE         FALSE          FALSE
>                                                 referenceDB
> 1 dbTables_Test4_AllOverallAggregations_snow_v1.8.2.sqlite3

- passed test project 5 (but no report because it hangs as described in
issue #43
> [1] "Exporting 275 objects from master process to workers"
> [1] "Export of 275 objects took 2.26 secs"
>
> [1] "SWSF: ended after 21.55 s"
> [1] "SWSF: ended with actions = create, execute, aggregate,
concatenate at 2016-11-11 12:34:50"
@alexreeder
Copy link
Contributor

alexreeder commented Nov 15, 2016

Something like this happens on Yellowstone too, here an example of the trace output. But because it only happens after trace output says "run completed". It is not nice, but no issue for execution, only for clean up.

[1] "SWSF simulation runs: completed with 342 runs: ended after 4882.32 s"
[1] "SWSF: ended after 4907.64 s"
[1] "SWSF: ended with actions = create, execute, aggregate at 2016-11-03 09:18:09"
[1] "Detaching Rmpi. Rmpi cannot be used unless relaunching R."

BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES
PID 4832 RUNNING AT ys0356-ib
EXIT CODE: 139
CLEANING UP REMAINING PROCESSES
YOU CAN IGNORE THE BELOW CLEANUP MESSAGES

BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES
PID 4832 RUNNING AT ys0356-ib
EXIT CODE: 11
CLEANING UP REMAINING PROCESSES
YOU CAN IGNORE THE BELOW CLEANUP MESSAGES

@dschlaep
Copy link
Member Author

This is not an issue for the execution of one simulation run, agreed.

However, it is an issue when several runs are linked because they cannot finish when one hangs. For instance, test projects 5 (parallelization with mpi) and 4 cannot be run together (e.g., with ./Test_projects/Run_all_test_projects.R -t=5,4) because 5 hangs and 4 never gets started.

It is also a problem because resources are not properly freed — at least on my local computer. When I run test project 5 and force quit it when it hangs, MPI workers are not released. This requires an additional step.

Thus, we still need to figure out how to properly clean up after running a simulation with MPI.

@dschlaep
Copy link
Member Author

When the function Rmpi::mpi.close.Rslaves() is called, then it hangs because of the call to Rmpi::mpi.comm.disconnect(comm). This is basically .Call("mpi_comm_disconnect", as.integer(comm), PACKAGE = "Rmpi").
Calling Rmpi::mpi.exit() removes all workers and works; for now, I comment all calls to Rmpi::mpi.close.Rslaves().

dschlaep pushed a commit that referenced this issue Dec 22, 2016
longer used for Rsoilwat (but not new function ‘sw_out_flags’ which
should eventually be moved to Rsoilwat)
- replaced .GlobalEnv with function globalenv(); in the hope that this
would work better on parallel workers (probably no change)
- moved ensemble code to file SWSF_Ensembles.R (all ensemble
functionality to be deprecated and removed?)
- code that checks whether all requested sites/runs (Include_YN column
in master input) have associated data (include_YN_* from external data
sources), moved to function check_requested_sites()
- function mpi_work(): explicit call to do_OneSite() replaced with
‘do.call’: changing function parameters will require less code
maintenance
- moved code to clean parallel simulation setup to function
clean_parallel_workers(), but this seems not yet to work for test
project 5
- moved code for ‘prior calculations’ to file SWSF_PriorCalculations.R:
new functions ‘calc_ExtendSoilDatafileToRequestedSoilLayers’,
‘calc_CalculateBareSoilEvaporationCoefficientsFromSoilTexture’,  and
‘do_prior_TableLookups’
- main function ‘do_OneSite()’:
    - moved to file SWSF_Simulation.R
    - gained new argument ‘SimParams’ which are all the arguments that
do not change between runs within a simulation experiment, i.e., all
variables that were previously exported to workers from ‘list.export’
    - because the Rcpp functions like ‘get_KilledBySoilLayers’ are not
any more defined in the same enclosure, ‘do_OneSite’ doesn’t find them
any more (and Rcpp doesn’t currently work any more), hence they need to
be re-defined from the global environment inside do_OneSite (see lines
18-20): this will be fixed and functional again, once SWSF is a R
package
    - new helper functions for do_OneSite(): ‘gather_args_do_OneSite’
to prepare the ‘SimParams’ argument of ‘do_OneSite’; and
‘run_simulation_experiment’ to execute the entire simulation
experiment, i.e., preparing the parallel workers and executing
do_OneSite with them
- moved timing functions to file SWSF_Timing.R
- moved all code and functions to determine sources of daily weather to
file SWSF_WeatherDB.R: ‘dw_LookupWeatherFolder’,
‘dw_Maurer2002_NorthAmerica’, ‘dw_DayMet_NorthAmerica’,
‘dw_NRCan_10km_Canada’, ‘dw_NCEPCFSR_Global’, ‘dw_determine_sources’,

- contributes to #49

- passes test projects 1-4
>       elapsed_s has_run has_problems made_new_refs deleted_output
> Test1    67.168    TRUE         TRUE         FALSE          FALSE
> Test2    60.695    TRUE         TRUE         FALSE          FALSE
> Test3     6.151    TRUE        FALSE         FALSE          FALSE
> Test4     0.622    TRUE        FALSE         FALSE          FALSE
>                                                      referenceDB
> Test1        dbTables_Test1_downscaling_overhaul_v1.10.1.sqlite3
> Test2        dbTables_Test2_LookupWeatherFolders_v1.10.1.sqlite3
> Test3         dbTables_Test3_OnlyMeanDailyOutput_v1.10.1.sqlite3
> Test4 dbTables_Test4_AllOverallAggregations_snow_v1.10.1.sqlite3

- test project 5 runs through as well but hangs again at the end when
code attempts to clean up workers (see #43)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants