Adapt to new LSF situation on Summit #82
Closed
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
After the last update of Summit, many internal mechanisms of SmartSim stopped working. Here is a list of the issues and what I did to mitigate them:
rank_count
, but need rank id. Now the LSFOrchestrator will run each shard through the rank with the same id (rank 0 will run shard 0 and so on). It is the same as before, but we specify it.jsrun
process does not kill its spawned processes anymore. This caused most of the problems, as we were relying on it to stop applications. I turnedJsrunSteps
into managed ones. This means we usejslist
to get the status of ajsrun
call inside an allocation. It works, but if anything but SmartSim launches ajsrun
command, the matching step id could be lost due to a race condition (ids are assigned incrementally, starting from 0, we mock the Slurm format ofalloc_id.step_id
internally, to distinguish them from batch jobs). Users will need to only use SmartSim within an allocation.