Adapt to new LSF situation on Summit #82

al-rigazzi · 2021-09-07T13:54:43Z

After the last update of Summit, many internal mechanisms of SmartSim stopped working. Here is a list of the issues and what I did to mitigate them:

ERF files used for mpmd now don't accept simple rank_count, but need rank id. Now the LSFOrchestrator will run each shard through the rank with the same id (rank 0 will run shard 0 and so on). It is the same as before, but we specify it.
ERF files now don't accept more than one app on the same host. I suspect this is a bug, but this just means we cannot run more than one shard per host. This did not result in any change, but limits our features.
Environment variables are read the wrong way. Specifying more than one env var resulted in wrong handling (everything was assigned to first var). We now store the formatted env vars as a list of strings which is then parsed correctly.
Killing a jsrun process does not kill its spawned processes anymore. This caused most of the problems, as we were relying on it to stop applications. I turned JsrunSteps into managed ones. This means we use jslist to get the status of a jsrun call inside an allocation. It works, but if anything but SmartSim launches a jsrun command, the matching step id could be lost due to a race condition (ids are assigned incrementally, starting from 0, we mock the Slurm format of alloc_id.step_id internally, to distinguish them from batch jobs). Users will need to only use SmartSim within an allocation.

Spartee · 2021-09-07T17:22:08Z

@al-rigazzi What happens if a user launches a couple jobs without smartsim in an interactive allocation and then uses smartsim in the same interactive allocation? is there a method for accounting for already used step id's??

al-rigazzi · 2021-09-07T17:49:29Z

@Spartee Yes, good question! This is accounted for. What we do is that right after we launch a job step, we call jsrun and see what the highest task ID is, and use it in our mapping. So, basically, the only risk is if SmartSim launches a job, and before it fetches the task ID, something else launches another job, whose ID is then recognized as the highest and causes an inconsistency.

al-rigazzi · 2021-09-07T17:56:36Z

Also, this is the typical jslist output:

     parent                cpus      gpus      exit
   ID   ID	 nrs    per RS    per RS    status         status
===============================================================================
    1    0         1   various   various         0        Running
    2    0         1         1         0         0        Running

…martSim into summit_update_fixes

Turn jsrun into managed, change lsf env and orc

63aee33

al-rigazzi added 4 commits September 8, 2021 12:40

Fix code block for Summit doc

dca85d6

Move step_id creation for LSF

3719b30

Merge branch 'summit_update_fixes' of https://github.com/al-rigazzi/S…

073fd85

…martSim into summit_update_fixes

Remove outdated comments

f6b0756

al-rigazzi linked an issue Sep 9, 2021 that may be closed by this pull request

New LSF launcher features #83

Closed

mellis13 closed this Sep 13, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Adapt to new LSF situation on Summit #82

Adapt to new LSF situation on Summit #82

al-rigazzi commented Sep 7, 2021

Spartee commented Sep 7, 2021

al-rigazzi commented Sep 7, 2021

al-rigazzi commented Sep 7, 2021

Adapt to new LSF situation on Summit #82

Adapt to new LSF situation on Summit #82

Conversation

al-rigazzi commented Sep 7, 2021

Spartee commented Sep 7, 2021

al-rigazzi commented Sep 7, 2021

al-rigazzi commented Sep 7, 2021