Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Adapt to new LSF situation on Summit #82

Closed
wants to merge 5 commits into from

Conversation

al-rigazzi
Copy link
Collaborator

After the last update of Summit, many internal mechanisms of SmartSim stopped working. Here is a list of the issues and what I did to mitigate them:

  • ERF files used for mpmd now don't accept simple rank_count, but need rank id. Now the LSFOrchestrator will run each shard through the rank with the same id (rank 0 will run shard 0 and so on). It is the same as before, but we specify it.
  • ERF files now don't accept more than one app on the same host. I suspect this is a bug, but this just means we cannot run more than one shard per host. This did not result in any change, but limits our features.
  • Environment variables are read the wrong way. Specifying more than one env var resulted in wrong handling (everything was assigned to first var). We now store the formatted env vars as a list of strings which is then parsed correctly.
  • Killing a jsrun process does not kill its spawned processes anymore. This caused most of the problems, as we were relying on it to stop applications. I turned JsrunSteps into managed ones. This means we use jslist to get the status of a jsrun call inside an allocation. It works, but if anything but SmartSim launches a jsrun command, the matching step id could be lost due to a race condition (ids are assigned incrementally, starting from 0, we mock the Slurm format of alloc_id.step_id internally, to distinguish them from batch jobs). Users will need to only use SmartSim within an allocation.

@Spartee
Copy link
Contributor

Spartee commented Sep 7, 2021

@al-rigazzi What happens if a user launches a couple jobs without smartsim in an interactive allocation and then uses smartsim in the same interactive allocation? is there a method for accounting for already used step id's??

@al-rigazzi
Copy link
Collaborator Author

@Spartee Yes, good question! This is accounted for. What we do is that right after we launch a job step, we call jsrun and see what the highest task ID is, and use it in our mapping. So, basically, the only risk is if SmartSim launches a job, and before it fetches the task ID, something else launches another job, whose ID is then recognized as the highest and causes an inconsistency.

@al-rigazzi
Copy link
Collaborator Author

Also, this is the typical jslist output:

     parent                cpus      gpus      exit
   ID   ID	 nrs    per RS    per RS    status         status
===============================================================================
    1    0         1   various   various         0        Running
    2    0         1         1         0         0        Running

@al-rigazzi al-rigazzi linked an issue Sep 9, 2021 that may be closed by this pull request
@mellis13 mellis13 closed this Sep 13, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

New LSF launcher features
3 participants