Skip to content

Conversation

@amaslenn
Copy link
Contributor

@amaslenn amaslenn commented Feb 19, 2025

Summary

Always create ranks mapping file for slurm jobs by running extra srun command before actual test.

Test Plan

  1. CI
  2. Manual run with modified sbatch file where actual test is excluded (for the sake of speed).
    Single node, mapping-stdout.txt content (mapping-stderr.txt is empty).
Wed Feb 19 04:54:52 PST 2025: <node-name>:node 0:rank 0.
Wed Feb 19 04:54:52 PST 2025: <node-name>:node 0:rank 3.
Wed Feb 19 04:54:52 PST 2025: <node-name>:node 0:rank 2.
Wed Feb 19 04:54:52 PST 2025: <node-name>:node 0:rank 7.
Wed Feb 19 04:54:52 PST 2025: <node-name>:node 0:rank 6.
Wed Feb 19 04:54:52 PST 2025: <node-name>:node 0:rank 5.
Wed Feb 19 04:54:52 PST 2025: <node-name>:node 0:rank 4.
Wed Feb 19 04:54:52 PST 2025: <node-name>:node 0:rank 1.

Two nodes, mapping-stdout.txt content (mapping-stderr.txt is empty).

Wed Feb 19 04:54:51 PST 2025: <node1-name>:node 1:rank 14.
Wed Feb 19 04:54:51 PST 2025: <node1-name>:node 1:rank 9.
Wed Feb 19 04:54:51 PST 2025: <node1-name>:node 1:rank 13.
Wed Feb 19 04:54:51 PST 2025: <node1-name>:node 1:rank 8.
Wed Feb 19 04:54:51 PST 2025: <node1-name>:node 1:rank 15.
Wed Feb 19 04:54:51 PST 2025: <node1-name>:node 1:rank 12.
Wed Feb 19 04:54:51 PST 2025: <node1-name>:node 1:rank 11.
Wed Feb 19 04:54:51 PST 2025: <node1-name>:node 1:rank 10.
Wed Feb 19 04:54:52 PST 2025: <node2-name>:node 0:rank 7.
Wed Feb 19 04:54:52 PST 2025: <node2-name>:node 0:rank 2.
Wed Feb 19 04:54:52 PST 2025: <node2-name>:node 0:rank 5.
Wed Feb 19 04:54:52 PST 2025: <node2-name>:node 0:rank 1.
Wed Feb 19 04:54:52 PST 2025: <node2-name>:node 0:rank 6.
Wed Feb 19 04:54:52 PST 2025: <node2-name>:node 0:rank 3.
Wed Feb 19 04:54:52 PST 2025: <node2-name>:node 0:rank 4.
Wed Feb 19 04:54:52 PST 2025: <node2-name>:node 0:rank 0.

Additional Notes

I was considering using pre-hook for this task, here is why I decided not to go with it:

  1. Pre-hook is considered as verification, but here we need an action that is used for collecting extra info.
  2. Pre-hook would require a status validation function.
  3. With pre-hook + SlurmContainer job (or any other job) we won't be able to redirect output into file with "mapping" in the name, we can't control that.

@amaslenn amaslenn marked this pull request as ready for review February 19, 2025 16:40
@TaekyungHeo TaekyungHeo added enhancement New feature or request feature labels Feb 19, 2025
TaekyungHeo
TaekyungHeo previously approved these changes Feb 19, 2025
@TaekyungHeo
Copy link
Member

I will approve again once unit tests are fixed

@amaslenn amaslenn merged commit 92b1b61 into main Feb 27, 2025
2 checks passed
@amaslenn amaslenn deleted the am/ranks-mapping branch February 27, 2025 18:17
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

enhancement New feature or request feature

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants