Skip to content

Conversation

@TaekyungHeo
Copy link
Member

@TaekyungHeo TaekyungHeo commented Apr 1, 2025

Summary

This pull request aims to add RunAI support to CloudAI and enable submission of NCCL tests to RunAI via CloudAI.

I can submit a NCCL test to RunAI, monitor it, and detect job completion.

However, there is still much room for improvement.

  • MPI is currently not used due to a server misconfiguration.
  • Some features are currently unsupported:
    • Hooks
    • Report generation
    • Installation
  • More importantly, we should clarify the requirements.

Design Document: https://docs.google.com/document/d/1fqD_hBXmj0ikXX91iAVGvCLZk2tZdUKqK08TZdT7E0g/edit?tab=t.0#heading=h.p9oz8l1n9zma

Test Plan

  1. CI passes
  2. Ran on a server
$ pip install .
(venv) ➜  cloudai git:(runai) cloudai run\
    --system-config conf/common/system/example_runai_cluster.toml\
    --tests-dir conf/common/test\
    --test-scenario conf/common/test_scenario/nccl_test.toml
[INFO] System Name: example-runai-cluster
[INFO] Scheduler: runai
[INFO] Test Scenario Name: nccl-test
[INFO] Checking if test templates are installed.
[INFO] Checking if DockerImage(url=nvcr.io/nvidia/pytorch:24.02-py3) is installed for RunAI.
[INFO] Test Scenario: nccl-test

Section Name: Tests.1
  Test Name: nccl_test_all_reduce
  Description: all_reduce
  No dependencies
[INFO] Initializing Runner [RUN] mode
[INFO] Creating RunAIRunner
[INFO] Starting test: Tests.1
[INFO] Running test: Tests.1
[INFO] Submitted RunAI job: a84baafc-7b53-49c0-b89f-38a111a78ab1
[INFO] Job completed: Tests.1 (iteration 1 of 1)
[INFO] All test scenario results stored at: results/nccl-test_2025-04-03_06-45-00
[INFO] Generated scenario report at results/nccl-test_2025-04-03_06-45-00/nccl-test.html
[INFO] All test scenario execution attempts are complete. Please review the 'debug.log' file to confirm successful completion or to identify any issues.
➜  0 git:(runai) pwd          
/Users/theo/cloudai/results/nccl-test_2025-04-03_06-45-00/Tests.1/0

➜  0 git:(runai) ls
cloudai_nccl_test_bokeh_report.html cloudai_nccl_test_csv_report.csv    events.txt                          stdout.txt

$ cat stdout.txt                                                                                                                                              
# nThread 1 nGpus 1 minBytes 128 maxBytes 17179869184 step: 2(factor) warmup iters: 50 iters: 100 agg iters: 1 validation: 1 graph: 0
#
# Using devices
#  Rank  0 Group  0 Pid      1 on nccl-test-20250403104500-1-0 device  0 [0x53] NVIDIA H100 80GB HBM3
#
#                                                              out-of-place                       in-place          
#       size         count      type   redop    root     time   algbw   busbw #wrong     time   algbw   busbw #wrong
#        (B)    (elements)                               (us)  (GB/s)  (GB/s)            (us)  (GB/s)  (GB/s)       
         128            32     float     sum      -1     3.07    0.04    0.00      0     0.10    1.33    0.00      0
         256            64     float     sum      -1     2.82    0.09    0.00      0     0.09    2.76    0.00      0
         512           128     float     sum      -1     2.85    0.18    0.00      0     0.09    5.52    0.00      0
        1024           256     float     sum      -1     2.83    0.36    0.00      0     0.09   11.06    0.00      0
        2048           512     float     sum      -1     2.96    0.69    0.00      0     0.11   19.07    0.00      0
        4096          1024     float     sum      -1     2.84    1.44    0.00      0     0.09   44.28    0.00      0
        8192          2048     float     sum      -1     2.82    2.91    0.00      0     0.09   88.36    0.00      0
       16384          4096     float     sum      -1     2.85    5.76    0.00      0     0.09  177.30    0.00      0
       32768          8192     float     sum      -1     2.84   11.54    0.00      0     0.09  353.83    0.00      0
       65536         16384     float     sum      -1     2.84   23.06    0.00      0     0.09  703.86    0.00      0
      131072         32768     float     sum      -1     2.85   46.02    0.00      0     0.09  1415.46    0.00      0
      262144         65536     float     sum      -1     3.11   84.28    0.00      0     0.09  2827.87    0.00      0
      524288        131072     float     sum      -1     3.31  158.62    0.00      0     0.09  5613.36    0.00      0
     1048576        262144     float     sum      -1     5.17  202.82    0.00      0     0.09  11213.52    0.00      0
     2097152        524288     float     sum      -1     5.74  365.07    0.00      0     0.09  22501.63    0.00      0
     4194304       1048576     float     sum      -1     7.12  589.12    0.00      0     0.09  45387.99    0.00      0
     8388608       2097152     float     sum      -1    10.51  798.33    0.00      0     0.09  90103.20    0.00      0
    16777216       4194304     float     sum      -1    16.98  988.00    0.00      0     0.09  182559.48    0.00      0
    33554432       8388608     float     sum      -1    30.29  1107.87    0.00      0     0.09  363497.26    0.00      0
    67108864      16777216     float     sum      -1    55.56  1207.83    0.00      0     0.09  723857.88    0.00      0
   134217728      33554432     float     sum      -1    106.1  1265.43    0.00      0     0.09  1443201.38    0.00      0
   268435456      67108864     float     sum      -1    206.5  1299.95    0.00      0     0.09  2895743.86    0.00      0
   536870912     134217728     float     sum      -1    410.6  1307.60    0.00      0     0.09  5748082.57    0.00      0
  1073741824     268435456     float     sum      -1    813.6  1319.79    0.00      0     0.09  11581726.07    0.00      0
  2147483648     536870912     float     sum      -1   1619.7  1325.87    0.00      0     0.09  23291579.70    0.00      0
  4294967296    1073741824     float     sum      -1   3232.4  1328.70    0.00      0     0.09  45935479.10    0.00      0
  8589934592    2147483648     float     sum      -1   6457.2  1330.29    0.00      0     0.09  90601567.26    0.00      0
 17179869184    4294967296     float     sum      -1    12907  1331.10    0.00      0     0.10  179705744.60    0.00      0
# Out of bounds values : 0 OK
# Avg bus bandwidth    : 0 
#

$ cat events.txt 
RunAIEvent(created_at=2025-04-03 10:45:05+00:00, id=5a5561e4-cbcd-4035-9edb-0a5cb7620c54, type=Normal, cluster_id=a69928cc-ccaa-48be-bda9-482440f4d855, message=Created pod: nccl-test-20250403104500-0-0, reason=SuccessfulCreate, source=runaijobcontroller, involved_object=InvolvedObject(uid=31ebd188-b4a7-4068-bc8c-b4cce7bedf0a, kind=RunaiJob, name=nccl-test-20250403104500, namespace=runai-cloudai))
...

Additional Notes

@TaekyungHeo TaekyungHeo force-pushed the runai branch 29 times, most recently from 4922ce4 to 2b381b3 Compare April 1, 2025 19:36
@TaekyungHeo TaekyungHeo force-pushed the runai branch 6 times, most recently from 8334b79 to fd53ca0 Compare April 3, 2025 10:44
@TaekyungHeo TaekyungHeo changed the title [WIP] Add RunAI scheduler support and enable NCCL tests submission Add RunAI scheduler support and enable NCCL tests submission Apr 3, 2025
@TaekyungHeo TaekyungHeo marked this pull request as ready for review April 3, 2025 10:47
@TaekyungHeo
Copy link
Member Author

@amaslenn , Resolved all of your comments. Thanks. Tested after updating the code, and it works.

Copy link
Contributor

@srivatsankrishnan srivatsankrishnan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM for the purpose of enabling RunAI for acceptance test using CloudAI. Its a pretty long PR and need more time to discuss the design and architecture and how it affects the other core features in CloudAI such as DSE, reporting etc. Based on the time I have for review this PR, I don;t think it affects DSE yet because we don't have a use case for it with this scheduler.

@TaekyungHeo TaekyungHeo merged commit ae17cf1 into NVIDIA:main Apr 4, 2025
2 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants