Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

PyTorch test that uses torchvision #130

Merged
merged 28 commits into from
Jul 30, 2024
Merged

PyTorch test that uses torchvision #130

merged 28 commits into from
Jul 30, 2024

Conversation

casparvl
Copy link
Collaborator

Work in progress...

Caspar van Leeuwen added 6 commits March 29, 2024 12:00
…cher-agnostic way, by simply passing them to the python script as argument, and having that set it to the environment. Print clear error if SLURM or openMPIs mpirun are not used - we still rely on these to get the local rank, there is no other way
…uired environment variables have been set.
@casparvl
Copy link
Collaborator Author

casparvl commented May 3, 2024

Test is still failing on multiple nodes:

...
  File "/sw/arch/RHEL8/EB_production/2022/software/PyTorch/1.12.0-foss-2022a-CUDA-11.7.0/lib/python3.10/site-packages/torch/distributed/utils.py", line 89, in _verify_param_shape_across_processes
    return dist._verify_params_across_processes(process_group, tensors, logger)
RuntimeError: NCCL error in: /tmp/jenkins/build/PyTorch/1.12.0/foss-2022a-CUDA-11.7.0/pytorch/torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1191, invalid usage, NCCL version 2.12.12
ncclInvalidUsage: This usually reflects invalid usage of NCCL library (such as too many async ops, too many collectives at once, mixing streams in a group, etc)
...

Not a very explicit error...

The generated jobscript is e.g.

#!/bin/bash
#SBATCH --job-name="rfm_PyTorch_torchvision_GPU_afffb4aa"
#SBATCH --ntasks=8
#SBATCH --ntasks-per-node=4
#SBATCH --cpus-per-task=18
#SBATCH --output=rfm_job.out
#SBATCH --error=rfm_job.err
#SBATCH --time=0:30:0
#SBATCH -p gpu
#SBATCH --export=None
#SBATCH --gpus-per-node=4
source /cvmfs/software.eessi.io/versions/2023.06/init/bash
module load torchvision/0.13.1-foss-2022a-CUDA-11.7.0
export I_MPI_PIN_CELL=core
export I_MPI_PIN_DOMAIN=18:compact
export OMPI_MCA_rmaps_base_mapping_policy=node:PE=18
export SLURM_CPU_BIND=verbose
mpirun -np 8 python pytorch_synthetic_benchmark.py --model resnet50 --master-port $(python python_get_free_socket.py) --master-address $(hostname --fqdn) --world-size 8 --use-ddp

Note that replacing the mpirun -np 8 with srun, this test runs fine. It may be related to how ranks are mapped to devices, and that somehow this leads to e.g. 2 ranks being mapped to the same device with mpirun?

Possibly related:

@casparvl
Copy link
Collaborator Author

casparvl commented May 3, 2024

With

NCCL_DEBUG=INFO mpirun -np 8 python pytorch_synthetic_benchmark.py --model resnet50 --master-port $(python python_get_free_socket.py) --master-address $(hostname --fqdn) --world-size 8 --use-ddp

I get

gcn20:834301:834709 [2] init.cc:469 NCCL WARN Duplicate GPU detected : rank 2 and rank 6 both on CUDA device ca000
gcn20:834301:834709 [2] NCCL INFO init.cc:914 -> 5

gcn20:834302:834703 [0] init.cc:469 NCCL WARN Duplicate GPU detected : rank 4 and rank 0 both on CUDA device 31000
gcn20:834302:834703 [0] NCCL INFO init.cc:914 -> 5
gcn20:834301:834709 [2] NCCL INFO group.cc:58 -> 5 [Async thread]

gcn20:834303:834711 [2] init.cc:469 NCCL WARN Duplicate GPU detected : rank 6 and rank 2 both on CUDA device ca000
gcn20:834303:834711 [2] NCCL INFO init.cc:914 -> 5
gcn20:834302:834703 [0] NCCL INFO group.cc:58 -> 5 [Async thread]
gcn20:834303:834711 [2] NCCL INFO group.cc:58 -> 5 [Async thread]

gcn35:1583938:1584337 [3] init.cc:469 NCCL WARN Duplicate GPU detected : rank 3 and rank 7 both on CUDA device e3000
gcn35:1583938:1584337 [3] NCCL INFO init.cc:914 -> 5

gcn20:834300:834695 [0] init.cc:469 NCCL WARN Duplicate GPU detected : rank 0 and rank 4 both on CUDA device 31000
gcn35:1583938:1584337 [3] NCCL INFO group.cc:58 -> 5 [Async thread]
gcn20:834300:834695 [0] NCCL INFO init.cc:914 -> 5
gcn20:834300:834695 [0] NCCL INFO group.cc:58 -> 5 [Async thread]

gcn35:1583940:1584335 [3] init.cc:469 NCCL WARN Duplicate GPU detected : rank 7 and rank 3 both on CUDA device e3000
gcn35:1583940:1584335 [3] NCCL INFO init.cc:914 -> 5
gcn35:1583940:1584335 [3] NCCL INFO group.cc:58 -> 5 [Async thread]

gcn35:1583937:1584343 [1] init.cc:469 NCCL WARN Duplicate GPU detected : rank 1 and rank 5 both on CUDA device 32000
gcn35:1583937:1584343 [1] NCCL INFO init.cc:914 -> 5
gcn35:1583937:1584343 [1] NCCL INFO group.cc:58 -> 5 [Async thread]

gcn35:1583939:1584345 [1] init.cc:469 NCCL WARN Duplicate GPU detected : rank 5 and rank 1 both on CUDA device 32000
gcn35:1583939:1584345 [1] NCCL INFO init.cc:914 -> 5
gcn35:1583939:1584345 [1] NCCL INFO group.cc:58 -> 5 [Async thread]

So yeah, it does seem to be about mapping ranks to devices...

@casparvl
Copy link
Collaborator Author

casparvl commented May 3, 2024

Yep, that's where the problem is. With mpirun:

host: gcn20.local.snellius.surf.nl, rank: 0, local_rank: 0
host: gcn20.local.snellius.surf.nl, rank: 4, local_rank: 0

with srun:

host: gcn20.local.snellius.surf.nl, rank: 0, local_rank: 0
host: gcn35.local.snellius.surf.nl, rank: 4, local_rank: 0

srun by default maps ranks in a block-fashion, i.e. if you have 2 nodes with 4 ranks per node, it puts rank 0-3 on node 0, and rank 4-7 on node 1. mpirun defaults to ranking by slot. Since we do

export OMPI_MCA_rmaps_base_mapping_policy=node:PE=18

Our first slot will be on node 0 (core 0-17), our second slot will be on node 1 (core 0-17), third slot on node 0 (core 18-35), etc. Then, MPI ranks by slot, so: first rank will be on node 0 (core 0-17), second rank will be on node 1 (core 0-17), third rank on node 0 (core 18-35), etc. This is different from srun, and it breaks the assumption made in calculating the local rank with

        local_rank = rank - visible_gpus * (rank // visible_gpus)

which assumes that consequitive ranks are mapped to the same node, until that node is full, i.e. srun's block distribution. Note that with srun, we could actually use SLURM_LOCALID to determine the local rank, which would work irrespective of the distribution method used by srun.

It's probably better / cleaner to do something like was done by CSCS at https://github.com/eth-cscs/cscs-reframe-tests/blob/9c076fe2c1904b8e41c56184dba58de657d8c3a7/checks/apps/pytorch/src/pt_distr_env.py . I.e. seperate setting the distributed environment in a different script. They also set the CUDA_VISIBLE_DEVICES externally in https://github.com/eth-cscs/cscs-reframe-tests/blob/9c076fe2c1904b8e41c56184dba58de657d8c3a7/checks/apps/pytorch/src/set_visible_devices.sh . It's cleaner to seperate those preparatory steps from the actual test. Note that the CSCS code only works for srun as launcher, we would still have to make implementations for mpirun.

I'll first do a fix on my current code, since this:

mpirun -np 8 --rank-by hwthread --report-bindings python pytorch_synthetic_benchmark.py --model resnet50 --master-port $(python python_get_free_socket.py) --master-address $(hostname --fqdn) --world-size 8 --use-ddp

runs succesfully. Using --rank-by hwthread, we distribute ranks according to the smallest allocatable entity (a hwthread), which will always result in block distribution. That will satisfy the assumption made in

local_rank = rank - visible_gpus * (rank // visible_gpus)

After that is done, I'll refactor to split off the preparation from the actual test.

@casparvl
Copy link
Collaborator Author

casparvl commented May 3, 2024

Merging #137 and synchronizing this branch with main afterwards should solve the mapping issue.

@casparvl
Copy link
Collaborator Author

casparvl commented May 3, 2024

Ok, fix_compact_process_binding has resolved the mapping issue. I managed to do a succesful 2-node 8-GPU run with mixed precision. For some reason, the 2-node 8-GPU run with default precision seems to hang. I don't get any output after the initialization (thought that might also be buffering?). The job script was:

#!/bin/bash
#SBATCH --job-name="rfm_PyTorch_torchvision_GPU_afffb4aa"
#SBATCH --ntasks=8
#SBATCH --ntasks-per-node=4
#SBATCH --cpus-per-task=18
#SBATCH --output=rfm_job.out
#SBATCH --error=rfm_job.err
#SBATCH --time=0:30:0
#SBATCH -p gpu
#SBATCH --export=None
#SBATCH --gpus-per-node=4
source /cvmfs/software.eessi.io/versions/2023.06/init/bash
module load torchvision/0.13.1-foss-2022a-CUDA-11.7.0
export I_MPI_PIN_CELL=core
export I_MPI_PIN_DOMAIN=18:compact
export OMPI_MCA_rmaps_base_mapping_policy=slot:PE=18
export SLURM_CPU_BIND=verbose
mpirun -np 8 python pytorch_synthetic_benchmark.py --model resnet50 --master-port $(python python_get_free_socket.py) --master-address $(hostname --fqdn) --world-size 8 --use-ddp

It's been going for 11 minutes now, the others all finished in ~2 minutes. I'll retest this one later, it might just be an issue on one of the nodes I got allocated - and not an issue with this test at all.

@casparvl
Copy link
Collaborator Author

casparvl commented May 6, 2024

Retest succeeded:

{EESSI 2023.06} (eessi_test_venv) [casparl@int6 EESSI]$ reframe -t CI -c test-suite/eessi/testsuite/tests/apps/PyTorch/ -n PyTorch_torchvision_GPU -t "1_4_node|1_node|^2_nodes" --run
[ReFrame Setup]
  version:           4.6.0-dev.2+a3b495fa
  command:           "/gpfs/home4/casparl/EESSI/eessi_test_venv/bin/reframe -t CI -c test-suite/eessi/testsuite/tests/apps/PyTorch/ -n PyTorch_torchvision_GPU -t '1_4_node|1_node|^2_nodes' --run"
  launched by:       casparl@int6
  working directory: '/gpfs/home4/casparl/EESSI'
  settings files:    '<builtin>', '/gpfs/home4/casparl/EESSI/test-suite/config/surf_snellius.py'
  check search path: (R) '/gpfs/home4/casparl/EESSI/test-suite/eessi/testsuite/tests/apps/PyTorch'
  stage directory:   '/scratch-shared/casparl/reframe_output/staging'
  output directory:  '/gpfs/home4/casparl/EESSI/reframe_runs/output'
  log files:         '/gpfs/home4/casparl/EESSI/reframe_runs/logs/reframe_20240506_105753.log'

[==========] Running 12 check(s)
[==========] Started on Mon May  6 10:58:07 2024+0200

[----------] start processing checks
[ RUN      ] PyTorch_torchvision_GPU %nn_model=resnet50 %scale=2_nodes %parallel_strategy=ddp %module_name=torchvision/0.13.1-foss-2022a-CUDA-11.7.0 %precision=mixed /9670834c @snellius:gpu+default
[ RUN      ] PyTorch_torchvision_GPU %nn_model=resnet50 %scale=2_nodes %parallel_strategy=ddp %module_name=torchvision/0.13.1-foss-2022a-CUDA-11.7.0 %precision=default /afffb4aa @snellius:gpu+default
[ RUN      ] PyTorch_torchvision_GPU %nn_model=resnet50 %scale=2_nodes %parallel_strategy=None %module_name=torchvision/0.13.1-foss-2022a-CUDA-11.7.0 %precision=mixed /9a097893 @snellius:gpu+default
[     SKIP ] ( 1/12) Skipping test: parallel strategy is 'None', but requested process count is larger than one (8)
[ RUN      ] PyTorch_torchvision_GPU %nn_model=resnet50 %scale=2_nodes %parallel_strategy=None %module_name=torchvision/0.13.1-foss-2022a-CUDA-11.7.0 %precision=default /f38b922d @snellius:gpu+default
[     SKIP ] ( 2/12) Skipping test: parallel strategy is 'None', but requested process count is larger than one (8)
[ RUN      ] PyTorch_torchvision_GPU %nn_model=resnet50 %scale=1_node %parallel_strategy=ddp %module_name=torchvision/0.13.1-foss-2022a-CUDA-11.7.0 %precision=mixed /b61de8e2 @snellius:gpu+default
[ RUN      ] PyTorch_torchvision_GPU %nn_model=resnet50 %scale=1_node %parallel_strategy=ddp %module_name=torchvision/0.13.1-foss-2022a-CUDA-11.7.0 %precision=default /57b7dcc3 @snellius:gpu+default
[ RUN      ] PyTorch_torchvision_GPU %nn_model=resnet50 %scale=1_node %parallel_strategy=None %module_name=torchvision/0.13.1-foss-2022a-CUDA-11.7.0 %precision=mixed /4f67a6ef @snellius:gpu+default
[     SKIP ] ( 3/12) Skipping test: parallel strategy is 'None', but requested process count is larger than one (4)
[ RUN      ] PyTorch_torchvision_GPU %nn_model=resnet50 %scale=1_node %parallel_strategy=None %module_name=torchvision/0.13.1-foss-2022a-CUDA-11.7.0 %precision=default /b12bb24d @snellius:gpu+default
[     SKIP ] ( 4/12) Skipping test: parallel strategy is 'None', but requested process count is larger than one (4)
[ RUN      ] PyTorch_torchvision_GPU %nn_model=resnet50 %scale=1_4_node %parallel_strategy=ddp %module_name=torchvision/0.13.1-foss-2022a-CUDA-11.7.0 %precision=mixed /aa1b6024 @snellius:gpu+default
[     SKIP ] ( 5/12) Skipping test: parallel strategy is ddp, but only one process is requested
[ RUN      ] PyTorch_torchvision_GPU %nn_model=resnet50 %scale=1_4_node %parallel_strategy=ddp %module_name=torchvision/0.13.1-foss-2022a-CUDA-11.7.0 %precision=default /4ee3891a @snellius:gpu+default
[     SKIP ] ( 6/12) Skipping test: parallel strategy is ddp, but only one process is requested
[ RUN      ] PyTorch_torchvision_GPU %nn_model=resnet50 %scale=1_4_node %parallel_strategy=None %module_name=torchvision/0.13.1-foss-2022a-CUDA-11.7.0 %precision=mixed /fab3ae32 @snellius:gpu+default
[ RUN      ] PyTorch_torchvision_GPU %nn_model=resnet50 %scale=1_4_node %parallel_strategy=None %module_name=torchvision/0.13.1-foss-2022a-CUDA-11.7.0 %precision=default /a9961662 @snellius:gpu+default
[       OK ] ( 7/12) PyTorch_torchvision_GPU %nn_model=resnet50 %scale=2_nodes %parallel_strategy=ddp %module_name=torchvision/0.13.1-foss-2022a-CUDA-11.7.0 %precision=mixed /9670834c @snellius:gpu+default
P: total_throughput: 6580.7 img/sec (r:0, l:None, u:None)
P: througput_per_CPU: 822.6 img/sec (r:0, l:None, u:None)
[       OK ] ( 8/12) PyTorch_torchvision_GPU %nn_model=resnet50 %scale=2_nodes %parallel_strategy=ddp %module_name=torchvision/0.13.1-foss-2022a-CUDA-11.7.0 %precision=default /afffb4aa @snellius:gpu+default
P: total_throughput: 5476.0 img/sec (r:0, l:None, u:None)
P: througput_per_CPU: 684.5 img/sec (r:0, l:None, u:None)
[       OK ] ( 9/12) PyTorch_torchvision_GPU %nn_model=resnet50 %scale=1_node %parallel_strategy=ddp %module_name=torchvision/0.13.1-foss-2022a-CUDA-11.7.0 %precision=mixed /b61de8e2 @snellius:gpu+default
P: total_throughput: 3395.7 img/sec (r:0, l:None, u:None)
P: througput_per_CPU: 848.9 img/sec (r:0, l:None, u:None)
[       OK ] (10/12) PyTorch_torchvision_GPU %nn_model=resnet50 %scale=1_node %parallel_strategy=ddp %module_name=torchvision/0.13.1-foss-2022a-CUDA-11.7.0 %precision=default /57b7dcc3 @snellius:gpu+default
P: total_throughput: 3090.6 img/sec (r:0, l:None, u:None)
P: througput_per_CPU: 772.7 img/sec (r:0, l:None, u:None)
[       OK ] (11/12) PyTorch_torchvision_GPU %nn_model=resnet50 %scale=1_4_node %parallel_strategy=None %module_name=torchvision/0.13.1-foss-2022a-CUDA-11.7.0 %precision=mixed /fab3ae32 @snellius:gpu+default
P: total_throughput: 984.5 img/sec (r:0, l:None, u:None)
P: througput_per_CPU: 984.5 img/sec (r:0, l:None, u:None)
[       OK ] (12/12) PyTorch_torchvision_GPU %nn_model=resnet50 %scale=1_4_node %parallel_strategy=None %module_name=torchvision/0.13.1-foss-2022a-CUDA-11.7.0 %precision=default /a9961662 @snellius:gpu+default
P: total_throughput: 816.7 img/sec (r:0, l:None, u:None)
P: througput_per_CPU: 816.7 img/sec (r:0, l:None, u:None)
[----------] all spawned checks have finished

[  PASSED  ] Ran 6/12 test case(s) from 12 check(s) (0 failure(s), 6 skipped, 0 aborted)
[==========] Finished on Mon May  6 11:01:52 2024+0200
Log file(s) saved in '/gpfs/home4/casparl/EESSI/reframe_runs/logs/reframe_20240506_105753.log'

@casparvl
Copy link
Collaborator Author

casparvl commented May 6, 2024

Full run, including CPU tests, was succesfull:

[       OK ] (13/24) EESSI_PyTorch_torchvision_GPU %nn_model=resnet50 %scale=2_nodes %parallel_strategy=ddp %module_name=torchvision/0.13.1-foss-2022a-CUDA-11.7.0 %precision=mixed /10d80ee6 @snellius:gpu+default
P: total_throughput: 6647.3 img/sec (r:0, l:None, u:None)
P: througput_per_CPU: 830.9 img/sec (r:0, l:None, u:None)
[       OK ] (14/24) EESSI_PyTorch_torchvision_GPU %nn_model=resnet50 %scale=2_nodes %parallel_strategy=ddp %module_name=torchvision/0.13.1-foss-2022a-CUDA-11.7.0 %precision=default /857cd04a @snellius:gpu+default
P: total_throughput: 5818.4 img/sec (r:0, l:None, u:None)
P: througput_per_CPU: 727.3 img/sec (r:0, l:None, u:None)
[       OK ] (15/24) EESSI_PyTorch_torchvision_GPU %nn_model=resnet50 %scale=1_node %parallel_strategy=ddp %module_name=torchvision/0.13.1-foss-2022a-CUDA-11.7.0 %precision=mixed /925f89d6 @snellius:gpu+default
P: total_throughput: 3373.7 img/sec (r:0, l:None, u:None)
P: througput_per_CPU: 843.4 img/sec (r:0, l:None, u:None)
[       OK ] (16/24) EESSI_PyTorch_torchvision_GPU %nn_model=resnet50 %scale=1_node %parallel_strategy=ddp %module_name=torchvision/0.13.1-foss-2022a-CUDA-11.7.0 %precision=default /294549e0 @snellius:gpu+default
P: total_throughput: 3097.8 img/sec (r:0, l:None, u:None)
P: througput_per_CPU: 774.5 img/sec (r:0, l:None, u:None)
[       OK ] (17/24) EESSI_PyTorch_torchvision_GPU %nn_model=resnet50 %scale=1_4_node %parallel_strategy=None %module_name=torchvision/0.13.1-foss-2022a-CUDA-11.7.0 %precision=mixed /f56603c8 @snellius:gpu+default
P: total_throughput: 963.9 img/sec (r:0, l:None, u:None)
P: througput_per_CPU: 963.9 img/sec (r:0, l:None, u:None)
[       OK ] (18/24) EESSI_PyTorch_torchvision_GPU %nn_model=resnet50 %scale=1_4_node %parallel_strategy=None %module_name=torchvision/0.13.1-foss-2022a-CUDA-11.7.0 %precision=default /31e51c44 @snellius:gpu+default
P: total_throughput: 814.0 img/sec (r:0, l:None, u:None)
P: througput_per_CPU: 814.0 img/sec (r:0, l:None, u:None)
[       OK ] (19/24) EESSI_PyTorch_torchvision_CPU %nn_model=resnet50 %scale=2_nodes %parallel_strategy=ddp %module_name=torchvision/0.13.1-foss-2022a /3a25a79c @snellius:genoa+default
P: total_throughput: 344.3 img/sec (r:0, l:None, u:None)
P: througput_per_CPU: 21.5 img/sec (r:0, l:None, u:None)
[       OK ] (20/24) EESSI_PyTorch_torchvision_CPU %nn_model=resnet50 %scale=1_node %parallel_strategy=ddp %module_name=torchvision/0.13.1-foss-2022a /41f0b958 @snellius:genoa+default
P: total_throughput: 195.5 img/sec (r:0, l:None, u:None)
P: througput_per_CPU: 24.4 img/sec (r:0, l:None, u:None)
[       OK ] (21/24) EESSI_PyTorch_torchvision_CPU %nn_model=resnet50 %scale=1_4_node %parallel_strategy=ddp %module_name=torchvision/0.13.1-foss-2022a /6c4d7201 @snellius:genoa+default
P: total_throughput: 53.9 img/sec (r:0, l:None, u:None)
P: througput_per_CPU: 27.0 img/sec (r:0, l:None, u:None)
[       OK ] (22/24) EESSI_PyTorch_torchvision_CPU %nn_model=resnet50 %scale=2_nodes %parallel_strategy=ddp %module_name=torchvision/0.13.1-foss-2022a /3a25a79c @snellius:rome+default
P: total_throughput: 183.0 img/sec (r:0, l:None, u:None)
P: througput_per_CPU: 11.4 img/sec (r:0, l:None, u:None)
[       OK ] (23/24) EESSI_PyTorch_torchvision_CPU %nn_model=resnet50 %scale=1_node %parallel_strategy=ddp %module_name=torchvision/0.13.1-foss-2022a /41f0b958 @snellius:rome+default
P: total_throughput: 93.7 img/sec (r:0, l:None, u:None)
P: througput_per_CPU: 11.7 img/sec (r:0, l:None, u:None)
[       OK ] (24/24) EESSI_PyTorch_torchvision_CPU %nn_model=resnet50 %scale=1_4_node %parallel_strategy=ddp %module_name=torchvision/0.13.1-foss-2022a /6c4d7201 @snellius:rome+default
P: total_throughput: 27.0 img/sec (r:0, l:None, u:None)
P: througput_per_CPU: 13.5 img/sec (r:0, l:None, u:None)

(note that the other 12 tests are correctly skipped since they are invalid combinations. E.g. parallel_strategy=ddp but with a process count of 1)

@casparvl casparvl marked this pull request as ready for review May 6, 2024 14:55
@smoors
Copy link
Collaborator

smoors commented Jun 13, 2024

@casparvl can you pls merge main, so it's easier to compare and see what changed?

@@ -57,7 +57,7 @@ def _assign_default_num_gpus_per_node(test: rfm.RegressionTest):

def assign_tasks_per_compute_unit(test: rfm.RegressionTest, compute_unit: str, num_per: int = 1):
"""
Assign one task per compute unit (COMPUTE_UNIT[CPU], COMPUTE_UNIT[CPU_SOCKET] or COMPUTE_UNIT[GPU]).
Assign one task per compute unit.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
Assign one task per compute unit.
Assign one task per compute unit. More than 1 task per compute unit can be assigned with num_per for compute units that support it.

please also update line 83 to:

 if num_per != 1 and compute_unit not in [COMPUTE_UNIT[NODE]]:

Comment on lines 66 to 67
# Hybrid code, so launch 1 rank per socket.
# Probably, launching 1 task per NUMA domain is even better, but the current hook doesn't support it
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i guess these comments no longer apply, as we are now launching 1 rank per numa node?

hooks.set_compact_process_binding(self)

@run_after('setup')
def set_ddp_env_vars(self):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
def set_ddp_env_vars(self):
def set_ddp_options(self):

# Set environment variables for PyTorch DDP
if self.parallel_strategy == 'ddp':
# Set additional options required by DDP
self.executable_opts += ["--master-port $(python python_get_free_socket.py)"]
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i was a bit confused by the name of this script. can we call it get_free_port.py instead?

# the compute threads. It was fixed, but seems to be broken again in Horovod 0.28.1
# The easiest workaround is to reduce the number of compute threads by 2
if self.compute_device == DEVICE_TYPES[CPU] and self.parallel_strategy == 'horovod':
self.env_vars['OMP_NUM_THREADS'] = max(self.num_cpus_per_task - 2, 2) # Never go below 2 compute threads
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what if you have only 1 or 2 cores? is it still better to have 2 compute threads in that case?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll take it out. We don't have parallel_strategy=horovod in this test anyway, that was a remnant of our (internal) test that I ported.

If we ever deploy horovod (with PyTorch support), we could bring this back. But then it'd be a horovod test, not a pytorch test :)

@smoors
Copy link
Collaborator

smoors commented Jun 16, 2024

i ran a test with 1 GPU but it failed. apparently OMP_NUM_THREADS is not defined in that case:

FAILURE INFO for EESSI_PyTorch_torchvision_GPU %nn_model=resnet50 %scale=1_2_node %parallel_strategy=None %module_name=PyTorch-bundle/2.1.2-foss-2023a-CUDA-12.1.1 %precision=mi
xed (run: 1/1)                                                                                                                                                                  
  * Description: Benchmark that runs a selected torchvision model on synthetic data                                                                                             
  * System partition: hydra:ampere
  * Environment: default
  * Stage directory: /data/brussel/100/vsc10009/reframe/eessitestsuite/newtest/prefix/stage/hydra/ampere/default/EESSI_PyTorch_torchvision_GPU_4e984066
  * Node list: node402
  * Job type: batch job (id=9582095)
  * Dependencies (conceptual): []
  * Dependencies (actual): []
  * Maintainers: []
  * Failing phase: sanity
  * Rerun with '-n /4e984066 -p default --system hydra:ampere -r'
  * Reason: sanity error: pattern 'Total img/sec on 1 .PU\\(s\\):.*' not found in 'rfm_job.out'
--- rfm_job.out (first 10 lines) ---
World size: 1
--- rfm_job.out ---
--- rfm_job.err (first 10 lines) ---
cpu-bind=MASK - node402, task  0  0 [1914655]: mask 0xffff set
Traceback (most recent call last):
  File "/vscmnt/brussel_pixiu_data/_data_brussel/100/vsc10009/reframe/eessitestsuite/newtest/prefix/stage/hydra/ampere/default/EESSI_PyTorch_torchvision_GPU_4e984066/pytorch_s$
nthetic_benchmark.py", line 152, in <module>
    torch.set_num_threads(int(os.environ['OMP_NUM_THREADS']))
                              ~~~~~~~~~~^^^^^^^^^^^^^^^^^^^
  File "<frozen os>", line 679, in __getitem__
KeyError: 'OMP_NUM_THREADS'

@casparvl casparvl requested a review from smoors July 1, 2024 12:26
@casparvl
Copy link
Collaborator Author

i ran a test with 1 GPU but it failed. apparently OMP_NUM_THREADS is not defined in that case:

Darn, I realize I didn't look into this yet... need to still fix it, but don't have time now :\

Caspar van Leeuwen added 4 commits July 25, 2024 16:32
@smoors
Copy link
Collaborator

smoors commented Jul 30, 2024

single-GPU test:

[ReFrame Setup]                                                                                                                                                                                                    
  version:           4.3.2                                                                                                                                                                                         
  command:           '/apps/brussel/RL8/skylake-ib/software/ReFrame/4.3.2/bin/reframe -C /data/brussel/100/vsc10009/reframe/eessitestsuite/test-suite/../config_hydra.py -c /data/brussel/100/vsc10009/reframe/eess
itestsuite/test-suite/eessi/testsuite/tests/apps/PyTorch/PyTorch_torchvision.py -t CI -t 1_2_node --setvar modules=PyTorch-bundle/2.1.2-foss-2023a-CUDA-12.1.1 --system hydra:ampere -r --performance-report -v'   
  launched by:       vsc10009@login1.cerberus.os                                                                                                                                                                   
  working directory: '/vscmnt/brussel_pixiu_data/_data_brussel/100/vsc10009/reframe/eessitestsuite'                                                                                                                
  settings files:    '<builtin>', '/data/brussel/100/vsc10009/reframe/eessitestsuite/test-suite/../config_hydra.py'                                                                                                
  check search path: '/vscmnt/brussel_pixiu_data/_data_brussel/100/vsc10009/reframe/eessitestsuite/test-suite/eessi/testsuite/tests/apps/PyTorch/PyTorch_torchvision.py'                                           
  stage directory:   '/data/brussel/100/vsc10009/reframe/eessitestsuite/newtest/prefix/stage'                                                                                                                      
  output directory:  '/data/brussel/100/vsc10009/reframe/eessitestsuite/newtest/prefix/output'                                                                                                                     
  log files:         '/data/brussel/100/vsc10009/reframe/eessitestsuite/newtest/prefix/logs/reframe_20240728_222949.log'                                                                                           
                                                                                                                                                                                                                   
                                                                                                                                                                                                                   
Loaded 3900 test(s)                                                                                                                                                                                                
Generated 60 test case(s)                                                                                                                                                                                          
Filtering test cases(s) by name: 60 remaining                                                                                                                                                                      
Filtering test cases(s) by tags: 4 remaining                                                                                                                                                                       
Filtering test cases(s) by other attributes: 4 remaining            
Final number of test cases: 4                         
[==========] Running 4 check(s)                                                                                                                                                                                   
[==========] Started on Sun Jul 28 22:30:38 2024                                                                                                                                                                  
                                                                                                                                                                                                                   
[----------] start processing checks                                                                                                                                                                              
[ RUN      ] EESSI_PyTorch_torchvision_GPU %nn_model=resnet50 %scale=1_2_node %parallel_strategy=ddp %module_name=PyTorch-bundle/2.1.2-foss-2023a-CUDA-12.1.1 %precision=mixed /f41aa0c4 @hydra:ampere+default    
[     SKIP ] (1/4) Skipping test: parallel strategy is ddp, but only one process is requested.                                                                                                                    
[ RUN      ] EESSI_PyTorch_torchvision_GPU %nn_model=resnet50 %scale=1_2_node %parallel_strategy=ddp %module_name=PyTorch-bundle/2.1.2-foss-2023a-CUDA-12.1.1 %precision=default /d7cc8bfe @hydra:ampere+default
[     SKIP ] (2/4) Skipping test: parallel strategy is ddp, but only one process is requested.                                                      
[ RUN      ] EESSI_PyTorch_torchvision_GPU %nn_model=resnet50 %scale=1_2_node %parallel_strategy=None %module_name=PyTorch-bundle/2.1.2-foss-2023a-CUDA-12.1.1 %precision=mixed /4e984066 @hydra:ampere+default   
[ RUN      ] EESSI_PyTorch_torchvision_GPU %nn_model=resnet50 %scale=1_2_node %parallel_strategy=None %module_name=PyTorch-bundle/2.1.2-foss-2023a-CUDA-12.1.1 %precision=default /498fd283 @hydra:ampere+default
[       OK ] (3/4) EESSI_PyTorch_torchvision_GPU %nn_model=resnet50 %scale=1_2_node %parallel_strategy=None %module_name=PyTorch-bundle/2.1.2-foss-2023a-CUDA-12.1.1 %precision=mixed /4e984066 @hydra:ampere+defa$
lt                                                                                                                                                                                                                
P: total_throughput: 1250.9 img/sec (r:0, l:None, u:None)                                                                                      
P: througput_per_CPU: 1250.9 img/sec (r:0, l:None, u:None)                                                                                                                                                        
==> setup: 0.232s compile: 0.010s run: 44.738s sanity: 0.007s performance: 0.006s total: 45.267s                                                                                                                  
[       OK ] (4/4) EESSI_PyTorch_torchvision_GPU %nn_model=resnet50 %scale=1_2_node %parallel_strategy=None %module_name=PyTorch-bundle/2.1.2-foss-2023a-CUDA-12.1.1 %precision=default /498fd283 @hydra:ampere+de$
ault                                                                                                                                                                                                              
P: total_throughput: 828.2 img/sec (r:0, l:None, u:None)                                                                                                                                                          
P: througput_per_CPU: 828.2 img/sec (r:0, l:None, u:None)                                                                                                                                                         
==> setup: 0.248s compile: 0.010s run: 45.726s sanity: 0.006s performance: 0.003s total: 46.129s
[----------] all spawned checks have finished                                                                                                                                                                     
                                                                                                                                                                                                                  
[  PASSED  ] Ran 2/4 test case(s) from 4 check(s) (0 failure(s), 2 skipped, 0 aborted)
[==========] Finished on Sun Jul 28 22:31:25 2024

================================================================================================================================================================================
PERFORMANCE REPORT
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
[EESSI_PyTorch_torchvision_GPU %nn_model=resnet50 %scale=1_2_node %parallel_strategy=None %module_name=PyTorch-bundle/2.1.2-foss-2023a-CUDA-12.1.1 %precision=mixed /4e984066 @hydra:ampere:default]
  num_tasks: 1
  num_gpus_per_node: 1
  num_tasks_per_node: 1
  num_cpus_per_task: 16
  performance:
    - total_throughput: 1250.9 img/sec (r: 0 img/sec l: -inf% u: +inf%)
    - througput_per_CPU: 1250.9 img/sec (r: 0 img/sec l: -inf% u: +inf%)
[EESSI_PyTorch_torchvision_GPU %nn_model=resnet50 %scale=1_2_node %parallel_strategy=None %module_name=PyTorch-bundle/2.1.2-foss-2023a-CUDA-12.1.1 %precision=default /498fd283 @hydra:ampere:default]
  num_tasks: 1
  num_gpus_per_node: 1
  num_tasks_per_node: 1
  num_cpus_per_task: 16
  performance:
    - total_throughput: 828.2 img/sec (r: 0 img/sec l: -inf% u: +inf%)
    - througput_per_CPU: 828.2 img/sec (r: 0 img/sec l: -inf% u: +inf%)

@smoors
Copy link
Collaborator

smoors commented Jul 30, 2024

2-GPU test:

[ReFrame Setup]
  version:           4.3.3
  command:           '/readonly/dodrio/apps/RHEL8/zen2-ib/software/ReFrame/4.3.3/bin/reframe -C /data/brussel/100/vsc10009/reframe/eessitestsuite/test-suite/../vsc_hortense.py -c /data/brussel/100/vsc10009/refra
me/eessitestsuite/test-suite/eessi/testsuite/tests/apps/PyTorch/PyTorch_torchvision.py -t CI -t 1_2_node --setvar modules=PyTorch-bundle/2.1.2-foss-2023a-CUDA-12.1.1 --system hortense:gpu_rome_a100_40gb -r --per
formance-report -v'
  launched by:       vsc10009@login55.dodrio.os
  working directory: '/dodrio/scratch/projects/badmin/vsc10009/eessitestsuite'                                                                                                                                    
  settings files:    '<builtin>', '/data/brussel/100/vsc10009/reframe/eessitestsuite/test-suite/../vsc_hortense.py'                                                                                               
  check search path: '/vscmnt/brussel_cerberus_data/_data_brussel/100/vsc10009/reframe/eessitestsuite/test-suite/eessi/testsuite/tests/apps/PyTorch/PyTorch_torchvision.py'                                       
  stage directory:   '/dodrio/scratch/projects/badmin/vsc10009/eessitestsuite/newtest/prefix/stage'                                                                                                               
  output directory:  '/dodrio/scratch/projects/badmin/vsc10009/eessitestsuite/newtest/prefix/output'
  log files:         '/dodrio/scratch/projects/badmin/vsc10009/eessitestsuite/newtest/prefix/logs/reframe_20240729_175936.log'                                                                                    

Loaded 3120 test(s)
Generated 260 test case(s)
Filtering test cases(s) by name: 260 remaining
Filtering test cases(s) by tags: 4 remaining
Filtering test cases(s) by other attributes: 4 remaining                                                                                                                                                          
Final number of test cases: 4
[==========] Running 4 check(s)
[==========] Started on Mon Jul 29 18:03:38 2024

[----------] start processing checks
[ RUN      ] EESSI_PyTorch_torchvision_GPU %nn_model=resnet50 %scale=1_2_node %parallel_strategy=ddp %module_name=PyTorch-bundle/2.1.2-foss-2023a-CUDA-12.1.1 %precision=mixed /f41aa0c4 @hortense:gpu_rome_a100_40
gb+default
WARNING: hooks.set_compact_process_binding does not support the current launcher (mympirun). The test will run, but using the default binding strategy of your parallel launcher. This may lead to suboptimal perfo
rmance. Please expand the functionality of hooks.set_compact_process_binding for your parallel launcher.                                                                                                          
[ RUN      ] EESSI_PyTorch_torchvision_GPU %nn_model=resnet50 %scale=1_2_node %parallel_strategy=ddp %module_name=PyTorch-bundle/2.1.2-foss-2023a-CUDA-12.1.1 %precision=default /d7cc8bfe @hortense:gpu_rome_a100_40
gb+default
WARNING: hooks.set_compact_process_binding does not support the current launcher (mympirun). The test will run, but using the default binding strategy of your parallel launcher. This may lead to suboptimal perfo
rmance. Please expand the functionality of hooks.set_compact_process_binding for your parallel launcher.
[ RUN      ] EESSI_PyTorch_torchvision_GPU %nn_model=resnet50 %scale=1_2_node %parallel_strategy=None %module_name=PyTorch-bundle/2.1.2-foss-2023a-CUDA-12.1.1 %precision=mixed /4e984066 @hortense:gpu_rome_a100_4
0gb+default
WARNING: hooks.set_compact_process_binding does not support the current launcher (mympirun). The test will run, but using the default binding strategy of your parallel launcher. This may lead to suboptimal perfo
rmance. Please expand the functionality of hooks.set_compact_process_binding for your parallel launcher.
[     SKIP ] (1/4) Skipping test: parallel strategy is 'None', but requested process count is larger than one (2).
[ RUN      ] EESSI_PyTorch_torchvision_GPU %nn_model=resnet50 %scale=1_2_node %parallel_strategy=None %module_name=PyTorch-bundle/2.1.2-foss-2023a-CUDA-12.1.1 %precision=default /498fd283 @hortense:gpu_rome_a100
_40gb+default
WARNING: hooks.set_compact_process_binding does not support the current launcher (mympirun). The test will run, but using the default binding strategy of your parallel launcher. This may lead to suboptimal perfo
rmance. Please expand the functionality of hooks.set_compact_process_binding for your parallel launcher.
[     SKIP ] (2/4) Skipping test: parallel strategy is 'None', but requested process count is larger than one (2).
[       OK ] (3/4) EESSI_PyTorch_torchvision_GPU %nn_model=resnet50 %scale=1_2_node %parallel_strategy=ddp %module_name=PyTorch-bundle/2.1.2-foss-2023a-CUDA-12.1.1 %precision=mixed /f41aa0c4 @hortense:gpu_rome_a
100_40gb+default
P: total_throughput: 2222.0 img/sec (r:0, l:None, u:None)
P: througput_per_CPU: 1111.0 img/sec (r:0, l:None, u:None)
==> setup: 0.627s compile: 0.004s run: 9646.885s sanity: 0.035s performance: 0.004s total: 9649.153s
[       OK ] (4/4) EESSI_PyTorch_torchvision_GPU %nn_model=resnet50 %scale=1_2_node %parallel_strategy=ddp %module_name=PyTorch-bundle/2.1.2-foss-2023a-CUDA-12.1.1 %precision=default /d7cc8bfe @hortense:gpu_rome
_a100_40gb+default
P: total_throughput: 1543.0 img/sec (r:0, l:None, u:None)
P: througput_per_CPU: 771.5 img/sec (r:0, l:None, u:None)
==> setup: 0.185s compile: 0.004s run: 9664.221s sanity: 0.004s performance: 0.002s total: 9666.775s
[----------] all spawned checks have finished

[  PASSED  ] Ran 2/4 test case(s) from 4 check(s) (0 failure(s), 2 skipped, 0 aborted)
[==========] Finished on Mon Jul 29 20:44:46 2024

==================================================================================================================================================================================================================$
PERFORMANCE REPORT
------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------$
[EESSI_PyTorch_torchvision_GPU %nn_model=resnet50 %scale=1_2_node %parallel_strategy=ddp %module_name=PyTorch-bundle/2.1.2-foss-2023a-CUDA-12.1.1 %precision=mixed /f41aa0c4 @hortense:gpu_rome_a100_40gb:default]
  num_gpus_per_node: 2
  num_tasks: 2
  num_tasks_per_node: 2
  num_cpus_per_task: 12
  performance:
    - total_throughput: 2222.0 img/sec (r: 0 img/sec l: -inf% u: +inf%)
    - througput_per_CPU: 1111.0 img/sec (r: 0 img/sec l: -inf% u: +inf%)
[EESSI_PyTorch_torchvision_GPU %nn_model=resnet50 %scale=1_2_node %parallel_strategy=ddp %module_name=PyTorch-bundle/2.1.2-foss-2023a-CUDA-12.1.1 %precision=default /d7cc8bfe @hortense:gpu_rome_a100_40gb:default
]
  num_gpus_per_node: 2
  num_tasks: 2
  num_tasks_per_node: 2
  num_cpus_per_task: 12
  performance:
    - total_throughput: 1543.0 img/sec (r: 0 img/sec l: -inf% u: +inf%)
    - througput_per_CPU: 771.5 img/sec (r: 0 img/sec l: -inf% u: +inf%)

Copy link
Collaborator

@smoors smoors left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looking good!

@smoors smoors merged commit 6e05428 into EESSI:main Jul 30, 2024
10 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants