PyTorch test that uses torchvision #130

casparvl · 2024-03-27T16:46:51Z

Work in progress...

eessi/testsuite/tests/apps/PyTorch/PyTorch_torchvision.py

…mount of numa nodes in a node

…ectly from the refraem.core.systems.ProcessorInfo object. See https://reframe-hpc.readthedocs.io/en/stable/regression_test_api.html#reframe.core.systems.ProcessorInfo

…cher-agnostic way, by simply passing them to the python script as argument, and having that set it to the environment. Print clear error if SLURM or openMPIs mpirun are not used - we still rely on these to get the local rank, there is no other way

…uired environment variables have been set.

casparvl · 2024-05-03T12:27:48Z

Test is still failing on multiple nodes:

...
  File "/sw/arch/RHEL8/EB_production/2022/software/PyTorch/1.12.0-foss-2022a-CUDA-11.7.0/lib/python3.10/site-packages/torch/distributed/utils.py", line 89, in _verify_param_shape_across_processes
    return dist._verify_params_across_processes(process_group, tensors, logger)
RuntimeError: NCCL error in: /tmp/jenkins/build/PyTorch/1.12.0/foss-2022a-CUDA-11.7.0/pytorch/torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1191, invalid usage, NCCL version 2.12.12
ncclInvalidUsage: This usually reflects invalid usage of NCCL library (such as too many async ops, too many collectives at once, mixing streams in a group, etc)
...

Not a very explicit error...

The generated jobscript is e.g.

#!/bin/bash
#SBATCH --job-name="rfm_PyTorch_torchvision_GPU_afffb4aa"
#SBATCH --ntasks=8
#SBATCH --ntasks-per-node=4
#SBATCH --cpus-per-task=18
#SBATCH --output=rfm_job.out
#SBATCH --error=rfm_job.err
#SBATCH --time=0:30:0
#SBATCH -p gpu
#SBATCH --export=None
#SBATCH --gpus-per-node=4
source /cvmfs/software.eessi.io/versions/2023.06/init/bash
module load torchvision/0.13.1-foss-2022a-CUDA-11.7.0
export I_MPI_PIN_CELL=core
export I_MPI_PIN_DOMAIN=18:compact
export OMPI_MCA_rmaps_base_mapping_policy=node:PE=18
export SLURM_CPU_BIND=verbose
mpirun -np 8 python pytorch_synthetic_benchmark.py --model resnet50 --master-port $(python python_get_free_socket.py) --master-address $(hostname --fqdn) --world-size 8 --use-ddp

Note that replacing the mpirun -np 8 with srun, this test runs fine. It may be related to how ranks are mapped to devices, and that somehow this leads to e.g. 2 ranks being mapped to the same device with mpirun?

Possibly related:

casparvl · 2024-05-03T12:34:53Z

With

NCCL_DEBUG=INFO mpirun -np 8 python pytorch_synthetic_benchmark.py --model resnet50 --master-port $(python python_get_free_socket.py) --master-address $(hostname --fqdn) --world-size 8 --use-ddp

I get

gcn20:834301:834709 [2] init.cc:469 NCCL WARN Duplicate GPU detected : rank 2 and rank 6 both on CUDA device ca000
gcn20:834301:834709 [2] NCCL INFO init.cc:914 -> 5

gcn20:834302:834703 [0] init.cc:469 NCCL WARN Duplicate GPU detected : rank 4 and rank 0 both on CUDA device 31000
gcn20:834302:834703 [0] NCCL INFO init.cc:914 -> 5
gcn20:834301:834709 [2] NCCL INFO group.cc:58 -> 5 [Async thread]

gcn20:834303:834711 [2] init.cc:469 NCCL WARN Duplicate GPU detected : rank 6 and rank 2 both on CUDA device ca000
gcn20:834303:834711 [2] NCCL INFO init.cc:914 -> 5
gcn20:834302:834703 [0] NCCL INFO group.cc:58 -> 5 [Async thread]
gcn20:834303:834711 [2] NCCL INFO group.cc:58 -> 5 [Async thread]

gcn35:1583938:1584337 [3] init.cc:469 NCCL WARN Duplicate GPU detected : rank 3 and rank 7 both on CUDA device e3000
gcn35:1583938:1584337 [3] NCCL INFO init.cc:914 -> 5

gcn20:834300:834695 [0] init.cc:469 NCCL WARN Duplicate GPU detected : rank 0 and rank 4 both on CUDA device 31000
gcn35:1583938:1584337 [3] NCCL INFO group.cc:58 -> 5 [Async thread]
gcn20:834300:834695 [0] NCCL INFO init.cc:914 -> 5
gcn20:834300:834695 [0] NCCL INFO group.cc:58 -> 5 [Async thread]

gcn35:1583940:1584335 [3] init.cc:469 NCCL WARN Duplicate GPU detected : rank 7 and rank 3 both on CUDA device e3000
gcn35:1583940:1584335 [3] NCCL INFO init.cc:914 -> 5
gcn35:1583940:1584335 [3] NCCL INFO group.cc:58 -> 5 [Async thread]

gcn35:1583937:1584343 [1] init.cc:469 NCCL WARN Duplicate GPU detected : rank 1 and rank 5 both on CUDA device 32000
gcn35:1583937:1584343 [1] NCCL INFO init.cc:914 -> 5
gcn35:1583937:1584343 [1] NCCL INFO group.cc:58 -> 5 [Async thread]

gcn35:1583939:1584345 [1] init.cc:469 NCCL WARN Duplicate GPU detected : rank 5 and rank 1 both on CUDA device 32000
gcn35:1583939:1584345 [1] NCCL INFO init.cc:914 -> 5
gcn35:1583939:1584345 [1] NCCL INFO group.cc:58 -> 5 [Async thread]

So yeah, it does seem to be about mapping ranks to devices...

casparvl · 2024-05-03T13:09:48Z

Yep, that's where the problem is. With mpirun:

host: gcn20.local.snellius.surf.nl, rank: 0, local_rank: 0
host: gcn20.local.snellius.surf.nl, rank: 4, local_rank: 0

with srun:

host: gcn20.local.snellius.surf.nl, rank: 0, local_rank: 0
host: gcn35.local.snellius.surf.nl, rank: 4, local_rank: 0

srun by default maps ranks in a block-fashion, i.e. if you have 2 nodes with 4 ranks per node, it puts rank 0-3 on node 0, and rank 4-7 on node 1. mpirun defaults to ranking by slot. Since we do

export OMPI_MCA_rmaps_base_mapping_policy=node:PE=18

Our first slot will be on node 0 (core 0-17), our second slot will be on node 1 (core 0-17), third slot on node 0 (core 18-35), etc. Then, MPI ranks by slot, so: first rank will be on node 0 (core 0-17), second rank will be on node 1 (core 0-17), third rank on node 0 (core 18-35), etc. This is different from srun, and it breaks the assumption made in calculating the local rank with

        local_rank = rank - visible_gpus * (rank // visible_gpus)

which assumes that consequitive ranks are mapped to the same node, until that node is full, i.e. srun's block distribution. Note that with srun, we could actually use SLURM_LOCALID to determine the local rank, which would work irrespective of the distribution method used by srun.

It's probably better / cleaner to do something like was done by CSCS at https://github.com/eth-cscs/cscs-reframe-tests/blob/9c076fe2c1904b8e41c56184dba58de657d8c3a7/checks/apps/pytorch/src/pt_distr_env.py . I.e. seperate setting the distributed environment in a different script. They also set the CUDA_VISIBLE_DEVICES externally in https://github.com/eth-cscs/cscs-reframe-tests/blob/9c076fe2c1904b8e41c56184dba58de657d8c3a7/checks/apps/pytorch/src/set_visible_devices.sh . It's cleaner to seperate those preparatory steps from the actual test. Note that the CSCS code only works for srun as launcher, we would still have to make implementations for mpirun.

I'll first do a fix on my current code, since this:

mpirun -np 8 --rank-by hwthread --report-bindings python pytorch_synthetic_benchmark.py --model resnet50 --master-port $(python python_get_free_socket.py) --master-address $(hostname --fqdn) --world-size 8 --use-ddp

runs succesfully. Using --rank-by hwthread, we distribute ranks according to the smallest allocatable entity (a hwthread), which will always result in block distribution. That will satisfy the assumption made in

local_rank = rank - visible_gpus * (rank // visible_gpus)

After that is done, I'll refactor to split off the preparation from the actual test.

casparvl · 2024-05-03T16:43:26Z

Merging #137 and synchronizing this branch with main afterwards should solve the mapping issue.

casparvl · 2024-05-03T17:05:33Z

Ok, fix_compact_process_binding has resolved the mapping issue. I managed to do a succesful 2-node 8-GPU run with mixed precision. For some reason, the 2-node 8-GPU run with default precision seems to hang. I don't get any output after the initialization (thought that might also be buffering?). The job script was:

#!/bin/bash
#SBATCH --job-name="rfm_PyTorch_torchvision_GPU_afffb4aa"
#SBATCH --ntasks=8
#SBATCH --ntasks-per-node=4
#SBATCH --cpus-per-task=18
#SBATCH --output=rfm_job.out
#SBATCH --error=rfm_job.err
#SBATCH --time=0:30:0
#SBATCH -p gpu
#SBATCH --export=None
#SBATCH --gpus-per-node=4
source /cvmfs/software.eessi.io/versions/2023.06/init/bash
module load torchvision/0.13.1-foss-2022a-CUDA-11.7.0
export I_MPI_PIN_CELL=core
export I_MPI_PIN_DOMAIN=18:compact
export OMPI_MCA_rmaps_base_mapping_policy=slot:PE=18
export SLURM_CPU_BIND=verbose
mpirun -np 8 python pytorch_synthetic_benchmark.py --model resnet50 --master-port $(python python_get_free_socket.py) --master-address $(hostname --fqdn) --world-size 8 --use-ddp

It's been going for 11 minutes now, the others all finished in ~2 minutes. I'll retest this one later, it might just be an issue on one of the nodes I got allocated - and not an issue with this test at all.

casparvl · 2024-05-06T09:05:17Z

Retest succeeded:

{EESSI 2023.06} (eessi_test_venv) [casparl@int6 EESSI]$ reframe -t CI -c test-suite/eessi/testsuite/tests/apps/PyTorch/ -n PyTorch_torchvision_GPU -t "1_4_node|1_node|^2_nodes" --run
[ReFrame Setup]
  version:           4.6.0-dev.2+a3b495fa
  command:           "/gpfs/home4/casparl/EESSI/eessi_test_venv/bin/reframe -t CI -c test-suite/eessi/testsuite/tests/apps/PyTorch/ -n PyTorch_torchvision_GPU -t '1_4_node|1_node|^2_nodes' --run"
  launched by:       casparl@int6
  working directory: '/gpfs/home4/casparl/EESSI'
  settings files:    '<builtin>', '/gpfs/home4/casparl/EESSI/test-suite/config/surf_snellius.py'
  check search path: (R) '/gpfs/home4/casparl/EESSI/test-suite/eessi/testsuite/tests/apps/PyTorch'
  stage directory:   '/scratch-shared/casparl/reframe_output/staging'
  output directory:  '/gpfs/home4/casparl/EESSI/reframe_runs/output'
  log files:         '/gpfs/home4/casparl/EESSI/reframe_runs/logs/reframe_20240506_105753.log'

[==========] Running 12 check(s)
[==========] Started on Mon May  6 10:58:07 2024+0200

[----------] start processing checks
[ RUN      ] PyTorch_torchvision_GPU %nn_model=resnet50 %scale=2_nodes %parallel_strategy=ddp %module_name=torchvision/0.13.1-foss-2022a-CUDA-11.7.0 %precision=mixed /9670834c @snellius:gpu+default
[ RUN      ] PyTorch_torchvision_GPU %nn_model=resnet50 %scale=2_nodes %parallel_strategy=ddp %module_name=torchvision/0.13.1-foss-2022a-CUDA-11.7.0 %precision=default /afffb4aa @snellius:gpu+default
[ RUN      ] PyTorch_torchvision_GPU %nn_model=resnet50 %scale=2_nodes %parallel_strategy=None %module_name=torchvision/0.13.1-foss-2022a-CUDA-11.7.0 %precision=mixed /9a097893 @snellius:gpu+default
[     SKIP ] ( 1/12) Skipping test: parallel strategy is 'None', but requested process count is larger than one (8)
[ RUN      ] PyTorch_torchvision_GPU %nn_model=resnet50 %scale=2_nodes %parallel_strategy=None %module_name=torchvision/0.13.1-foss-2022a-CUDA-11.7.0 %precision=default /f38b922d @snellius:gpu+default
[     SKIP ] ( 2/12) Skipping test: parallel strategy is 'None', but requested process count is larger than one (8)
[ RUN      ] PyTorch_torchvision_GPU %nn_model=resnet50 %scale=1_node %parallel_strategy=ddp %module_name=torchvision/0.13.1-foss-2022a-CUDA-11.7.0 %precision=mixed /b61de8e2 @snellius:gpu+default
[ RUN      ] PyTorch_torchvision_GPU %nn_model=resnet50 %scale=1_node %parallel_strategy=ddp %module_name=torchvision/0.13.1-foss-2022a-CUDA-11.7.0 %precision=default /57b7dcc3 @snellius:gpu+default
[ RUN      ] PyTorch_torchvision_GPU %nn_model=resnet50 %scale=1_node %parallel_strategy=None %module_name=torchvision/0.13.1-foss-2022a-CUDA-11.7.0 %precision=mixed /4f67a6ef @snellius:gpu+default
[     SKIP ] ( 3/12) Skipping test: parallel strategy is 'None', but requested process count is larger than one (4)
[ RUN      ] PyTorch_torchvision_GPU %nn_model=resnet50 %scale=1_node %parallel_strategy=None %module_name=torchvision/0.13.1-foss-2022a-CUDA-11.7.0 %precision=default /b12bb24d @snellius:gpu+default
[     SKIP ] ( 4/12) Skipping test: parallel strategy is 'None', but requested process count is larger than one (4)
[ RUN      ] PyTorch_torchvision_GPU %nn_model=resnet50 %scale=1_4_node %parallel_strategy=ddp %module_name=torchvision/0.13.1-foss-2022a-CUDA-11.7.0 %precision=mixed /aa1b6024 @snellius:gpu+default
[     SKIP ] ( 5/12) Skipping test: parallel strategy is ddp, but only one process is requested
[ RUN      ] PyTorch_torchvision_GPU %nn_model=resnet50 %scale=1_4_node %parallel_strategy=ddp %module_name=torchvision/0.13.1-foss-2022a-CUDA-11.7.0 %precision=default /4ee3891a @snellius:gpu+default
[     SKIP ] ( 6/12) Skipping test: parallel strategy is ddp, but only one process is requested
[ RUN      ] PyTorch_torchvision_GPU %nn_model=resnet50 %scale=1_4_node %parallel_strategy=None %module_name=torchvision/0.13.1-foss-2022a-CUDA-11.7.0 %precision=mixed /fab3ae32 @snellius:gpu+default
[ RUN      ] PyTorch_torchvision_GPU %nn_model=resnet50 %scale=1_4_node %parallel_strategy=None %module_name=torchvision/0.13.1-foss-2022a-CUDA-11.7.0 %precision=default /a9961662 @snellius:gpu+default
[       OK ] ( 7/12) PyTorch_torchvision_GPU %nn_model=resnet50 %scale=2_nodes %parallel_strategy=ddp %module_name=torchvision/0.13.1-foss-2022a-CUDA-11.7.0 %precision=mixed /9670834c @snellius:gpu+default
P: total_throughput: 6580.7 img/sec (r:0, l:None, u:None)
P: througput_per_CPU: 822.6 img/sec (r:0, l:None, u:None)
[       OK ] ( 8/12) PyTorch_torchvision_GPU %nn_model=resnet50 %scale=2_nodes %parallel_strategy=ddp %module_name=torchvision/0.13.1-foss-2022a-CUDA-11.7.0 %precision=default /afffb4aa @snellius:gpu+default
P: total_throughput: 5476.0 img/sec (r:0, l:None, u:None)
P: througput_per_CPU: 684.5 img/sec (r:0, l:None, u:None)
[       OK ] ( 9/12) PyTorch_torchvision_GPU %nn_model=resnet50 %scale=1_node %parallel_strategy=ddp %module_name=torchvision/0.13.1-foss-2022a-CUDA-11.7.0 %precision=mixed /b61de8e2 @snellius:gpu+default
P: total_throughput: 3395.7 img/sec (r:0, l:None, u:None)
P: througput_per_CPU: 848.9 img/sec (r:0, l:None, u:None)
[       OK ] (10/12) PyTorch_torchvision_GPU %nn_model=resnet50 %scale=1_node %parallel_strategy=ddp %module_name=torchvision/0.13.1-foss-2022a-CUDA-11.7.0 %precision=default /57b7dcc3 @snellius:gpu+default
P: total_throughput: 3090.6 img/sec (r:0, l:None, u:None)
P: througput_per_CPU: 772.7 img/sec (r:0, l:None, u:None)
[       OK ] (11/12) PyTorch_torchvision_GPU %nn_model=resnet50 %scale=1_4_node %parallel_strategy=None %module_name=torchvision/0.13.1-foss-2022a-CUDA-11.7.0 %precision=mixed /fab3ae32 @snellius:gpu+default
P: total_throughput: 984.5 img/sec (r:0, l:None, u:None)
P: througput_per_CPU: 984.5 img/sec (r:0, l:None, u:None)
[       OK ] (12/12) PyTorch_torchvision_GPU %nn_model=resnet50 %scale=1_4_node %parallel_strategy=None %module_name=torchvision/0.13.1-foss-2022a-CUDA-11.7.0 %precision=default /a9961662 @snellius:gpu+default
P: total_throughput: 816.7 img/sec (r:0, l:None, u:None)
P: througput_per_CPU: 816.7 img/sec (r:0, l:None, u:None)
[----------] all spawned checks have finished

[  PASSED  ] Ran 6/12 test case(s) from 12 check(s) (0 failure(s), 6 skipped, 0 aborted)
[==========] Finished on Mon May  6 11:01:52 2024+0200
Log file(s) saved in '/gpfs/home4/casparl/EESSI/reframe_runs/logs/reframe_20240506_105753.log'

casparvl · 2024-05-06T12:38:48Z

Full run, including CPU tests, was succesfull:

[       OK ] (13/24) EESSI_PyTorch_torchvision_GPU %nn_model=resnet50 %scale=2_nodes %parallel_strategy=ddp %module_name=torchvision/0.13.1-foss-2022a-CUDA-11.7.0 %precision=mixed /10d80ee6 @snellius:gpu+default
P: total_throughput: 6647.3 img/sec (r:0, l:None, u:None)
P: througput_per_CPU: 830.9 img/sec (r:0, l:None, u:None)
[       OK ] (14/24) EESSI_PyTorch_torchvision_GPU %nn_model=resnet50 %scale=2_nodes %parallel_strategy=ddp %module_name=torchvision/0.13.1-foss-2022a-CUDA-11.7.0 %precision=default /857cd04a @snellius:gpu+default
P: total_throughput: 5818.4 img/sec (r:0, l:None, u:None)
P: througput_per_CPU: 727.3 img/sec (r:0, l:None, u:None)
[       OK ] (15/24) EESSI_PyTorch_torchvision_GPU %nn_model=resnet50 %scale=1_node %parallel_strategy=ddp %module_name=torchvision/0.13.1-foss-2022a-CUDA-11.7.0 %precision=mixed /925f89d6 @snellius:gpu+default
P: total_throughput: 3373.7 img/sec (r:0, l:None, u:None)
P: througput_per_CPU: 843.4 img/sec (r:0, l:None, u:None)
[       OK ] (16/24) EESSI_PyTorch_torchvision_GPU %nn_model=resnet50 %scale=1_node %parallel_strategy=ddp %module_name=torchvision/0.13.1-foss-2022a-CUDA-11.7.0 %precision=default /294549e0 @snellius:gpu+default
P: total_throughput: 3097.8 img/sec (r:0, l:None, u:None)
P: througput_per_CPU: 774.5 img/sec (r:0, l:None, u:None)
[       OK ] (17/24) EESSI_PyTorch_torchvision_GPU %nn_model=resnet50 %scale=1_4_node %parallel_strategy=None %module_name=torchvision/0.13.1-foss-2022a-CUDA-11.7.0 %precision=mixed /f56603c8 @snellius:gpu+default
P: total_throughput: 963.9 img/sec (r:0, l:None, u:None)
P: througput_per_CPU: 963.9 img/sec (r:0, l:None, u:None)
[       OK ] (18/24) EESSI_PyTorch_torchvision_GPU %nn_model=resnet50 %scale=1_4_node %parallel_strategy=None %module_name=torchvision/0.13.1-foss-2022a-CUDA-11.7.0 %precision=default /31e51c44 @snellius:gpu+default
P: total_throughput: 814.0 img/sec (r:0, l:None, u:None)
P: througput_per_CPU: 814.0 img/sec (r:0, l:None, u:None)
[       OK ] (19/24) EESSI_PyTorch_torchvision_CPU %nn_model=resnet50 %scale=2_nodes %parallel_strategy=ddp %module_name=torchvision/0.13.1-foss-2022a /3a25a79c @snellius:genoa+default
P: total_throughput: 344.3 img/sec (r:0, l:None, u:None)
P: througput_per_CPU: 21.5 img/sec (r:0, l:None, u:None)
[       OK ] (20/24) EESSI_PyTorch_torchvision_CPU %nn_model=resnet50 %scale=1_node %parallel_strategy=ddp %module_name=torchvision/0.13.1-foss-2022a /41f0b958 @snellius:genoa+default
P: total_throughput: 195.5 img/sec (r:0, l:None, u:None)
P: througput_per_CPU: 24.4 img/sec (r:0, l:None, u:None)
[       OK ] (21/24) EESSI_PyTorch_torchvision_CPU %nn_model=resnet50 %scale=1_4_node %parallel_strategy=ddp %module_name=torchvision/0.13.1-foss-2022a /6c4d7201 @snellius:genoa+default
P: total_throughput: 53.9 img/sec (r:0, l:None, u:None)
P: througput_per_CPU: 27.0 img/sec (r:0, l:None, u:None)
[       OK ] (22/24) EESSI_PyTorch_torchvision_CPU %nn_model=resnet50 %scale=2_nodes %parallel_strategy=ddp %module_name=torchvision/0.13.1-foss-2022a /3a25a79c @snellius:rome+default
P: total_throughput: 183.0 img/sec (r:0, l:None, u:None)
P: througput_per_CPU: 11.4 img/sec (r:0, l:None, u:None)
[       OK ] (23/24) EESSI_PyTorch_torchvision_CPU %nn_model=resnet50 %scale=1_node %parallel_strategy=ddp %module_name=torchvision/0.13.1-foss-2022a /41f0b958 @snellius:rome+default
P: total_throughput: 93.7 img/sec (r:0, l:None, u:None)
P: througput_per_CPU: 11.7 img/sec (r:0, l:None, u:None)
[       OK ] (24/24) EESSI_PyTorch_torchvision_CPU %nn_model=resnet50 %scale=1_4_node %parallel_strategy=ddp %module_name=torchvision/0.13.1-foss-2022a /6c4d7201 @snellius:rome+default
P: total_throughput: 27.0 img/sec (r:0, l:None, u:None)
P: througput_per_CPU: 13.5 img/sec (r:0, l:None, u:None)

(note that the other 12 tests are correctly skipped since they are invalid combinations. E.g. parallel_strategy=ddp but with a process count of 1)

smoors · 2024-06-13T09:32:54Z

@casparvl can you pls merge main, so it's easier to compare and see what changed?

smoors · 2024-06-13T09:39:34Z

eessi/testsuite/hooks.py

@@ -57,7 +57,7 @@ def _assign_default_num_gpus_per_node(test: rfm.RegressionTest):

 def assign_tasks_per_compute_unit(test: rfm.RegressionTest, compute_unit: str, num_per: int = 1):
    """
-    Assign one task per compute unit (COMPUTE_UNIT[CPU], COMPUTE_UNIT[CPU_SOCKET] or COMPUTE_UNIT[GPU]).
+    Assign one task per compute unit.


Suggested change

Assign one task per compute unit.

Assign one task per compute unit. More than 1 task per compute unit can be assigned with num_per for compute units that support it.

please also update line 83 to:

if num_per != 1 and compute_unit not in [COMPUTE_UNIT[NODE]]:

smoors · 2024-06-13T12:23:10Z

eessi/testsuite/tests/apps/PyTorch/PyTorch_torchvision.py

+            # Hybrid code, so launch 1 rank per socket.
+            # Probably, launching 1 task per NUMA domain is even better, but the current hook doesn't support it


i guess these comments no longer apply, as we are now launching 1 rank per numa node?

smoors · 2024-06-13T12:25:00Z

eessi/testsuite/tests/apps/PyTorch/PyTorch_torchvision.py

+        hooks.set_compact_process_binding(self)
+
+    @run_after('setup')
+    def set_ddp_env_vars(self):


Suggested change

def set_ddp_env_vars(self):

def set_ddp_options(self):

smoors · 2024-06-13T15:31:54Z

eessi/testsuite/tests/apps/PyTorch/PyTorch_torchvision.py

+        # Set environment variables for PyTorch DDP
+        if self.parallel_strategy == 'ddp':
+            # Set additional options required by DDP
+            self.executable_opts += ["--master-port $(python python_get_free_socket.py)"]


i was a bit confused by the name of this script. can we call it get_free_port.py instead?

smoors · 2024-06-13T15:37:03Z

eessi/testsuite/tests/apps/PyTorch/PyTorch_torchvision.py

+        # the compute threads. It was fixed, but seems to be broken again in Horovod 0.28.1
+        # The easiest workaround is to reduce the number of compute threads by 2
+        if self.compute_device == DEVICE_TYPES[CPU] and self.parallel_strategy == 'horovod':
+            self.env_vars['OMP_NUM_THREADS'] = max(self.num_cpus_per_task - 2, 2)  # Never go below 2 compute threads


what if you have only 1 or 2 cores? is it still better to have 2 compute threads in that case?

I'll take it out. We don't have parallel_strategy=horovod in this test anyway, that was a remnant of our (internal) test that I ported.

If we ever deploy horovod (with PyTorch support), we could bring this back. But then it'd be a horovod test, not a pytorch test :)

smoors · 2024-06-16T19:46:33Z

i ran a test with 1 GPU but it failed. apparently OMP_NUM_THREADS is not defined in that case:

FAILURE INFO for EESSI_PyTorch_torchvision_GPU %nn_model=resnet50 %scale=1_2_node %parallel_strategy=None %module_name=PyTorch-bundle/2.1.2-foss-2023a-CUDA-12.1.1 %precision=mi
xed (run: 1/1)                                                                                                                                                                  
  * Description: Benchmark that runs a selected torchvision model on synthetic data                                                                                             
  * System partition: hydra:ampere
  * Environment: default
  * Stage directory: /data/brussel/100/vsc10009/reframe/eessitestsuite/newtest/prefix/stage/hydra/ampere/default/EESSI_PyTorch_torchvision_GPU_4e984066
  * Node list: node402
  * Job type: batch job (id=9582095)
  * Dependencies (conceptual): []
  * Dependencies (actual): []
  * Maintainers: []
  * Failing phase: sanity
  * Rerun with '-n /4e984066 -p default --system hydra:ampere -r'
  * Reason: sanity error: pattern 'Total img/sec on 1 .PU\\(s\\):.*' not found in 'rfm_job.out'
--- rfm_job.out (first 10 lines) ---
World size: 1
--- rfm_job.out ---
--- rfm_job.err (first 10 lines) ---
cpu-bind=MASK - node402, task  0  0 [1914655]: mask 0xffff set
Traceback (most recent call last):
  File "/vscmnt/brussel_pixiu_data/_data_brussel/100/vsc10009/reframe/eessitestsuite/newtest/prefix/stage/hydra/ampere/default/EESSI_PyTorch_torchvision_GPU_4e984066/pytorch_s$
nthetic_benchmark.py", line 152, in <module>
    torch.set_num_threads(int(os.environ['OMP_NUM_THREADS']))
                              ~~~~~~~~~~^^^^^^^^^^^^^^^^^^^
  File "<frozen os>", line 679, in __getitem__
KeyError: 'OMP_NUM_THREADS'

casparvl · 2024-07-10T14:30:00Z

i ran a test with 1 GPU but it failed. apparently OMP_NUM_THREADS is not defined in that case:

Darn, I realize I didn't look into this yet... need to still fix it, but don't have time now :\

…onditionally, as this would be evaluated on the login node. That environment is irrelevant to the batch job

smoors · 2024-07-30T07:42:33Z

single-GPU test:

[ReFrame Setup]                                                                                                                                                                                                    
  version:           4.3.2                                                                                                                                                                                         
  command:           '/apps/brussel/RL8/skylake-ib/software/ReFrame/4.3.2/bin/reframe -C /data/brussel/100/vsc10009/reframe/eessitestsuite/test-suite/../config_hydra.py -c /data/brussel/100/vsc10009/reframe/eess
itestsuite/test-suite/eessi/testsuite/tests/apps/PyTorch/PyTorch_torchvision.py -t CI -t 1_2_node --setvar modules=PyTorch-bundle/2.1.2-foss-2023a-CUDA-12.1.1 --system hydra:ampere -r --performance-report -v'   
  launched by:       vsc10009@login1.cerberus.os                                                                                                                                                                   
  working directory: '/vscmnt/brussel_pixiu_data/_data_brussel/100/vsc10009/reframe/eessitestsuite'                                                                                                                
  settings files:    '<builtin>', '/data/brussel/100/vsc10009/reframe/eessitestsuite/test-suite/../config_hydra.py'                                                                                                
  check search path: '/vscmnt/brussel_pixiu_data/_data_brussel/100/vsc10009/reframe/eessitestsuite/test-suite/eessi/testsuite/tests/apps/PyTorch/PyTorch_torchvision.py'                                           
  stage directory:   '/data/brussel/100/vsc10009/reframe/eessitestsuite/newtest/prefix/stage'                                                                                                                      
  output directory:  '/data/brussel/100/vsc10009/reframe/eessitestsuite/newtest/prefix/output'                                                                                                                     
  log files:         '/data/brussel/100/vsc10009/reframe/eessitestsuite/newtest/prefix/logs/reframe_20240728_222949.log'                                                                                           
                                                                                                                                                                                                                   
                                                                                                                                                                                                                   
Loaded 3900 test(s)                                                                                                                                                                                                
Generated 60 test case(s)                                                                                                                                                                                          
Filtering test cases(s) by name: 60 remaining                                                                                                                                                                      
Filtering test cases(s) by tags: 4 remaining                                                                                                                                                                       
Filtering test cases(s) by other attributes: 4 remaining            
Final number of test cases: 4                         
[==========] Running 4 check(s)                                                                                                                                                                                   
[==========] Started on Sun Jul 28 22:30:38 2024                                                                                                                                                                  
                                                                                                                                                                                                                   
[----------] start processing checks                                                                                                                                                                              
[ RUN      ] EESSI_PyTorch_torchvision_GPU %nn_model=resnet50 %scale=1_2_node %parallel_strategy=ddp %module_name=PyTorch-bundle/2.1.2-foss-2023a-CUDA-12.1.1 %precision=mixed /f41aa0c4 @hydra:ampere+default    
[     SKIP ] (1/4) Skipping test: parallel strategy is ddp, but only one process is requested.                                                                                                                    
[ RUN      ] EESSI_PyTorch_torchvision_GPU %nn_model=resnet50 %scale=1_2_node %parallel_strategy=ddp %module_name=PyTorch-bundle/2.1.2-foss-2023a-CUDA-12.1.1 %precision=default /d7cc8bfe @hydra:ampere+default
[     SKIP ] (2/4) Skipping test: parallel strategy is ddp, but only one process is requested.                                                      
[ RUN      ] EESSI_PyTorch_torchvision_GPU %nn_model=resnet50 %scale=1_2_node %parallel_strategy=None %module_name=PyTorch-bundle/2.1.2-foss-2023a-CUDA-12.1.1 %precision=mixed /4e984066 @hydra:ampere+default   
[ RUN      ] EESSI_PyTorch_torchvision_GPU %nn_model=resnet50 %scale=1_2_node %parallel_strategy=None %module_name=PyTorch-bundle/2.1.2-foss-2023a-CUDA-12.1.1 %precision=default /498fd283 @hydra:ampere+default
[       OK ] (3/4) EESSI_PyTorch_torchvision_GPU %nn_model=resnet50 %scale=1_2_node %parallel_strategy=None %module_name=PyTorch-bundle/2.1.2-foss-2023a-CUDA-12.1.1 %precision=mixed /4e984066 @hydra:ampere+defa$
lt                                                                                                                                                                                                                
P: total_throughput: 1250.9 img/sec (r:0, l:None, u:None)                                                                                      
P: througput_per_CPU: 1250.9 img/sec (r:0, l:None, u:None)                                                                                                                                                        
==> setup: 0.232s compile: 0.010s run: 44.738s sanity: 0.007s performance: 0.006s total: 45.267s                                                                                                                  
[       OK ] (4/4) EESSI_PyTorch_torchvision_GPU %nn_model=resnet50 %scale=1_2_node %parallel_strategy=None %module_name=PyTorch-bundle/2.1.2-foss-2023a-CUDA-12.1.1 %precision=default /498fd283 @hydra:ampere+de$
ault                                                                                                                                                                                                              
P: total_throughput: 828.2 img/sec (r:0, l:None, u:None)                                                                                                                                                          
P: througput_per_CPU: 828.2 img/sec (r:0, l:None, u:None)                                                                                                                                                         
==> setup: 0.248s compile: 0.010s run: 45.726s sanity: 0.006s performance: 0.003s total: 46.129s
[----------] all spawned checks have finished                                                                                                                                                                     
                                                                                                                                                                                                                  
[  PASSED  ] Ran 2/4 test case(s) from 4 check(s) (0 failure(s), 2 skipped, 0 aborted)
[==========] Finished on Sun Jul 28 22:31:25 2024

================================================================================================================================================================================
PERFORMANCE REPORT
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
[EESSI_PyTorch_torchvision_GPU %nn_model=resnet50 %scale=1_2_node %parallel_strategy=None %module_name=PyTorch-bundle/2.1.2-foss-2023a-CUDA-12.1.1 %precision=mixed /4e984066 @hydra:ampere:default]
  num_tasks: 1
  num_gpus_per_node: 1
  num_tasks_per_node: 1
  num_cpus_per_task: 16
  performance:
    - total_throughput: 1250.9 img/sec (r: 0 img/sec l: -inf% u: +inf%)
    - througput_per_CPU: 1250.9 img/sec (r: 0 img/sec l: -inf% u: +inf%)
[EESSI_PyTorch_torchvision_GPU %nn_model=resnet50 %scale=1_2_node %parallel_strategy=None %module_name=PyTorch-bundle/2.1.2-foss-2023a-CUDA-12.1.1 %precision=default /498fd283 @hydra:ampere:default]
  num_tasks: 1
  num_gpus_per_node: 1
  num_tasks_per_node: 1
  num_cpus_per_task: 16
  performance:
    - total_throughput: 828.2 img/sec (r: 0 img/sec l: -inf% u: +inf%)
    - througput_per_CPU: 828.2 img/sec (r: 0 img/sec l: -inf% u: +inf%)

smoors · 2024-07-30T07:47:26Z

2-GPU test:

[ReFrame Setup]
  version:           4.3.3
  command:           '/readonly/dodrio/apps/RHEL8/zen2-ib/software/ReFrame/4.3.3/bin/reframe -C /data/brussel/100/vsc10009/reframe/eessitestsuite/test-suite/../vsc_hortense.py -c /data/brussel/100/vsc10009/refra
me/eessitestsuite/test-suite/eessi/testsuite/tests/apps/PyTorch/PyTorch_torchvision.py -t CI -t 1_2_node --setvar modules=PyTorch-bundle/2.1.2-foss-2023a-CUDA-12.1.1 --system hortense:gpu_rome_a100_40gb -r --per
formance-report -v'
  launched by:       vsc10009@login55.dodrio.os
  working directory: '/dodrio/scratch/projects/badmin/vsc10009/eessitestsuite'                                                                                                                                    
  settings files:    '<builtin>', '/data/brussel/100/vsc10009/reframe/eessitestsuite/test-suite/../vsc_hortense.py'                                                                                               
  check search path: '/vscmnt/brussel_cerberus_data/_data_brussel/100/vsc10009/reframe/eessitestsuite/test-suite/eessi/testsuite/tests/apps/PyTorch/PyTorch_torchvision.py'                                       
  stage directory:   '/dodrio/scratch/projects/badmin/vsc10009/eessitestsuite/newtest/prefix/stage'                                                                                                               
  output directory:  '/dodrio/scratch/projects/badmin/vsc10009/eessitestsuite/newtest/prefix/output'
  log files:         '/dodrio/scratch/projects/badmin/vsc10009/eessitestsuite/newtest/prefix/logs/reframe_20240729_175936.log'                                                                                    

Loaded 3120 test(s)
Generated 260 test case(s)
Filtering test cases(s) by name: 260 remaining
Filtering test cases(s) by tags: 4 remaining
Filtering test cases(s) by other attributes: 4 remaining                                                                                                                                                          
Final number of test cases: 4
[==========] Running 4 check(s)
[==========] Started on Mon Jul 29 18:03:38 2024

[----------] start processing checks
[ RUN      ] EESSI_PyTorch_torchvision_GPU %nn_model=resnet50 %scale=1_2_node %parallel_strategy=ddp %module_name=PyTorch-bundle/2.1.2-foss-2023a-CUDA-12.1.1 %precision=mixed /f41aa0c4 @hortense:gpu_rome_a100_40
gb+default
WARNING: hooks.set_compact_process_binding does not support the current launcher (mympirun). The test will run, but using the default binding strategy of your parallel launcher. This may lead to suboptimal perfo
rmance. Please expand the functionality of hooks.set_compact_process_binding for your parallel launcher.                                                                                                          
[ RUN      ] EESSI_PyTorch_torchvision_GPU %nn_model=resnet50 %scale=1_2_node %parallel_strategy=ddp %module_name=PyTorch-bundle/2.1.2-foss-2023a-CUDA-12.1.1 %precision=default /d7cc8bfe @hortense:gpu_rome_a100_40
gb+default
WARNING: hooks.set_compact_process_binding does not support the current launcher (mympirun). The test will run, but using the default binding strategy of your parallel launcher. This may lead to suboptimal perfo
rmance. Please expand the functionality of hooks.set_compact_process_binding for your parallel launcher.
[ RUN      ] EESSI_PyTorch_torchvision_GPU %nn_model=resnet50 %scale=1_2_node %parallel_strategy=None %module_name=PyTorch-bundle/2.1.2-foss-2023a-CUDA-12.1.1 %precision=mixed /4e984066 @hortense:gpu_rome_a100_4
0gb+default
WARNING: hooks.set_compact_process_binding does not support the current launcher (mympirun). The test will run, but using the default binding strategy of your parallel launcher. This may lead to suboptimal perfo
rmance. Please expand the functionality of hooks.set_compact_process_binding for your parallel launcher.
[     SKIP ] (1/4) Skipping test: parallel strategy is 'None', but requested process count is larger than one (2).
[ RUN      ] EESSI_PyTorch_torchvision_GPU %nn_model=resnet50 %scale=1_2_node %parallel_strategy=None %module_name=PyTorch-bundle/2.1.2-foss-2023a-CUDA-12.1.1 %precision=default /498fd283 @hortense:gpu_rome_a100
_40gb+default
WARNING: hooks.set_compact_process_binding does not support the current launcher (mympirun). The test will run, but using the default binding strategy of your parallel launcher. This may lead to suboptimal perfo
rmance. Please expand the functionality of hooks.set_compact_process_binding for your parallel launcher.
[     SKIP ] (2/4) Skipping test: parallel strategy is 'None', but requested process count is larger than one (2).
[       OK ] (3/4) EESSI_PyTorch_torchvision_GPU %nn_model=resnet50 %scale=1_2_node %parallel_strategy=ddp %module_name=PyTorch-bundle/2.1.2-foss-2023a-CUDA-12.1.1 %precision=mixed /f41aa0c4 @hortense:gpu_rome_a
100_40gb+default
P: total_throughput: 2222.0 img/sec (r:0, l:None, u:None)
P: througput_per_CPU: 1111.0 img/sec (r:0, l:None, u:None)
==> setup: 0.627s compile: 0.004s run: 9646.885s sanity: 0.035s performance: 0.004s total: 9649.153s
[       OK ] (4/4) EESSI_PyTorch_torchvision_GPU %nn_model=resnet50 %scale=1_2_node %parallel_strategy=ddp %module_name=PyTorch-bundle/2.1.2-foss-2023a-CUDA-12.1.1 %precision=default /d7cc8bfe @hortense:gpu_rome
_a100_40gb+default
P: total_throughput: 1543.0 img/sec (r:0, l:None, u:None)
P: througput_per_CPU: 771.5 img/sec (r:0, l:None, u:None)
==> setup: 0.185s compile: 0.004s run: 9664.221s sanity: 0.004s performance: 0.002s total: 9666.775s
[----------] all spawned checks have finished

[  PASSED  ] Ran 2/4 test case(s) from 4 check(s) (0 failure(s), 2 skipped, 0 aborted)
[==========] Finished on Mon Jul 29 20:44:46 2024

==================================================================================================================================================================================================================$
PERFORMANCE REPORT
------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------$
[EESSI_PyTorch_torchvision_GPU %nn_model=resnet50 %scale=1_2_node %parallel_strategy=ddp %module_name=PyTorch-bundle/2.1.2-foss-2023a-CUDA-12.1.1 %precision=mixed /f41aa0c4 @hortense:gpu_rome_a100_40gb:default]
  num_gpus_per_node: 2
  num_tasks: 2
  num_tasks_per_node: 2
  num_cpus_per_task: 12
  performance:
    - total_throughput: 2222.0 img/sec (r: 0 img/sec l: -inf% u: +inf%)
    - througput_per_CPU: 1111.0 img/sec (r: 0 img/sec l: -inf% u: +inf%)
[EESSI_PyTorch_torchvision_GPU %nn_model=resnet50 %scale=1_2_node %parallel_strategy=ddp %module_name=PyTorch-bundle/2.1.2-foss-2023a-CUDA-12.1.1 %precision=default /d7cc8bfe @hortense:gpu_rome_a100_40gb:default
]
  num_gpus_per_node: 2
  num_tasks: 2
  num_tasks_per_node: 2
  num_cpus_per_task: 12
  performance:
    - total_throughput: 1543.0 img/sec (r: 0 img/sec l: -inf% u: +inf%)
    - througput_per_CPU: 771.5 img/sec (r: 0 img/sec l: -inf% u: +inf%)

smoors

looking good!

Initial commit for an EESSI PyTorch test that uses torchvision models

2e7bf95

casparvl commented Mar 28, 2024

View reviewed changes

eessi/testsuite/tests/apps/PyTorch/PyTorch_torchvision.py Outdated Show resolved Hide resolved

Caspar van Leeuwen added 6 commits March 29, 2024 12:00

Add option to assign number of tasks and cpus per task based on the a…

e089760

…mount of numa nodes in a node

We don't need to first calculate cpus_per_socket, it is available dir…

a675fc9

…ectly from the refraem.core.systems.ProcessorInfo object. See https://reframe-hpc.readthedocs.io/en/stable/regression_test_api.html#reframe.core.systems.ProcessorInfo

Define constant for numa node as a compute unit

a6e53bc

Fix num_cores_per_numa_node, use it from the current_partition info

8bb9bbd

Change order of imports so that initialization only happens after req…

fce2a45

…uired environment variables have been set.

Add comment on explicit assumption for computing the local rank

b868cc1

Merge branch 'fix_compact_process_binding' into pytorch

760cd59

Use EESSI prefix to name test

fed8906

Child classes should also be renamed and inherit from renamed class

357f649

Caspar van Leeuwen added 7 commits May 6, 2024 14:43

Merge branch 'main' into pytorch

887f7b3

Remove stray blank line

6b1e36a

Rephrased comment, some changes to make the linter happy

2d33141

Fix some more linter issues

4cb7b36

Fix some more linter issues

2f0bea2

Fix linter issues

fc067b2

Fix linter issues

4ddfe23

casparvl marked this pull request as ready for review May 6, 2024 14:55

Can't combine generators with plus, so use chain

73b7e84

casparvl mentioned this pull request May 7, 2024

Add test for QuantumESPRESSO (pw.x) #128

Merged

smoors reviewed Jun 13, 2024

View reviewed changes

Merge branch 'main' into pytorch

07b2c1b

smoors reviewed Jun 13, 2024

View reviewed changes

Caspar van Leeuwen and others added 4 commits July 1, 2024 12:15

Merge branch 'main' into pytorch

11146ef

Fix comments from Review Sam

8298e6a

Make linter happy

af30b64

Remove training whitespace

7ddeedb

casparvl requested a review from smoors July 1, 2024 12:26

Caspar van Leeuwen added 4 commits July 25, 2024 16:32

Add set_omp_num_threads hook from EESSI#133

d62443b

Call hook to set OMP_NUM_THREADS

00fca31

Revert using the hook, it doesn't make sense to set OMP_NUM_THREADS c…

a69e2d3

…onditionally, as this would be evaluated on the login node. That environment is irrelevant to the batch job

test is not defined, should be 'self'

4c5c3e7

smoors approved these changes Jul 30, 2024

View reviewed changes

smoors merged commit 6e05428 into EESSI:main Jul 30, 2024
10 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

PyTorch test that uses torchvision #130

PyTorch test that uses torchvision #130

casparvl commented Mar 27, 2024

casparvl commented May 3, 2024 •

edited

Loading

casparvl commented May 3, 2024

casparvl commented May 3, 2024 •

edited

Loading

casparvl commented May 3, 2024

casparvl commented May 3, 2024

casparvl commented May 6, 2024

casparvl commented May 6, 2024 •

edited

Loading

smoors commented Jun 13, 2024

smoors Jun 13, 2024

smoors Jun 13, 2024

smoors Jun 13, 2024

smoors Jun 13, 2024

smoors Jun 13, 2024

casparvl Jul 1, 2024

smoors commented Jun 16, 2024

casparvl commented Jul 10, 2024

smoors commented Jul 30, 2024

smoors commented Jul 30, 2024

smoors left a comment

	Assign one task per compute unit.
	Assign one task per compute unit. More than 1 task per compute unit can be assigned with num_per for compute units that support it.

		# Hybrid code, so launch 1 rank per socket.
		# Probably, launching 1 task per NUMA domain is even better, but the current hook doesn't support it

PyTorch test that uses torchvision #130

PyTorch test that uses torchvision #130

Conversation

casparvl commented Mar 27, 2024

casparvl commented May 3, 2024 • edited Loading

casparvl commented May 3, 2024

casparvl commented May 3, 2024 • edited Loading

casparvl commented May 3, 2024

casparvl commented May 3, 2024

casparvl commented May 6, 2024

casparvl commented May 6, 2024 • edited Loading

smoors commented Jun 13, 2024

smoors Jun 13, 2024

Choose a reason for hiding this comment

smoors Jun 13, 2024

Choose a reason for hiding this comment

smoors Jun 13, 2024

Choose a reason for hiding this comment

smoors Jun 13, 2024

Choose a reason for hiding this comment

smoors Jun 13, 2024

Choose a reason for hiding this comment

casparvl Jul 1, 2024

Choose a reason for hiding this comment

smoors commented Jun 16, 2024

casparvl commented Jul 10, 2024

smoors commented Jul 30, 2024

smoors commented Jul 30, 2024

smoors left a comment

Choose a reason for hiding this comment

casparvl commented May 3, 2024 •

edited

Loading

casparvl commented May 3, 2024 •

edited

Loading

casparvl commented May 6, 2024 •

edited

Loading