Releases: ROCm/TransferBench
Releases · ROCm/TransferBench
TransferBench v1.50
Added
- Adding new parallel copy preset benchmark (pcopy)
- Usage: ./TransferBench pcopy <numBytes=64M> <#CUs=8> <srcGpu=0> <minGpus=1> <maxGpus=#GPU-1>
Fixed
- Removed non-copies DMA Transfers (this had previously been using hipMemset)
- Fixed CPU executor when operating on null destination
TransferBench v1.49
Fixes
- Enumerating previously missed DMA engines used only for CPU traffic in topology display
TransferBench v1.48
v1.48
Fixes
- Various fixes for TransferBenchCuda
Additions
- Support for targeting specific DMA engines via executor subindex (e.g. D0.1)
- Printing warnings when exeuctors are overcommited
Modifications
- USE_REMOTE_READ supported for rwrite preset benchmark
TransferBench v1.47
Fixes
- Fixing CUDA compilation
TransferBench v1.46
Fixes
- Fixing GFX_UNROLL set to 13 (past 8) on gfx906 cards
Modifications
- GFX_SINGLE_TEAM=1 by default
- Adding field showing summation of individual Transfer bandwidths for Executors
TransferBench v1.45
Additions
- Adding A2A_MODE to a2a preset (0 = copy, 1 = read-only, 2 = write-only)
- Adding GFX_UNROLL to modify GFX kernel's unroll factor
- Adding GFX_WAVE_ORDER to modify order in which wavefronts process data
Modifications
- Rewrote the GFX reduction kernel to support new wave ordering
TransferBench v1.44
Additions
- Adding rwrite preset to benchmark remote parallel writes
- Usage: ./TransferBench rwrite <numBytes=64M> <#CUs=8> <srcGpu=0> <minGpus=1> <maxGpus=3>
TransferBench v1.43
Changes
- Modifying a2a to show executor timing, as well as executor min/max bandwidth
TransferBench v1.42
Fixes
- Fixing schmoo maxNumCus optional arg parsing
- Schmoo output modified to be easier to copy
TransferBench v1.41
Additions
- Adding schmoo preset config benchmarks local/remote reads/writes/copies
- Usage: ./TransferBench schmoo <numBytes=64M> <localIdx=0> <remoteIdx=1> <maxNumCUs=32>
Fixes
- Fixing some misreported timings when running with non-fixed number of iterations