Skip to content

Tune warpspeed scan for B200 (sm100) #8348

@bernhardmgruber

Description

@bernhardmgruber

After #8342 is merged, we can finally tune the warpspeed scan implementation. There is currently only one benchmark which is cub/benchmarks/bench/scan/exclusive/sum.warpspeed.cu. We should tune for at least the value types I8, I16, I32, I64, and I128. No offset type needs to be provided (it's always I64). The problem size should be at least 2^28.

Example using the random search:

$ CUDA_VISIBLE_DEVICES=0 PYTHONPATH=../benchmarks/scripts ../benchmarks/scripts/search.py -R '.*scan.exclusive.sum.warpspeed' -a 'T{ct}=I8' -a 'Elements{io}[pow2]=32'
 ctk:  13.1.115
cccl:  v3.4.0.dev-433-ge76addccbd
cub.bench.scan.exclusive.sum.warpspeed.wrps_8.lbi_5.ipt_176 0.6979214053331619
cub.bench.scan.exclusive.sum.warpspeed.wrps_5.lbi_4.ipt_48 0.9386672015634218
cub.bench.scan.exclusive.sum.warpspeed.wrps_4.lbi_5.ipt_208 0.8706167012214349
cub.bench.scan.exclusive.sum.warpspeed.wrps_4.lbi_2.ipt_112 0.8389938131208066
...
cub.bench.scan.exclusive.sum.warpspeed.wrps_4.lbi_8.ipt_160 1.2642544877327193

The last run already shows a speedup of 1.26 at 4 warps, 8 look back items, and 160 items per thread.

Metadata

Metadata

Assignees

Labels

No labels
No labels

Type

No type

Projects

Status

Todo

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions