Skip to content

Make warpspeed scan tunable#8342

Open
bernhardmgruber wants to merge 5 commits intoNVIDIA:mainfrom
bernhardmgruber:warpspeed_tuning_fixes
Open

Make warpspeed scan tunable#8342
bernhardmgruber wants to merge 5 commits intoNVIDIA:mainfrom
bernhardmgruber:warpspeed_tuning_fixes

Conversation

@bernhardmgruber
Copy link
Copy Markdown
Contributor

@bernhardmgruber bernhardmgruber commented Apr 9, 2026

Here are some more, hopefully final, fixes to make the warpspeed scan tunable.

With this, i can run now:

$ ../benchmarks/scripts/search.py -R '.*warpspeed.*' -a 'T{ct}=I16' -a 'Elements{io}[pow2]=28'
 ctk:  13.2.51
cccl:  v3.4.0.dev-431-g27a5856d94
cub.bench.scan.exclusive.sum.warpspeed.wrps_1.lbi_2.ipt_160 1.0000454437516246
cub.bench.scan.exclusive.sum.warpspeed.wrps_6.lbi_4.ipt_24 0.9971013976972155
cub.bench.scan.exclusive.sum.warpspeed.wrps_7.lbi_2.ipt_104 0.9173333009957451
cub.bench.scan.exclusive.sum.warpspeed.wrps_6.lbi_2.ipt_48 1.0
cub.bench.scan.exclusive.sum.warpspeed.wrps_2.lbi_8.ipt_168 1.0000454437516246
cub.bench.scan.exclusive.sum.warpspeed.wrps_2.lbi_6.ipt_16 0.6758141927287776
cub.bench.scan.exclusive.sum.warpspeed.wrps_6.lbi_2.ipt_120 1.0
cub.bench.scan.exclusive.sum.warpspeed.wrps_4.lbi_8.ipt_80 1.0
cub.bench.scan.exclusive.sum.warpspeed.wrps_1.lbi_8.ipt_136 0.9971013976972155
cub.bench.scan.exclusive.sum.warpspeed.wrps_4.lbi_8.ipt_32 0.9971013976972155
cub.bench.scan.exclusive.sum.warpspeed.wrps_5.lbi_5.ipt_128 0.9971013976972155
cub.bench.scan.exclusive.sum.warpspeed.wrps_2.lbi_6.ipt_32 0.9828571614182595
cub.bench.scan.exclusive.sum.warpspeed.wrps_7.lbi_8.ipt_96 0.9913544763005123

in cccl_meta_bench.log I can see:

2026-04-09 20:05:45,458: starting build for cub.bench.scan.exclusive.sum.warpspeed.wrps_4.lbi_3.ipt_192: cmake --build . --target cub.bench.scan.exclusive.sum.warpspeed.variant
2026-04-09 20:05:53,485: finished build for cub.bench.scan.exclusive.sum.warpspeed.wrps_4.lbi_3.ipt_192 (exit code: 1) in 8.027s
2026-04-09 20:05:53,495: found cached base build for cub.bench.scan.exclusive.sum.warpspeed.base
2026-04-09 20:05:53,495: found cached base build for cub.bench.scan.exclusive.sum.warpspeed.base
2026-04-09 20:05:53,496: starting build for cub.bench.scan.exclusive.sum.warpspeed.wrps_4.lbi_3.ipt_40: cmake --build . --target cub.bench.scan.exclusive.sum.warpspeed.variant
2026-04-09 20:06:06,280: finished build for cub.bench.scan.exclusive.sum.warpspeed.wrps_4.lbi_3.ipt_40 (exit code: 0) in 12.785s
2026-04-09 20:06:06,292: found benchmark cub.bench.scan.exclusive.sum.warpspeed.base in cache
2026-04-09 20:06:06,292: starting benchmark cub.bench.scan.exclusive.sum.warpspeed.wrps_4.lbi_3.ipt_40 with ('T{ct}=I16', 'OffsetT{ct}=I64'): ./bin/cub.bench.scan.exclusive.sum.warpspeed.variant -a T{ct}=I16 -a OffsetT{ct}=I64 --jsonbin result.json --stopping-criterion entropy -d 0 -b base -a Elements{io}[pow2]=[28]
2026-04-09 20:06:06,957: finished benchmark cub.bench.scan.exclusive.sum.warpspeed.wrps_4.lbi_3.ipt_40 with ('T{ct}=I16', 'OffsetT{ct}=I64') (0) in 0.665s

The compilation failures happen whenever a tuning variant is produced that exceeds 48KiB SMEM for a single stage. Such a tuning is rejected at compile-time.

@bernhardmgruber bernhardmgruber requested a review from a team as a code owner April 9, 2026 18:07
@github-project-automation github-project-automation bot moved this to Todo in CCCL Apr 9, 2026
@cccl-authenticator-app cccl-authenticator-app bot moved this from Todo to In Review in CCCL Apr 9, 2026
Comment on lines +42 to +62
def get_jsonlist(self, algname, listname):
benchmark_bin = os.path.join(".", "bin", algname + ".base")
if not os.path.exists(benchmark_bin):
raise Exception(f"Benchmark binary not found: {benchmark_bin}")
return subprocess.check_output([benchmark_bin, f"--jsonlist-{listname}"])

def get_bench(self, algname):
if algname not in self.bench_cache:
result = subprocess.check_output(
[os.path.join(".", "bin", algname + ".base"), "--jsonlist-benches"]
)
result = self.get_jsonlist(algname, "benches")
self.bench_cache[algname] = json.loads(result)
return self.bench_cache[algname]

def get_device(self, algname):
if algname not in self.device_cache:
result = subprocess.check_output(
[os.path.join(".", "bin", algname + ".base"), "--jsonlist-devices"]
)
devices = json.loads(result)["devices"]

result = self.get_jsonlist(algname, "devices")
data = json.loads(result)
if "devices" not in data:
raise Exception(
"JSON returned from --jsonlist-devices does not contain 'devices' key"
)
devices = data["devices"]
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Those don't change functionality, but should rather improve diagnostics

#endif // !TUNE_BASE

#ifndef USES_WARPSPEED
# define USES_WARPSPEED() 0
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is that still meant to be here?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes

@github-actions
Copy link
Copy Markdown
Contributor

github-actions bot commented Apr 9, 2026

🥳 CI Workflow Results

🟩 Finished in 2h 14m: Pass: 100%/230 | Total: 9d 00h | Max: 2h 13m | Hits: 39%/133926

See results here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

Status: In Review

Development

Successfully merging this pull request may close these issues.

2 participants