Change our warp specialized kernel to something like:
if load-warp:
asm volatile("{setmaxnreg.dec.sync.aligned.u32 56; \n\t}");
do the work
return; # Super important!
else:
asm volatile("{setmaxnreg.inc.sync.aligned.u32 224; \n\t}");
do the work
This seems to improve performance, see experiment at #3566