# Best practices for reducing host-side latency

Overall performance depends on both device performance (i.e., that of the kernel) and host performance (i.e., that of the runtime).
This notebook focuses on the latter: techniques to minimize any overheads incurred from the CUTLASS API and underlying
DSL runtimes.

This notebook does not discuss techniques for improving device-side performance. A future notebook may cover this topic.

**Note**: Latency measurements can vary from system to system. You may see different results on your system than shown
in the pre-populated fields of this notebook.

In [None]:
import time
import torch
import cutlass_api

In [None]:
if not (status := cutlass_api.utils.is_device_cc_supported({80, 89, 90, 100, 103})):
    print(
        f"This notebook requires a GPU with compute capability >= 80.\n{status.error}"
    )
    import sys

    sys.exit(0)

We start with boilerplate initial setup to create tensors and pick a kernel.

For the purposes of this notebook, we use a very small GEMM size of M=N=K=128
and L=1. This small size is chosen to magnify the impact of host latency on
end-to-end performance so as to better illustrate the effect of the techniques
described below.

In [None]:
warmup_iterations = 10
profiling_iterations = 100
total_iterations = warmup_iterations + profiling_iterations

# Use a small problem size to showcase host overheads
L, M, N, K = 1, 128, 128, 128

# We use different operands in each iteration. Though not particularly relevant for
# host latency, this is a best practice when benchmarking GPU kernels to avoid
# unrealistic caching effects.
As = [
    torch.randint(-1, 2, (M, K), device="cuda", dtype=torch.float16)
    for _ in range(total_iterations)
]
Bs = [
    torch.randint(-1, 2, (K, N), device="cuda", dtype=torch.float16)
    for _ in range(total_iterations)
]
outs = [
    torch.empty((M, N), device="cuda", dtype=torch.float16)
    for _ in range(total_iterations)
]

# Construct arguments outside of the benchmarking loop. We will later also consider
# cases in which they are constructed inside the benchmarking loop.
args = [
    cutlass_api.arguments.GemmArguments(
        A=As[i], B=Bs[i], out=outs[i], accumulator_type=torch.float32
    )
    for i in range(total_iterations)
]

references = [(As[i] @ Bs[i]).to(outs[i].dtype) for i in range(total_iterations)]

cc = cutlass_api.utils.device_cc()
kernels = cutlass_api.get_kernels(args[0], cc=cc)
assert len(kernels) > 0

kernel = kernels[0]

We next set up a basic benchmarking routine.

In [None]:
def benchmark(
    label, code, warmup_it=warmup_iterations, profiling_it=profiling_iterations
):
    total_it = warmup_it + profiling_it
    assert total_it <= total_iterations, (
        f"Benchmark-local iteration count must be less than or equal to total iterations: {total_it} > {total_iterations}"
    )
    # warmup
    rets = [None] * total_it
    for i in range(warmup_it):
        rets[i] = code(i)
    torch.cuda.synchronize()

    start = time.time()
    for i in range(profiling_it):
        idx = warmup_it + i
        rets[idx] = code(idx)
    torch.cuda.synchronize()
    end = time.time()

    avg_time = (end - start) / profiling_it
    print(f"[{label:<30}] avg of {profiling_it} iterations: {avg_time:1.3e} seconds")
    return avg_time, rets

We now describe techniques for reducing host latency:
* Compile once, run many times
* Bypassing checks for argument-kernel compatibility
* Using [CUDA Graphs](https://developer.nvidia.com/blog/cuda-graphs/)
* Using [TVM FFI](https://tvm.apache.org/ffi/)

These techniques are complementary and should be used together when applicable
for an application.

### Compile once, run many times
The `kernel.run` method takes in an optional `compiled_artifact` argument of type
`cutlass_api.artifact.CompiledArtifact`. When this argument is set, the kernel
will directly use the precompiled function within `compiled_artifact`. When
it is not set, the call to `kernel.run` will JIT compile the kernel on each
invocation.

Precompiling the kernel is critical to achieving good performance.

In [None]:
stream = torch.cuda.current_stream()


def no_compiled_artifact(i: int):
    return kernel.run(args[i], stream=stream)


# Compile the kernel once, reuse for each iterations
compiled_artifact = kernel.compile(args[0])


def with_compiled_artifact(i: int):
    return kernel.run(args[i], stream=stream, compiled_artifact=compiled_artifact)

In [None]:
time_no_artifact, _ = benchmark(
    f"Without compiled artifact", no_compiled_artifact, warmup_it=2, profiling_it=5
)
time_w_artifact, _ = benchmark(
    f"With compiled artifact", with_compiled_artifact, warmup_it=2, profiling_it=5
)

[Without compiled artifact     ] avg of 5 iterations: 1.376e+00 seconds
[With compiled artifact        ] avg of 5 iterations: 1.016e-05 seconds


### Bypassing checks for argument-kernel compatibility
By default, the call to `kernel.run` will check if the kernel supports the provided arguments.
Under the hood, this invokes `kernel.supports(args)`.

While these checks are helpful for catching incompatible arguments, they are performed
in Python, and thus can add to host overhead.

When confident that arguments will be compatible with a kernel, one should bypass
the `supports` check in `kernel.run` by setting the optional `assume_supported_args`
argument to `True`.

In [None]:
def with_supports_check(i: int):
    return kernel.run(
        args[i],
        compiled_artifact=compiled_artifact,
        stream=stream,
        assume_supported_args=False,
    )


def without_supports_check(i: int):
    return kernel.run(
        args[i],
        compiled_artifact=compiled_artifact,
        stream=stream,
        assume_supported_args=True,
    )

In [None]:
time_w_supports, _ = benchmark("With supports check", with_supports_check)
time_wo_supports, _ = benchmark("Bypass supports check", without_supports_check)
print(f"Speedup with skip supports: {time_w_supports / time_wo_supports:.2f}x")

[With supports check           ] avg of 100 iterations: 1.463e-05 seconds
[Bypass supports check         ] avg of 100 iterations: 6.239e-06 seconds
Speedup with skip supports: 2.34x


### CUDA Graphs

[CUDA Graphs](https://developer.nvidia.com/blog/cuda-graphs/) allow a sequence of GPU operations to be defined as a dependency graph and then launched as a single unit, significantly reducing CPU launch overhead and enabling whole-graph optimizations.

CUTLASS API supports CUDA Graphs usage with PyTorch as usual.

The kernel compilation must happen outside the CUDA graph. Then, we create a graph using usual PyTorch idioms to launch a kernel several times on the graph's stream.

In [None]:
num_launches = 20

# Create a CUDA Graph to run our compiled kernel N times
g = torch.cuda.CUDAGraph()
with torch.cuda.graph(g):

    ### NOTE! Kernel compilation must happen outside the graph
    ### kernel.compile(args)

    # Run N iterations of our compiled kernel on the current stream
    for i in range(num_launches):
        kernel.run(
            args[i],
            compiled_artifact=compiled_artifact,
            stream=torch.cuda.current_stream(),
            assume_supported_args=True,
        )

This records/captures all the kernel launches to the CUDA Stream associated with the graph `g`, without actually launching them.
Once captured, we can replay the graph.

Note that graph replay will only replay the kernel launches placed on the graph's stream
* During graph capture, we must be careful to capture to the correct stream (`torch.cuda.current_stream()` under the graph context)
* Any other preparatory work on the host and arguments passed in from Python are cached during the capture. Changing them would require re-capturing the graph

In [10]:
# Replay captured graph and check first result
g.replay()

torch.testing.assert_close(outs[0], references[0])

Let's compare the timing:

In [None]:
def without_cuda_graph(x: int):
    for i in range(num_launches):
        kernel.run(
            args[i],
            compiled_artifact=compiled_artifact,
            stream=torch.cuda.current_stream(),
            assume_supported_args=True,
        )


def with_cuda_graph(x: int):
    g.replay()


time_wo_cuda_graph, _ = benchmark(
    f"{num_launches} launches without CUDA Graph",
    without_cuda_graph,
    warmup_it=0,
    profiling_it=1,
)
time_w_cuda_graph, _ = benchmark(
    f"{num_launches} launches with CUDA Graph",
    with_cuda_graph,
    warmup_it=0,
    profiling_it=1,
)

print(f"Speedup with CUDA Graph: {time_wo_cuda_graph / time_w_cuda_graph:.2f}x")

[20 launches without CUDA Graph] avg of 1 iterations: 4.699e-04 seconds
[20 launches with CUDA Graph   ] avg of 1 iterations: 9.084e-05 seconds
Speedup with CUDA Graph: 5.17x


### TVM FFI

[Apache TVM FFI](https://tvm.apache.org/ffi/) is an open ABI and FFI for machine learning systems.
When available, CUTLASS API uses Apache TVM-FFI under the hood as its interface for invoking compiled DSL kernels from Python.

TVM FFI is enabled by default in CUTLASS API, and is recommended for best performance.

`cutlass_api.config.GlobalOptions().use_tvm_ffi` controls whether or not TVM-FFI will be used by CUTLASS API.

In [12]:
print(cutlass_api.config.GlobalOptions().use_tvm_ffi)

True


If for some reason you do not wish to use it, this section demonstrates how, you can set this to False. No other change is needed. The below code compares the performance with and without TVM-FFI.

In [None]:
original_use_tvm_ffi = cutlass_api.config.GlobalOptions().use_tvm_ffi

cutlass_api.config.GlobalOptions().use_tvm_ffi = True


def run_iteration(i):
    args = cutlass_api.arguments.GemmArguments(
        A=As[i], B=Bs[i], out=outs[i], accumulator_type=torch.float16
    )
    return kernel.run(
        args,
        compiled_artifact=compiled_artifact,
        stream=torch.cuda.current_stream(),
        assume_supported_args=True,
    )


def create_arguments(i: int):
    return cutlass_api.arguments.GemmArguments(
        A=As[i], B=Bs[i], out=outs[i], accumulator_type=torch.float16
    )


args_creation_on, args = benchmark("[TVM-FFI ON ] Create args", create_arguments)
compilation_on, compiled = benchmark(
    "[TVM-FFI ON ] Compile kernel",
    lambda i: kernel.compile(args[i]),
    warmup_it=2,
    profiling_it=5,
)
compiled_artifact = compiled[0]
run_on, _ = benchmark(
    "[TVM-FFI ON ] Run kernel",
    lambda i: kernel.run(
        args[i],
        compiled_artifact=compiled_artifact,
        assume_supported_args=True,
        stream=stream,
    ),
)

[[TVM-FFI ON ] Create args     ] avg of 100 iterations: 8.367e-05 seconds
[[TVM-FFI ON ] Compile kernel  ] avg of 5 iterations: 1.352e+00 seconds
[[TVM-FFI ON ] Run kernel      ] avg of 100 iterations: 6.509e-06 seconds


In [None]:
cutlass_api.config.GlobalOptions().use_tvm_ffi = False
args_creation_off, args = benchmark("[TVM-FFI OFF ] Create args", create_arguments)
compilation_off, compiled = benchmark(
    "[TVM-FFI OFF ] Compile kernel",
    lambda i: kernel.compile(args[i]),
    warmup_it=2,
    profiling_it=5,
)
compiled_artifact = compiled[0]
run_off, _ = benchmark(
    "[TVM-FFI OFF ] Run kernel",
    lambda i: kernel.run(
        args[i],
        compiled_artifact=compiled_artifact,
        assume_supported_args=True,
        stream=stream,
    ),
)

# Restore original setting
cutlass_api.config.GlobalOptions().use_tvm_ffi = original_use_tvm_ffi

[[TVM-FFI OFF ] Create args    ] avg of 100 iterations: 1.255e-04 seconds
[[TVM-FFI OFF ] Compile kernel ] avg of 5 iterations: 1.278e+00 seconds
[[TVM-FFI OFF ] Run kernel     ] avg of 100 iterations: 4.519e-05 seconds


In [15]:
print("Speedups with TVM-FFI: ")
print(f"Arg creation: {args_creation_off / args_creation_on:.2f}x")
print(f"Compilation: {compilation_off / compilation_on:.2f}x")
print(f"Run: {run_off / run_on:.2f}x")

Speedups with TVM-FFI: 
Arg creation: 1.50x
Compilation: 0.95x
Run: 6.94x
