What is your question?
I am currently using AOT-compiled CuTeDSL kernels through TVM-FFI in C++.
However, is there a way to ensure that the compiled CUfunctions are loaded before the first kernel invocation?
I call InvokeExternC every time, but it seems that the function internally calls cuKernelGetFunction each time.
More specifically, when working with precompiled CUBINs, we can explicitly load all required kernels upfront by calling cuModuleGetFunction.
Is there an equivalent mechanism for kernels compiled via CuTeDSL and accessed through TVM-FFI?