# Integrating needle with TVM

This is a demo for our Deep Learning System (10714) course's final project. Our project extends our needle framework with the ability to generate TVMScript, which open the door to many powerful machine learning model optimizations.

## Introduction to TVM

TVM is an open-source machine learning compiler stack designed to optimize and deploy deep learning models across various hardware platforms. It provides a unified framework for transforming high-level model descriptions into optimized tensor programs that can run efficiently on CPUs, GPUs, and specialized accelerators like TPUs. By integrating TVM into Needle, we unlock several powerful optimization techniques that dramatically improve model performance, including Graph-Level Optimizations, Tensor Program-Level Optimizations, and Cross-Hardware Compatibility. In the following sections, we will delve deeper into each step of the integration process.

We will conduct performance experiment on the following models to demonstrate the power of TVM optimizations:

-   MLP: A simple feed-forward model that serves as a baseline for optimization comparison.
-   Resnet9: A convolutional neural network model often used for image classification tasks.
-   Transformer: A model designed for sequence-to-sequence tasks, such as machine translation or text summarization.


## Approach Overview:

### Graph Transpiler: from `ndl.Tensor` to `relax.IRModule`

Our translation logic can be founded in `./dlsys/python/needle_tvm/to_tvm.py`.

Our main task is to build a graph transpiler that converts needle's computation graph to TVM's `IRModule`. We took advantage of `ndl.Tensor`'s inherent graphical structure to design our translation algorithm. Once a needle model's forward pass is run, we will be able to traverse the tensor graph through a simple topological sort, starting from the output Tensor.


In [None]:
def topological_sort(output):
  visited = set()
  topo_order = []

  def dfs(node):
      if node in visited:
          return
      visited.add(node)
      for inp in node.inputs:
          dfs(inp)
      topo_order.append(node)

  dfs(output)
  return topo_order  # Reverse for topological order

While we traverse the tensor graph, we incrementally build the final `tvm.IRModule` through `tvm.relax.block_builder` API.


In [None]:
def to_tvm_tensor(mod: ndl.nn.Module, *args, **kwargs) -> tvm.relax.IRModule:
  # set user input Tensor placeholder=True
  for t in args:
    if isinstance(t, Tensor):
      t.placeholder = True

  # topologically sort ndl.Tensor computation graph
  topo_order = topological_sort(output_tensor)

  # initialize block builder, module inputs/outputs, value to relax.Var map
  bb = block_builder.BlockBuilder()
  fn_inputs = []
  fn_output = None
  value_to_var : Dict[ndl.Tensor, relax.Var] = {}


  # Create the "main" function in emitted IRModule
  with bb.function("main"):
      with bb.dataflow():
          for i, node in enumerate(topo_order):
              # Leaf nodes (inputs or constants)
              if node.is_leaf():
                  if node.placeholder:
                      tvm_var = (relax.Var("X", relax.TensorStructInfo(node.shape, "float32")))
                      value_to_var.setdefault(node, tvm_var)
                      fn_inputs.append(tvm_var)
                      continue
                  else:
                      tvm_var = (relax.const(node.numpy(), relax.TensorStructInfo(node.shape, "float32")))
                      value_to_var.setdefault(node, tvm_var)
                      continue

              # Map the operation to TVM
              tvm_var = node.op.emit_te(bb, value_to_var, node)
              value_to_var[node] = tvm_var

          fn_output = bb.emit_output(value_to_var[topo_order[-1]])

      bb.emit_func_output(value_to_var[topo_order[-1]], fn_inputs)
  return bb.get()

We uses a dictionary that maps `ndl.Tensor` to `relax.Var` to query intermediate translation results. A `relax.Var` represents the result of a computation in `IRModule`. It could be one of three things:

1. A placeholder, i.e. model inputs
2. A constant, e.g. model parameters
3. Result of a tensor operator

Not so coincidentally, needle `Tensor` can also be classified into these three categories. The input `Tensor` to user's model will be "placeholder" to the `main` TIR function; model weights and biases will be constant (`relax.const`); and since any non-leaf `Tensor` are coupled with a `TensorOp` (the `Tensor.op`), we can translate the operator as a Tensor IR function, and insert a call instruction to said function in the `main` TIR function of the final `IRModule`.

For the first kind, we add an boolean attribute `placeholder` to `Value` class (parent of `Tensor`) to indicate if a `Tensor` is an input to user's model. Once detected, our translation algorithm will correspondingly generate a `relax.Var`. This also explains why our `to_tvm_tensor` function does this in the beginning:


In [None]:
# set user input Tensor placeholder=True
for t in args:
  if isinstance(t, Tensor):
    t.placeholder = True

For the last kind of `Tensor`, we generate the corresponding `TensorIR` function in the `IRModule` using `tvm.topi` operators. In `./dlsys/python/needle/ops/ops_mathematics.py`, we extend every `TensorOp` with a `emit_te` function, e.g. in `EwiseAdd.emit_te`:


In [None]:
class EWiseAdd(TensorOp):
    def emit_te(self, bb: relax.BlockBuilder, node_map: Dict[Tensor, relax.Var], node: Tensor) -> relax.Var:
        A = node_map[node.inputs[0]]
        B = node_map[node.inputs[1]]

        def te_ewise_add(A, B):
            return topi.add(A, B)

        return bb.emit_te(te_ewise_add, A, B)

`relax.BlockBuilder.emit_te` will generate the following TIR function in the final `IRModule`:


In [None]:
@T.prim_func(private=True)
def te_ewise_add(lv: T.Buffer((T.int64(32), T.int64(512)), "float32"), lv2: T.Buffer((T.int64(32), T.int64(512)), "float32"), T_add: T.Buffer((T.int64(32), T.int64(512)), "float32")):
    T.func_attr({"tir.noalias": T.bool(True)})
    # with T.block("root"):
    for ax0, ax1 in T.grid(T.int64(32), T.int64(512)):
        with T.block("T_add"):
            v_ax0, v_ax1 = T.axis.remap("SS", [ax0, ax1])
            T.reads(lv[v_ax0, v_ax1], lv2[v_ax0, v_ax1])
            T.writes(T_add[v_ax0, v_ax1])
            T_add[v_ax0, v_ax1] = lv[v_ax0, v_ax1] + lv2[v_ax0, v_ax1]

### Build and Save `IRModule` as Executable

Our model compile and evaluation logic can be found in `./dlsys/apps/models/model_eval.py` in class `ModelEval`.

The following code builds and runs the `IRModule` transpiled from our `ndl.nn.Module` using TVM `Relax` frontend:


**_Note_**:
For `nn.Module` that uses have different behavior during inference and training (e.g. `BatchNorm1d`), it's absolutely necessary to run `model.eval()` before calling `to_tvm_tensor` to ensure the transpiled tvm module is indeed from model's inference path.


In [None]:
# ensure model is inference mode
model.eval()

ir_module = to_tvm_tensor(model, ndl.Tensor(x, device=self.ndl_device))
module_ex = relax.build(ir_module, target="llvm")
module_vm = relax.VirtualMachine(module_ex, self.tvm_device)

while the following code saves the executable as a shared library (`.so`) to be reloaded. For our project we save model executables in `./dlsys/apps/models/module_lib/`


In [None]:
module_ex.export_library(module_save_path)
...
module_ex = tvm.runtime.load_module(module_save_path)

Finally, we run the compiled TVM module. We check correctness by running needle model side-by-side and compare final activation layer values:


In [None]:
input_ndl = ndl.Tensor(X, device=self.ndl_device, requires_grad=False, placeholder=True)
input_tvm = tvm.nd.array(X)

ndl_out = self.model(input_ndl)
tvm_out = self.module_vm["main"](input_tvm)

try:
  assert np.allclose(tvm_out.asnumpy(),ndl_out.numpy(), atol=1e-4) # tweak tolerance if fails
except AssertionError:
  # Compute the absolute difference between two outputs
  abs_diff = np.abs(np.linalg.norm(tvm_out.asnumpy()) - np.linalg.norm(ndl_out.numpy()))
  print(f"TVM-NDL diff norm: {abs_diff}")
  raise ValueError

### Tensor Program and Computational Graph Optimization

TVM provides ways to optimize `IRModule` at two level:

-   Tensor program level: loop parallelization, tiling, vectorization, etc.
-   Computational graph level: operator fusion, layout transformation, memory management.

We perform operator fusion on the transpiled `IRModule`. Below is our optimization pipeline:


In [None]:
# ir_module derived from to_tvm_tensor
ir_module = tvm.ir.transform.Sequential([
  tvm.relax.transform.LegalizeOps(),
  tvm.relax.transform.AnnotateTIROpPattern(),
  tvm.relax.transform.FuseOps(),
  tvm.relax.transform.FuseTIR(),
])(ir_module)

As for tensor program optimizations, we utilize TVM's `meta_schedule` feature to automatically discover optimizations within each TIR function. It enables workload tuning through either custom-defined search spaces or the system's built-in, automatically generated search spaces. In this project, we utilize the autotuning capabilities of meta_schedule to explore and maximize potential performance gains.


In [None]:
# detect number of cores for loop parallelization
target = "llvm" + f" -num-cores={os.cpu_count()}"

# Iterate over all functions in the IRModule
funcs = 0
for func_name in ir_module.get_global_vars():
    funcs += 1
    if max_funcs is not None and funcs > max_funcs: break
    try:
        func_name_str = func_name.name_hint
        print(f"tuning: {func_name_str}")
        # Create a tuning database for each function
        mod_func = tvm.IRModule.from_expr(ir_module[func_name].with_attr("global_symbol", func_name_str))

        # Tune the TIR function
        database = meta_schedule.tune_tir(
            mod=mod_func,                 # Input module
            target=target,                # Target platform (e.g., "llvm", "cuda")
            max_trials_global=5,          # Total tuning trials
            num_trials_per_iter=5,        # Trials per tuning iteration
            work_dir=f"{work_dir}/{func_name_str}",  # Separate logs for each function
        )

        # Compile the tuned TIR function into a new IRModule
        sch = meta_schedule.tir_integration.compile_tir(
            database=database,           # The tuning database
            mod=mod_func,                # Input module to compile
            target=target                # Target platform
        )

        # Update the module with the tuned function
        updated_mod = sch.mod["main"].with_attr("global_symbol", func_name_str)
        gv = ir_module.get_global_var(func_name_str)
        ir_module.update_func(gv, updated_mod)

    except:
        continue

Although compute-intensive, `meta_schedule` excels at uncovering a wide range of optimizations beyond just tiling. What impressed us most was its remarkable generalizability in identifying optimizations for loop nests of varying shapes and sizes. Unsurprisingly, the optimized IRModule significantly outperforms our needle model, which uses register-tiling exclusively for matrix multiplication kernels.

Below is an example of the effects of `meta_schedule` on `reshape`'s TIR function. We were able to see `meta_schedule` discovers loop parallelization, vectorization, and tiling within the loop nest.


In [None]:
@T.prim_func
def te_reshape(A: T.Buffer((T.int64(1), T.int64(2048)), "float32"), T_reshape: T.Buffer((T.int64(3), T.int64(2048)), "float32")):
T.func_attr({"op_pattern": 2, "tir.noalias": T.bool(True)})
# with T.block("root"):
for ax0_ax1_fused_0 in T.parallel(T.int64(32)):
  for ax0_ax1_fused_1 in T.vectorized(T. int64(64)):
    with T.block("T_reshape"):
      v_ax0 = T.axis.spatial(T.int64(1), T.int64(0))
      v_ax1 = T.axis.spatial(T.int64(2048), ax0_ax1_fused_0 * T.int64(64) + ax0_ax1_tused_1)
      T.reads(A[T.int64(0), v_ax1 % T.Int64(2948)])
      T.writes(T_reshape[v_ax0, v_ax1])
      T_reshape[vax0, v_ax1] = A[T.int64(0), v_ax1 % T. Int64(2048)]

# Code Demo: CPU Device


In [13]:
# from google.colab import drive
# drive.mount('/content/drive')
%cd /content/drive/MyDrive/
!mkdir -p 10714
%cd 10714
!rm -rf 10714-project
!git clone https://github.com/Theorem411/10714-project
%cd /content/drive/MyDrive/10714/10714-project/

!pip3 install pybind11
# Install tvm
!python -m pip install --pre -U -f https://mlc.ai/wheels mlc-ai-nightly-cu122
!python -c "from tvm import relax"

/content/drive/MyDrive
/content/drive/MyDrive/10714
Cloning into '10714-project'...
remote: Enumerating objects: 1810, done.[K
remote: Counting objects: 100% (324/324), done.[K
remote: Compressing objects: 100% (202/202), done.[K
remote: Total 1810 (delta 201), reused 225 (delta 120), pack-reused 1486 (from 1)[K
Receiving objects: 100% (1810/1810), 3.82 MiB | 13.39 MiB/s, done.
Resolving deltas: 100% (1137/1137), done.
/content/drive/MyDrive/10714/10714-project
Looking in links: https://mlc.ai/wheels


In [14]:
%set_env PYTHONPATH ./dlsys/python
%set_env NEEDLE_BACKEND nd

%cd /content/drive/MyDrive/10714/10714-project/dlsys
!make clean && make

env: PYTHONPATH=./dlsys/python
env: NEEDLE_BACKEND=nd
env: APP_DIR=/content/drive/MyDrive/10714/10714-project/dlsys/apps/
/content/drive/MyDrive/10714/10714-project/dlsys
rm -rf build python/needle/backend_ndarray/ndarray_backend*.so
-- The C compiler identification is GNU 11.4.0
-- The CXX compiler identification is GNU 11.4.0
-- Detecting C compiler ABI info
-- Detecting C compiler ABI info - done
-- Check for working C compiler: /usr/bin/cc - skipped
-- Detecting C compile features
-- Detecting C compile features - done
-- Detecting CXX compiler ABI info
-- Detecting CXX compiler ABI info - done
-- Check for working CXX compiler: /usr/bin/c++ - skipped
-- Detecting CXX compile features
-- Detecting CXX compile features - done
-- Found Python: /usr/local/bin/python (found version "3.10.12") found components: Development Interpreter Development.Module Development.Embed
-- Performing Test HAS_FLTO
-- Performing Test HAS_FLTO - Success
-- Found pybind11: /usr/local/lib/python3.10/dist-p

### MLP Performance

**_Note:_** meta scheduling for `Transformer` and `ResNet9` model might take rounghly **_10-15 minutes_** to finish on the first run. However, since we reload the compiled module executable, the second time would be significantly faster as we bypass the meta scheduler.

If you want to recompile the model, add `-r` flag when running `tvm_eval.py`.

In [18]:
%cd /content/drive/MyDrive/10714/10714-project/dlsys/apps/
!python tvm_eval.py -m='mlp' -d='cpu'

/content/drive/MyDrive/10714/10714-project/dlsys/apps
Using needle backend
===== original module=====
[90;03m# from tvm.script import ir as I[39;00m
[90;03m# from tvm.script import tir as T[39;00m
[90;03m# from tvm.script import relax as R[39;00m

[95;03m@I[39;00m[35;01m.[39;00mir_module
[32;01mclass[39;00m [34;01mModule[39;00m:
    [95;03m@T[39;00m[35;01m.[39;00mprim_func(private[35;01m=[39;00m[32;01mTrue[39;00m)
    [32;01mdef[39;00m [34;01mte_broadcast_to[39;00m(lv1: T[35;01m.[39;00mBuffer((T[35;01m.[39;00mint64([92m1[39m), T[35;01m.[39;00mint64([92m512[39m)), [33m"[39m[33mfloat32[39m[33m"[39m), T_broadcast_to: T[35;01m.[39;00mBuffer((T[35;01m.[39;00mint64([92m8[39m), T[35;01m.[39;00mint64([92m512[39m)), [33m"[39m[33mfloat32[39m[33m"[39m)):
        T[35;01m.[39;00mfunc_attr({[33m"[39m[33mtir.noalias[39m[33m"[39m: T[35;01m.[39;00mbool([32;01mTrue[39;00m)})
        [90;03m# with T.block("root"):[39;00m
        [3

### Transformer Performance


In [19]:
%cd /content/drive/MyDrive/10714/10714-project/dlsys/apps/
!python tvm_eval.py -m='transformer' -d='cpu'

/content/drive/MyDrive/10714/10714-project/dlsys/apps
Using needle backend
===== original module=====
[90;03m# from tvm.script import ir as I[39;00m
[90;03m# from tvm.script import tir as T[39;00m
[90;03m# from tvm.script import relax as R[39;00m

[95;03m@I[39;00m[35;01m.[39;00mir_module
[32;01mclass[39;00m [34;01mModule[39;00m:
    [95;03m@T[39;00m[35;01m.[39;00mprim_func(private[35;01m=[39;00m[32;01mTrue[39;00m)
    [32;01mdef[39;00m [34;01mte_add_scalar[39;00m(lv20: T[35;01m.[39;00mBuffer((T[35;01m.[39;00mint64([92m80[39m), T[35;01m.[39;00mint64([92m1[39m)), [33m"[39m[33mfloat32[39m[33m"[39m), T_add: T[35;01m.[39;00mBuffer((T[35;01m.[39;00mint64([92m80[39m), T[35;01m.[39;00mint64([92m1[39m)), [33m"[39m[33mfloat32[39m[33m"[39m)):
        T[35;01m.[39;00mfunc_attr({[33m"[39m[33mtir.noalias[39m[33m"[39m: T[35;01m.[39;00mbool([32;01mTrue[39;00m)})
        [90;03m# with T.block("root"):[39;00m
        [32;01mfor[39

### ResNet9 Performance


In [20]:
%cd /content/drive/MyDrive/10714/10714-project/dlsys/apps/
!python tvm_eval.py -m='conv' -d='cpu'

/content/drive/MyDrive/10714/10714-project/dlsys/apps
Using needle backend
===== original module=====
[90;03m# from tvm.script import ir as I[39;00m
[90;03m# from tvm.script import tir as T[39;00m
[90;03m# from tvm.script import relax as R[39;00m

[95;03m@I[39;00m[35;01m.[39;00mir_module
[32;01mclass[39;00m [34;01mModule[39;00m:
    [95;03m@T[39;00m[35;01m.[39;00mprim_func(private[35;01m=[39;00m[32;01mTrue[39;00m)
    [32;01mdef[39;00m [34;01mte_broadcast_to[39;00m(lv: T[35;01m.[39;00mBuffer((T[35;01m.[39;00mint64([92m1[39m), T[35;01m.[39;00mint64([92m128[39m)), [33m"[39m[33mfloat32[39m[33m"[39m), T_broadcast_to: T[35;01m.[39;00mBuffer((T[35;01m.[39;00mint64([92m8[39m), T[35;01m.[39;00mint64([92m128[39m)), [33m"[39m[33mfloat32[39m[33m"[39m)):
        T[35;01m.[39;00mfunc_attr({[33m"[39m[33mtir.noalias[39m[33m"[39m: T[35;01m.[39;00mbool([32;01mTrue[39;00m)})
        [90;03m# with T.block("root"):[39;00m
        [32

# Code Demo: GPU Device

To check if TVM have `USE_CUDA` turned on. You can run the following command and search for `USE_CUDA`.

In [21]:
!python -c "import tvm; print('\n'.join(f'{k}: {v}' for k, v in tvm.support.libinfo().items()))"

USE_NVTX: OFF
USE_GTEST: AUTO
SUMMARIZE: OFF
TVM_DEBUG_WITH_ABI_CHANGE: OFF
USE_IOS_RPC: OFF
USE_MSC: OFF
USE_ETHOSU: 
CUDA_VERSION: 12.2
USE_LIBBACKTRACE: AUTO
DLPACK_PATH: 3rdparty/dlpack/include
USE_TENSORRT_CODEGEN: OFF
USE_OPENCL_EXTN_QCOM: NOT-FOUND
USE_TARGET_ONNX: OFF
USE_AOT_EXECUTOR: ON
BUILD_DUMMY_LIBTVM: OFF
USE_CUDNN: OFF
USE_TENSORRT_RUNTIME: OFF
USE_ARM_COMPUTE_LIB_GRAPH_EXECUTOR: OFF
USE_THRUST: ON
USE_CCACHE: AUTO
USE_ARM_COMPUTE_LIB: OFF
USE_CPP_RTVM: 
USE_OPENCL_GTEST: /path/to/opencl/gtest
TVM_LOG_BEFORE_THROW: OFF
USE_MKL: OFF
USE_PT_TVMDSOOP: OFF
MLIR_VERSION: NOT-FOUND
USE_CLML: OFF
USE_STACKVM_RUNTIME: OFF
USE_GRAPH_EXECUTOR_CUDA_GRAPH: OFF
ROCM_PATH: /opt/rocm
USE_DNNL: OFF
USE_MSCCL: OFF
USE_NNAPI_RUNTIME: OFF
USE_VITIS_AI: OFF
USE_MLIR: OFF
USE_RCCL: OFF
USE_LLVM: llvm-config --ignore-libllvm --link-static
USE_VERILATOR: OFF
USE_TF_TVMDSOOP: OFF
USE_THREADS: ON
USE_MSVC_MT: OFF
BACKTRACE_ON_SEGFAULT: OFF
USE_GRAPH_EXECUTOR: ON
USE_NCCL: ON
USE_ROCBLAS: OFF
GI

### MLP Performance

**_Note:_** meta scheduling for `Transformer` and `ResNet9` model might take rounghly **_10-15 minutes_** to finish on the first run. However, since we reload the compiled module executable, the second time would be significantly faster as we bypass the meta scheduler.

If you want to recompile the model, add `-r` flag when running `tvm_eval.py`.


In [None]:
%cd /content/drive/MyDrive/10714/10714-project/dlsys/apps/
!python tvm_eval.py -m='mlp' -d='cuda'

/content/drive/MyDrive/10714/10714-project/dlsys/apps
Using needle backend


### Transformer Performance

In [None]:
%cd /content/drive/MyDrive/10714/10714-project/dlsys/apps/
!python tvm_eval.py -m='transformer' -d='cuda'

### Resnet9 Performance

In [None]:
%cd /content/drive/MyDrive/10714/10714-project/dlsys/apps/
!python tvm_eval.py -m='conv' -d='cuda'

# Result Analysis
#### CPU

The results showcase the performance improvements on Intel(R) Xeon(R) CPU (Google Colab environment) achieved through various levels of TVM optimizations on three models: MLP, Transformer, and Conv ResNet9. The key takeaways from the analysis are as follows:

| Model        | Needle (Baseline) | TVM (no opt) | TVM (Fusion) | TVM (Fusion + Autotune) |
| ------------ | ----------------- | ------------ | ------------ | ----------------------- |
| MLP          | 1                 | 1.044256574  | 0.9039618473 | 0.08090549938           |
| Transformer  | 1                 | 1.139139027  | 1.124182424  | 0.1171342219            |
| Conv ResNet9 | 1                 | 0.3264763352 | 0.3361647851 | 0.1040680634            |

<div style="text-align: center;">
  <img src="./images/cpu_benchmark.png" alt="CPU Benchmark Results" width="70%">
</div>

## Explanation

**TVM without any optimizations** demonstrates no significant execution time gains for MLP and Transformer models, with MLP showing a slight performance drop (1.04x baseline) and Transformer experiencing a minor slowdown (1.14x baseline). However, Conv ResNet9 benefits considerably, achieving a substantial improvement (0.33x baseline). This discrepancy suggests that while TVM's default handling of tensor operations in the relax framework may implicitly optimize convolution-heavy models like Conv ResNet9, it does not provide similar benefits for fully connected or attention-based models, which rely on different operation patterns.

**Operator fusion in TVM** results in modest performance improvements, with mixed outcomes across models. For MLP, there is a slight improvement compared to TVM without optimizations and the baseline, achieving 0.90x baseline performance, but the gain remains limited and far from optimal. For the Transformer, the result is marginally better than TVM with no optimizations (1.12x baseline) but still underperforms compared to the baseline. Conv ResNet9 shows performance similar to TVM without optimization. These results suggest that while operator fusion has potential, its current implementation has limitations, and we aim to revisit and refine the operation fusion design as part of future work.

**Combining fusion with autotuning in TVM** results in substantial performance improvements for all models. MLP achieves a remarkable speedup, running at just 0.08x Needle runtime, translating to over 12x faster performance, demonstrating the effectiveness of fine-grained operator tuning. The Transformer sees significant gains, with a runtime of 0.12x Needle, approximately 8.5x faster, highlighting the ability of autotuning to optimize complex operations like matrix multiplications and attention mechanisms. Conv ResNet9 benefits drastically, achieving 0.10x Needle runtime, a 10x improvement, showcasing the impact of autotuning in optimizing convolution-heavy workloads. This combination unlocks the full potential of TVM for diverse workloads.

Here are the highlights of the quantitative result:

1. MLP benefits the most from TVM's optimizations, particularly autotuning, with a speedup of over 12x compared to the baseline
2. Transformer, while more complex, sees notable improvements, especially with autotuning, achieving a speedup of 8.5x
3. Conv ResNet9 demonstrates the importance of autotuning for convolution-heavy models, achieving a 10x speedup over the baseline

#### GPU

The results showcase the performance improvements on NVIDIA T4 GPU  achieved through TVM GPU optimizations on three models: MLP, Transformer, and Conv ResNet9. The key takeaways from the analysis are as follows:

| Model          | Needle Runtime (scaled) | TVM Runtime (scaled) |
|----------------|--------------------------|-----------------------|
| MLP            | 1                        | 0.1361687547         |
| Transformer    | 1                        | 0.04388941894        |
| Conv ResNet9   | 1                        | 0.0576553462         |

<div style="text-align: center;">
  <img src="./images/gpu_benchmark.png" alt="CPU Benchmark Results" width="70%">
</div>

## Explanation
The optimization pipeline applies GPU-specific schedules targeting common computational patterns, such as matrix multiplication (Matmul), generalized matrix-vector products (GEMV), and reduction operations. The results indicate the efficacy of the pipeline in optimizing specific workloads; for instance, the scaled runtime shows significant improvements in execution times for TVM. The varying relative execution times across models suggest that operations with higher computational intensity, such as those in Transformers and Conv ResNet9, benefit more from the pipeline’s optimizations.


# Conlcusion

This project demonstrates the seamless integration of the Needle framework with TVM, unlocking advanced optimization capabilities to significantly enhance model performance. By translating Needle's computational graph into TVM's IRModule, we leverage graph-level and tensor program-level optimizations to achieve efficient execution across diverse hardware platforms.

Our evaluation demonstrates the performance impact of integrating TVM with the Needle framework across three models: MLP, Transformer, and a convolutional ResNet9. Initially, without any optimizations, TVM underperforms compared to the baseline. However, with operator fusion enabled, the performance aligns closely with the baseline. The true potential of TVM is unlocked through autotuning, which delivers an impressive 8–12x speedup compared to the baseline. These results highlight the transformative power of TVM’s advanced optimization capabilities, emphasizing the value of its integration into lightweight deep learning frameworks like Needle.


# Reference:

[1] Apache TVM. TVM Documentation. Available at: [https://tvm.apache.org/docs/](https://tvm.apache.org/docs/)  
[2] Machine Learning Compilation. Online course developed by Tianqi Chen. Available at: [https://mlc.ai](https://mlc.ai)  
[3] PyTorch torch.fx. PyTorch Torch.fx Documentation. Available at: [https://pytorch.org/docs/stable/fx.html](https://pytorch.org/docs/stable/fx.html)
