# Define a model in Hidet
On top of running PyTorch and ONNX models through their respective frontends in Hidet, you may also define your own model with Hidet and run it. In this tutorial, we will walk through a simple example of defining, optimizing, and running a small model with Hidet.

In [1]:
import hidet

# set the cache dir 
hidet.option.cache_dir('./hidet-cache')
# clear the operator cache 
hidet.utils.hidet_clear_op_cache()

Clearing operator cache: /home/yaoyao/repos/asplos23-tutorial/hidet_tutorials_ipynb/hidet-cache/ops


### Define the model
Defining a Hidet model is similar to defining a PyTorch model. Instead of the PyTorch operators, we will use Hidet operators.

In [2]:
def conv_relu(x, out_channels, kernel, stride, padding):
    w = hidet.randn(shape=[out_channels, x.shape[1], kernel, kernel]).cuda()
    b = hidet.randn(shape=[1, out_channels, 1, 1]).cuda()
    x = hidet.ops.conv_pad(x, pads=padding)
    x = hidet.ops.conv2d(x, w, stride=stride) + b
    x = hidet.ops.relu(x)
    return x

We define a simple model consisting of a convolution layer and a ReLU layer. Inside the model, we first randomly generate the weight `w` and bias `b` as constants. We use `.cuda()` to move them to the GPU as we later wish to run the model using GPU. Then we get the result by padding the input `x`, and applying the `conv2d` and `relu` layers. For a list of other operators, please refer to the [Hidet Python API](https://docs.hidet.org/stable/python_api/ops/index.html).

### Imperatively run the model

The model can be directly run by feeding in an input tensor and the parameters.

In [3]:
img = hidet.randn([1, 3, 224, 224]).cuda()
out = conv_relu(img, 64, 7, 2, 3)

Compiling cuda task [92mpad(data=float32[1, 3, 224, 224])[0m...
Compiling cuda task [92mconv2d(x=float32[1, 3, 230, 230], w=float32[64, 3, 7, 7])[0m...
Compiling cuda task [92madd(x=float32[1, 64, 112, 112], y=float32[1, 64, 1, 1])[0m...
Compiling cuda task [92mrelu(x=float32[1, 64, 112, 112])[0m...


We randomly generate an input `img`. The output tensor of the model is `out`.

### Trace the model and run
A more efficient way to run the model is to first trace the execution and get the static computation graph of the deep learning model.

In [4]:
img_sym = hidet.symbol([1, 3, 224, 224]).cuda()
out_sym = conv_relu(img_sym, 64, 7, 2, 3)
graph = hidet.trace_from(out_sym, inputs=[img_sym])
print(graph)

Graph(x: float32[1, 3, 224, 224]){
  c = Constant(float32[64, 3, 7, 7])
  c_1 = Constant(float32[1, 64, 1, 1])
  x_1: float32[1, 3, 230, 230] = Pad(x, pads=[0, 0, 3, 3, 0, 0, 3, 3], mode="constant", value=0.0)  
  x_2: float32[1, 64, 112, 112] = Conv2d(x_1, c, stride=[2, 2], groups=1, dilations=(1, 1))  
  x_3: float32[1, 64, 112, 112] = Add(x_2, c_1)  
  x_4: float32[1, 64, 112, 112] = Relu(x_3)  
  return x_4
}


We define the input `img_sym` as a symbol tensor. We get the symbol output by running the `conv_relu()` model with the symbol input. `out_sym` is a symbol tensor that contains all the information of how it is derived. Then we can use `hidet.trace_from()` to create the static computation graph from the symbol output. 

`graph` is an instance of `hidet.graph.FlowGraph`, which is Hidet's representation of a computation graph and also the basic unit of graph-level optimizations. You can print `graph` to get its textual definition.

In [5]:
from hidet.utils import benchmark_func
cuda_graph = graph.cuda_graph()
img = hidet.randn([1, 3, 224, 224]).cuda()
(out,) = cuda_graph.run([img])
print('Hidet: {:.3f} ms'.format(benchmark_func(lambda: cuda_graph.run())))

Hidet: 0.304 ms


We can use the `cuda_graph()` method of a `FlowGraph` to create a `CudaGraph`. Cuda Graph is a more effcient way to submit workloads to NVIDIA GPUs as it eliminates framework-side overhead.

We randomly generate an input tensor `img` and get the output tensor `out` by calling `run()` with the cuda graph and input. We also test the latency of our model.
### Optimize the model

In [6]:
hidet.option.search_space(0)
with hidet.graph.PassContext() as ctx:
    ctx.save_graph_instrument('./outs/graphs')
    graph_opt = hidet.graph.optimize(graph)
    cuda_graph_opt = graph_opt.cuda_graph()
    print('Hidet optimized: {:.3f} ms'.format(benchmark_func(lambda: cuda_graph_opt.run())))


Compiling cuda task [92mrearrange(x=float32[1, 64, 3, 7, 7])[0m...
Compiling cuda task [92mbatch_matmul(a=float32[1, 12544, 147], b=float32[1, 147, 64], batch_size=1, m_size=12544, n_size=64, k_size=147, mma=simt) (6 fused)[0m...


Hidet optimized: 0.106 ms


To optimize the model, we set the level of operator schedule space to 2 with `hidet.option.search_space()`. The search space level can be : 0, 1, and 2. By default, the search space level is 0, which means no kernel tuning. The higher the level, the better performance but longer compilation time. The above code may take a few minutes to run, depending on the machine.

We also perform graph level optimizations with `hidet.graph.optimize()` and save intermiediate graphs with `save_graph_instrument()`. We run the benchmark again to see that the resulting optimized model runs faster than the previous unoptimized version.

### Inspect the generated code
The source code and compiled kernel by Hidet are cached under hidet's cache director, you could get the path to the cache directory with

In [7]:
hidet.option.get_cache_dir()

'/home/yaoyao/repos/asplos23-tutorial/hidet_tutorials_ipynb/hidet-cache'

The structure of this cache directory are shown in the documentation. For example, it may look like
```text
cache_root
|-- onnx                          (automatically downloaded ONNX models)
|   |-- bert.onnx
|   `-- resnet50.onnx
`-- ops
    |-- cpu_space_0
    |-- cuda_space_0              (<target>_space_<space>)
    |   |-- add
    |   |   `-- 41920731adb3acf4  (task string hash)
    |   |       |-- lib.so        (compiled kernel)
    |   |       |-- nvcc_log.txt  (compilation command and nvcc output)
    |   |       |-- source.cu     (kernel source code)
    |   |       `-- task.txt      (task string)
    |   `-- matmul
    |       `-- 92dfdc1734b3854d
    |           |-- lib.so
    |           |-- nvcc_log.txt
    |           |-- source.cu
    |           `-- task.txt
```

### Example of generated kernel


In [9]:
# print the source code for conv2d without optimization
print(graph.nodes[1].task_func.source(color=True))

[38;5;64m#[39m[38;5;64minclude[39m[38;5;250m [39m[38;5;248;03m<stdint.h>[39;00m
[38;5;64m#[39m[38;5;64minclude[39m[38;5;250m [39m[38;5;248;03m<cuda_fp16.h>[39;00m
[38;5;64m#[39m[38;5;64minclude[39m[38;5;250m [39m[38;5;248;03m<cuda_bf16.h>[39;00m
[38;5;64m#[39m[38;5;64minclude[39m[38;5;250m [39m[38;5;248;03m<hidet/runtime/cuda_context.h>[39;00m
[38;5;64m#[39m[38;5;64minclude[39m[38;5;250m [39m[38;5;248;03m<hidet/runtime/cpu_context.h>[39;00m
[38;5;19mtypedef[39m[38;5;250m [39m[38;5;37mfloat[39m[38;5;250m [39mtfloat32_t;
[38;5;64m#[39m[38;5;64mdefine __float_to_tf32(x) (x)[39m
[38;5;248;03m/*[39;00m
[38;5;248;03mTask([39;00m
[38;5;248;03m  name: conv2d[39;00m
[38;5;248;03m  parameters: [39;00m
[38;5;248;03m    x: tensor(float32, [1, 3, 230, 230])[39;00m
[38;5;248;03m    w: tensor(float32, [64, 3, 7, 7])[39;00m
[38;5;248;03m    out: tensor(float32, [1, 64, 112, 112])[39;00m
[38;5;248;03m  inputs: [x, w][39;00m
[38;5;248;

In [10]:
# print out the source code of conv2d with implicit gemm algorithm
print(graph_opt.nodes[-1].task_func.source(color=True))

[38;5;64m#[39m[38;5;64minclude[39m[38;5;250m [39m[38;5;248;03m<stdint.h>[39;00m
[38;5;64m#[39m[38;5;64minclude[39m[38;5;250m [39m[38;5;248;03m<cuda_fp16.h>[39;00m
[38;5;64m#[39m[38;5;64minclude[39m[38;5;250m [39m[38;5;248;03m<cuda_bf16.h>[39;00m
[38;5;64m#[39m[38;5;64minclude[39m[38;5;250m [39m[38;5;248;03m<hidet/runtime/cuda_context.h>[39;00m
[38;5;64m#[39m[38;5;64minclude[39m[38;5;250m [39m[38;5;248;03m<hidet/runtime/cpu_context.h>[39;00m
[38;5;19mtypedef[39m[38;5;250m [39m[38;5;37mfloat[39m[38;5;250m [39mtfloat32_t;
[38;5;64m#[39m[38;5;64mdefine __float_to_tf32(x) (x)[39m
[38;5;248;03m/*[39;00m
[38;5;248;03mTask([39;00m
[38;5;248;03m  name: batch_matmul[39;00m
[38;5;248;03m  parameters: [39;00m
[38;5;248;03m    b: tensor(float32, [1, 147, 64])[39;00m
[38;5;248;03m    y: tensor(float32, [1, 64, 1, 1])[39;00m
[38;5;248;03m    data: tensor(float32, [1, 3, 224, 224])[39;00m
[38;5;248;03m    y_1: tensor(float32, [1, 64, 1