# Torch2Needle: ​Automated Model Transformation and Optimization for AMD GPUs​

This is a demonstration notebook that shows our project implementation. Our Project is a automated model transformation tool that directly convert torch-based model to needle model, and deploy to AMD MI300X GPU. With Torch2Needle, user can directly convert a torch-defined model to portable model on AMD GPU via Needle framework. Torch2Needle also provides operator fusion automatically to speed up model inference on AMD GPU.

We have made a comprehensive profiling on Torch2Needle with four model examples: `ResNet18`, `ResNet50`, `ResNet101`, `UNet`. You are free to check their profiling report on inference time before and after Operator Fusion on main directory (named `Performance_Summary_model_name.txt`), which includes total inference time speed up and each fused operator speed up.

### Build environment

As long as you have an python environment in `conda`, `uv`, or others with `torch`, `numpy` and `torchvision` modules, you can run this notebook. However, we strongly recommend you to build the python environment in `uv` and ensure your system can access an Nvidia GPU or AMD GPU. If you running the notebook in colab, you're good to go!

### Pipeline tests

You are also free to use torch2needle by yourself! We provide two examples for torch2needle: `ResNet` and `Unet`. Here are the detail output of complete model conversion pipeline:

Torch2needle model conversion tool itself is device-independent, you can use torch2needle to convert torch-like model to needle-like model in CPU or GPU. This notebook use `UNet` as an example for your reference.

Run the command below to convert torch-like UNet defined in `apps/UNet/unet` directory to needle-like UNet, and perform operator fusion. Please be careful that in our project, operator fusion is hardware-specific, where the model is truly speed up if you have an AMD GPU with ROCm library.

Before running the following code, mount your own drive space and clone our project

In [None]:
from google.colab import drive
drive.mount('/content/drive')
%cd /content/drive/MyDrive/
!mkdir -p 10714f25
%cd /content/drive/MyDrive/10714f25
!git clone -b shuaiweh --single-branch https://github.com/BaoBao0926/CMU-10714-DL-System-Course-Project.git project
%cd /content/drive/MyDrive/10714f25/project
!pip3 install -r requirements.txt

In [None]:
!make

In [None]:
!python3 apps/Unet/run_unet.py --backend nd --n_classes 2 --device cuda > output_unet.txt

You can check output result at `output_unet.txt` in main directory. The following content in notebook will guide you step by step to show what this result means:

```
【Step 1】PyTorch model Prepare
PyTorch model input shape: torch.Size([1, 3, 224, 224])
PyTorch model output shape: torch.Size([1, 2, 224, 224])
```

Step 1 simply shows the input shape and output shape of original torch-like model and converted needle-like model

```
【Step 2】Transfer to needle model
Needle model type: FXGraphExecutor
Needle model structure:
FXGraphExecutor(
  (layer_0): <needle.nn.nn_conv.Conv object at 0x7ea35f0fc6e0>
  (layer_1): <needle.nn.nn_basic.BatchNorm2d object at 0x7ea34b461af0>
  (layer_10): <needle.nn.nn_conv.Conv object at 0x7ea34b479070>
  (layer_11): <needle.nn.nn_basic.BatchNorm2d object at 0x7ea34b4792b0>
  (layer_12): <needle.nn.nn_basic.ReLU object at 0x7ea34b4792e0>
  (layer_13): <needle.nn.nn_conv.MaxPool2d object at 0x7ea34b479460>
  (layer_14): <needle.nn.nn_conv.Conv object at 0x7ea34b4793a0>
  (layer_15): <needle.nn.nn_basic.BatchNorm2d object at 0x7ea34b479730>
  (layer_16): <needle.nn.nn_basic.ReLU object at 0x7ea34b479760>
  (layer_17): <needle.nn.nn_conv.Conv object at 0x7ea34b4798b0>
  (layer_18): <needle.nn.nn_basic.BatchNorm2d object at 0x7ea34b479b50>
  (layer_19): <needle.nn.nn_basic.ReLU object at 0x7ea34b479b80>
  ....
```

If you step into step2, it means the torch model has already been converted to needle model. All converted needle model is wrapped up by a class called `FXGraphExecutor` in `torch2needle/torch2needle_converter.py` with converted needle layers

```
【Step 3】Load weight to needle model
[✔] Copied Conv2d(3, 64, kernel_size=(3, 3))
[✔] Copied BatchNorm2d(64)
[✔] ReLU (no weights)
[✔] Copied Conv2d(64, 64, kernel_size=(3, 3))
[✔] Copied BatchNorm2d(64)
[✔] ReLU (no weights)
[✔] MaxPool2d (no weights)
[✔] Copied Conv2d(64, 128, kernel_size=(3, 3))
[✔] Copied BatchNorm2d(128)
[✔] ReLU (no weights)
[✔] Copied Conv2d(128, 128, kernel_size=(3, 3))
[✔] Copied BatchNorm2d(128)
[✔] ReLU (no weights)
[✔] MaxPool2d (no weights)
[✔] Copied Conv2d(128, 256, kernel_size=(3, 3))
[✔] Copied BatchNorm2d(256)
[✔] ReLU (no weights)
[✔] Copied Conv2d(256, 256, kernel_size=(3, 3))
[✔] Copied BatchNorm2d(256)
[✔] ReLU (no weights)
[✔] MaxPool2d (no weights)
[✔] Copied Conv2d(256, 512, kernel_size=(3, 3))
[✔] Copied BatchNorm2d(512)
...
```

Step 3 copies weight and bias(if bias=True) from original layer implementation to current layer implementation layer-by-layer

```
【Step 4】Verify converted model
Max difference after conversion: 6.71e-08
✅ Conversion is correct!
```

Step 4 shows the max difference between converted model and original torch model, if you see `Conversion is correct`, congradulations!! You have already made a successful model conversion from torch-like model to needle-like model, and the output between two model are nearly the same! 

```
【Step 5】Perform operator fusion

Fuse report:

============================================================
Operator Fusion Report
============================================================
Total fusions: 18
------------------------------------------------------------
Position Fusion Pattern            Original Operators  
------------------------------------------------------------
0        ConvBatchNorm2dReLU       Conv -> BatchNorm2d -> ReLU
3        ConvBatchNorm2dReLU       Conv -> BatchNorm2d -> ReLU
7        ConvBatchNorm2dReLU       Conv -> BatchNorm2d -> ReLU
10       ConvBatchNorm2dReLU       Conv -> BatchNorm2d -> ReLU
14       ConvBatchNorm2dReLU       Conv -> BatchNorm2d -> ReLU
17       ConvBatchNorm2dReLU       Conv -> BatchNorm2d -> ReLU
21       ConvBatchNorm2dReLU       Conv -> BatchNorm2d -> ReLU
...
============================================================
```

Step5 tries to fuse layers in original model architecture that is matched to `fusion_pattern` defined in `operator_fusion/`, in UNet, all fusible layers follows: `conv -> bn -> relu` pattern, and torch2needle find 18 fusions in total

```
【Step 6】Verify conversion of fused model with torch model
Max difference between fused and torch model: 6.71e-08
✅ Fusion correct!
```

If you step to step 6 and find `Fusion correct!`, congradulation!! That means your fused model has the same output as your torch-like model, and your fusion is success!

```
【Step 7】Compared fused model and non-fused model
Max difference before and after fused: 0.00e+00
✅ Fusion produces no difference
```

Step 7 measures the difference between fused model and non-fused model, in this case, since we are not working on AMD GPU backend, fusion provides no difference. Our implementation provides fusion in backend, which may produces a difference smaller than 1e-6

If you have an AMD GPU, you are free to run the following command that convert torch-like `ResNet101` model to needle-like model and runs on AMD GPU. Please make sure the `--backend` argument is set to `hip`, where you will use operators defined in `needle/ops/ops_hip.py`

In [None]:
!python3 apps/resnet/run_resnet.py --backend hip --model resnet50 --device hip > output_resnet.py