<a href="https://colab.research.google.com/github/AEW2015/altera_examples/blob/main/py_AI/colab/yolov8_fpga.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# YOLOv8 example for Altera FPGA AI Suite

In [1]:
import torch

# Check PyTorch version
pt_version = torch.__version__
print(f"[INFO] Current PyTorch version: {pt_version} (should be 2.x+)")

# Install PyTorch 2.0 if necessary
if pt_version.split(".")[0] == "1": # Check if PyTorch version begins with 1
    !pip3 install -U torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu118
    print("[INFO] PyTorch 2.x installed, if you're on Google Colab, you may need to restart your runtime.\
          Though as of April 2023, Google Colab comes with PyTorch 2.0 pre-installed.")
    import torch
    pt_version = torch.__version__
    print(f"[INFO] Current PyTorch version: {pt_version} (should be 2.x+)")
else:
    print("[INFO] PyTorch 2.x installed, you'll be able to use the new features.")

[INFO] Current PyTorch version: 2.5.1+cu121 (should be 2.x+)
[INFO] PyTorch 2.x installed, you'll be able to use the new features.


In [2]:
# Make sure we're using a NVIDIA GPU
if torch.cuda.is_available():
  gpu_info = !nvidia-smi
  gpu_info = '\n'.join(gpu_info)
  if gpu_info.find("failed") >= 0:
    print("Not connected to a GPU, to leverage the best of PyTorch 2.0, you should connect to a GPU.")

  # Get GPU name
  gpu_name = !nvidia-smi --query-gpu=gpu_name --format=csv
  gpu_name = gpu_name[1]
  GPU_NAME = gpu_name.replace(" ", "_") # remove underscores for easier saving
  print(f'GPU name: {GPU_NAME}')

  # Get GPU capability score
  GPU_SCORE = torch.cuda.get_device_capability()
  print(f"GPU capability score: {GPU_SCORE}")
  if GPU_SCORE >= (8, 0):
    print(f"GPU score higher than or equal to (8, 0), PyTorch 2.x speedup features available.")
  else:
    print(f"GPU score lower than (8, 0), PyTorch 2.x speedup features will be limited (PyTorch 2.x speedups happen most on newer GPUs).")

  # Print GPU info
  print(f"GPU information:\n{gpu_info}")

else:
  print("PyTorch couldn't find a GPU, to leverage the best of PyTorch 2.0, you should connect to a GPU.")

GPU name: Tesla_T4
GPU capability score: (7, 5)
GPU score lower than (8, 0), PyTorch 2.x speedup features will be limited (PyTorch 2.x speedups happen most on newer GPUs).
GPU information:
Wed Nov 20 03:23:57 2024       
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.104.05             Driver Version: 535.104.05   CUDA Version: 12.2     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|   0  Tesla T4                       Off | 00000000:00:04.0 Off |                    0 |
| N/A   42C    P8              11W /  70W |      3MiB / 15360MiB |      0%      Default |
|                                         |                

In [3]:
import torch
print(torch.cuda.is_available())

True


## Download YOLO model

In [5]:
pip install ultralytics -qq

In [6]:
from ultralytics import YOLO

# Load the YOLO11 model
model = YOLO("yolov8n-cls.pt")

Creating new Ultralytics Settings v0.0.6 file ✅ 
View Ultralytics Settings with 'yolo settings' or at '/root/.config/Ultralytics/settings.json'
Update Settings with 'yolo settings key=value', i.e. 'yolo settings runs_dir=path/to/dir'. For help see https://docs.ultralytics.com/quickstart/#ultralytics-settings.
Downloading https://github.com/ultralytics/assets/releases/download/v8.3.0/yolov8n-cls.pt to 'yolov8n-cls.pt'...


100%|██████████| 5.31M/5.31M [00:00<00:00, 253MB/s]


In [19]:
results = model("https://ultralytics.com/images/bus.jpg")  # predict on an image


Downloading https://ultralytics.com/images/bus.jpg to 'bus.jpg'...


100%|██████████| 134k/134k [00:00<00:00, 37.1MB/s]

image 1/1 /content/bus.jpg: 224x224 minibus 0.50, police_van 0.29, trolleybus 0.05, golfcart 0.02, jinrikisha 0.02, 3.4ms
Speed: 35.2ms preprocess, 3.4ms inference, 0.1ms postprocess per image at shape (1, 3, 224, 224)





Export as ONNX

In [7]:
# Export the model to ONNX format
model.export(format="onnx",batch=1,opset=11)

Ultralytics 8.3.34 🚀 Python-3.10.12 torch-2.5.1+cu121 CPU (Intel Xeon 2.00GHz)
YOLOv8n-cls summary (fused): 73 layers, 2,715,880 parameters, 0 gradients, 4.3 GFLOPs

[34m[1mPyTorch:[0m starting from 'yolov8n-cls.pt' with input shape (1, 3, 224, 224) BCHW and output shape(s) (1, 1000) (5.3 MB)
[31m[1mrequirements:[0m Ultralytics requirements ['onnx>=1.12.0', 'onnxslim', 'onnxruntime-gpu'] not found, attempting AutoUpdate...
Collecting onnx>=1.12.0
  Downloading onnx-1.17.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (16 kB)
Collecting onnxslim
  Downloading onnxslim-0.1.39-py3-none-any.whl.metadata (2.9 kB)
Collecting onnxruntime-gpu
  Downloading onnxruntime_gpu-1.20.0-cp310-cp310-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl.metadata (4.4 kB)
Collecting coloredlogs (from onnxruntime-gpu)
  Downloading coloredlogs-15.0.1-py2.py3-none-any.whl.metadata (12 kB)
Collecting humanfriendly>=9.1 (from coloredlogs->onnxruntime-gpu)
  Downloading humanfriendly-10.

'yolov8n-cls.onnx'

OpenVINO command to compile

`mo --input_model yolov8n-cls.onnx  --output_dir ./yolov8  --reverse_input_channels`

Checkout the DOT Kernel

Build the benchmark image

```
dla_compiler --march $COREDLA_ROOT/example_architectures/AGX7_Performance.arch --network-file ./yolov8n-cls.xml --foutput-format=open_vino_hetero --o ./yolov8_bench.bin --batch-size=1 --fanalyze-performance
```

Run on the FPGA:

```
uio-devices restart
export compiled_model=~/resnet-50-tf/yolov8_bench.bin
export imgdir=~/resnet-50-tf/sample_images
export archfile=~/resnet-50-tf/AGX7_Performance.arch
cd ~/app
export COREDLA_ROOT=/home/root/app
./dla_benchmark -b=1 -cm $compiled_model -d=HETERO:FPGA,CPU -i $imgdir -niter=5 -plugins_xml_file ./plugins.xml -arch_file $archfile -api=async -perf_est  -nireq=4 -bgr
```

## Performace Testing

In [20]:
# Example: Dummy dataset with random data
from PIL import Image
import numpy as np

# Generate random images to simulate a dataset
def generate_dummy_images(num_images, size=(224, 224)):
    images = []
    for _ in range(num_images):
        array = np.random.randint(0, 255, size + (3,), dtype=np.uint8)
        images.append(Image.fromarray(array))
    return images

images = generate_dummy_images(2000)

In [21]:
from time import *
# Ensure GPU is available
device = 'cuda' if torch.cuda.is_available() else 'cpu'
print(f"Using device: {device}")

# Convert images to tensors and move to the appropriate device
processed_images = [torch.tensor(np.array(img).transpose(2, 0, 1), dtype=torch.float32) / 255.0 for img in images]
processed_images = [img.unsqueeze(0) for img in processed_images]  # Add batch dimension

# Batch the images
batch_size = 2000  # Adjust batch size for performance vs memory tradeoff
batches = [torch.cat(processed_images[i:i + batch_size], dim=0).to(device) for i in range(0, len(processed_images), batch_size)]

# Run inference and measure speed
times = []
for batch in batches:
    start_time = time()
    results = model.predict(batch)  # Let YOLO handle preprocessing and inference
    elapsed_time = time() - start_time
    times.append(elapsed_time)



Using device: cuda


errors for large sources or long-running streams and videos. See https://docs.ultralytics.com/modes/predict/ for help.

Example:
    results = model(source=..., stream=True)  # generator of Results objects
    for r in results:
        boxes = r.boxes  # Boxes object for bbox outputs
        masks = r.masks  # Masks object for segment masks outputs
        probs = r.probs  # Class probabilities for classification outputs

0: 224x224 paper_towel 0.08, tennis_ball 0.04, velvet 0.04, vulture 0.03, bubble 0.03, 0.2ms
1: 224x224 paper_towel 0.09, velvet 0.07, dishrag 0.06, tennis_ball 0.04, tick 0.03, 0.2ms
2: 224x224 velvet 0.07, paper_towel 0.06, tennis_ball 0.05, dishrag 0.05, vulture 0.03, 0.2ms
3: 224x224 tennis_ball 0.06, paper_towel 0.05, velvet 0.04, vulture 0.04, tick 0.03, 0.2ms
4: 224x224 paper_towel 0.08, dishrag 0.06, velvet 0.05, tennis_ball 0.05, tick 0.03, 0.2ms
5: 224x224 paper_towel 0.08, dishrag 0.06, tennis_ball 0.04, velvet 0.04, tick 0.03, 0.2ms
6:

In [22]:
# Display average speed
print(f"Average inference time per image: {sum(times) / 2000:.4f} seconds")
print(f"Average speed: {len(images) / sum(times):.2f} images/second")

Average inference time per image: 0.0006 seconds
Average speed: 1592.55 images/second


## Add Scale Divide to ONNX

In [16]:
import onnx
from onnx import helper, numpy_helper, TensorProto
def insert_divide_by_255_layer(model_path, output_path):
  # Load the ONNX model
  model = onnx.load(model_path)
  graph = model.graph
  # Identify the input node
  input_name = graph.input[0].name
  input_shape = graph.input[0].type.tensor_type.shape.dim
  input_shape = [dim.dim_value for dim in input_shape]
  # Create a constant node for the value 255
  const_value = numpy_helper.from_array(np.array([255.0],dtype=np.float32), name="const_255")
  const_node = helper.make_node(
              'Constant',
              inputs=[],
              outputs=['const_255'],
              value=const_value
              )
  # Create a divide node
  div_node = helper.make_node(
              'Div',
              inputs=[input_name, 'const_255'],
              outputs=['normalized_input']
              )
  # Find the node that consumes the input
  for node in graph.node:
    for i, input in enumerate(node.input):
      if input == input_name:
        node.input[i] = 'normalized_input'

  # Add the new nodes to the graph
  graph.node.insert(0, const_node)
  graph.node.insert(1, div_node)
  # Save the modified model
  onnx.save(model, output_path)


In [18]:
model_path = 'yolov8n-cls.onnx'
output_path = 'model_div255.onnx'
insert_divide_by_255_layer(model_path, output_path)

OpenVINO command to compile

`mo --input_model model_div255.onnx  --output_dir ./yolov8  --reverse_input_channels`

```
dla_compiler --march $COREDLA_ROOT/example_architectures/AGX7_Performance.arch --network-file ./model_div255.xml --foutput-format=open_vino_hetero --o ./yolov8_bench.bin --batch-size=1 --fanalyze-performance
```

Divider with the OpenVINO `mo` tools

`mo --input_model yolov8n-cls.onnx -s 255 --output_dir ./yolov8_s --reverse_input_channels`

```
dla_compiler --march $COREDLA_ROOT/example_architectures/AGX7_Performance.arch --network-file ./yolov8n-cls.xml --foutput-format=open_vino_hetero --o ./yolov8_bench.bin --batch-size=1 --fanalyze-performance
```