# Real-Time 3D Object Detection with PointPillars and Hailo Integration

This notebook demonstrates how to create a real-time inference pipeline for 3D object detection using the PointPillars network from the OpenPCDet repository, with acceleration from a Hailo device. By offloading the heavy 2D convolutional computations to the Hailo hardware, we can significantly improve inference speed suitable for real-time applications.
This notebook is based on Hailo's script available in their application code examples: [Hailo Application Code Examples](https://github.com/hailo-ai/Hailo-Application-Code-Examples/tree/main).



---

# Table of Contents

1. [Introduction](#introduction)
2. [Setup](#setup)
   - [Install Dependencies](#install-dependencies)
   - [Import Libraries and Set Paths](#import-libraries-and-set-paths)
3. [Load and Prepare the Model](#load-and-prepare-the-model)
   - [Load Configuration and Build the Model](#load-configuration-and-build-the-model)
   - [Run a Sanity Test](#run-a-sanity-test)
4. [Integrate with Hailo Hardware](#integrate-with-hailo-hardware)
   - [Export the 2D Backbone and Detection Head to ONNX](#export-the-2d-backbone-and-detection-head-to-onnx)
   - [Translate ONNX to Hailo Format](#translate-onnx-to-hailo-format)
   - [Verify Inference Equivalence with Hailo Emulation](#verify-inference-equivalence-with-hailo-emulation)
5. [Optimize and Compile for Hailo Hardware](#optimize-and-compile-for-hailo-hardware)
   - [Create Calibration Dataset](#create-calibration-dataset)
   - [Run Model Optimization (Quantization)](#run-model-optimization-quantization)
   - [Compile the Model for Hailo Hardware](#compile-the-model-for-hailo-hardware)
6. [Run Inference with Hailo Offload](#run-inference-with-hailo-offload)
7. [Conclusion](#conclusion)

---

# Introduction

We will guide you through setting up the environment, preparing the data, integrating the PointPillars model with Hailo hardware, and running real-time inference on point cloud data.

## Setup

### Install Dependencies
1. Install CUDA and PyTorch

Ensure you have CUDA installed.   
The following PyTorch versions have been tested:

In [None]:
# For CUDA 11.3
pip install torch==1.12.1+cu113 torchvision==0.13.1+cu113 torchaudio==0.12.1 --extra-index-url https://download.pytorch.org/whl/cu113

# For CUDA 10.2
pip install torch==1.12.1+cu102 torchvision==0.13.1+cu102 torchaudio==0.12.1 --extra-index-url https://download.pytorch.org/whl/cu102

2. Clone and Install OpenPCDet


In [None]:
git clone https://github.com/open-mmlab/OpenPCDet.git
cd OpenPCDet
pip install -r requirements.txt
pip install spconv kornia
python setup.py develop

3. If you haven't cloned this git repository yet, do it now.
  
Clone this git repo to your environment if you plan on using one of the files in it (e.g. model weights, the adjusted yaml configuration file for the Poinpillars model, custom dataset configuration for Innoviz pointclouds etc.). 
 
**NOTE** that even if you use your own weigts and model/dataset configurations, you still need to make sure you have the ``openpcdet2hailo_utils.py`` file as it is necessary for this pipeline. 

### Import Libraries and Set Paths


In [None]:
import os
import sys
import torch
import numpy as np
from pathlib import Path

# Replace with your OpenPCDet clone directory
openpcdet_clonedir = '/path/to/OpenPCDet'
sys.path.append(openpcdet_clonedir + '/tools/')

from pcdet.config import cfg, cfg_from_yaml_file
from pcdet.utils import common_utils
from pcdet.models import build_network, load_data_to_gpu
from pcdet.datasets import DatasetTemplate

# Import custom utilities for OpenPCDet and Hailo integration
import openpcdet2hailo_utils as ohu;

# Load and Prepare the Model
### Load Configuration and Build the Model
Specify the paths to the model configuration file, pretrained weights, and point cloud data.

In [None]:
# Paths to model configuration and weights
yaml_name = '/path/to/yaml/cfg/file'
pth_name = '/path/to/.pth/file'

# Path to point cloud data
sample_pointclouds = '/path/to/pc_samples/testing/innoviz'
demo_pointcloud = '/path/to/pc_samples/testing/innoviz/00001.npy'

# File extension of point cloud files
pc_file_extention = '.npy'

Load the configuration and build the model:

In [None]:
logger = common_utils.create_logger()
cfg_from_yaml_file(yaml_name, cfg)

demo_dataset = ohu.DemoDataset(
    dataset_cfg=cfg.DATA_CONFIG,
    class_names=cfg.CLASS_NAMES,
    training=False,
    root_path=Path(sample_pointclouds),
    ext=pc_file_extention,
    logger=logger
)

logger.info(f'Total number of samples: \t{len(demo_dataset)}')

# Build the model and load pretrained weights
model = build_network(
    model_cfg=cfg.MODEL,
    num_class=len(cfg.CLASS_NAMES),
    dataset=demo_dataset
)
model.load_params_from_file(filename=pth_name, logger=logger, to_cpu=True)
model.eval()

In [6]:
def get_model(cfg, pth_name, demo_dataset):    
    model = build_network(model_cfg=cfg.MODEL, num_class=len(cfg.CLASS_NAMES), dataset=demo_dataset)
    model.load_params_from_file(filename=pth_name, logger=logger, to_cpu=True)
    model.eval()
    return model

def cfg_from_yaml_file_wrap(yaml_name, cfg):
    cwd = os.getcwd()
    os.chdir(openpcdet_clonedir+'/tools/')
    cfg_from_yaml_file(yaml_name, cfg)
    os.chdir(cwd)

### Run a Sanity Test
Process a point cloud to ensure the model works correctly.

In [None]:
cfg_from_yaml_file_wrap(yaml_name, cfg)
if False: # 
    display(cfg.DATA_CONFIG)

demo_dataset = ohu.DemoDataset(
    dataset_cfg=cfg.DATA_CONFIG, class_names=cfg.CLASS_NAMES, training=False,
    root_path=Path(demo_pointcloud), ext=pc_file_extention, logger=logger
)
logger.info(f'Total number of samples: \t{len(demo_dataset)}')

model = get_model(cfg, pth_name, demo_dataset)
model_cpu = ohu.PointPillarsCPU(model)

with torch.no_grad():
    for idx, data_dict in enumerate(demo_dataset):        
        data_dict = demo_dataset.collate_batch([data_dict])
        # pred_dicts, _ = model_cpu(data_dict)
        pred_dicts = model_cpu(data_dict)        
        break

print(pred_dicts)

## Integrate with Hailo Hardware
We will offload the 2D backbone and detection head computations to the Hailo device.

### Export the 2D Backbone and Detection Head to ONNX
Extract the 2D convolutional parts and export them to ONNX format.

In [None]:
# Define the module that includes backbone_2d and dense_head
bev_w_head = ohu.Bev_w_Head(model.backbone_2d, model.dense_head)

# Export to ONNX
torch.onnx.export(
    bev_w_head,
    args=(data_dict['spatial_features'],),
    f="pp_bev_w_head.onnx",
    verbose=False
)


Simplify the ONNX model:

In [None]:
!onnxsim pp_bev_w_head.onnx pp_bev_w_head_simple.onnx

In [None]:
import onnxruntime
from hailo_sdk_client import ClientRunner
from hailo_sdk_common.targets.inference_targets import SdkNative
from hailo_sdk_client import InferenceContext #SdkPartialNumeric, SdkNative # 
import tensorflow as tf
import hailo_sdk_client, hailo_sdk_common
print(hailo_sdk_client.__version__)

### Translate ONNX to Hailo Format
Use the Hailo SDK to translate the ONNX model.

In [None]:
from hailo_sdk_client import ClientRunner

runner = ClientRunner(hw_arch='hailo8')

onnx_path = "pp_bev_w_head_simple.onnx"
hn, npz = runner.translate_onnx_model(onnx_path)

# Save the translated model
har_name = 'pp_bev_w_head.har'
runner.save_har(har_name)

### Verify Inference Equivalence with Hailo Emulation
Ensure the Hailo-emulated model produces similar results.

In [19]:
class Bev_W_Head_Hailo(torch.nn.Module):
    """ Drop-in replacement to the sequence of original "backbone-2d" and "dense_head" modules, accepting and returning dictionary,
        while under the hood using Hailo [emulator] implementation for the 2D CNN part, accepting and returning tensors I/O
    """
    def __init__(self, runner, emulate_quantized=False, use_hw=False, generate_predicted_boxes=None):
        super().__init__()
        self._runner = runner
        self.generate_predicted_boxes = generate_predicted_boxes
        
        if use_hw:
            context_type = InferenceContext.SDK_HAILO_HW
        elif emulate_quantized:
            context_type = InferenceContext.SDK_QUANTIZED 
        else:
            context_type = InferenceContext.SDK_FP_OPTIMIZED
            
        with runner.infer_context(context_type) as ctx:
            self._hailo_model = runner.get_keras_model(ctx)   
            
    def forward(self, data_dict):        
        spatial_features = data_dict['spatial_features']
        
        spatial_features_hailoinp = np.transpose(spatial_features.cpu().detach().numpy(), (0,2,3,1))
        
        # ============ Hailo-emulation of the Hailo-mapped part ==========
        spatial_features_2d, cls_preds, box_preds, dir_cls_preds = \
                            self._hailo_model(spatial_features_hailoinp)
        # ================================================================
        
        print(cls_preds.shape, type(cls_preds), box_preds.shape)
        cls_preds = torch.Tensor(cls_preds.numpy()) # .permute(0, 2, 3, 1).contiguous()          # [N, H, W, C]
        box_preds = torch.Tensor(box_preds.numpy()) # .permute(0, 2, 3, 1).contiguous()          # [N, H, W, C]
        dir_cls_preds = torch.Tensor(dir_cls_preds.numpy()) # .permute(0, 2, 3, 1).contiguous()
                
        data_dict['spatial_features_2d'] = torch.Tensor(spatial_features_2d.numpy())
        
        batch_cls_preds, batch_box_preds = self.generate_predicted_boxes(
            batch_size=data_dict['batch_size'],
            cls_preds=cls_preds, box_preds=box_preds, dir_cls_preds=dir_cls_preds
        )
        data_dict['batch_cls_preds'] = batch_cls_preds
        data_dict['batch_box_preds'] = batch_box_preds
        data_dict['cls_preds_normalized'] = False

        return data_dict


In [None]:
def quick_test(runner, hailoize=True, emulate_quantized=False, use_hw=False, fname='00001.npy', verbose=False):
    """ Encapsulates a minimalistic test of the complete network with/without hailo offload emulation 
    """
    demo_dataset = ohu.DemoDataset(
        dataset_cfg=cfg.DATA_CONFIG, class_names=cfg.CLASS_NAMES, training=False,
        root_path=Path(fname), ext=pc_file_extention, logger=logger
    )
    logger.info(f'Total number of samples: \t{len(demo_dataset)}')
    
    model_h = build_network(model_cfg=cfg.MODEL, num_class=len(cfg.CLASS_NAMES), dataset=demo_dataset)
    model_h.load_params_from_file(filename=pth_name, logger=logger, to_cpu=True)
    model_h.eval()

    #bb2d_hailo1 = BB2d_Hailo(runner, pppost_onnx='./pp_tmp_post.onnx', emulate_quantized=emulate_quantized, use_hw=use_hw)
    bev_w_head_hailo = Bev_W_Head_Hailo(runner, generate_predicted_boxes=model_h.dense_head.generate_predicted_boxes,
                                        emulate_quantized=emulate_quantized, use_hw=use_hw)    
    
    if hailoize:        
        # ==== Hook a call into Hailo by replacing parts of sequence by our rigged submodule ====
        model_h.module_list = model_h.module_list[:2] + [bev_w_head_hailo]
        # =======================================================================================                                          
    
    model_cpu = ohu.PointPillarsCPU(model_h) 
    logger.info(f'Total number of samples: \t{len(demo_dataset)}')

    for idx, data_dict in enumerate(demo_dataset):
        logger.info(f'Visualized sample index: \t{idx + 1}')
        data_dict = demo_dataset.collate_batch([data_dict])            
        # pred_dicts, _ = model_cpu.forward(data_dict)
        pred_dicts = model_cpu.forward(data_dict)
        if verbose:
            print(pred_dicts)
        else:
            print(pred_dicts[0][0]['pred_scores'][:7])
            
quick_test(runner, hailoize=False, emulate_quantized=False)     
quick_test(runner, hailoize=True, emulate_quantized=False)     
# This should give exact same result as we're yet to actually emulate the HW datapath,
# with its "lossy-compression" (e.g., 8b) features. 
# This will be possible after calibration and quantization of the model which will also enabling compilation for a physical HW.


## Optimize and Compile for Hailo Hardware
### Create Calibration Dataset
Prepare a dataset for quantizing the model.  
Make sure you provide a path to at least 8 point cloud samples.

In [None]:
cfg_from_yaml_file_wrap(yaml_name, cfg)
demo_dataset = ohu.DemoDataset(
    dataset_cfg=cfg.DATA_CONFIG, class_names=cfg.CLASS_NAMES, training=False,
    root_path=Path(sample_pointclouds), ext=pc_file_extention, logger=logger
)
logger.info(f'Total number of samples: \t{len(demo_dataset)}')
model = build_network(model_cfg=cfg.MODEL, num_class=len(cfg.CLASS_NAMES), dataset=demo_dataset)
model.load_params_from_file(filename=pth_name, logger=logger, to_cpu=True)
model.cuda()
model.eval()
cs_size = 8
calib_set = np.zeros((cs_size, 496, 432, 64))
np.set_printoptions(precision=2)

perc4stats = [50, 90, 98.6, 99.7, 99.9]
with torch.no_grad():
    for idx, _data_dict in enumerate(demo_dataset):        
        print(f"cloud #{idx}")
        if idx >= cs_size:
            break
        _data_dict = demo_dataset.collate_batch([_data_dict])
        load_data_to_gpu(_data_dict)
        # pred_dicts, goo = model.forward(_data_dict) 
        pred_dicts = model.forward(_data_dict)       
        calib_set[idx] = np.transpose(_data_dict['spatial_features'].cpu().numpy(), (0,2,3,1))
        # Basic stats just to verify there's some data diversity (just in top percentile apparently...)
        print(f'basic stats - percentile {perc4stats} of data (@ 2d-net input)', \
              np.percentile((np.abs(calib_set[idx])), perc4stats))
    

In [31]:
np.save('calib.npy', calib_set)

### Run Model Optimization (Quantization)
Optimize and quantize the model using the Hailo SDK.

In [None]:
# Load calibration dataset
calib_set = np.load('calib.npy')

# Optimize the model
runner.optimize(calib_set)

# Save the quantized model
q_har_name = 'pp_bev_w_head.q.har'
runner.save_har(q_har_name)


### Compile the Model for Hailo Hardware
Compile the quantized model to generate a HEF file.

In [None]:
# Load quantized model
runner = ClientRunner(har_path=q_har_name)

# Compile the model
compiled_model = runner.compile()

# Save the compiled model
hef_name = 'pp_bev_w_head.hef'
with open(hef_name, 'wb') as f:
    f.write(compiled_model)


In [None]:
quick_test(runner, hailoize=False)

In [None]:
quick_test(runner, hailoize=True, emulate_quantized=True)

# You should see here results with small differences

## Run Inference with Hailo Offload
Set up Hailo Runtime to run inference with the Hailo device.


In [None]:
%%time

do_compile = True
if do_compile:
    alls_line1 = 'shortcut_concat1_conv20 = shortcut(concat1, conv20)\n'
    open('helper.alls','w').write(alls_line1)  #   !!!!
    runner.load_model_script('./helper.alls') 

    compiled_model=runner.compile()    
    open(hef_name, 'wb').write(compiled_model)

# Expected results:
# [info] | Cluster   | Control Utilization | Compute Utilization | Memory Utilization |
# [info] +-----------+---------------------+---------------------+--------------------+
#                                        ...
# [info] +-----------+---------------------+---------------------+--------------------+
# [info] | Total     | 62.5%               | 63.7%               | 37.2%              |

In [None]:
from multiprocessing import Process, Queue
from hailo_platform import (HEF, PcieDevice, VDevice, HailoStreamInterface, ConfigureParams,
 InputVStreamParams, OutputVStreamParams, InputVStreams, OutputVStreams, FormatType)

def send_from_queue(configured_network, read_q, num_images, start_time):
    """ Bridging a queue into Hailo platform FEED. To be run as a separate process. 
        Reads (preprocessed) images from a given queue, and sends them serially to Hailo platform.        
    """    
    configured_network.wait_for_activation(1000)
    vstreams_params = InputVStreamParams.make(configured_network, quantized=False, format_type=FormatType.FLOAT32)
    print('Starting sending input images to HW inference...\n')
    with InputVStreams(configured_network, vstreams_params) as vstreams:
        vstream_to_buffer = {vstream: np.ndarray([1] + list(vstream.shape), dtype=vstream.dtype) for vstream in vstreams}
        for i in range(num_images):
            hailo_inp = read_q.get()
            for vstream, _ in vstream_to_buffer.items():                                
                vstream.send(hailo_inp)
            print(f'sent img #{i}')
    print(F'Finished send after {(time.time()-start_time) :.1f}')
    return 0

def recv_to_queue(configured_network, write_q, num_images, start_time):
    """ Bridging Hailo platform OUTPUT into a queue. To be run as a separate process. 
        Reads output data from Hailo platform and sends them serially to a given queue.
    """
    configured_network.wait_for_activation(1000)
    vstreams_params = OutputVStreamParams.make_from_network_group(configured_network, quantized=False, format_type=FormatType.FLOAT32)
    print('Starting receving HW inference output..\n')
    with OutputVStreams(configured_network, vstreams_params) as vstreams:
        # print('vstreams_params', vstreams_params)
        for i in range(num_images):            
            hailo_out = {vstream.name: np.expand_dims(vstream.recv(), 0) for vstream in vstreams}    
            
            print("hailo_out keys:", hailo_out.keys())
                      
            write_q.put(hailo_out)
            print(f'received img #{i}')
    print(F'Finished recv after {time.time()-start_time :.1f}')
    return 0

def generate_data_dicts(demo_dataset, num_images, pp_pre_bev_w_head):
    for idx, data_dict in enumerate(demo_dataset):
        if idx > num_images:
            break
        data_dict = demo_dataset.collate_batch([data_dict])
        ohu.load_data_to_CPU(data_dict)
        # Add sample_name to data_dict with only the file name
        data_dict['sample_name'] = os.path.basename(demo_dataset.sample_file_list[idx])
        # ------ (!) Applying torch PRE-processing -------
        data_dict = pp_pre_bev_w_head.forward(data_dict)
        # ------------------------------------------------
        logger.info(f'preprocessed sample #{idx}')
        yield data_dict

def generate_hailo_inputs(demo_dataset, num_images, pp_pre_bev_w_head):
    """ generator-style encapsulation for preprocessing inputs for Hailo HW feed
    """
    for data_dict in generate_data_dicts(demo_dataset, num_images, pp_pre_bev_w_head):
        spatial_features = data_dict['spatial_features']
        spatial_features_hailoinp = np.transpose(spatial_features.cpu().detach().numpy(), (0, 2, 3, 1))
        yield data_dict, spatial_features_hailoinp

def post_proc_from_queue(recv_queue, num_images, pp_post_bev_w_head,
                         output_layers_order=['model/concat1', 'model/conv19', 'model/conv18', 'model/conv20']):
    results = []
    for i in range(num_images):
        t_ = time.time()
        while(recv_queue.empty() and time.time()-t_ < 3):
            time.sleep(0.01)
        if recv_queue.empty():
            print("RECEIVE TIMEOUT!")
            break
        hailo_out = recv_queue.get(0)
        bev_out = (hailo_out[lname] for lname in output_layers_order)
        
        # ------ (!) Applying torch POST-processing -------
        pred_dicts, _ = pp_post_bev_w_head(bev_out)
        # pred_dicts = pp_post_bev_w_head(bev_out)
        # ------------------------------------------------
        # Add sample_name to each prediction dictionary
        sample_name = recv_queue.sample_names[i]

        # Add 'sample_name' to the dictionary
        pred_dicts['sample_name'] = sample_name

        # Append the dictionary to results
        results.append(pred_dicts)
    
    return results

Set ``num_images`` to the number of sample you are going to process.

In [None]:
import time, onnxruntime

data_source = demo_pointcloud  # replace by a folder for a more serious test
num_images = 1

cfg_from_yaml_file_wrap(yaml_name, cfg)
logger = common_utils.create_logger()
demo_dataset = ohu.DemoDataset(
    dataset_cfg=cfg.DATA_CONFIG, class_names=cfg.CLASS_NAMES, training=False,
    root_path=Path(data_source), ext=pc_file_extention, logger=logger
)
model = get_model(cfg, pth_name, demo_dataset)

# Library creates the anchors in cuda by default (applying .cuda() in internal implementation)
model.dense_head.anchors = [anc.cpu() for anc in model.dense_head.anchors]

""" (!) Slicing off the torch model all that happens before and after Hailo
"""

pp_pre_bev_w_head = ohu.PP_Pre_Bev_w_Head(model)
pp_post_bev_w_head = ohu.PP_Post_Bev_w_Head(model)
    
with VDevice() as target:
    hef = HEF(hef_name)
    configure_params = ConfigureParams.create_from_hef(hef, interface=HailoStreamInterface.PCIe)
    network_group = target.configure(hef, configure_params)[0]
    network_group_params = network_group.create_params()
    recv_queue = Queue()
    send_queue = Queue()
    start_time = time.time()
    results = []
    hw_send_process = Process(target=send_from_queue, args=(network_group, send_queue, num_images, start_time))
    hw_recv_process = Process(target=recv_to_queue, args=(network_group, recv_queue, num_images, start_time))

    sample_names = []

    with network_group.activate(network_group_params):
        hw_recv_process.start()
        hw_send_process.start()

        tik = time.time()

        for data_dict, hailo_inp in generate_hailo_inputs(demo_dataset, num_images, pp_pre_bev_w_head):
            send_queue.put(hailo_inp)
            sample_names.append(data_dict['sample_name'])
        
        recv_queue.sample_names = sample_names

        results = post_proc_from_queue(recv_queue, num_images, pp_post_bev_w_head)

        # Stop timing after processing
        tok = time.time()

        elapsed_time = tok - tik
        average_time_per_image = elapsed_time / num_images
        inference_rate_hz = num_images / elapsed_time

        print(f"Total elapsed time: {elapsed_time:.4f} seconds")
        print(f"Average time per image: {average_time_per_image:.4f} seconds")
        print(f"Inference rate: {inference_rate_hz:.2f} Hz")
                             
    hw_recv_process.join(10)
    hw_send_process.join(10)
    
    pred_dicts = results[-1]
    print(pred_dicts['pred_scores'])

## Conclusion
In this notebook, we demonstrated how to set up a real-time inference pipeline for 3D object detection using the PointPillars network with acceleration from a Hailo device. By offloading heavy computations to the Hailo hardware, we achieved improved inference speed suitable for real-time applications.

---

Note: Ensure all custom utilities (openpcdet2hailo_utils.py and any other required modules) are properly imported and available in your environment. Adjust file paths and configurations according to your setup. Some sections, especially the Hailo integration parts, may require additional implementation details based on your specific hardware and software environment.