
SPDX-FileCopyrightText: Copyright (c) 1993-2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved.SPDX-License-Identifier: Apache-2.0

Licensed under the Apache License, Version 2.0 (the "License"); you may not use

this file except in compliance with the License. You may obtain a copy of the License at



http://www.apache.org/licenses/LICENSE-2.0



Unless required by applicable law or agreed to in writing, software

distributed under the License is distributed on an "AS IS" BASIS,

WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.

See the License for the specific language governing permissions and

limitations under the License.



# Getting Started with TensorRT: Accelerate Your Deep Learning Inference

Welcome to your first TensorRT tutorial! In this notebook, you'll learn how to:
1. Load a pre-trained EfficientNet model in ONNX format
2. Convert it to a TensorRT engine for faster inference
3. Run inference and see the speedup firsthand
4. Make predictions on real images

## Understanding ONNX: The Universal Model Format

ONNX (Open Neural Network Exchange) is a standard format for representing deep learning models. Think of it as a universal language that different deep learning frameworks can understand. Here's why it's important:

- **Framework Independence**: Models trained in PyTorch, TensorFlow, or other frameworks can be exported to ONNX
- **Interoperability**: ONNX models can be imported into various inference engines and frameworks
- **Production Ready**: ONNX is widely used in production environments for model deployment

### The ONNX to TensorRT Workflow

TensorRT is NVIDIA's deep learning inference optimizer that can import models from ONNX. This makes it a powerful tool in your deployment pipeline:

```
Your Framework (PyTorch/TF/etc.) → ONNX → TensorRT ===> Optimized Inference
```

This workflow is particularly powerful because:
1. You can train your model in any framework you prefer
2. Export it to ONNX (a one-time conversion)
3. Use TensorRT to optimize it for NVIDIA GPUs
4. Get significant speedup in production

## Prerequisites

Before we start, make sure you have:
- NVIDIA GPU with CUDA support
- Python 3.8+ installed
- Basic understanding of deep learning and inference

Let's begin by installing and importing the required packages:

In [None]:
%pip install tensorrt cuda-python pillow onnxruntime-gpu==1.17.0
import tensorrt as trt
from cuda.bindings import runtime as cudart
from PIL import Image
import numpy as np
from pathlib import Path
import time
from typing import Optional, Union, Tuple

In [None]:
root = Path.cwd()

In [None]:
# define a function to download files

import requests
from requests.adapters import HTTPAdapter
from urllib3.util.retry import Retry

def download_file(url: str, output_path: Union[str, Path]):
    """Download a file with retry mechanism."""
    session = requests.Session()
    retry = Retry(total=10, backoff_factor=1)
    adapter = HTTPAdapter(max_retries=retry)
    session.mount('http://', adapter)
    session.mount('https://', adapter)
    
    response = session.get(url, verify=False, timeout=30)
    output_path.parent.mkdir(parents=True, exist_ok=True)
    with open(output_path, 'wb') as f:
        f.write(response.content)

## Step 1: Download a Pre-trained Model

We'll use EfficientNet-B0, a popular and efficient image classification model, as an example for this sample. 

### Understanding ONNX Model Structure

An ONNX model contains:
- Model architecture (layers, connections)
- Weights and biases
- Input/output specifications
- Metadata about the model
just like any other model representations. 

This standardized format makes it easy to move models between different frameworks and inference engines.

In [None]:
download_file("https://github.com/onnx/models/raw/refs/heads/main/Computer_Vision/efficientnet_b0_Opset17_timm/efficientnet_b0_Opset17.onnx", root / "efficientnet-b0.onnx")
assert (root / "efficientnet-b0.onnx").exists(), "Model file not found. Please check if the download was successful."

## Step 2: Convert ONNX to TensorRT Engine

This is where the magic happens! We'll convert our ONNX model into a TensorRT engine. The engine is optimized for your specific GPU and will run much faster than the original model.

### The Conversion Process

1. **Load ONNX Model**: TensorRT reads the ONNX file and understands the model structure
2. **Optimize**: TensorRT performs several optimizations:
   - Layer fusion
   - Memory optimization
   - Precision calibration
3. **Generate Engine**: Creates a highly optimized inference engine

The resulting engine is specific to your GPU and will run much faster than the original ONNX model.

In [None]:
logger = trt.Logger(trt.Logger.WARNING)
builder = trt.Builder(logger)
network = builder.create_network()

# Bind the TensorRT network to the parser so that the parser can update the network later accordingly
parser = trt.OnnxParser(network, logger)

onnx_path = root / "efficientnet-b0.onnx"
print(f'Parsing ONNX model at {onnx_path}...')
with open(onnx_path, "rb") as model:
    parser.parse(model.read())
print('Parsing ONNX model... done')

Now that we have the TensorRT `INetworkDefinition`, we can start building the engine

In [None]:
config = builder.create_builder_config()

# TensorRT needs memory for layer operations and intermediate activations during inference
# Setting a memory limit helps control resource usage and prevents out-of-memory errors
config.set_memory_pool_limit(
        trt.MemoryPoolType.WORKSPACE, 1 << 30
) # 1GB

print('Starting to build engine. This might take several minutes depending on the hardware...')
engine = builder.build_serialized_network(network, config)
assert engine is not None, 'Engine build failed'

engine_path = root / "efficientnet-b0.plan"
with open(engine_path, 'wb') as f:
    f.write(engine)

print("TensorRT engine created successfully!")

## Optional: Using Editable Timing Cache

TensorRT engines may vary between builds because kernel selection is based on runtime performance measurements. The hardware state (GPU utilization, temperature, system load) affects which kernels are chosen since kernels might outperform each other under different scenarios. 

To ensure consistent builds, TensorRT provides an editable timing cache that:
- Stores intermediate optimization results
- Enables deterministic engine builds
- Speeds up subsequent builds since they don't need to measure kernel execution time for each op again

In [None]:
def build_engine_with_cache(onnx_path: Union[str, Path], timing_cache: Optional[trt.ITimingCache]):
    builder = trt.Builder(logger)
    network = builder.create_network()
    parser = trt.OnnxParser(network, logger)
    with open(onnx_path, 'rb') as model:
        parser.parse(model.read())
    config = builder.create_builder_config()
    config.set_memory_pool_limit(trt.MemoryPoolType.WORKSPACE, 1 << 30)
    
    # Enable editable timing cache
    config.set_flag(trt.BuilderFlag.EDITABLE_TIMING_CACHE)

    # Create timing cache if not provided
    if not timing_cache:
        timing_cache = config.create_timing_cache(bytes())
    config.set_timing_cache(timing_cache, True)
    
    # Build engine
    print('Start building engine...')
    tik = time.time()
    engine = builder.build_serialized_network(network, config)
    tok = time.time()
    
    print(f'Engine build cost {tok - tik}ms')
    return engine, timing_cache

# First build (creates cache)
engine1, timing_cache = build_engine_with_cache(onnx_path, None)
print("First build completed with cache creation")

# Second build (uses cache)
engine2, timing_cache = build_engine_with_cache(onnx_path, timing_cache)
print("Second build completed with cache creation")

is_identical = np.array_equal(
    np.frombuffer(engine1, dtype=np.uint8),
    np.frombuffer(engine2, dtype=np.uint8))
print(f'Is engine identical: {is_identical}')

## Step 3: Run Inference and Compare Performance

Now let's see the real power of TensorRT! We'll:
1. Run inference with both ONNX and TensorRT
2. Compare their performance
3. See the speedup TensorRT provides

### Understanding the Performance Difference

The speedup comes from several optimizations:
- Layer fusion: Combining multiple operations into one
- Memory optimization: Better memory access patterns
- Precision optimization: Using optimal precision for each layer
- CUDA optimization: Direct GPU execution without framework overhead

In [None]:
def load_and_preprocess_image(image_path: Union[str, Path], input_size: Tuple[int, int] = (224, 224)):
    img = Image.open(image_path)
    img = img.resize(input_size)
    img = np.array(img).astype(np.float32)
    img = img / 255.0  # Normalize from [0, 255] to [0, 1]
    img = np.transpose(img, (2, 0, 1))  # HWC to CHW
    img = np.expand_dims(img, axis=0)  # Add batch dimension
    return img
    
def check_cuda_error(error):
    if isinstance(error, tuple):
        error = error[0]
    if error != cudart.cudaError_t.cudaSuccess:
        error_name = cudart.cudaGetErrorName(error)[1]
        error_string = cudart.cudaGetErrorString(error)[1]
        raise RuntimeError(f"CUDA Error: {error_name} ({error_string})")

def run_inference_trt(engine: trt.ICudaEngine, input_data: np.ndarray):
    # Create execution context - this stores the device memory allocations
    # and bindings needed for inference
    context = engine.create_execution_context()

    # Initialize lists to store input/output information and GPU memory allocations
    inputs = []
    outputs = []
    allocations = []
    
    # Iterate through all input/output tensors to set up memory and bindings
    for i in range(engine.num_io_tensors):
        name = engine.get_tensor_name(i)
        # Check if this tensor is an input or output
        is_input = engine.get_tensor_mode(name) == trt.TensorIOMode.INPUT
        # Get tensor datatype and shape information
        dtype = engine.get_tensor_dtype(name)
        shape = engine.get_tensor_shape(name)
        
        # Calculate required memory size for this tensor
        size = np.dtype(trt.nptype(dtype)).itemsize
        for s in shape:
            size *= s
            
        # Allocate GPU memory for this tensor
        err, allocation = cudart.cudaMalloc(size)
        check_cuda_error(err)
        
        # Store tensor information in a dictionary for easy access
        binding = {
            "index": i,
            "name": name,
            "dtype": np.dtype(trt.nptype(dtype)),
            "shape": list(shape),
            "allocation": allocation,
            "size": size,
        }
        
        # Keep track of all allocations and sort tensors into inputs/outputs
        allocations.append(allocation)
        if is_input:
            inputs.append(binding)
        else:
            outputs.append(binding)

    # Ensure input data is contiguous in memory for efficient GPU transfer
    input_data = np.ascontiguousarray(input_data)
    
    # Copy input data from host (CPU) to device (GPU)
    err = cudart.cudaMemcpy(
        inputs[0]["allocation"],
        input_data.ctypes.data,
        inputs[0]["size"],
        cudart.cudaMemcpyKind.cudaMemcpyHostToDevice,
    )
    check_cuda_error(err)

    # Set tensor addresses for all tensors
    for i in range(engine.num_io_tensors):
        context.set_tensor_address(engine.get_tensor_name(i), allocations[i])

    # Create a CUDA stream for asynchronous execution
    err, stream = cudart.cudaStreamCreate()
    check_cuda_error(err)

    # Run inference using the TensorRT engine
    context.execute_async_v3(stream_handle=stream)
    err = cudart.cudaStreamSynchronize(stream)
    check_cuda_error(err)

    # Prepare numpy array for output and copy results from GPU to CPU
    output_shape = outputs[0]["shape"]
    output = np.empty(output_shape, dtype=outputs[0]["dtype"])

    err = cudart.cudaMemcpy(
        output.ctypes.data,
        outputs[0]["allocation"],
        outputs[0]["size"],
        cudart.cudaMemcpyKind.cudaMemcpyDeviceToHost,
    )
    check_cuda_error(err)

    # Free all GPU memory allocations
    for allocation in allocations:
        err = cudart.cudaFree(allocation)
        check_cuda_error(err)

    # Destroy the CUDA stream
    err = cudart.cudaStreamDestroy(stream)
    check_cuda_error(err)

    return output

import onnxruntime as ort
def run_inference_onnx(session, input_data: np.ndarray):
    output = session.run(None, {'x': input_data})[0]
    return output

## Let's Compare Performance!

We'll run both models multiple times to get an accurate comparison of their performance. This will show you the baseline speedup that TensorRT provides. 

Refer to https://docs.nvidia.com/deeplearning/tensorrt/latest/index.html for more information about how to further optimize your engine

In [None]:
# Create a sample input
sample_input = np.random.randn(1, 3, 224, 224).astype(np.float32)

# Benchmark ONNX Runtime
session = ort.InferenceSession(onnx_path)
onnx_times = []
for _ in range(100):
    start_time = time.time()
    _ = run_inference_onnx(session, sample_input)
    onnx_times.append(time.time() - start_time)

# Benchmark TensorRT
with open(engine_path, "rb") as f, trt.Runtime(logger) as runtime:
    engine = runtime.deserialize_cuda_engine(f.read())
trt_times = []
for _ in range(100):
    start_time = time.time()
    _ = run_inference_trt(engine, sample_input)
    trt_times.append(time.time() - start_time)

print(f"ONNX Runtime Average Time: {np.mean(onnx_times)*1000:.2f} ms")
print(f"TensorRT Average Time: {np.mean(trt_times)*1000:.2f} ms")
print(f"Speedup: {np.mean(onnx_times)/np.mean(trt_times):.2f}x")

## Step 4: Run Inference on a Real Image

Now let's try our optimized model on a real image! We'll:
1. Download a sample image
2. Load the ImageNet class labels
3. Make predictions and show the results

This will demonstrate how the optimized TensorRT engine performs in a real-world scenario.

In [None]:
# Download a sample image
download_file("https://raw.githubusercontent.com/pytorch/hub/master/images/dog.jpg", root / "test_image.jpg")

from PIL import Image
from IPython.display import display

# Open and display the image
img = Image.open(root/"test_image.jpg")
display(img)

In [None]:
def load_imagenet_labels():
    # Download ImageNet labels if not exists
    if not (root / "imagenet_classes.txt").is_file():
        download_file("https://raw.githubusercontent.com/pytorch/hub/master/imagenet_classes.txt", root / "imagenet_classes.txt")
    # Read the labels
    with open(root / "imagenet_classes.txt") as f:
        categories = [s.strip() for s in f.readlines()]
    return categories

# Load ImageNet labels
categories = load_imagenet_labels()

In [None]:
# Load and preprocess a test image
test_image_path = root / "test_image.jpg"
input_data = load_and_preprocess_image(test_image_path)

# Run inference
output = run_inference_trt(engine, input_data)

# Get top 5 predictions
top5_idx = np.argsort(output[0])[-5:][::-1]
print("Top 5 predictions:")
for idx in top5_idx:
    print(f"{categories[idx]}: {output[0][idx]:.2f}%")
assert categories[top5_idx[0]] == "Samoyed", 'Incorrect prediction'
print('Correctly recognized!')
print('Notebook executed successfully')

## Congratulations! 🎉

### You've successfully:
1. Loaded a pre-trained EfficientNet model in ONNX format
2. Converted it to a TensorRT engine
3. Achieved significant speedup in inference
4. Made predictions on real images
5. Learned how to use timing cache to speed up engine building and ensure engine build determinism. 

### What's Next?

Now that you understand the ONNX to TensorRT workflow, you can:
- Export your own models from PyTorch/TensorFlow to ONNX
- Try different optimization settings in TensorRT
- Apply this workflow to your production models and get instant performance boost with NVIDIA GPUs!
