SPDX-FileCopyrightText: Copyright (c) 1993-2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved.SPDX-License-Identifier: Apache-2.0

Licensed under the Apache License, Version 2.0 (the "License"); you may not use

this file except in compliance with the License. You may obtain a copy of the License at



http://www.apache.org/licenses/LICENSE-2.0



Unless required by applicable law or agreed to in writing, software

distributed under the License is distributed on an "AS IS" BASIS,

WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.

See the License for the specific language governing permissions and

limitations under the License.

# 2. Constructing a Network with TensorRT Layer APIs

In this notebook, you'll learn how to move beyond pre-built model formats and directly construct a neural network using TensorRT's versatile Layer APIs. This approach offers fine-grained control over your network architecture and optimizations.

Specifically, we will cover:

1.  **Building a Recurrent Network (LSTM) from Scratch:** Understand how to define each layer of a Long Short-Term Memory (LSTM) cell and then use these components to construct an entire recurrent LSTM layer. This involves using various Layer API functionalities like `add_constant`, `add_matrix_multiply`, `add_elementwise`, `add_slice`, and `add_activation`.
2.  **Implementing Loops for Recurrence:** Utilize TensorRT's `add_loop` functionality to efficiently handle the recurrent nature of the LSTM, processing an input sequence step-by-step.
3.  **Monitoring Build Progress:** Implement an `IProgressMonitor` to track the engine creation process in real-time, providing visibility into potentially long build times.
4.  **Creating Version-Compatible Engines:** Learn to save TensorRT engines with the `BuilderFlag.VERSION_COMPATIBLE` flag, enhancing their portability across different TensorRT patch versions and compatible hardware.

This sample uses a small, single-layer LSTM to keep the focus on these core TensorRT API features. We'll also verify its output against an equivalent NumPy implementation to ensure correctness.

## Introduction

While importing models via ONNX offers convenience, constructing networks directly with TensorRT APIs provides fine-grained control over the network definition. The **[TensorRT Layer API](https://docs.nvidia.com/deeplearning/tensorrt/latest/python_api/infer/Graph/Layers.html)** enables users to define each layer explicitly, offering flexibility and optimization opportunities.

To facilitate understanding and verification, this demonstration employs small tensors, allowing for direct comparison with an equivalent NumPy implementation.

> **Note: This sample assumes familiarity with the basic concepts of Long Short-Term Memory (LSTM) networks. If you're new to LSTMs, you might find it helpful to review their structure and operation before proceeding.**

## Step 0: Prerequisites

In [None]:
%pip install numpy tensorrt polygraphy --extra-index-url https://pypi.ngc.nvidia.com
import tensorrt as trt
import numpy as np
from typing import Tuple

To simplify, network parameters and weight initializations use small, illustrative values and dimensions.

In [None]:
# === Network Parameters & Weights Initialization ===
batch_size = 1
seq_len = 5      # Length of the sequence
input_size = 1   # Dimension of input vector at each time step
hidden_size = 2  # Dimension of hidden state and cell state
num_units = 1

# --- Create Fixed Dummy Weights and Biases (NumPy arrays with dummy values) ---
# These will be used by both the TensorRT build and the NumPy verification
w_val, u_val, b_val = 0.01, 0.05, 0.3
initial_h_val = 0.1
initial_c_val = 0.2

# Define shapes
w_shape = (input_size, 4 * hidden_size) # e.g., [1, 8] for layer 0
u_shape = (hidden_size, 4 * hidden_size)       # e.g., [2, 8]
b_shape = (4 * hidden_size,)                   # e.g., [8]
initial_h_shape = (batch_size, hidden_size)
initial_c_shape = (batch_size, hidden_size)

# Create NumPy arrays
np_weight_W = np.full(w_shape, w_val, dtype=np.float32)
np_weight_U = np.full(u_shape, u_val, dtype=np.float32)
np_bias = np.full(b_shape, b_val, dtype=np.float32)
np_initial_h = np.full(initial_h_shape, initial_h_val, dtype=np.float32)
np_initial_c = np.full(initial_c_shape, initial_c_val, dtype=np.float32)

# Create inputs for the network
np_inputs = np.ones((seq_len, batch_size, input_size), dtype=np.float32)

print("NumPy Weights Initialized:")
print(f"  W shape : {np_weight_W.shape}")
print(f"  U shape : {np_weight_U.shape}")
print(f"  Bias shape : {np_bias.shape}")
print(f"  Initial H shape : {np_initial_h.shape}")
print(f"  Initial C shape : {np_initial_c.shape}")


## Step 1: Defining LSTM Operations with the Layer API

This step involves defining the LSTM operations by adding layers to the TensorRT `INetworkDefinition`.

### Typical Usage Pattern for TensorRT Layer APIs

When adding layers to a TensorRT network using the Layer API, the common pattern is:

1.  **Add the layer:** Use a `network.add_*` method (e.g., `network.add_matrix_multiply`) to add the desired layer. This method takes input tensors and layer-specific parameters, returning an `ILayer` object representing the newly added layer.
2.  **Configure the layer:** Access the returned `ILayer` object to configure its properties. This is optional but useful for naming layer's name and output tensors name for easier debugging and more helpful logs. 

```python
# Example: Adding and configuring a generic layer

# 1. Add the layer (replace with a specific layer like add_matrix_multiply)
layer = network.add_some_layer(input_tensor, ...)

# 2. Configure the layer (optional)
output_tensor = layer.get_output(0)
output_tensor.name = 'my_layer_output'  # Name the output
# ... other configurations ...
```

For a comprehensive list of available layer types and their specific methods and properties, consult the official [TensorRT Layer API documentation](https://docs.nvidia.com/deeplearning/tensorrt/latest/_static/python-api/infer/Graph/Layers.html).

In [None]:
TRT_LOGGER = trt.Logger(trt.Logger.INFO)

def add_lstm_unit(network: trt.INetworkDefinition,
                  input_x: trt.ITensor,      # Shape: [batch_size, input_size]
                  prev_h: trt.ITensor,       # Shape: [batch_size, hidden_size]
                  prev_c: trt.ITensor,       # Shape: [batch_size, hidden_size]
                  W: np.ndarray,            # Shape: [input_size, 4 * hidden_size]
                  U: np.ndarray,            # Shape: [hidden_size, 4 * hidden_size]
                  bias: np.ndarray,         # Shape: [4 * hidden_size]
                  hidden_size: int,
                  input_size: int
                  ) -> Tuple[trt.ITensor, trt.ITensor]:
    """
    Adds the computations for a single LSTM time step.
    Assumes input tensors have a leading batch dimension.
    """
    batch_size = input_x.shape[0] # Get batch size from input

    # Create constant layers for weights and biases
    W_layer = network.add_constant(W.shape, trt.Weights(W))
    W_layer.get_output(0).name = "W_const"
    U_layer = network.add_constant(U.shape, U)
    U_layer.get_output(0).name = "U_const"
    # Reshape bias for broadcasting: [4*hidden] -> [1, 4*hidden]
    bias_reshaped_np = np.expand_dims(bias.copy(), axis=0)
    bias_layer = network.add_constant(bias_reshaped_np.shape, bias_reshaped_np)
    bias_layer.get_output(0).name = "Bias_const"


    # Linear transformations: Wx = input_x * W ; Uh = prev_h * U
    # Wx = [batch, input] * [input, 4*hidden] = [batch, 4*hidden]
    mm_wx = network.add_matrix_multiply(input_x, trt.MatrixOperation.NONE,
                                        W_layer.get_output(0), trt.MatrixOperation.NONE)
    mm_wx.get_output(0).name = "Wx"

    # Uh = [batch, hidden] * [hidden, 4*hidden] = [batch, 4*hidden]
    mm_uh = network.add_matrix_multiply(prev_h, trt.MatrixOperation.NONE,
                                        U_layer.get_output(0), trt.MatrixOperation.NONE)
    mm_uh.get_output(0).name = "Uh"


    # Combined gates = Wx + Uh + Bias
    gates_wx_uh = network.add_elementwise(mm_wx.get_output(0), mm_uh.get_output(0),
                                         trt.ElementWiseOperation.SUM)
    gates_wx_uh.get_output(0).name = "Wx_plus_Uh"

    gates = network.add_elementwise(gates_wx_uh.get_output(0), bias_layer.get_output(0),
                                    trt.ElementWiseOperation.SUM)

    gates_output = gates.get_output(0) # Shape [batch, 4*hidden]
    gates_output.name = "Gates_Combined"

    # Split the combined gates tensor [batch, 4*hidden] -> four [batch, hidden] gate tensors (Input, Forget, Candidate, Output)
    def add_gate_slice(index):
        gate_slice_layer = network.add_slice(input=gates_output,
                                       start=(0, index * hidden_size), # Start [batch_idx=0, col_idx]
                                       shape=(batch_size, hidden_size), # Slice shape
                                       stride=(1, 1))                   # Stride
        return gate_slice_layer.get_output(0)

    slice_i = add_gate_slice(0)
    slice_i.name = "Slice_I"
    slice_f = add_gate_slice(1)
    slice_f.name = "Slice_F"
    slice_c = add_gate_slice(2)
    slice_c.name = "Slice_C_candidate" # Cell candidate
    slice_o = add_gate_slice(3)
    slice_o.name = "Slice_O"

    # Apply activations
    act_i_layer = network.add_activation(slice_i, trt.ActivationType.SIGMOID)
    act_i = act_i_layer.get_output(0)
    act_i.name = "Gate_I"
    act_f_layer = network.add_activation(slice_f, trt.ActivationType.SIGMOID)
    act_f = act_f_layer.get_output(0)
    act_f.name = "Gate_F"
    act_c_layer = network.add_activation(slice_c, trt.ActivationType.TANH)
    act_c = act_c_layer.get_output(0)
    act_c.name = "Gate_C_candidate"
    act_o_layer = network.add_activation(slice_o, trt.ActivationType.SIGMOID)
    act_o = act_o_layer.get_output(0)
    act_o.name = "Gate_O"

    # Cell state update: c_t = f_t * c_{t-1} + i_t * g_t
    term1_c = network.add_elementwise(act_f, prev_c, trt.ElementWiseOperation.PROD)
    term2_c = network.add_elementwise(act_i, act_c, trt.ElementWiseOperation.PROD)
    next_c_layer = network.add_elementwise(term1_c.get_output(0), term2_c.get_output(0), trt.ElementWiseOperation.SUM)

    next_c = next_c_layer.get_output(0)
    next_c.name = "next_c" # Shape [batch, hidden]

    # Hidden state update: h_t = o_t * tanh(c_t)
    tanh_c_layer = network.add_activation(next_c, trt.ActivationType.TANH)
    tanh_c = tanh_c_layer.get_output(0)
    next_h_layer = network.add_elementwise(act_o, tanh_c, trt.ElementWiseOperation.PROD)

    next_h = next_h_layer.get_output(0)
    next_h.name = "next_h" # Shape [batch, hidden]

    return next_h, next_c


def add_lstm_layer(network: trt.INetworkDefinition,
                   input_sequence: trt.ITensor, # Shape: [seq_len, batch_size, input_size]
                   hidden_size: int,
                   seq_len: int,
                   weight_W: np.ndarray, # [input_size, 4*hidden] or [hidden, 4*hidden]
                   weight_U: np.ndarray, # [hidden, 4*hidden]
                   bias: np.ndarray    # [4*hidden]
                   ) -> trt.ITensor:
    """
    Adds a LSTM to the network by adding one lstm_unit, and run multiple times with loops.
    """
    # Infer batch_size and input_size from the input tensor shape
    assert len(input_sequence.shape) == 3, f"Input sequence tensor must have 3 dimensions [seq, batch, input]. Got shape {input_sequence.shape}"
    input_size = input_sequence.shape[2]

    # Shape: [batch_size, hidden_size]
    initial_h = network.add_constant(np_initial_h.shape, np_initial_h).get_output(0)
    initial_h.name = "Initial_H"
    initial_c = network.add_constant(np_initial_c.shape, np_initial_c).get_output(0)
    initial_c.name = "Initial_C"

    loop = network.add_loop()
    loop.name = "Time_Loop_Layer"

    # add_trip_limit determines when the loop should stop. For here we want the loop to run seq_len times.
    trip_limit = network.add_constant((), np.array([seq_len], dtype=np.int32)).get_output(0)
    loop.add_trip_limit(trip_limit, trt.TripLimit.COUNT)

    # Recurrences for hidden and cell states
    h_recurrence = loop.add_recurrence(initial_h)
    c_recurrence = loop.add_recurrence(initial_c)
    prev_h_tensor = h_recurrence.get_output(0)
    prev_h_tensor.name = "Prev_H"
    prev_c_tensor = c_recurrence.get_output(0)
    prev_c_tensor.name = "Prev_C"

    # add_iterator iterates through slices of the input sequence along the specified axis, providing one slice per iteration.
    x_t_iterator = loop.add_iterator(input_sequence, axis=0)
    x_t = x_t_iterator.get_output(0)
    x_t.name = "x_t"


    # Call the LSTM unit function
    next_h, next_c = add_lstm_unit(network=network,
                                    input_x=x_t,
                                    prev_h=prev_h_tensor,
                                    prev_c=prev_c_tensor,
                                    W=weight_W,
                                    U=weight_U,
                                    bias=bias,
                                    hidden_size=hidden_size,
                                    input_size=input_size)

    # Feed the computed states back into the recurrence inputs
    h_recurrence.set_input(1, next_h)
    c_recurrence.set_input(1, next_c)

    # add_loop_output() collects the values in the loop and outputs them. For this example, we concatenate the values along the first axis.
    loop_output_h = loop.add_loop_output(next_h, trt.LoopOutput.CONCATENATE, axis=0)

    # when using CONCATENATE, the second input must be the trip limit.
    loop_output_h.set_input(1, trip_limit)
    loop_output_h.get_output(0).name = "Hidden_Sequence"

    # --- End of time step loop definition ---

    layer_output_sequence = loop_output_h.get_output(0)

    # The final output sequence is the sequence from the last layer
    if layer_output_sequence is None:
         raise RuntimeError("LSTM layer output was not generated (num_layers may be 0)")
    layer_output_sequence.name = "Final_LSTM_Output_Sequence"
    return layer_output_sequence

## Step 2: Build the Network

Now that we have the LSTM layer implementation (`add_lstm_layer`), let's proceed to build the TensorRT `INetworkDefinition`.
This involves defining the network structure by:
1. Adding the input tensor using `network.add_input`.
2. Adding the LSTM layer using our custom `add_lstm_layer` function.
3. Marking the LSTM layer's output tensor as the network's final output.

In [None]:
builder = trt.Builder(TRT_LOGGER)
network = builder.create_network()

# === Network Definition ===
# Shape: [seq_len, batch_size, input_size] -> e.g., [5, 1, 1]
input_tensor = network.add_input(name='input', dtype=trt.float32, shape=(seq_len, batch_size, input_size))

# --- Add SINGLE LSTM Layer ---
lstm_output = add_lstm_layer(network=network,
                                input_sequence=input_tensor,
                                hidden_size=hidden_size,
                                seq_len=seq_len,
                                weight_W=np_weight_W, 
                                weight_U=np_weight_U, 
                                bias=np_bias) 
# lstm_output shape: [seq_len, batch_size, hidden_size] -> e.g., [5, 1, 2]

# --- Mark Output ---
lstm_output.name = 'hidden_state_sequence'
network.mark_output(lstm_output)

## Step 3: Build the Engine

Now that we have defined the network (`INetworkDefinition`), the next step is to build the optimized TensorRT engine. This process involves using the `trt.Builder` along with an `trt.BuilderConfig` object to specify how the engine should be built.

The `IBuilderConfig` allows you to control various aspects of the build process, such as:
*   Setting memory constraints (e.g., workspace size using `set_memory_pool_limit`).
*   Setting builder flags to control optimization strategies and compatibility.

Once the network and configuration are ready, the `builder.build_serialized_network(network, config)` method is called to produce the serialized engine, which can then be saved to a file or used directly.


## (Optional) Defining a Progress Monitor
Building a TensorRT engine can sometimes take a while, especially for complex models. Don't worry if the build seems long! TensorRT offers a helpful tool called `IProgressMonitor`. This interface lets you track the build process step-by-step, making it easier to monitor progress and even debug if needed. 

### Implementing `IProgressMonitor`

To use the progress monitor, inherit from `trt.IProgressMonitor` and override its key methods:

*   `phase_start(self, phase_name, parent_phase, num_steps)`: TensorRT calls this method when it begins a significant phase of the build process (e.g., "Parsing ONNX Model", "Building Engine"). 
    *   `phase_name`: Name of the phase starting.
    *   `parent_phase`: Name of the parent phase, if this is a sub-phase (can be `None`).
    *   `num_steps`: The total number of steps expected for this phase.
*   `step_complete(self, phase_name, step)`: Called after each incremental step within a phase is completed.
    *   `phase_name`: Name of the current phase.
    *   `step`: The index of the step that just finished (0-based).
    *   *Your implementation* usually updates the corresponding progress indicator.
    *   **Crucially, this method must return `True` to allow the build to continue.** Returning `False` or `None` will signal TensorRT to cancel the build.
*   `phase_finish(self, phase_name)`: Called when a phase (and all its steps) is completed.
    *   `phase_name`: Name of the phase that finished.
    *   *Your implementation* typically finalizes and removes the progress indicator for this phase.

After that, hook it with `IBuilderConfig` by setting `config.progress_monitor = MyProgressMonitor()`

In [None]:
class SimpleProgressMonitor(trt.IProgressMonitor):
    def __init__(self):
        trt.IProgressMonitor.__init__(self)
        self._active_phases = 0

    def phase_start(self, phase_name, parent_phase, num_steps):
        print(f"[ProgressMonitor] Phase Start: {phase_name} ({num_steps} steps)")
        self._active_phases += 1

    def phase_finish(self, phase_name):
        print(f"[ProgressMonitor] Phase Finish: {phase_name}")
        self._active_phases -= 1

    def step_complete(self, phase_name, step):
        print(f"[ProgressMonitor] Step Complete: {phase_name}, Step {step}")
        return True

    @property
    def active_phases(self):
        return self._active_phases

## (Optional) Version Compatible Engine
TensorRT engines are typically optimized for the specific GPU and TensorRT version they are built on. This maximizes performance but can cause incompatibility if the deployment environment differs.

The `trt.BuilderFlag.VERSION_COMPATIBLE` flag addresses this by creating a more portable engine. This engine is less sensitive to minor variations in TensorRT versions or GPU models (within a compatible family), potentially at the cost of some performance compared to a non-compatible engine optimized for the exact target. It also reduces the need to rebuild the engine for every minor TensorRT update. Version compatibility is supported from TensorRT 8.6 onwards; the plan must be built with a version at least 8.6 or higher, and the runtime must also be 8.6 or higher.

### Use Cases
*   Deploying across diverse hardware fleets with compatible GPUs/TRT versions.
*   Distributing applications where end-user system configurations vary.
*   Simplifying maintenance by avoiding frequent rebuilds for minor updates.

### How it Works
Enabling `trt.BuilderFlag.VERSION_COMPATIBLE` instructs TensorRT to use more generic optimizations. By default, this flag also causes a copy of a "lean runtime" (a version-specific, stripped-down runtime component) to be packaged within the engine plan file. When you deserialize this engine plan on a compatible system, TensorRT recognizes the embedded lean runtime, loads it, and uses this runtime to deserialize and execute the rest of the plan. 

Because this process involves loading and executing code (the lean runtime) directly from the engine plan file, you must explicitly indicate that you trust the origin and integrity of the plan. This is done by setting `runtime.engine_host_code_allowed = True` on your `trt.Runtime` instance before attempting to deserialize the engine.

> **Considerations for Multiple Version-Compatible Engines:**
If deploying many version-compatible engines, the embedded lean runtime in each plan can lead to large overall application sizes. An alternative is to exclude the runtime from the engine plan (using `trt.BuilderFlag.EXCLUDE_LEAN_RUNTIME`) and load it manually. This approach can significantly reduce the total deployment footprint. For detailed instructions, refer to the NVIDIA TensorRT documentation on [Manually Loading the Runtime](https://docs.nvidia.com/deeplearning/tensorrt/latest/inference-library/advanced.html#manually-loading-the-runtime).

In [None]:
ENGINE_FILE_PATH = './lstm_network.plan'

config = builder.create_builder_config()
config.set_memory_pool_limit(trt.MemoryPoolType.WORKSPACE, 1 << 28) # 256MB
config.progress_monitor = SimpleProgressMonitor()
config.set_flag(trt.BuilderFlag.VERSION_COMPATIBLE)

print("Building engine...")
serialized_engine = builder.build_serialized_network(network, config)

print("Engine build completed.")
with open(ENGINE_FILE_PATH, 'wb') as f:
    f.write(serialized_engine)
print(f"Engine saved to {ENGINE_FILE_PATH}")

## Inference

Once the TensorRT engine is built, the next step is typically to run inference to verify its functionality and performance. The standard process involves creating a runtime and execution context, managing GPU memory for inputs and outputs, transferring data between host and device, and executing the engine etc. While this process provides fine-grained control, it involves boilerplate code. This standard procedure was demonstrated in detail in Sample 1.

In this sample, we'll simplify the inference process by using **[Polygraphy](https://github.com/NVIDIA/TensorRT/tree/main/tools/Polygraphy)**, a versatile toolkit included with TensorRT that automates many underlying details, such as:
*   Context creation
*   Buffer management
*   Data transfers

> **Important Note:** While Polygraphy is excellent for debugging and testing due to its ease of use, it may introduce overhead.
> For optimal performance in deployment scenarios, consider handcrafting the inference code as demonstrated in the `1_run_onnx_with_tensorrt` sample.

For more examples, please refer to [polygraphy examples](https://github.com/NVIDIA/TensorRT/tree/main/tools/Polygraphy/examples)

In [None]:
from polygraphy.backend.common import BytesFromPath
from polygraphy.backend.trt import EngineFromBytes, TrtRunner

def run_inference_with_polygraphy(h_input: np.ndarray) -> np.ndarray:
    input_name = 'input'
    output_name = 'hidden_state_sequence'

    # Prepare the feed dictionary required by Polygraphy
    # Ensure input is contiguous C-style array, which Polygraphy prefers.
    h_input_contiguous = np.ascontiguousarray(h_input)
    feed_dict = {input_name: h_input_contiguous}

    print(f"Loading engine from: {ENGINE_FILE_PATH}")
    outputs = None
    load_engine = EngineFromBytes(BytesFromPath(ENGINE_FILE_PATH))
    with TrtRunner(load_engine) as runner:
        outputs = runner.infer(feed_dict=feed_dict)
        # Polygraphy automatically synchronizes, so no explicit stream sync needed here

    output_sequence = outputs[output_name]
    print(f"Output '{output_name}' shape: {output_sequence.shape}, dtype: {output_sequence.dtype}")
    return output_sequence

output_sequence = run_inference_with_polygraphy(np_inputs) 

if output_sequence is not None:
    print(f"\nInput Sequence (shape {np_inputs.shape}):\n{np_inputs}")
    print(f"\nOutput Hidden State Sequence (shape {output_sequence.shape}):\n{output_sequence}")
else:
    print("Inference failed.")
    

## Verifying the Output (Comparison with Equivalent Operations in NumPy)

To ensure our TensorRT LSTM implementation is correct, we'll compare its output with a reference implementation in NumPy. This is a common practice to validate custom layer logic.

The NumPy version will mimic the same LSTM cell computations and unroll the loop over the time sequence.

In [None]:
def sigmoid_np(x):
    x_clipped = np.clip(x, -500, 500)  # avoid overflow
    return 1.0 / (1.0 + np.exp(-x_clipped))


def tanh_np(x):
    x_clipped = np.clip(x, -100, 100)  # avoid overflow
    return np.tanh(x_clipped)


def lstm_step_numpy(x_t, prev_h, prev_c, W, U, bias):
    # W: shape [input_size, 4*hidden_size]
    # U: shape [hidden_size, 4*hidden_size]
    # bias: shape [4*hidden_size]
    # x_t: shape [batch_size, input_size]
    # prev_h, prev_c: shape [batch_size, hidden_size]

    hidden_size_ = prev_h.shape[1]

    Wx = x_t @ W  # Shape [batch_size, 4*hidden_size]
    Uh = prev_h @ U  # Shape [batch_size, 4*hidden_size]
    gates = Wx + Uh + bias

    # Split gates
    i = gates[:, 0 * hidden_size_ : 1 * hidden_size_]
    f = gates[:, 1 * hidden_size_ : 2 * hidden_size_]
    c = gates[:, 2 * hidden_size_ : 3 * hidden_size_]  # Cell candidate
    o = gates[:, 3 * hidden_size_ : 4 * hidden_size_]

    i_act = sigmoid_np(i)
    f_act = sigmoid_np(f)
    c_act = tanh_np(c)
    o_act = sigmoid_np(o)

    next_c = f_act * prev_c + i_act * c_act
    next_h = o_act * tanh_np(next_c)

    return next_h, next_c


def lstm_layer_numpy(input_sequence_np, np_W, np_U, np_bias):
    seq_len_ = input_sequence_np.shape[0]
    final_output_sequence_np = None
    h = np_initial_h.copy()
    c = np_initial_c.copy()

    layer_output_sequence_list = []

    for t in range(seq_len_):
        # Slice the input sequence for this time step
        x_t = input_sequence_np[t, :, :]

        h, c = lstm_step_numpy(x_t, h, c, np_W, np_U, np_bias)
        layer_output_sequence_list.append(h)
        layer_output_sequence_np = np.stack(layer_output_sequence_list, axis=0)
        final_output_sequence_np = layer_output_sequence_np

    return final_output_sequence_np


numpy_output_sequence = lstm_layer_numpy(np_inputs, np_weight_W, np_weight_U, np_bias)
print("\n--- NumPy LSTM Calculation Results ---")
print(f"Input Sequence (all ones, shape {np_inputs.shape}):\n{np_inputs}")
print(f"\nNumPy Output Hidden State Sequence (shape {numpy_output_sequence.shape}):\n{numpy_output_sequence}")
print("\n--- Comparison ---")
diff = np.abs(output_sequence - numpy_output_sequence)
max_diff = np.max(diff) if diff.size > 0 else 0.0
print(f"Max absolute difference: {max_diff}")
assert np.allclose(
    output_sequence, numpy_output_sequence, atol=1e-5
), f"Output sequence mismatch between TensorRT and NumPy, max diff: {max_diff}"
print("Notebook executed successfully")

## Conclusion and Next Steps

Congratulations! You have successfully:
- Defined an LSTM cell and layer using TensorRT's Layer APIs.
- Implemented a recurrent loop with `add_loop`.
- Monitored the engine build process using `IProgressMonitor`.
- Built a version-compatible TensorRT engine.
- Performed inference using the built engine via Polygraphy.
- Verified the results against a NumPy implementation.

This sample demonstrates the fundamental building blocks for creating custom network architectures in TensorRT. From here, you can explore:
- More complex network structures.
- Different types of layers available in the TensorRT API.
- Advanced loop constructs and conditional logic.
- Further optimization techniques if performance is critical (though for this sample, we focused on API usage).

By mastering the Layer API, you gain the power to optimize virtually any deep learning model for inference on NVIDIA GPUs.