# Running Inference

This notebook demonstrates how to run inference with both PyTorch and TensorRT models, benchmark their performance, and verify that both models produce similar outputs. It includes:

- Loading PyTorch and TensorRT models
- Performance benchmarking (latency comparison)
- Output verification (comparing translations and numerical outputs)
- Encoder output accuracy verification

In [7]:
import torch
from config import get_config, get_weights_file_path
from train import get_model, get_ds, run_validation

## Let's find the GPU specifications

In [None]:
def get_gpu_info():
    if not torch.cuda.is_available():
        print("No GPU detected by PyTorch.")
        return
    
    num_devices = torch.cuda.device_count()
    print(f"GPUs detected: {num_devices}\n")

    for device_idx in range(num_devices):
        device_name = torch.cuda.get_device_name(device_idx)
        props = torch.cuda.get_device_properties(device_idx)
        print("---------------------------\n")
        print(f"All Properties: {props}")
        print("---------------------------\n")

        print(f"--- GPU Index: {device_idx} ---")
        print(f"Name: {device_name}")
        print(f"Processor Count: {props.multi_processor_count}")
        print(f"Total Memory: {props.total_memory / (1024 ** 3):.2f} GB")
        #print(f"Max Threads per Block: {props.max_threads_per_block}")
        print(f"Warp Size: {props.warp_size}")
        #print(f"Clock Rate: {props.clock_rate / 1e6:.2f} GHz")
        print(f"Memory Clock Rate: {getattr(props, 'memory_clock_rate', 'N/A')} Hz")
        print(f"Memory Bus Width: {getattr(props, 'memory_bus_width', 'N/A')} bits")
        print(f"Gcn Arch: {getattr(props, 'gcnArch', 'N/A')}")
        print(f"Compute Capability (arch): "
              f"{getattr(props, 'major', 'N/A')}.{getattr(props, 'minor', 'N/A')}")
        print(f"Is Integrated: {props.is_integrated}")
        print("---------------------------\n")

get_gpu_info()

GPUs detected: 1

---------------------------

All Properties: _CudaDeviceProperties(name='Orin', major=8, minor=7, total_memory=7619MB, multi_processor_count=8, uuid=6c7f4d62-8e83-589a-a8af-556d15b4a582, pci_bus_id=0, pci_device_id=0, pci_domain_id=0, L2_cache_size=2MB)
---------------------------

--- GPU Index: 0 ---
Name: Orin
Processor Count: 8
Total Memory: 7.44 GB
Warp Size: 32
Memory Clock Rate: N/A Hz
Memory Bus Width: N/A bits
Gcn Arch: N/A
Compute Capability (arch): 8.7
Is Integrated: 1
---------------------------



## Setup: Device and Data Loading

Initialize the device (CUDA if available), load configuration, and prepare the data loaders and tokenizers for both training and validation sets.

In [2]:
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
print(f'Using device {device}')
config = get_config()
train_dataloader, val_dataloader, tokenizer_src, tokenizer_tgt = get_ds(config)


Using device cuda
The dataset en-de is not available.
Checking availability of de-en...
Loaded dataset with config: de-en
Max length of source sentence: 466
Max length of target sentence: 479


## Load PyTorch Model

Load the trained PyTorch transformer model from the checkpoint. The model is moved to the specified device (CUDA/CPU) and weights are loaded from the checkpoint file.

In [3]:
EPOCH = "29"
model = get_model(config, tokenizer_src.get_vocab_size(), tokenizer_tgt.get_vocab_size()).to(device)

# Load the pretrained weights
model_filename = get_weights_file_path(config, EPOCH)
state = torch.load(model_filename)
model.load_state_dict(state['model_state_dict'])
del state


## Prepare for Inference

Set the model to evaluation mode and disable gradient computation for faster inference. Import libraries needed for benchmarking (time, numpy, pandas).


In [4]:
import time
import numpy as np
import pandas as pd

model.eval()
torch.set_grad_enabled(False)


torch.autograd.grad_mode.set_grad_enabled(mode=False)

## Load TensorRT Model

Load the split TensorRT engines (encoder, decoder, and projection layers). The memory fraction is limited to 60% to ensure TensorRT has enough GPU memory. The engines are loaded from the `tensorrt_split/` directory.

In [5]:
# IMPORTANT: this import should NOT trigger inference because the file uses if __name__ == "__main__"
from run_trt_split import TRTTransformer, greedy_decode as greedy_decode_trt

# (optional) same memory fraction trick in notebook
if torch.cuda.is_available():
    torch.cuda.set_per_process_memory_fraction(0.6)

trt_model = TRTTransformer(
    enc_path="tensorrt_split/tmodel_" + EPOCH + "_encoder_fp32.engine",
    dec_path="tensorrt_split/tmodel_" + EPOCH + "_decoder_fp32.engine",
    proj_path="tensorrt_split/tmodel_" + EPOCH + "_projection_fp32.engine"
)

trt_model  # sanity: should print "Engines loaded."


Loading TRT Engines...
Engines loaded.


<run_trt_split.TRTTransformer at 0xfffeb81822f0>

## Import Decoding Functions

Import the greedy decoding function from the training module for PyTorch inference. The TensorRT version was already imported above.


In [6]:
from train import greedy_decode as greedy_decode_pt


## Benchmark Function

The `benchmark_decode` function measures inference latency for a given model and decoding function. It:

1. **Warmup Phase**: Runs a few inference passes to stabilize GPU performance and optionally prints sample translations
2. **Timed Phase**: Measures actual inference time for multiple batches
3. **Statistics**: Computes mean, median (p50), p90, and p99 percentiles of latency

This function is used to compare PyTorch vs TensorRT performance.
   

In [7]:
def benchmark_decode(
    decode_fn,
    model_obj,
    dataloader,
    tokenizer_src,
    tokenizer_tgt,
    max_len,
    device,
    n_batches=50,
    warmup_batches=5,
    label="",
    print_warmup_samples=0 
):
    times_ms = []

    # ---------------- Warmup (optional prints here) ----------------
    for i, batch in enumerate(dataloader):
        if i >= warmup_batches:
            break

        src = batch["encoder_input"].to(device)
        src_mask = batch["encoder_mask"].to(device)

        out_ids = decode_fn(model_obj, src, src_mask, tokenizer_src, tokenizer_tgt, max_len, device)

        if i < print_warmup_samples:
            out_text = tokenizer_tgt.decode(out_ids.detach().cpu().numpy())
            print("-" * 80)
            print(f"{label} WARMUP SAMPLE {i+1}")
            print(f"SOURCE:    {batch['src_text'][0]}")
            print(f"TARGET:    {batch['tgt_text'][0]}")
            print(f"PRED:      {out_text}")

    if device.type == "cuda":
        torch.cuda.synchronize()

    # ---------------- Timed runs ----------------
    for i, batch in enumerate(dataloader):
        if i >= n_batches:
            break

        src = batch["encoder_input"].to(device)
        src_mask = batch["encoder_mask"].to(device)

        t0 = time.perf_counter()
        _ = decode_fn(model_obj, src, src_mask, tokenizer_src, tokenizer_tgt, max_len, device)
        if device.type == "cuda":
            torch.cuda.synchronize()
        t1 = time.perf_counter()

        times_ms.append((t1 - t0) * 1000)

    arr = np.array(times_ms, dtype=np.float32)
    print(f"\n{label} latency over {len(arr)} batches:")
    print(f"  mean: {arr.mean():.2f} ms")
    print(f"  p50 : {np.percentile(arr, 50):.2f} ms")
    print(f"  p90 : {np.percentile(arr, 90):.2f} ms")
    print(f"  p99 : {np.percentile(arr, 99):.2f} ms")

    return times_ms


## Run Performance Benchmarks

Benchmark both PyTorch and TensorRT models on the validation dataset. This will:

- Run 5 warmup batches (with 2 sample translations printed)
- Measure latency for 50 batches
- Print latency statistics (mean, p50, p90, p99)
- Calculate the speedup factor (PyTorch latency / TensorRT latency)

The results show how much faster TensorRT is compared to PyTorch for inference.


In [8]:
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
max_len = config["seq_len"]

pt_times = benchmark_decode(
    greedy_decode_pt, model, val_dataloader,
    tokenizer_src, tokenizer_tgt,
    max_len=max_len,
    device=device,
    n_batches=50,
    warmup_batches=5,
    label="PyTorch",
    print_warmup_samples=4
)

trt_times = benchmark_decode(
    greedy_decode_trt, trt_model, val_dataloader,
    tokenizer_src, tokenizer_tgt,
    max_len=max_len,
    device=device,
    n_batches=50,
    warmup_batches=5,
    label="TensorRT",
    print_warmup_samples=4
)

speedup = np.mean(pt_times) / np.mean(trt_times)
print(f"\nSpeedup (PyTorch / TRT): {speedup:.2f}x")


--------------------------------------------------------------------------------
PyTorch WARMUP SAMPLE 1
SOURCE:    * The city of Cambrai is well dressed. Marafin plundered it.
TARGET:    Was für schöne Kleider hat Cambrai, die gute Stadt: Marasin hat sie geplündert.
PRED:      Was für schöne Kleider hat , die gute Stadt : hat sie der guten Stadt bekommen .
--------------------------------------------------------------------------------
PyTorch WARMUP SAMPLE 2
SOURCE:    Upon which the bishop had been constrained to recite to him the ordinance of Legate Odo, which excepts certain great dames, ~aliquoe magnates mulieres, quoe sine scandalo vitari non possunt~.
TARGET:    Daraufhin hatte ihm der Bischof die Verordnung des Legaten Odo citiren müssen, welche gewisse Damen von Stande »aliquae magnates mulieres, quae sine scandalo evitari non possunt« ausnimmt.
PRED:      Bei dem aber , welcher der Bischof die Verordnung des Bischofs hatte , wie ihn die » « jenen Damen « nannte : » , quae si

| GPU        | Runtime Screenshot | GPU Details Screenshot |
|------------|------------------|----------------------|
| **MI300X** | ![MI300X Runtime](./images/rocm_runtime_pytorch.png) | ![MI300X Details](./images/amd_gpu_details.png) |
| **L4**     | ![L4 Runtime](./images/l4_runtime.png) | ![L4 Details](./images/l4_details.png) |
| **A100**   | ![A100 Runtime](./images/a100_runtime.png) | ![A100 Details](./images/a100_details.png) |
| **Jetson Orin Nano Super**   | ![Jetson Runtime](./images/jetson_runtime.png) | ![Jetson Details](./images/jetson_details.png) |


## Translation Comparison

The `verbose_compare_same_sample` function compares translations from both models on the same input sample. It:

- Takes the same batch from the dataloader
- Runs inference with both PyTorch and TensorRT models
- Prints side-by-side comparison of source, target, and predictions
- Returns the token IDs from both models

This helps verify that both models produce similar (or identical) translations.
 

In [9]:

def verbose_compare_same_sample(
    pt_model,
    trt_model,
    dataloader,
    tokenizer_src,
    tokenizer_tgt,
    max_len,
    device,
    batch_index=0,   # pick which batch to compare
):
    pt_model.eval()
    torch.set_grad_enabled(False)
    device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

    # ---- get ONE specific batch ----
    it = iter(dataloader)
    batch = None
    for i in range(batch_index + 1):
        batch = next(it)

    encoder_input = batch["encoder_input"].to(device)
    encoder_mask  = batch["encoder_mask"].to(device)
    source_text   = batch["src_text"][0]
    target_text   = batch["tgt_text"][0]

    # ---- PyTorch decode on SAME tensors ----
    out_pt_ids = greedy_decode_pt(
        pt_model, encoder_input, encoder_mask,
        tokenizer_src, tokenizer_tgt,
        max_len, device
    )
    out_pt_text = tokenizer_tgt.decode(out_pt_ids.detach().cpu().numpy())

    # ---- TensorRT decode on SAME tensors ----
    out_trt_ids = greedy_decode_trt(
        trt_model, encoder_input, encoder_mask,
        tokenizer_src, tokenizer_tgt,
        max_len, device
    )
    out_trt_text = tokenizer_tgt.decode(out_trt_ids.detach().cpu().numpy())

    # ---- print SAME sample ----
    print("-" * 80)
    print(f"SOURCE:     {source_text}")
    print(f"TARGET:     {target_text}")
    print(f"PT  PRED:   {out_pt_text}")
    print(f"TRT PRED:   {out_trt_text}")
    print("-" * 80)

    return out_pt_ids, out_trt_ids


# run on batch 0 (same sample)
_ = verbose_compare_same_sample(
    model, trt_model, val_dataloader,
    tokenizer_src, tokenizer_tgt,
    config["seq_len"], device,
    batch_index=1
)


--------------------------------------------------------------------------------
SOURCE:     When the Prince flared up she kept silent, feeling shame for her mother and tenderness toward her father because of his immediate return to kindliness; but when her father left the room she was ready for the chief thing needful, which was to go to Kitty and comfort her.
TARGET:     Während dann der Fürst seinem Ingrimm Luft machte, hatte sie geschwiegen und sich für ihre Mutter geschämt; mit zärtlicher Bewunderung hatte sie auf ihren Vater geblickt, als dieser so bald wieder gut und freundlich wurde. Aber als nun der Vater hinausgegangen war, da schickte sie sich an, das Wichtigste zu tun, was jetzt nötig war: zu Kitty zu gehen und sie zu beruhigen.
PT  PRED:   Beim Anblick des hatte sie sich gekränkt ; aber als sie das Schweigen der Mutter und mit einer Neigung für den Vater dieses Namens dieser Neigung zur Ruhe des Vaters zurückkehren ; aber als sie das Zimmer verlassen hatte , da hatte sie f

## Logits Comparison (First Decoding Step)

The `compare_first_step_logits` function verifies numerical accuracy by comparing the logits from the first decoding step. It:

- Encodes the source sequence with both models
- Runs one decoder step (with [SOS] token) for both models
- Compares the output logits (vocabulary probabilities)
- Reports maximum and mean absolute differences

Small differences (< 0.01) indicate that TensorRT is producing numerically similar outputs to PyTorch, which is expected due to floating-point precision differences.

In [10]:
from dataset import causal_mask
def compare_first_step_logits(pt_model, trt_model, batch):
    device = torch.device("cuda")
    pt_model.eval()
    torch.set_grad_enabled(False)

    src = batch["encoder_input"].to(device)
    src_mask = batch["encoder_mask"].to(device)

    # PT: encoder + one decoder step (SOS)
    sos_idx = tokenizer_tgt.token_to_id("[SOS]")
    dec_in = torch.tensor([[sos_idx]], device=device, dtype=src.dtype)

    enc_pt = pt_model.encode(src, src_mask)
    dec_mask = causal_mask(1).type_as(src_mask).to(device)  # (1,1,1)
    out_pt = pt_model.decode(enc_pt, src_mask, dec_in, dec_mask)
    logits_pt = pt_model.project(out_pt[:, -1])  # (B, vocab)

    # TRT: same
    enc_trt = trt_model.encode(src, src_mask)
    dec_mask_trt = causal_mask(1).type_as(src_mask).to(device).unsqueeze(1)  # (1,1,1,1)
    out_trt = trt_model.decode(enc_trt, src_mask, dec_in, dec_mask_trt)
    logits_trt = trt_model.project(out_trt[:, -1])

    lp = logits_pt.detach().cpu().float()
    lt = logits_trt.detach().cpu().float()

    max_abs = (lp - lt).abs().max().item()
    mean_abs = (lp - lt).abs().mean().item()
    print("First-step logits diff:")
    print("  max abs:", max_abs)
    print("  mean abs:", mean_abs)

batch0 = next(iter(val_dataloader))
compare_first_step_logits(model, trt_model, batch0)


First-step logits diff:
  max abs: 0.007963895797729492
  mean abs: 0.0017605420434847474


## Encoder Output Comparison

The `compare_encoder_outputs` function verifies that the encoder produces similar outputs in both models. It:

- **`pt_encode_trt_style`**: Adapts PyTorch encoder to use the same mask format as TensorRT (square mask shape)
- **`compare_encoder_outputs`**: Compares encoder outputs from both models and reports maximum and mean absolute differences

This ensures that the encoder component is working correctly in the TensorRT version. Differences should be very small (< 0.001) due to floating-point precision.

In [11]:
def pt_encode_trt_style(pt_model, src, src_mask):
    # Make PT use the same square mask shape TRT encoder expects
    S = src.shape[1]
    if src_mask.dim() == 3:
        src_mask = src_mask.unsqueeze(1)   # (B,1,S)->(B,1,1,S)
    if src_mask.dim() == 4 and src_mask.shape[2] == 1:
        src_mask = src_mask.repeat(1,1,S,1)  # -> (B,1,S,S)
    src_mask = src_mask.float()  # binary float
    return pt_model.encode(src, src_mask)

def compare_encoder_outputs(pt_model, trt_model, batch):
    device = torch.device("cuda")
    pt_model.eval()
    torch.set_grad_enabled(False)

    src = batch["encoder_input"].to(device)
    src_mask = batch["encoder_mask"].to(device)

    enc_pt  = pt_encode_trt_style(pt_model, src, src_mask).detach().cpu().float()
    enc_trt = trt_model.encode(src, src_mask).detach().cpu().float()

    diff = (enc_pt - enc_trt).abs()
    print("ENCODER diff: max =", diff.max().item(), "mean =", diff.mean().item())

batch0 = next(iter(val_dataloader))
compare_encoder_outputs(model, trt_model, batch0)


ENCODER diff: max = 0.000666201114654541 mean = 5.592039451585151e-05


## Position-wise Encoder Analysis

This cell performs a detailed analysis of encoder output differences:

- Computes per-position mean differences across the hidden dimension
- Separates differences for **padded positions** (where mask = 0) vs **unpadded positions** (where mask = 1)
- Reports mean differences for each category and the overall maximum

This helps understand if differences are concentrated in padded regions (which are typically ignored) or in actual content positions.

In [12]:
device = torch.device("cuda")
batch0 = next(iter(val_dataloader))
src = batch0["encoder_input"].to(device)
src_mask = batch0["encoder_mask"].to(device)

enc_pt  = pt_encode_trt_style(model, src, src_mask).detach().cpu().float()
enc_trt = trt_model.encode(src, src_mask).detach().cpu().float()

# per-position mean diff across hidden dim
pos_diff = (enc_pt - enc_trt).abs().mean(-1).squeeze(0)   # (S,)

# padded vs unpadded positions (mask is (1,1,1,S))
key_mask = src_mask.squeeze().cpu()  # (S,) with 0/1
pad_pos = key_mask == 0
unpad_pos = key_mask == 1

print("mean diff padded   :", pos_diff[pad_pos].mean().item())
print("mean diff unpadded :", pos_diff[unpad_pos].mean().item())
print("max diff overall   :", pos_diff.max().item())


mean diff padded   : 4.1217113903257996e-05
mean diff unpadded : 1.9660077668959275e-05
max diff overall   : 0.000134581932798028


## Export Benchmark Results

Save the benchmark timing data to a CSV file for further analysis. The DataFrame contains:

- `pytorch_ms`: Latency for each PyTorch inference (milliseconds)
- `tensorrt_ms`: Latency for each TensorRT inference (milliseconds)
- `speedup_x`: Per-sample speedup ratio (PyTorch / TensorRT)

The `describe()` method provides summary statistics (mean, std, min, max, percentiles) for all columns.


In [None]:
m = min(len(pt_times), len(trt_times))
df = pd.DataFrame({
    "pytorch_ms": pt_times[:m],
    "tensorrt_ms": trt_times[:m],
})
df["speedup_x"] = df["pytorch_ms"] / df["tensorrt_ms"]

df.to_csv("benchmark_times.csv", index=False)
df.describe()


## Optional: Full Validation

Uncomment the line below to run full validation on the PyTorch model. This will print multiple translation examples with source, target, and predicted outputs. This is useful for qualitative assessment of translation quality.

In [13]:
run_validation(model, val_dataloader, tokenizer_src, tokenizer_tgt, config['seq_len'], device, lambda msg: print(msg), 0, None, num_examples=10)


stty: 'standard input': Inappropriate ioctl for device


--------------------------------------------------------------------------------
    SOURCE: Benserade prepared us for it by some very gallant verses."
    TARGET: Benserade bereitete uns in recht niedlichen Versen darauf vor.«
 PREDICTED: bereitete uns in recht niedlichen Versen darauf vor .«
--------------------------------------------------------------------------------
    SOURCE: She renounced everything. But facts and time have shown that her situation is tormenting and impossible.'
    TARGET: Aber das nüchterne Leben und die Zeit haben ihr bewiesen, daß ihre Lage qualvoll und unerträglich ist.«
 PREDICTED: Aber das Volk hat alles geordnet , alles ist die Zeit , die sie hat , eine Qual und unmöglich .«
--------------------------------------------------------------------------------
    SOURCE: She came up to him and said, 'My darling!'
    TARGET: »Mein lieber Junge!« sagte sie.
 PREDICTED: Sie trat zu ihm heran und sagte : › Meine Liebe !‹
--------------------------------------

# Checking the tensorboard files

In [14]:
import tensorboard as tb
import os

In [23]:
# If you are on ssh then make sure you are doing port forwarding
#ssh -L 6005:localhost:6005 user@jetson_ip
#Finally on your host on a browser open http://localhost:6005

logdir = "./runs/tmodel/"
os.system(f"tensorboard --logdir {logdir} --port 6006 --host 0.0.0.0")



TensorFlow installation not found - running with reduced feature set.
TensorBoard 2.20.0 at http://0.0.0.0:6006/ (Press CTRL+C to quit)


2