# Running Inference

This notebook demonstrates how to run inference with both PyTorch and TensorRT models, benchmark their performance, and verify that both models produce similar outputs. It includes:

- Loading PyTorch and TensorRT models
- Performance benchmarking (latency comparison)
- Output verification (comparing translations and numerical outputs)
- Encoder output accuracy verification

In [1]:
import torch
from config import get_config, get_weights_file_path
from train import get_model, get_ds, run_validation

  from .autonotebook import tqdm as notebook_tqdm


## Setup: Device and Data Loading

Initialize the device (CUDA if available), load configuration, and prepare the data loaders and tokenizers for both training and validation sets.

In [2]:
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
print(f'Using device {device}')
config = get_config()
train_dataloader, val_dataloader, tokenizer_src, tokenizer_tgt = get_ds(config)


Using device cuda
Max length of source sentence: 466
Max length of target sentence: 479


## Load PyTorch Model

Load the trained PyTorch transformer model from the checkpoint. The model is moved to the specified device (CUDA/CPU) and weights are loaded from the checkpoint file.

In [4]:
model = get_model(config, tokenizer_src.get_vocab_size(), tokenizer_tgt.get_vocab_size()).to(device)

# Load the pretrained weights
model_filename = get_weights_file_path(config, f"13")
state = torch.load(model_filename)
model.load_state_dict(state['model_state_dict'])


<All keys matched successfully>

## Prepare for Inference

Set the model to evaluation mode and disable gradient computation for faster inference. Import libraries needed for benchmarking (time, numpy, pandas).


In [5]:
import time
import numpy as np
import pandas as pd

model.eval()
torch.set_grad_enabled(False)


torch.autograd.grad_mode.set_grad_enabled(mode=False)

## Load TensorRT Model

Load the split TensorRT engines (encoder, decoder, and projection layers). The memory fraction is limited to 60% to ensure TensorRT has enough GPU memory. The engines are loaded from the `tensorrt_split/` directory.

In [3]:
# IMPORTANT: this import should NOT trigger inference because the file uses if __name__ == "__main__"
from run_trt_split import TRTTransformer, greedy_decode as greedy_decode_trt

# (optional) same memory fraction trick in notebook
if torch.cuda.is_available():
    torch.cuda.set_per_process_memory_fraction(0.6)

trt_model = TRTTransformer(
    enc_path="tensorrt_split/tmodel_13_encoder_fp32.engine",
    dec_path="tensorrt_split/tmodel_13_decoder_fp32.engine",
    proj_path="tensorrt_split/tmodel_13_projection_fp32.engine"
)

trt_model  # sanity: should print "Engines loaded."


Loading TRT Engines...
Engines loaded.


<run_trt_split.TRTTransformer at 0xffff922e2800>

## Import Decoding Functions

Import the greedy decoding function from the training module for PyTorch inference. The TensorRT version was already imported above.


In [6]:
from train import greedy_decode as greedy_decode_pt


## Benchmark Function

The `benchmark_decode` function measures inference latency for a given model and decoding function. It:

1. **Warmup Phase**: Runs a few inference passes to stabilize GPU performance and optionally prints sample translations
2. **Timed Phase**: Measures actual inference time for multiple batches
3. **Statistics**: Computes mean, median (p50), p90, and p99 percentiles of latency

This function is used to compare PyTorch vs TensorRT performance.
   

In [7]:
def benchmark_decode(
    decode_fn,
    model_obj,
    dataloader,
    tokenizer_src,
    tokenizer_tgt,
    max_len,
    device,
    n_batches=50,
    warmup_batches=5,
    label="",
    print_warmup_samples=0 
):
    times_ms = []

    # ---------------- Warmup (optional prints here) ----------------
    for i, batch in enumerate(dataloader):
        if i >= warmup_batches:
            break

        src = batch["encoder_input"].to(device)
        src_mask = batch["encoder_mask"].to(device)

        out_ids = decode_fn(model_obj, src, src_mask, tokenizer_src, tokenizer_tgt, max_len, device)

        if i < print_warmup_samples:
            out_text = tokenizer_tgt.decode(out_ids.detach().cpu().numpy())
            print("-" * 80)
            print(f"{label} WARMUP SAMPLE {i+1}")
            print(f"SOURCE:    {batch['src_text'][0]}")
            print(f"TARGET:    {batch['tgt_text'][0]}")
            print(f"PRED:      {out_text}")

    if device.type == "cuda":
        torch.cuda.synchronize()

    # ---------------- Timed runs ----------------
    for i, batch in enumerate(dataloader):
        if i >= n_batches:
            break

        src = batch["encoder_input"].to(device)
        src_mask = batch["encoder_mask"].to(device)

        t0 = time.perf_counter()
        _ = decode_fn(model_obj, src, src_mask, tokenizer_src, tokenizer_tgt, max_len, device)
        if device.type == "cuda":
            torch.cuda.synchronize()
        t1 = time.perf_counter()

        times_ms.append((t1 - t0) * 1000)

    arr = np.array(times_ms, dtype=np.float32)
    print(f"\n{label} latency over {len(arr)} batches:")
    print(f"  mean: {arr.mean():.2f} ms")
    print(f"  p50 : {np.percentile(arr, 50):.2f} ms")
    print(f"  p90 : {np.percentile(arr, 90):.2f} ms")
    print(f"  p99 : {np.percentile(arr, 99):.2f} ms")

    return times_ms


## Run Performance Benchmarks

Benchmark both PyTorch and TensorRT models on the validation dataset. This will:

- Run 5 warmup batches (with 2 sample translations printed)
- Measure latency for 50 batches
- Print latency statistics (mean, p50, p90, p99)
- Calculate the speedup factor (PyTorch latency / TensorRT latency)

The results show how much faster TensorRT is compared to PyTorch for inference.


In [10]:
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
max_len = config["seq_len"]

pt_times = benchmark_decode(
    greedy_decode_pt, model, val_dataloader,
    tokenizer_src, tokenizer_tgt,
    max_len=max_len,
    device=device,
    n_batches=50,
    warmup_batches=5,
    label="PyTorch",
    print_warmup_samples=4
)

trt_times = benchmark_decode(
    greedy_decode_trt, trt_model, val_dataloader,
    tokenizer_src, tokenizer_tgt,
    max_len=max_len,
    device=device,
    n_batches=50,
    warmup_batches=5,
    label="TensorRT",
    print_warmup_samples=4
)

speedup = np.mean(pt_times) / np.mean(trt_times)
print(f"\nSpeedup (PyTorch / TRT): {speedup:.2f}x")


--------------------------------------------------------------------------------
PyTorch WARMUP SAMPLE 1
SOURCE:    He reproached himself with forgetting Emma, as if, all his thoughts belonging to this woman, it was robbing her of something not to be constantly thinking of her.
TARGET:    Er redete sich ein, er vernachlässige seine Frau, wenn er ihr nicht all sein Dichten und Trachten widme. Er wollte an nichts andres denken, selbst wenn ihr dadurch kein Abbruch geschähe.
PRED:      Er hatte sich über Emma häufig vergessen , als sei es ihm , daß sein Weib , die , die sie sich immer noch mehr an sich selbst gewöhnt gehabt hatten .
--------------------------------------------------------------------------------
PyTorch WARMUP SAMPLE 2
SOURCE:    Gringoire stooped quickly to pick it up; when he straightened up, the young girl and the goat had disappeared.
TARGET:    Gringoire bückte sich schnell, um es aufzuheben; als er sich wieder erhob, waren das junge Mädchen und die Ziege verschwende

## Translation Comparison

The `verbose_compare_same_sample` function compares translations from both models on the same input sample. It:

- Takes the same batch from the dataloader
- Runs inference with both PyTorch and TensorRT models
- Prints side-by-side comparison of source, target, and predictions
- Returns the token IDs from both models

This helps verify that both models produce similar (or identical) translations.
 

In [9]:

def verbose_compare_same_sample(
    pt_model,
    trt_model,
    dataloader,
    tokenizer_src,
    tokenizer_tgt,
    max_len,
    device,
    batch_index=0,   # pick which batch to compare
):
    pt_model.eval()
    torch.set_grad_enabled(False)
    device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

    # ---- get ONE specific batch ----
    it = iter(dataloader)
    batch = None
    for i in range(batch_index + 1):
        batch = next(it)

    encoder_input = batch["encoder_input"].to(device)
    encoder_mask  = batch["encoder_mask"].to(device)
    source_text   = batch["src_text"][0]
    target_text   = batch["tgt_text"][0]

    # ---- PyTorch decode on SAME tensors ----
    out_pt_ids = greedy_decode_pt(
        pt_model, encoder_input, encoder_mask,
        tokenizer_src, tokenizer_tgt,
        max_len, device
    )
    out_pt_text = tokenizer_tgt.decode(out_pt_ids.detach().cpu().numpy())

    # ---- TensorRT decode on SAME tensors ----
    out_trt_ids = greedy_decode_trt(
        trt_model, encoder_input, encoder_mask,
        tokenizer_src, tokenizer_tgt,
        max_len, device
    )
    out_trt_text = tokenizer_tgt.decode(out_trt_ids.detach().cpu().numpy())

    # ---- print SAME sample ----
    print("-" * 80)
    print(f"SOURCE:     {source_text}")
    print(f"TARGET:     {target_text}")
    print(f"PT  PRED:   {out_pt_text}")
    print(f"TRT PRED:   {out_trt_text}")
    print("-" * 80)

    return out_pt_ids, out_trt_ids


# run on batch 0 (same sample)
_ = verbose_compare_same_sample(
    model, trt_model, val_dataloader,
    tokenizer_src, tokenizer_tgt,
    config["seq_len"], device,
    batch_index=1
)


NvMapMemAllocInternalTagged: 1075072515 error 12
NvMapMemHandleAlloc: error 0
NvMapMemAllocInternalTagged: 1075072515 error 12
NvMapMemHandleAlloc: error 0
NvMapMemAllocInternalTagged: 1075072515 error 12
NvMapMemHandleAlloc: error 0


--------------------------------------------------------------------------------
SOURCE:     Diana and Mary have left you, and Moor House is shut up, and you are so lonely.
TARGET:     Diana und Mary haben Sie verlassen; Moor-House ist verschlossen, und Sie sind einsam.
PT  PRED:   Mary und Mary haben dich verlassen , und Moor - House ist geschlossen . Und du bist so einsam .
TRT PRED:   Mary und Mary haben dich verlassen , und Moor - House ist geschlossen . Und du bist so einsam .
--------------------------------------------------------------------------------


## Logits Comparison (First Decoding Step)

The `compare_first_step_logits` function verifies numerical accuracy by comparing the logits from the first decoding step. It:

- Encodes the source sequence with both models
- Runs one decoder step (with [SOS] token) for both models
- Compares the output logits (vocabulary probabilities)
- Reports maximum and mean absolute differences

Small differences (< 0.01) indicate that TensorRT is producing numerically similar outputs to PyTorch, which is expected due to floating-point precision differences.

In [11]:
from dataset import causal_mask
def compare_first_step_logits(pt_model, trt_model, batch):
    device = torch.device("cuda")
    pt_model.eval()
    torch.set_grad_enabled(False)

    src = batch["encoder_input"].to(device)
    src_mask = batch["encoder_mask"].to(device)

    # PT: encoder + one decoder step (SOS)
    sos_idx = tokenizer_tgt.token_to_id("[SOS]")
    dec_in = torch.tensor([[sos_idx]], device=device, dtype=src.dtype)

    enc_pt = pt_model.encode(src, src_mask)
    dec_mask = causal_mask(1).type_as(src_mask).to(device)  # (1,1,1)
    out_pt = pt_model.decode(enc_pt, src_mask, dec_in, dec_mask)
    logits_pt = pt_model.project(out_pt[:, -1])  # (B, vocab)

    # TRT: same
    enc_trt = trt_model.encode(src, src_mask)
    dec_mask_trt = causal_mask(1).type_as(src_mask).to(device).unsqueeze(1)  # (1,1,1,1)
    out_trt = trt_model.decode(enc_trt, src_mask, dec_in, dec_mask_trt)
    logits_trt = trt_model.project(out_trt[:, -1])

    lp = logits_pt.detach().cpu().float()
    lt = logits_trt.detach().cpu().float()

    max_abs = (lp - lt).abs().max().item()
    mean_abs = (lp - lt).abs().mean().item()
    print("First-step logits diff:")
    print("  max abs:", max_abs)
    print("  mean abs:", mean_abs)

batch0 = next(iter(val_dataloader))
compare_first_step_logits(model, trt_model, batch0)


First-step logits diff:
  max abs: 0.007893800735473633
  mean abs: 0.0017762379720807076


## Encoder Output Comparison

The `compare_encoder_outputs` function verifies that the encoder produces similar outputs in both models. It:

- **`pt_encode_trt_style`**: Adapts PyTorch encoder to use the same mask format as TensorRT (square mask shape)
- **`compare_encoder_outputs`**: Compares encoder outputs from both models and reports maximum and mean absolute differences

This ensures that the encoder component is working correctly in the TensorRT version. Differences should be very small (< 0.001) due to floating-point precision.

In [12]:
def pt_encode_trt_style(pt_model, src, src_mask):
    # Make PT use the same square mask shape TRT encoder expects
    S = src.shape[1]
    if src_mask.dim() == 3:
        src_mask = src_mask.unsqueeze(1)   # (B,1,S)->(B,1,1,S)
    if src_mask.dim() == 4 and src_mask.shape[2] == 1:
        src_mask = src_mask.repeat(1,1,S,1)  # -> (B,1,S,S)
    src_mask = src_mask.float()  # binary float
    return pt_model.encode(src, src_mask)

def compare_encoder_outputs(pt_model, trt_model, batch):
    device = torch.device("cuda")
    pt_model.eval()
    torch.set_grad_enabled(False)

    src = batch["encoder_input"].to(device)
    src_mask = batch["encoder_mask"].to(device)

    enc_pt  = pt_encode_trt_style(pt_model, src, src_mask).detach().cpu().float()
    enc_trt = trt_model.encode(src, src_mask).detach().cpu().float()

    diff = (enc_pt - enc_trt).abs()
    print("ENCODER diff: max =", diff.max().item(), "mean =", diff.mean().item())

batch0 = next(iter(val_dataloader))
compare_encoder_outputs(model, trt_model, batch0)


ENCODER diff: max = 0.000983595848083496 mean = 7.902805373305455e-05


## Position-wise Encoder Analysis

This cell performs a detailed analysis of encoder output differences:

- Computes per-position mean differences across the hidden dimension
- Separates differences for **padded positions** (where mask = 0) vs **unpadded positions** (where mask = 1)
- Reports mean differences for each category and the overall maximum

This helps understand if differences are concentrated in padded regions (which are typically ignored) or in actual content positions.

In [13]:
device = torch.device("cuda")
batch0 = next(iter(val_dataloader))
src = batch0["encoder_input"].to(device)
src_mask = batch0["encoder_mask"].to(device)

enc_pt  = pt_encode_trt_style(model, src, src_mask).detach().cpu().float()
enc_trt = trt_model.encode(src, src_mask).detach().cpu().float()

# per-position mean diff across hidden dim
pos_diff = (enc_pt - enc_trt).abs().mean(-1).squeeze(0)   # (S,)

# padded vs unpadded positions (mask is (1,1,1,S))
key_mask = src_mask.squeeze().cpu()  # (S,) with 0/1
pad_pos = key_mask == 0
unpad_pos = key_mask == 1

print("mean diff padded   :", pos_diff[pad_pos].mean().item())
print("mean diff unpadded :", pos_diff[unpad_pos].mean().item())
print("max diff overall   :", pos_diff.max().item())


mean diff padded   : 8.144229650497437e-05
mean diff unpadded : 3.956778527935967e-05
max diff overall   : 0.00023580221750307828


## Export Benchmark Results

Save the benchmark timing data to a CSV file for further analysis. The DataFrame contains:

- `pytorch_ms`: Latency for each PyTorch inference (milliseconds)
- `tensorrt_ms`: Latency for each TensorRT inference (milliseconds)
- `speedup_x`: Per-sample speedup ratio (PyTorch / TensorRT)

The `describe()` method provides summary statistics (mean, std, min, max, percentiles) for all columns.


In [14]:
m = min(len(pt_times), len(trt_times))
df = pd.DataFrame({
    "pytorch_ms": pt_times[:m],
    "tensorrt_ms": trt_times[:m],
})
df["speedup_x"] = df["pytorch_ms"] / df["tensorrt_ms"]

df.to_csv("benchmark_times.csv", index=False)
df.describe()


Unnamed: 0,pytorch_ms,tensorrt_ms,speedup_x
count,50.0,50.0,50.0
mean,932.016664,619.675165,4.612743
std,621.100976,1727.937622,5.182153
min,236.616705,84.566726,0.031
25%,464.246127,155.562562,1.278591
50%,699.716414,253.012468,2.540888
75%,1252.439549,419.665658,6.157211
max,2660.253322,12050.38141,22.153174


## Optional: Full Validation

Uncomment the line below to run full validation on the PyTorch model. This will print multiple translation examples with source, target, and predicted outputs. This is useful for qualitative assessment of translation quality.

In [None]:
#run_validation(model, val_dataloader, tokenizer_src, tokenizer_tgt, config['seq_len'], device, lambda msg: print(msg), 0, None, num_examples=10)


# Checking the tensorboard files

In [None]:
import tensorboard as tb
import os

In [None]:
# If you are on ssh then make sure you are doing port forwarding
#ssh -L 6005:localhost:6005 user@jetson_ip
#Finally on your host on a browser open http://localhost:6005

logdir = "./runs/tmodel/"
os.system(f"tensorboard --logdir {logdir} --port 6005")

