# Inference 101: Latency, Throughput & UX

This notebook is a hands-on introduction to the key performance aspects of running inference with Large Language Models (LLMs).

You’ll learn:

✅ What **latency** and **throughput** mean in the context of LLMs  
✅ Why these metrics often trade off against each other  
✅ How different parameters (like batch size, prompt length, sampling strategy) affect performance  
✅ How to measure and visualize **p50 vs p99 latency**, **first-token vs total latency**, and find the "sweet spot"  
✅ What this means for real-world **user experience**

We'll use **TensorRT-LLM** with **PyTorch** backend and a **Hugging Face**-hosted model (**Mistral 7B Instruct**) for all experiments.

By the end of this notebook, you'll not only be able to benchmark an LLM — you'll know what the numbers actually mean and how to tune them for real applications.

## Preliminaries

**Before you begin**, make sure you have:

- An NVIDIA GPU environment
- Access to the gated [Mistral-7B-Instruct-v0.3](https://huggingface.co/mistralai/Mistral-7B-Instruct-v0.3) model on Hugging Face
- Your Hugging Face access [token](https://huggingface.co/settings/token)

Let's test which GPUs are avaialable in our system:

In [None]:
!nvidia-smi

### Authenticating with Hugging Face

To download the model from Hugging Face, you’ll need to enter your personal access token.

The cell below provides a simple interface to enter and save your token securely. It will be cached locally, so you only need to do this once per environment.

➡️ You can find or generate your token at: [https://huggingface.co/settings/tokens](https://huggingface.co/settings/tokens)

Once saved, the token will allow seamless access to gated models like `mistralai/Mistral-7B-Instruct-v0.3`.


In [None]:
# ⬇️ Run this cell once
from ipywidgets import Password, Button, HBox, Output
import os, pathlib
import sys

from huggingface_hub import HfFolder, whoami

# ---- UI widgets ----
token_box = Password(
    description="HF Token:",
    placeholder="paste your Hugging Face token here",
    layout={"width": "450px"},
)
save_btn = Button(description="Save", button_style="success")
out = Output()

# ---- Callback ----
def save_token(_):
    out.clear_output()
    token = token_box.value.strip()
    with out:
        if not token:
            print("❌ No token entered.")
            return
        # Persist token
        HfFolder.save_token(token)                 # writes to ~/.cache/huggingface/token
        os.environ["HF_TOKEN"] = token             # current kernel env (optional)
        # Sanity-check who we are
        try:
            user = whoami(token)["name"]
            print(f"✅ Token saved. Logged in as: {user}")
        except Exception as e:
            print("⚠️ Token saved, but user lookup failed:", e)

save_btn.on_click(save_token)

display(HBox([token_box, save_btn]), out)

### Downloading and optimizing the model

We'll be using **TensorRT-LLM** with a **PyTorch** backend to run inference.  

[TensorRT-LLM](https://github.com/NVIDIA/TensorRT-LLM) is an open-source library from NVIDIA designed specifically to **optimize and accelerate LLM inference** on NVIDIA GPUs.

The PyTorch backend provides:

- A familiar Pythonic API for rapid prototyping and ease of integration
- Compatibility with the Hugging Face model ecosystem
- Seamless fallback to PyTorch ops when certain layers or patterns can't be fully optimized

This makes it an ideal choice for developers who want **high performance** without sacrificing **flexibility**.

Compared to native unoptimized inference, it offers significantly better performance (especially in terms of throughput and latency) by leveraging features like:

- quantization
- KV cache management
- kernel fusion
- efficient batching

In this step, we’ll load the `mistralai/Mistral-7B-Instruct-v0.3` model from Hugging Face.

📌 **Note:** The first time you run this, it may take a few minutes to:

- Download the model weights into your local Hugging Face cache
- Optimize the model for your GPU (this step is automatic)

Subsequent runs will be faster, as both the model and its compiled artifacts will be reused.

In [None]:
from tensorrt_llm._torch import LLM
from tensorrt_llm import SamplingParams

# Instantiate the model once, reuse for every experiment
model = LLM(model="mistralai/Mistral-7B-Instruct-v0.3")

## Executing inference

Let’s see what our model can do!

We’ll run inference on a batch of prompts and observe the outputs. This is your first direct interaction with the model.

### What is inference?

**Inference** is the process of using a trained model to generate outputs for new inputs without any learning or weight updates.

For large language models (LLMs), this means:

- Receiving a **prompt** (a string of text)
- Computing the most likely **next token(s)**
- Repeating the process token-by-token to generate a full response

Inference is the core of every production LLM system  — whether you’re building a chatbot, writing assistant, summarizer, or anything else.

### Sampling parameters

When generating text, the model can either:

- **Always pick the highest-probability token** (greedy decoding — fast and deterministic)
- **Sample from the probability distribution** over possible next tokens — which adds variety and creativity

We use the following parameters to control that behavior:

- `temperature = 0.7`: Controls randomness. Lower values → more confident, deterministic outputs. Higher values → more diverse, sometimes erratic responses.
- `top_p = 0.9`: Enables **nucleus sampling** — the model samples only from the top tokens that together make up 90% of the probability mass. Balances diversity and coherence.
- `top_k = 50`: Further restricts sampling to the top 50 tokens at each step (can be used alone or with `top_p`).
- `max_tokens = 512`: The maximum number of tokens to generate per prompt.
- `stop = ["</s>"]`: Tells the model to stop when it sees the end-of-sequence token.

Together, these settings ensure the output is diverse but not chaotic — a good balance for most use cases.

Try changing the prompts and sampling values below to see how the model’s behavior changes!

In [None]:
prompts = [
        "Summer is",
        "The president of France is",
        "The capital of Germany is",
        "The future of AI is",
    ]

sampling_params = SamplingParams(
    temperature=0.7,
    top_p=0.9,
    top_k=50,
    max_tokens=512,
    stop=["</s>"]
)

outputs = model.generate(prompts, sampling_params)

for i, output in enumerate(outputs):
    prompt = output.prompt
    generated_text = output.outputs[0].text
    print(f"[{i}] Prompt: {prompt!r}, Generated text: {generated_text!r}\n")

## Performance experiments

Before we dive into benchmarking, we’ll configure a few things to ensure our measurements are accurate and repeatable:

What this cell does:

- **Imports** all required Python libraries for timing, statistics, and plotting
- **Sets a fixed random seed** for reproducibility  
  > This doesn't significantly affect latency, but ensures the same outputs are generated each time — useful when debugging or comparing sampling runs.
- **Initializes the tokenizer** from Hugging Face, which we’ll use to measure how many tokens the model actually generates
- **Defines two helper functions**:
  - `count_generated_tokens()` → counts only the tokens *generated* by the model, excluding the prompt
  - `timed_generate()` → wraps the model's `.generate()` call and returns:
    - total **latency**
    - number of **generated tokens**
    - raw **outputs** (including prompts and completions)

We also apply `nest_asyncio` to avoid runtime warnings from Jupyter's existing event loop — it won’t affect performance but ensures clean execution in notebook environments.

In [None]:
import os, time, random, math, json, itertools, statistics, gc
import numpy as np
import torch
import matplotlib.pyplot as plt
from transformers import AutoTokenizer
import nest_asyncio; nest_asyncio.apply()
import pandas as pd

# Reproducibility – will NOT change latency much, but guarantees identical tokens
SEED = 42
random.seed(SEED)
np.random.seed(SEED)
torch.manual_seed(SEED)
torch.cuda.manual_seed_all(SEED)
torch.backends.cudnn.benchmark = False  # slightly hurts speed but stabilises variance

# Use the same model card so we don’t download twice
tokenizer = AutoTokenizer.from_pretrained(
    "mistralai/Mistral-7B-Instruct-v0.3",
    use_fast=True
)

def count_generated_tokens(output_obj, prompt_text):
    """
    Return only the *new* tokens produced by the model
    (prompt tokens aren't counted towards throughput).
    """
    gen = output_obj.outputs[0]

    # Fast path: some TRT-LLM builds expose raw token IDs
    if hasattr(gen, "token_ids") and gen.token_ids is not None:
        return len(gen.token_ids)

    # Fallback: tokenize the generated string itself
    # (prompt not included, so no subtraction needed)
    return len(tokenizer.encode(gen.text, add_special_tokens=False))


def timed_generate(prompts, sampling_params):
    """Return latency, generated-token count, and the raw outputs."""
    torch.cuda.synchronize()
    t0 = time.perf_counter()
    outputs = model.generate(prompts, sampling_params)
    torch.cuda.synchronize()
    elapsed = time.perf_counter() - t0

    gen_tokens = sum(count_generated_tokens(o, p)
                     for o, p in zip(outputs, prompts))
    return elapsed, gen_tokens, outputs

### Baseline latency and throughput

Let’s establish a **baseline** for how our model performs with a single prompt under typical sampling settings.

This test measures:

- **Latency**: the total time it takes to generate a complete response
- **Throughput**: the number of tokens generated per second

What we’re doing here:

- Use a **single input prompt** (feel free to change it!)
- Generate up to `128` tokens using moderate sampling (`temperature=0.7`, `top_p=0.9`)
- Call `timed_generate()` to:
  - time the entire inference run
  - count how many tokens were actually generated
- Print the generated response

This gives us a **reference point** to compare against later experiments where we vary different factors.

> This is also the simplest “real-world” case: one user, one prompt, one answer.

In [None]:
prompts = ["The future of AI is"]  # single-prompt baseline
sampling_params = SamplingParams(temperature=0.7, top_p=0.9, max_tokens=128, stop=["</s>"])

lat, toks, outs = timed_generate(prompts, sampling_params)
print(f"Latency: {lat:.3f}s  |  Throughput: {toks/lat:.1f} tok/s")

for o in outs:
    print(o.outputs[0].text)

### Batch size sweep

Now we’re going to benchmark how inference performance changes with **different batch sizes** — that is, how many prompts we process in parallel.

This experiment helps us understand the **tradeoff between throughput and latency**, and find the batch size that balances performance and responsiveness.

For each batch size in:

```python
[1, 8, 32, 128, 256, 512]
```

we run 10 repetitions and record:

- **Time to first token (TTFT)** — how quickly the model starts responding
- **Total latency** — how long it takes to generate the full output
- **p50 and p99 latency** — to capture both typical and tail performance
- **Throughput** — total tokens generated per second (real tokens, not just assumed)

Why this matters?

- Small batches (1–8) respond quickly but underutilize the GPU
- Large batches (128–512) maximize throughput but increase individual latency
- Real-time systems typically operate at a batch size that balances tail latency (p99) and system throughput

> These runs can take a few minutes, especially at large batch sizes — start executing the cell and grab a coffee!

We will be saving the results in the file `batch_benchmark.csv` so you can reuse or visualize the data later.

In [None]:
BATCH_SIZES     = [1, 8, 32, 128, 256, 512]
RUNS_PER_SIZE   = 10
MAX_TOKENS_FULL = 128        # length for total-latency test
SAMPLING_OPTS   = dict(temperature=0.7, top_p=0.9, stop=["</s>"])

records = []

for bs in BATCH_SIZES:
    first_latencies, total_latencies = [], []
    gen_token_total = 0
    for _ in range(RUNS_PER_SIZE):
        prompts = ["The future of AI is"] * bs

        # TTFT – first token
        t_first, _, _ = timed_generate(
            prompts,
            SamplingParams(**SAMPLING_OPTS, max_tokens=1)
        )
        first_latencies.append(t_first)

        # Total latency – full answer
        t_total, gen_tokens, _ = timed_generate(
            prompts,
            SamplingParams(**SAMPLING_OPTS, max_tokens=MAX_TOKENS_FULL)
        )
        total_latencies.append(t_total)
        gen_token_total += gen_tokens

        gc.collect(); torch.cuda.empty_cache()

    # derive stats once per batch size
    p50_first = statistics.quantiles(first_latencies, n=100)[49]
    p99_first = statistics.quantiles(first_latencies, n=100)[98]
    p50_total = statistics.quantiles(total_latencies, n=100)[49]
    p99_total = statistics.quantiles(total_latencies, n=100)[98]

    # throughput = tokens produced per second (generated tokens only)
    avg_latency = p50_total
    thru  = gen_token_total / (RUNS_PER_SIZE * avg_latency)

    # We can also assume every run produced MAX_TOKENS_FULL tokens per request
    # thru = (bs * MAX_TOKENS_FULL) / p50_total

    records.append(dict(
        batch_size   = bs,
        p50_ttft     = p50_first,
        p99_ttft     = p99_first,
        p50_latency  = p50_total,
        p99_latency  = p99_total,
        throughput   = thru
    ))

    print(f"BS={bs:>2} | p50={p50_total:.3f}s | p99={p99_total:.3f}s | thru={thru:.1f} tok/s")

df = pd.DataFrame.from_records(records).sort_values("batch_size")
display(df)  # Jupyter pretty-prints

# Persist for later sessions
df.to_csv("batch_benchmark.csv", index=False)
print("✅ Benchmark finished – data stored in DataFrame `df` and batch_benchmark.csv")

### Interpreting our benchmarks

Now that we've run our performance sweep, let’s dive into what the results actually mean — and what insights we can extract from them.

#### First-token vs total latency, p50 vs p99

A critical aspect of LLM user experience is how quickly the model starts responding. This is often referred to as:

- **Time to first token (TTFT)** — the latency until the first word appears on screen.  
- **Total Latency** — the time it takes to generate the *entire* response.

In many real-world applications (e.g. chatbots, assistants), **TTFT is more important than total latency** — users start reading as soon as the model begins responding.

Latency can vary between runs due to scheduling, GPU memory pressure, or sampling variability. That’s why we track:

- **p50 latency**: the median latency — what a “typical” user experiences.
- **p99 latency**: the slowest 1% — what users see during worst-case delays.

Why this matters:

- A low p50 is nice.
- A high p99 is a problem. It means some users are waiting 2–3× longer than expected.
- **Good systems optimize for p99, not just average.**

Let’s now visualize how **batch size** affects all of these:

- Time to first token (TTFT)
- Total latency
- Latency variability (p50 vs. p99)

In [None]:
bs = df["batch_size"]

plt.figure(figsize=(7,4))
plt.plot(bs, df["p50_ttft"],  'o-', label='TTFT p50')
plt.plot(bs, df["p99_ttft"],  'x-', label='TTFT p99')
plt.plot(bs, df["p50_latency"], 's--', label='Total p50')
plt.plot(bs, df["p99_latency"], '^--', label='Total p99')
plt.xlabel("Batch size"); plt.ylabel("Latency (s)")
plt.title("First-token vs Total latency across batch sizes")
plt.legend(); plt.grid(True); plt.tight_layout()
plt.show()

This plot shows that first-token latency (TTFT) remains low across all batch sizes, making it ideal for streaming or interactive applications, while total latency increases sharply — especially at higher batch sizes. The widening gap between p50 and p99 total latency indicates that tail latency becomes a problem as batch size grows, which can degrade user experience. For real-time systems, batch sizes above 128–256 may introduce noticeable delays, even if throughput improves. TTFT’s stability suggests that batching is still viable for streaming use cases, where the first token appears quickly and the rest of the response streams in.

#### Throughput vs p50 latency

Now let’s visualize the tradeoff between throughput and p50 latency. Higher batch sizes typically improve throughput by maximizing GPU utilization, but also increase latency. Our goal is to identify the “sweet spot” — the batch size that delivers strong throughput without compromising responsiveness, striking the right balance for real-time inference.

In [None]:
lat_n  = df["p50_latency"]  / df["p50_latency"].max()          # 0-1
thr_inv= 1 - df["throughput"] / df["throughput"].max()         # 0-1, inverted
gap    = (lat_n - thr_inv).abs()
best_i = gap.argmin()

plt.figure(figsize=(7,4))
plt.plot(bs, lat_n,  'o-', label='p50 latency (norm)')
plt.plot(bs, thr_inv,'^--', label='throughput (inverted norm)')
plt.scatter(bs.iloc[best_i], lat_n.iloc[best_i], c='red', s=120,
            label=f'sweet spot ≈ BS {bs.iloc[best_i]}')
plt.xlabel("Batch size"); plt.ylabel("Normalised metric (0-1)")
plt.title("Latency vs Throughput – knee point")
plt.legend(); plt.grid(True); plt.tight_layout()
plt.show()

This generated plot shows the tradeoff between p50 latency and throughput as batch size increases, helping us identify the optimal “sweet spot” for inference performance. As batch size grows, latency increases, while throughput improves (the line inverted for visual comparison). 

The two curves intersect around batch size 128, which represents the **knee point** — the smallest batch size that gets us close to maximum throughput without incurring excessive latency. Going beyond this point yields diminishing returns: throughput flattens while latency continues to rise sharply. For many real-time or near-real-time use cases, this intersection represents the best balance between speed and efficiency.

#### Finding the sweet spot based on UX budget

Choosing the optimal batch size isn’t just about throughput — it’s about respecting your users’ **latency expectations**.

This section helps you find the best-performing batch size that stays within a given **latency budget**, defined in seconds. You can adjust the value of `latency_budget` depending on your application’s needs (e.g. 2.5s for chat, 5s for summarization, etc.).

What this code does:

- Filters all benchmark results to include only configurations where latency is **below your target threshold**
- Then selects the **batch size with the highest throughput** from the remaining options
- Runs this selection twice:
  - Once using **p50 latency** (typical case)
  - Once using **p99 latency** (worst-case tail latency)

Try changing the `latency_budget` value and observe how the recommended batch size shifts depending on the metric used.

In [None]:
latency_budget = 2.5  # seconds

filtered_p50 = df[df["p50_latency"] <= latency_budget]
filtered_p99 = df[df["p99_latency"] <= latency_budget]

if not filtered_p50.empty:
    best_row_p50 = filtered_p50.loc[filtered_p50["throughput"].idxmax()]
    print(f"✅ Recommended: batch_size={int(best_row_p50.batch_size)}  "
          f"⇒  p50 latency = {best_row_p50.p50_latency:.2f}s, "
          f"throughput = {best_row_p50.throughput:.0f} tok/s")
else:
    print(f"❌ No configuration meets the latency budget of {latency_budget:.1f}s")

if not filtered_p99.empty:
    best_row_p99 = filtered_p99.loc[filtered_p99["throughput"].idxmax()]
    print(f"✅ Recommended: batch_size={int(best_row_p99.batch_size)}  "
          f"⇒  p99 latency = {best_row_p99.p99_latency:.2f}s, "
          f"throughput = {best_row_p99.throughput:.0f} tok/s")
else:
    print(f"❌ No configuration meets the latency budget of {latency_budget:.1f}s based on p99 latency")

### Asessing impact of other parameters

So far, we’ve seen that **batch size** directly affects latency and throughput. But batch size isn’t the only factor that matters.

Let’s now explore how **other inference parameters**, starting with **output sequence length**, influence performance.

#### Sequence Length

The **sequence length** `max_tokens` parameter defines how many tokens the model is allowed to generate per prompt.

In practice:

- Short outputs (e.g. 32 tokens) return quickly
- Long outputs (e.g. 512 tokens) take significantly more time, especially at high batch sizes

This is because while **prefill cost is fixed**, the **decode phase scales linearly** with the number of tokens generated — we will look into that in a bit!

What this experiment does:

- Uses a fixed batch size (`batch = 128`)
- Varies the `max_tokens` cap from 32 to 512
- Measures **mean** and **p99 latency** across 5 runs for each setting

> Note: This experiment can take a few minutes — longer sequence lengths at high batch size are compute-intensive.

Let’s visualize how increasing the output length impacts both average latency and tail latency.

In [None]:
seq_lengths = [32, 64, 128, 256, 512]
batch = 128
lat_means, lat_p99s = [], []
for mt in seq_lengths:
    sampling_params = SamplingParams(temperature=0.7, top_p=0.9, max_tokens=mt, stop=["</s>"])
    lats = [ timed_generate(["Summer is"]*batch, sampling_params)[0] for _ in range(5) ]
    lat_means.append(sum(lats)/len(lats))
    lat_p99s.append(statistics.quantiles(lats, n=100)[98])

plt.figure()
plt.plot(seq_lengths, lat_means, marker='o', label='Mean latency')
plt.plot(seq_lengths, lat_p99s, marker='x', label='p99 latency')
plt.xlabel('max_tokens'); plt.ylabel('Latency (s)'); plt.title('Latency vs Generated Length')
plt.legend(); plt.grid(True); plt.show()

Once the execution finished, you can see, that latency grows roughly linearly with tokens generated. For chat UX, cap responses (e.g., 128-256 tokens) or stream early chunks.

#### Sampling knobs (temperature & top-p) and determinism

You might be wondering — do **sampling parameters** like `temperature` and `top_p` affect latency? Let's check!

We run multiple generations using different combinations of:

- `temperature = [0.0, 0.7, 1.3]`
- `top_p     = [0.8, 0.9, 1.0]`

For each pair, we measure:

- **Mean latency**
- **p99 latency**

This helps us assess whether sampling diversity impacts performance.

Let’s visualize the results!

In [None]:
# Benchmark settings
temps = [0.0, 0.7, 1.3]
tops  = [0.8, 0.9, 1.0]
grid  = list(itertools.product(temps, tops))

# Results containers
mean_lats = []
p99_lats  = []

# Benchmark loop
for t, p in grid:
    sp = SamplingParams(temperature=t, top_p=p, max_tokens=64, stop=["</s>"])
    lats = [timed_generate(["The president of France is"], sp)[0] for _ in range(30)]
    mean_lat = statistics.mean(lats)
    p99_lat = statistics.quantiles(lats, n=100)[98]

    mean_lats.append(mean_lat)
    p99_lats.append(p99_lat)

    print(f"T={t:<3} top_p={p:<3} | mean {mean_lat:.3f}s | p99 {p99_lat:.3f}s")

# Visualization
x = range(len(grid))
labels = [f"T={t}, p={p}" for (t, p) in grid]

plt.figure(figsize=(8, 4))
plt.plot(x, mean_lats, marker='o', label='Mean latency')
plt.plot(x, p99_lats, marker='x', label='p99 latency')
plt.xticks(x, labels, rotation=45, ha='right')
plt.xlabel('(temperature, top_p)')
plt.ylabel('Latency (s)')
plt.title('Latency vs Sampling Parameters')
plt.legend(); plt.grid(True); plt.tight_layout()
plt.show()


You’ll notice sampling strategy barely moves latency (sub-1 % deltas). So worry about output quality, not perf, when tuning temperature and top-p.

p99 latency is more volatile and fluctuates randomly.

## Note on reasoning models

While this notebook focuses on explaining LLM inference and its performance, it's important to recognize that not all language models are built the same.

Some are optimized for **throughput and deployment efficiency** (e.g. Mistral 7B, LLaMA 3 8B), while others are designed for **complex reasoning and instruction following** (e.g. GPT-4, Claude Opus, Gemini Pro, LLaMA 3 70B, NVIDIA Llama Nemotron).

When working with reasoning models, it's important to evaluate them not just on latency and throughput, but also on:

- Accuracy and depth of reasoning
- Consistency across temperature settings
- Ability to follow multi-step or long-form instructions

That said, **UX rules still apply**: a powerful model that takes 10 seconds to respond without streaming may still feel unusable, no matter how smart it is.

> In production, you'll often need to balance **speed** with **output quality** — especially for use cases involving multi-step reasoning or decision making.

👉 **Choose the model based on the application**: chat assistants, real-time agents, RAG pipelines, and batch summarization all have different performance vs. quality requirements.

The right tradeoff isn't about raw metrics — it's about delivering the right user experience for the task.

## Summary

In this notebook, we explored the key performance aspects of running inference with large language models (LLMs) — and how to make informed tradeoffs between **latency**, **throughput**, and **user experience (UX)**.

### Key takeaways

- **Latency vs. Throughput is a balancing act**
  - Larger batch sizes boost throughput but increase response time  
  - The “sweet spot” depends on your UX latency budget (e.g. < 2.5s for chat)
- **First-token latency (TTFT) matters for UX**
  - TTFT stays low even at high batch sizes, making streaming UIs responsive  
  - Total latency, however, grows linearly with output length and batch size
- **p50 vs. p99 latency tells you everything about real-world performance**
  - Optimizing for p50 gives nice averages  
  - Optimizing for p99 ensures consistent UX and avoids tail spikes
- **Sequence length directly affects decode time**
  - Longer outputs = higher latency  
  - Always set `max_tokens` intentionally
- **Sampling parameters (temperature, top_p) have little impact on latency**
  - You can safely tune them for quality without worrying about performance  
  - Only high-entropy sampling (e.g. temp > 1.3) may slightly increase p99

### Final thought

LLM inference isn't just about making the model run fast — it's about **making it feel fast** for the user. By understanding and measuring the right metrics, you can deliver models that are not only efficient, but actually usable in production.