# Latency Microbenchmark — TGI vs. Transformers pipeline (concept)
**TL;DR:** Capture baseline pipeline latency on CPU and scaffold fields for future Text Generation Inference runs.

**Models & Datasets:** [Text Generation Inference (pending)](https://huggingface.co/docs/text-generation-inference/index) (Apache-2.0), [UltraChat prompts (sample)](https://huggingface.co/datasets/HuggingFaceH4/ultrachat_200k) (CC BY-SA 4.0)
**Run Profiles:** 🖥️ CPU | 🍎 Metal (Apple Silicon) | 🧪 Colab/T4 | ⚡ CUDA GPU
**Env (minimal):** python>=3.10, transformers, datasets, evaluate, accelerate (optional: peft, bitsandbytes, timm, diffusers)
**Colab:** [Open in Colab](https://colab.research.google.com/github/SSusantAchary/Hands-On-Huggingface-AI-Models/blob/main/notebooks/serving/tgi-vs-pipeline-latency_microbenchmark.ipynb)

**Switches (edit in one place):**
- `device` = {"cpu","mps","cuda"}
- `precision` = {"fp32","fp16","bf16","int8","4bit"}  (apply only if supported)
- `context_len` / `image_res` / `batch_size`

**Footprint & Speed (fill after run):**
- Peak RAM: TODO
- Peak VRAM: TODO (if GPU)
- TTFB: TODO, Throughput: TODO, Load time: TODO

**Gotchas:** TGI requires container and GPU setup—tracked in Fixes entry ([Fixes & Tips](../fixes-and-tips/tgi-setup-todo.md))



## Setup
Measure pipeline latency now and leave TODO hooks for a future TGI deployment.


In [None]:

import json
import os
import subprocess
import time
from pathlib import Path

import torch
from datasets import load_dataset
from transformers import pipeline

from notebooks._templates.measure import append_benchmark_row, measure_memory_speed

DEVICE_PREFERENCE = os.environ.get("HF_DEVICE", "cpu")
PRECISION = os.environ.get("HF_PRECISION", "fp32")

def resolve_device(preference: str = "cpu") -> str:
    if preference == "cuda" and torch.cuda.is_available():
        return "cuda:0"
    if preference == "mps" and torch.backends.mps.is_available():
        return "mps"
    return "cpu"

DEVICE = resolve_device(DEVICE_PREFERENCE)
print(f"Using device={DEVICE}")

PIPELINE_MODEL_ID = "TinyLlama/TinyLlama-1.1B-Chat-v1.0"
OUTPUT_DIR = Path("outputs") / "tgi-vs-pipeline"
OUTPUT_DIR.mkdir(parents=True, exist_ok=True)

prompts = load_dataset("HuggingFaceH4/ultrachat_200k", split="train[:4]")["prompt"]


## Pipeline latency (baseline)


In [None]:

torch.manual_seed(42)

load_start = time.perf_counter()
generator = pipeline(
    "text-generation",
    model=PIPELINE_MODEL_ID,
    device=DEVICE,
    max_new_tokens=64,
    do_sample=False,
)
load_time = time.perf_counter() - load_start

pipeline_outputs = []
for prompt in prompts:
    pipeline_outputs.append(generator(prompt)[0]["generated_text"])

with open(OUTPUT_DIR / "pipeline_outputs.json", "w", encoding="utf-8") as fp:
    json.dump(pipeline_outputs, fp, indent=2)


## Measurement


In [None]:

def run_inference(recorder):
    for idx, prompt in enumerate(prompts):
        result = generator(prompt, max_new_tokens=64, do_sample=False)
        if idx == 0:
            recorder.mark_first_token()
        recorder.add_items(len(result[0]["generated_text"].split()))

metrics = measure_memory_speed(run_inference)

def fmt(value, digits=4):
    if value in (None, "", float("inf")):
        return ""
    return f"{value:.{digits}f}"

try:
    repo_commit = subprocess.check_output(["git", "rev-parse", "HEAD"], text=True).strip()
except Exception:  # noqa: BLE001
    repo_commit = ""

append_benchmark_row(
    task="tgi-pipeline-baseline",
    model_id=PIPELINE_MODEL_ID,
    dataset="ultrachat_200k",
    sequence_or_image_res="64-tokens",
    batch="1",
    peak_ram_mb=fmt(metrics.get("peak_ram_mb"), 2),
    peak_vram_mb=fmt(metrics.get("peak_vram_mb"), 2),
    load_time_s=fmt(load_time, 2),
    ttfb_s=fmt(metrics.get("ttfb_s"), 3),
    tokens_per_s_or_images_per_s=fmt(metrics.get("throughput_per_s"), 3),
    precision=PRECISION,
    notebook_path="notebooks/serving/tgi-vs-pipeline-latency_microbenchmark.ipynb",
    repo_commit=repo_commit,
)

TODO_TGI_NOTES = {
    "status": "pending",
    "notes": "Provision Text Generation Inference container and populate compare.csv",
}
with open(OUTPUT_DIR / "tgi_todo.json", "w", encoding="utf-8") as fp:
    json.dump({"metrics": metrics, "tgi": TODO_TGI_NOTES}, fp, indent=2)
metrics


## Results Summary
        - Observations: TODO
        - Metrics captured: see `benchmarks/matrix.csv`

        ## Next Steps
        - TODOs: fill in after benchmarking

        ## Repro
        - Seed: 42 (set in measurement cell)
        - Libraries: captured via `detect_env()`
        - Notebook path: `notebooks/serving/tgi-vs-pipeline-latency_microbenchmark.ipynb`
        - Latest commit: populated automatically when appending benchmarks (if git available)
