# Inference Pipeline and Format Selection (Course 2)

This notebook demonstrates model format selection and the inference pipeline (tokenization, batching, forward pass, decoding). Inspired by [Intro to Inference: How to Run AI Models on a GPU](https://developers.google.com/learn/pathways/ai-models-on-gpu-intro).

## Format selection by use case

In [None]:
import sys
from pathlib import Path
sys.path.insert(0, str(Path("..").resolve()))

from src.inference.format_selector import select_format, FORMAT_RATIONALE

for use_case in ["research", "sharing", "local", "production", "portable"]:
    fmt, rationale = select_format(use_case, hardware="gpu")
    print(f"{use_case}: {fmt}")
    print(f"  {rationale[:80]}...")
    print()

## Inference metrics (p50, p90, throughput)

Metrics: total latency, TTFT, sustained throughput, inter-token latency.

In [None]:
from src.inference.metrics import compute_metrics, timed_generate

total_latencies = [1.2, 1.1, 1.3, 1.0, 1.2]
first_token_latencies = [0.1, 0.09, 0.11, 0.1, 0.1]
token_counts = [64, 64, 64, 64, 64]

metrics = compute_metrics(
    total_latencies=total_latencies,
    first_token_latencies=first_token_latencies,
    token_counts=token_counts,
)
for k, v in metrics.items():
    print(f"{k}: {v:.4f}")

## Inference pipeline with telemetry query

To run the full pipeline with a real model, load a small LLM (e.g. Gemma 270M) and use InferencePipeline. Preprocessing, batching, forward pass, and decoding.

In [None]:
# Example: pipeline usage (requires transformers, torch, and HF token)
# from src.inference.pipeline import InferencePipeline
# from transformers import AutoTokenizer, AutoModelForCausalLM
# tokenizer = AutoTokenizer.from_pretrained("google/gemma-3-270m-it")
# model = AutoModelForCausalLM.from_pretrained("google/gemma-3-270m-it", device_map="cuda")
# pipe = InferencePipeline(model, tokenizer, device="cuda")
# out = pipe.generate(["What was the peak brake pressure in vehicle V001?"], max_new_tokens=64)
# print(out[0])
print("Pipeline usage: see comments above. Requires transformers, torch, GPU.")