# Inference Pipeline and Format Selection (Course 2)

This notebook demonstrates model format selection and the inference pipeline (tokenization, batching, forward pass, decoding). Inspired by [Intro to Inference: How to Run AI Models on a GPU](https://developers.google.com/learn/pathways/ai-models-on-gpu-intro).

## Format selection by use case

In [2]:
import sys
from pathlib import Path

# Mocking the missing 'src' module for demonstration
class MockSelector:
    @staticmethod
    def select_format(use_case, hardware="gpu"):
        formats = {
            "research": ("PyTorch (.bin)", "Native format, best for experimentation and debugging."),
            "sharing": ("Safetensors", "Secure, fast loading, and becoming the industry standard for weights."),
            "local": ("GGUF", "Optimized for CPU/GPU inference on consumer hardware via llama.cpp."),
            "production": ("TensorRT / ONNX", "Compiled for specific hardware to maximize throughput."),
            "portable": ("TFLite", "Designed for mobile and edge device deployment.")
        }
        return formats.get(use_case, ("Unknown", "No rationale available."))

# Using the mock instead of the missing import
select_format = MockSelector.select_format

for use_case in ["research", "sharing", "local", "production", "portable"]:
    fmt, rationale = select_format(use_case, hardware="gpu")
    print(f"{use_case}: {fmt}")
    print(f"  {rationale[:80]}...")
    print()

research: PyTorch (.bin)
  Native format, best for experimentation and debugging....

sharing: Safetensors
  Secure, fast loading, and becoming the industry standard for weights....

local: GGUF
  Optimized for CPU/GPU inference on consumer hardware via llama.cpp....

production: TensorRT / ONNX
  Compiled for specific hardware to maximize throughput....

portable: TFLite
  Designed for mobile and edge device deployment....



## Inference metrics (p50, p90, throughput)

Metrics: total latency, TTFT, sustained throughput, inter-token latency.

In [5]:
import numpy as np

def compute_metrics(total_latencies, first_token_latencies, token_counts):
    """Mock implementation of the missing src.inference.metrics function"""
    total_latencies = np.array(total_latencies)
    token_counts = np.array(token_counts)

    # Throughput: total tokens / total time
    throughput = np.sum(token_counts) / np.sum(total_latencies)

    return {
        "p50_latency": np.percentile(total_latencies, 50),
        "p90_latency": np.percentile(total_latencies, 90),
        "average_ttft": np.mean(first_token_latencies),
        "tokens_per_second": throughput
    }

total_latencies = [1.2, 1.1, 1.3, 1.0, 1.2]
first_token_latencies = [0.1, 0.09, 0.11, 0.1, 0.1]
token_counts = [64, 64, 64, 64, 64]

metrics = compute_metrics(
    total_latencies=total_latencies,
    first_token_latencies=first_token_latencies,
    token_counts=token_counts,
)
for k, v in metrics.items():
    print(f"{k}: {v:.4f}")

p50_latency: 1.2000
p90_latency: 1.2600
average_ttft: 0.1000
tokens_per_second: 55.1724


## Inference pipeline with telemetry query

To run the full pipeline with a real model, load a small LLM (e.g. Gemma 270M) and use InferencePipeline. Preprocessing, batching, forward pass, and decoding.

In [9]:
# Example: pipeline usage (requires transformers, torch)
import torch
from transformers import AutoTokenizer, AutoModelForCausalLM

# Mock/Simplified implementation of InferencePipeline since 'src' is missing
class InferencePipeline:
    def __init__(self, model, tokenizer, device="cuda"):
        self.model = model
        self.tokenizer = tokenizer
        self.device = device

    def generate(self, prompts, max_new_tokens=64):
        inputs = self.tokenizer(prompts, return_tensors="pt", padding=True).to(self.device)
        with torch.no_grad():
            outputs = self.model.generate(**inputs, max_new_tokens=max_new_tokens)
        return self.tokenizer.batch_decode(outputs, skip_special_tokens=True)

# Using a non-gated model for demonstration purposes
model_id = "facebook/opt-125m"
tokenizer = AutoTokenizer.from_pretrained(model_id)
# Set device to cuda if available, else cpu
device = "cuda" if torch.cuda.is_available() else "cpu"
model = AutoModelForCausalLM.from_pretrained(model_id).to(device)

pipe = InferencePipeline(model, tokenizer, device=device)
out = pipe.generate(["What was the peak brake pressure in vehicle V001?"], max_new_tokens=32)
print(out[0])

config.json:   0%|          | 0.00/651 [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/685 [00:00<?, ?B/s]

vocab.json: 0.00B [00:00, ?B/s]

merges.txt: 0.00B [00:00, ?B/s]

special_tokens_map.json:   0%|          | 0.00/441 [00:00<?, ?B/s]



pytorch_model.bin:   0%|          | 0.00/251M [00:00<?, ?B/s]

Loading weights:   0%|          | 0/197 [00:00<?, ?it/s]



generation_config.json:   0%|          | 0.00/137 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/251M [00:00<?, ?B/s]

What was the peak brake pressure in vehicle V001?
I don't know, I've never seen it. I've seen it in the video.
