**Configure runtime first:** Runtime > Change runtime type > Hardware accelerator: **GPU** (T4) > Save.

# Inference Pipeline and Format Selection (Course 2)

This notebook demonstrates model format selection and the inference pipeline. Inspired by [Intro to Inference: How to Run AI Models on a GPU](https://developers.google.com/learn/pathways/ai-models-on-gpu-intro).

Run the setup cell first. Use **Runtime > Change runtime type > GPU** in Colab for faster inference.

In [8]:
# Colab setup: clone repo and install dependencies (run this cell first)
try:
    import google.colab
    get_ipython().system("git clone -q https://github.com/KarthikSriramGit/Project-Insight.git")
    get_ipython().run_line_magic("cd", "Project-Insight")
    get_ipython().system("pip install -q -r requirements.txt")
except Exception:
    pass

In [9]:
# Setup: Colab runs from repo root after clone
import sys
from pathlib import Path
ROOT = Path(".").resolve()
if str(ROOT) not in sys.path:
    sys.path.insert(0, str(ROOT))
print(f"ROOT={ROOT}")

ROOT=C:\Users\skart\Desktop\ROG SSD\Git Repos\Project-Insight


## 1. Format selection by use case

In [10]:
from src.inference.format_selector import select_format

for use_case in ["research", "sharing", "local", "production", "portable"]:
    fmt, rationale = select_format(use_case, hardware="gpu")
    print(f"{use_case}: {fmt}")
    print(f"  {rationale[:85]}...")
    print()

research: safetensors
  Fast, secure weight serialization. Memory-mapped loading, no arbitrary code execution...

sharing: safetensors
  Fast, secure weight serialization. Memory-mapped loading, no arbitrary code execution...

local: gguf
  Compact, quantized format for local inference. Powers llama.cpp and run-on-laptop wor...

production: tensorrt
  Compiled engine for NVIDIA GPUs. Pre-optimized kernels, lowest latency and highest th...

portable: onnx
  Graph-level interchange format. Framework-agnostic, runs on ONNX Runtime, OpenVINO, T...



## 2. Inference metrics (p50, p90, throughput)

In [11]:
from src.inference.metrics import compute_metrics

total_latencies = [1.2, 1.1, 1.3, 1.0, 1.2]
first_token_latencies = [0.1, 0.09, 0.11, 0.1, 0.1]
token_counts = [64, 64, 64, 64, 64]

metrics = compute_metrics(
    total_latencies=total_latencies,
    first_token_latencies=first_token_latencies,
    token_counts=token_counts,
)
for k, v in metrics.items():
    print(f"{k}: {v:.4f}")

p50_latency_s: 1.2000
p90_latency_s: 1.2600
p50_ttft_s: 0.1000
p90_ttft_s: 0.1060
throughput_sustained_tok_s: 55.1724


## 3. Inference pipeline with TinyLlama (no Hugging Face login)

Uses TinyLlama 1.1B, a public model. For Gemma, add `from huggingface_hub import login; login()` first.

In [None]:
# Dependencies installed by setup cell; enable GPU in Runtime > Change runtime type

Defaulting to user installation because normal site-packages is not writeable


ERROR: Could not open requirements file: [Errno 2] No such file or directory: 'requirements.txt'


In [14]:
# Inference pipeline with TinyLlama (requires GPU runtime for best performance)
import torch
from transformers import AutoTokenizer, AutoModelForCausalLM
from src.inference.pipeline import InferencePipeline

model_id = "TinyLlama/TinyLlama-1.1B-Chat-v1.0"
device = "cuda" if torch.cuda.is_available() else "cpu"
print(f"Device: {device}")

tokenizer = AutoTokenizer.from_pretrained(model_id)
if tokenizer.pad_token is None:
    tokenizer.pad_token = tokenizer.eos_token
model = AutoModelForCausalLM.from_pretrained(
    model_id,
    torch_dtype=torch.float16 if device == "cuda" else torch.float32,
)
model = model.to(device)

pipe = InferencePipeline(model, tokenizer, device=device, max_new_tokens=64)
out = pipe.generate(["What was the peak brake pressure in vehicle V001?"], max_new_tokens=32)
print(out[0])

Device: cpu


Loading weights: 100%|â–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆ| 201/201 [00:03<00:00, 51.98it/s, Materializing param=model.norm.weight]                              


What was the peak brake pressure in vehicle V001? 
<|assistant|>
The peak brake pressure in vehicle V001 was 120 psi (8.2
