# Inference Pipeline and Format Selection

This notebook demonstrates model format selection and the inference pipeline. Inspired by [Intro to Inference: How to Run AI Models on a GPU](https://developers.google.com/learn/pathways/ai-models-on-gpu-intro).

Run the setup cell first. Use **Runtime > Change runtime type > GPU** in Colab for faster inference.

**Configure runtime first:** Runtime > Change runtime type > Hardware accelerator: **GPU** (T4) > Save.

In [1]:
# Colab setup: clone repo and install dependencies (run this cell first)
try:
    import google.colab
    get_ipython().system("git clone -q https://github.com/KarthikSriramGit/Project-Insight.git")
    get_ipython().run_line_magic("cd", "Project-Insight")
    get_ipython().system("pip install -q -r requirements.txt")
except Exception:
    pass

/content/Project-Insight


In [2]:
# Setup: Colab runs from repo root after clone
import sys
from pathlib import Path
ROOT = Path(".").resolve()
if str(ROOT) not in sys.path:
    sys.path.insert(0, str(ROOT))
print(f"ROOT={ROOT}")

ROOT=/content/Project-Insight


## 1. Format selection by use case

In [3]:
from src.inference.format_selector import select_format

for use_case in ["research", "sharing", "local", "production", "portable"]:
    fmt, rationale = select_format(use_case, hardware="gpu")
    print(f"{use_case}: {fmt}")
    print(f"  {rationale[:85]}...")
    print()

research: safetensors
  Fast, secure weight serialization. Memory-mapped loading, no arbitrary code execution...

sharing: safetensors
  Fast, secure weight serialization. Memory-mapped loading, no arbitrary code execution...

local: gguf
  Compact, quantized format for local inference. Powers llama.cpp and run-on-laptop wor...

production: tensorrt
  Compiled engine for NVIDIA GPUs. Pre-optimized kernels, lowest latency and highest th...

portable: onnx
  Graph-level interchange format. Framework-agnostic, runs on ONNX Runtime, OpenVINO, T...



## 2. Inference metrics (p50, p90, throughput)

In [4]:
from src.inference.metrics import compute_metrics

total_latencies = [1.2, 1.1, 1.3, 1.0, 1.2]
first_token_latencies = [0.1, 0.09, 0.11, 0.1, 0.1]
token_counts = [64, 64, 64, 64, 64]

metrics = compute_metrics(
    total_latencies=total_latencies,
    first_token_latencies=first_token_latencies,
    token_counts=token_counts,
)
for k, v in metrics.items():
    print(f"{k}: {v:.4f}")

p50_latency_s: 1.2000
p90_latency_s: 1.2600
p50_ttft_s: 0.1000
p90_ttft_s: 0.1060
throughput_sustained_tok_s: 55.1724


## 3. Inference pipeline with Phi-3 Mini (no Hugging Face login)

Uses **Microsoft Phi-3 Mini 4K Instruct** (3.8B params) — a free, public model that fits on a T4 GPU in float16. Much more capable than TinyLlama 1.1B while still lightweight. No HuggingFace login required.

In [6]:
# Inference pipeline with TinyLlama (requires GPU runtime for best performance)
import torch
from transformers import AutoTokenizer, AutoModelForCausalLM
from src.inference.pipeline import InferencePipeline

model_id = "microsoft/Phi-3-mini-4k-instruct"
device = "cuda" if torch.cuda.is_available() else "cpu"
print(f"Device: {device}")
print(f"Model: {model_id}")

tokenizer = AutoTokenizer.from_pretrained(model_id, trust_remote_code=True)
if tokenizer.pad_token is None:
    tokenizer.pad_token = tokenizer.eos_token
model = AutoModelForCausalLM.from_pretrained(
    model_id,
    torch_dtype=torch.float16 if device == "cuda" else torch.float32,
    trust_remote_code=True,
)
model = model.to(device)

pipe = InferencePipeline(model, tokenizer, device=device, max_new_tokens=200)

def clean_response(raw_text, prompt_text=""):
    """Extract only the generated response, remove chat tags, trim to complete sentence."""
    text = raw_text
    if prompt_text and text.startswith(prompt_text):
        text = text[len(prompt_text):]
    for tag in ["<|assistant|>", "<|end|>", "</s>", "<|user|>"]:
        text = text.replace(tag, "")
    text = text.strip()
    if text and text[-1] not in ".!?":
        last_end = max(text.rfind("."), text.rfind("!"), text.rfind("?"))
        if last_end > 0:
            text = text[:last_end + 1]
    return text.strip()

queries = [
    "What was the peak brake pressure in vehicle V001?",
    "Summarize the fleet telemetry health for 10 autonomous vehicles.",
    "What model format should I use for production inference on NVIDIA GPUs?",
]

for q in queries:
    chat_prompt = f"<|user|>\n{q}<|end|>\n<|assistant|>\n"
    out = pipe.generate([chat_prompt], max_new_tokens=200)
    response = clean_response(out[0], chat_prompt)
    print(f"Q: {q}")
    print(f"A: {response}")
    print("-" * 60)

Device: cuda


The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


config.json:   0%|          | 0.00/608 [00:00<?, ?B/s]

tokenizer_config.json: 0.00B [00:00, ?B/s]

tokenizer.json: 0.00B [00:00, ?B/s]

tokenizer.model:   0%|          | 0.00/500k [00:00<?, ?B/s]



special_tokens_map.json:   0%|          | 0.00/551 [00:00<?, ?B/s]

`torch_dtype` is deprecated! Use `dtype` instead!


model.safetensors:   0%|          | 0.00/2.20G [00:00<?, ?B/s]

Loading weights:   0%|          | 0/201 [00:00<?, ?it/s]

generation_config.json:   0%|          | 0.00/124 [00:00<?, ?B/s]

What was the peak brake pressure in vehicle V001? 
<|assistant|>
The peak brake pressure in vehicle V001 was 120 psi (8.2
