<a href="https://colab.research.google.com/github/AliNoorian/LLMOps_Series_Model_Selection/blob/main/LLMOps_Model_Selection.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>


# LLMOps Series — Model Selection (Colab/Jupyter Notebook)

> **Use this notebook to choose, test, and cost out LLMs for your use case.**  
> It includes setup cells, side‑by‑side comparisons (proprietary vs open‑source), latency/throughput tests, context‑window experiments, quantization notes, and lightweight benchmarking utilities.

**Contents**
1. [Environment Check & Setup](#env)
2. [Your Use Case Checklist](#checklist)
3. [Proprietary vs Open-Source: Decision Guide](#decision)
4. [Open-Source: Try a Small Model (Transformers)](#transformers)
5. [Open-Source: Try a GGUF Model (llama.cpp / llama-cpp-python)](#gguf)
6. [Latency & Throughput Testing](#perf)
7. [Context Window Experiments](#ctx)
8. [Prompt Engineering vs Fine‑Tuning (Overview + Demo)](#tuning)
9. [Cost Estimation — API & Self-Hosting Calculators](#cost)
10. [Minimal RAG Harness (Optional)](#rag)
11. [Production Inference (vLLM/TGI) — Optional Installs](#prod)
12. [Quick Benchmarking Utilities](#bench)
13. [References & Next Steps](#refs)

---

**Two Main Model Types**  
**Proprietary** (e.g., GPT-5, Claude, Gemini) → _Plug-and-play_, top-tier performance, pay‑per‑use, limited data control.  
**Open-Source** (e.g., LLaMA, Mistral, Falcon, Zephyr) → _Full control_, lower long‑term cost, infra/DevOps required.

> **General rule:** Start fast with proprietary APIs, then migrate to open‑source for cost/privacy control.



---
<a id="env"></a>

## 1) Environment Check & Setup

This section verifies Python version, GPU availability, and installs commonly used libraries.  
Run each cell once per runtime.


In [None]:

import sys, platform, os, subprocess, json, textwrap, math, time, random
from datetime import datetime

print("Python:", sys.version.split()[0])
print("Platform:", platform.platform())
print("CUDA_VISIBLE_DEVICES:", os.environ.get("CUDA_VISIBLE_DEVICES"))
!nvidia-smi || echo "No NVIDIA GPU detected."


Python: 3.12.11
Platform: Linux-6.6.97+-x86_64-with-glibc2.35
CUDA_VISIBLE_DEVICES: None
Wed Sep 24 08:35:07 2025       
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 550.54.15              Driver Version: 550.54.15      CUDA Version: 12.4     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|   0  Tesla T4                       Off |   00000000:00:04.0 Off |                    0 |
| N/A   46C    P8              9W /   70W |       0MiB /  15360MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+--------

In [None]:

# Core libraries used across the notebook
!pip -q install transformers accelerate sentencepiece bitsandbytes --upgrade
!pip -q install llama-cpp-python --upgrade
# Optional helpers
!pip -q install einops datasets evaluate tiktoken --upgrade


[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m40.1/40.1 kB[0m [31m2.8 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m11.6/11.6 MB[0m [31m81.5 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m61.3/61.3 MB[0m [31m12.4 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m50.7/50.7 MB[0m [31m13.3 MB/s[0m eta [36m0:00:00[0m
[?25h  Installing build dependencies ... [?25l[?25hdone
  Getting requirements to build wheel ... [?25l[?25hdone
  Installing backend dependencies ... [?25l[?25hdone
  Preparing metadata (pyproject.toml) ... [?25l[?25hdone
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m45.5/45.5 kB[0m [31m4.4 MB/s[0m eta [36m0:00:00[0m
[?25h  Building wheel for llama-cpp-python (pyproject.toml) ... [?25l[?25hdone
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m503.6/503.6 kB[0m [31m16.


---
<a id="checklist"></a>

## 2) Your Use Case Checklist

Before choosing a model, clarify:

- **Use case**: chatbot, summarizer, code assistant, RAG, agent, etc.
- **Privacy**: healthcare/finance/government constraints?
- **Budget**: API pay‑per‑use vs GPU hosting?
- **Scale**: daily active users (DAU), peak RPS, latency targets?
- **Fit**: prompt‑only vs fine‑tune; domain‑specific data?

Run the next cell to record your choices. You can re-run and modify anytime.


In [None]:

from dataclasses import dataclass, asdict

@dataclass
class UseCaseConfig:
    name: str = "My Assistant"
    use_case: str = "chatbot"
    privacy_level: str = "standard"  # options: standard, high, extreme
    budget_mode: str = "api"         # options: api, self-host, hybrid
    target_latency_ms: int = 800
    target_rps: float = 2.0
    need_finetune: bool = False
    context_window_tokens: int = 8000
    notes: str = "add any constraints here"

cfg = UseCaseConfig()
print(cfg)


UseCaseConfig(name='My Assistant', use_case='chatbot', privacy_level='standard', budget_mode='api', target_latency_ms=800, target_rps=2.0, need_finetune=False, context_window_tokens=8000, notes='add any constraints here')



---
<a id="decision"></a>

## 3) Proprietary vs Open‑Source — Decision Guide

| Factor | Proprietary (GPT/Claude/Gemini) | Open-Source (LLaMA/Mistral/Falcon/Zephyr) |
|---|---|---|
| **Speed to MVP** | ◎ Fast | ○ Medium |
| **Peak Quality** | ◎ Very high | ○ High (varies by model/size) |
| **Cost at Scale** | △ Increases with usage | ◎ Can be cheaper long‑term |
| **Data Control** | △ Limited | ◎ Full |
| **Customization** | ○ Prompting & fine‑tune (sometimes) | ◎ Full (fine‑tune/quantize) |
| **Ops Overhead** | ◎ Low | △ Requires MLOps/DevOps |

**Rule of thumb:** Prototype on proprietary APIs → baseline quality/perf → evaluate open‑source (quantized) for cost/privacy.



---
<a id="transformers"></a>

## 4) Open‑Source: Try a Small Model (Transformers)

Below we load a small model to keep downloads quick in Colab. You can swap to any compatible causal LM on Hugging Face.


In [None]:

from transformers import AutoModelForCausalLM, AutoTokenizer, pipeline
import torch, os

# Choose a lightweight model for demo
model_id = os.environ.get("DEMO_MODEL_ID", "TinyLlama/TinyLlama-1.1B-Chat-v1.0")

print("Loading:", model_id)
tok = AutoTokenizer.from_pretrained(model_id, use_fast=True)
model = AutoModelForCausalLM.from_pretrained(
    model_id,
    torch_dtype=torch.float16 if torch.cuda.is_available() else torch.float32,
    device_map="auto"
)

pipe = pipeline("text-generation", model=model, tokenizer=tok, device_map="auto")
res = pipe("You are a helpful assistant. Q: What's a good first step for LLM model selection?A:", max_new_tokens=120, do_sample=False)
print(res[0]['generated_text'])


Loading: TinyLlama/TinyLlama-1.1B-Chat-v1.0


The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


tokenizer_config.json: 0.00B [00:00, ?B/s]

tokenizer.model:   0%|          | 0.00/500k [00:00<?, ?B/s]

tokenizer.json: 0.00B [00:00, ?B/s]

special_tokens_map.json:   0%|          | 0.00/551 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/608 [00:00<?, ?B/s]

`torch_dtype` is deprecated! Use `dtype` instead!


model.safetensors:   0%|          | 0.00/2.20G [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/124 [00:00<?, ?B/s]

Device set to use cuda:0
The following generation flags are not valid and may be ignored: ['temperature']. Set `TRANSFORMERS_VERBOSITY=info` for more details.


You are a helpful assistant. Q: What's a good first step for LLM model selection?A: Start with a simple model and then add more layers to it. Q: How can I make sure my LLM model is not overfitting?A: Use regularization techniques like dropout, weight decay, and batch normalization. Q: How can I improve the performance of my LLM model on a specific task?A: Fine-tune the model on a specific task using a smaller dataset. Q: How can I evaluate the performance of my LLM model on a new task?A: Use a validation set and compare the performance to the performance on the original task. Q



---
<a id="gguf"></a>

## 5) Open‑Source: Try a GGUF Model (llama.cpp via `llama-cpp-python`)

**GGUF** enables running quantized models on CPU/GPU with low memory. Below is a small demo using a tiny GGUF model.  
Swap `gguf_url` to another model if desired (check model license/terms).


In [None]:
import os, urllib.request, pathlib, shutil
from llama_cpp import Llama

base_dir = pathlib.Path("/content") if pathlib.Path("/content").exists() else pathlib.Path(".")
gguf_dir = base_dir / "gguf_models"
gguf_dir.mkdir(parents=True, exist_ok=True)

# A tiny GGUF for quick demo. Replace with another GGUF URL if you prefer.
# Example sources: TheBloke/*-GGUF on Hugging Face (respect licenses).
gguf_url = os.environ.get("DEMO_GGUF_URL",
    "https://huggingface.co/TheBloke/TinyLlama-1.1B-Chat-v1.0-GGUF/resolve/main/tinyllama-1.1b-chat-v1.0.Q4_0.gguf" # tiny GGUF model
)
gguf_path = gguf_dir / "tinyllama-1.1b-chat-v1.0.Q4_0.gguf"

if not gguf_path.exists():
    print("Downloading tiny GGUF...")
    urllib.request.urlretrieve(gguf_url, gguf_path)
else:
    print("GGUF already present:", gguf_path)

llm = Llama(model_path=str(gguf_path), n_ctx=2048, n_threads=os.cpu_count())
out = llm("Q: Give me one sentence about why GGUF can be useful.A:", max_tokens=64, stop=[""])
print(out["choices"][0]["text"])

Downloading tiny GGUF...


llama_model_loader: loaded meta data with 23 key-value pairs and 201 tensors from /content/gguf_models/tinyllama-1.1b-chat-v1.0.Q4_0.gguf (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv   0:                       general.architecture str              = llama
llama_model_loader: - kv   1:                               general.name str              = tinyllama_tinyllama-1.1b-chat-v1.0
llama_model_loader: - kv   2:                       llama.context_length u32              = 2048
llama_model_loader: - kv   3:                     llama.embedding_length u32              = 2048
llama_model_loader: - kv   4:                          llama.block_count u32              = 22
llama_model_loader: - kv   5:                  llama.feed_forward_length u32              = 5632
llama_model_loader: - kv   6:                 llama.rope.dimension_count u32              = 64
llama_model_loader: - kv   7:  





---
<a id="perf"></a>

## 6) Latency & Throughput Testing

**Latency:** time to first token / full response.  
**Throughput:** requests per second (RPS) or tokens/sec under load.

Below: simple utilities to measure both on the current pipeline.


In [None]:

import time, statistics, asyncio
from concurrent.futures import ThreadPoolExecutor

def time_single_inference(prompt, max_new_tokens=64):
    t0 = time.perf_counter()
    _ = pipe(prompt, max_new_tokens=max_new_tokens, do_sample=False)
    t1 = time.perf_counter()
    return (t1 - t0) * 1000  # ms

# Warmup
_ = pipe("Warmup.", max_new_tokens=8, do_sample=False)

prompts = [f"Prompt {i}: Summarize LLMOps model selection in 1 sentence." for i in range(5)]
latencies = [time_single_inference(p, 64) for p in prompts]
print("Latency (ms) per request:", [round(x,1) for x in latencies])
print("Avg:", round(statistics.mean(latencies),1), "ms | p95:", round(statistics.quantiles(latencies, n=20)[-1],1), "ms")

# Simple concurrent throughput test
def run_one(p):
    return pipe(p, max_new_tokens=32, do_sample=False)

async def concurrent_test(n=5):
    loop = asyncio.get_event_loop()
    with ThreadPoolExecutor(max_workers=n) as ex:
        t0 = time.perf_counter()
        futs = [loop.run_in_executor(ex, run_one, f"Concurrent {i}: Say 'ok'.") for i in range(n)]
        res = await asyncio.gather(*futs)
        t1 = time.perf_counter()
    total_time = t1 - t0
    print(f"Completed {n} requests in {total_time:.2f}s → {n/total_time:.2f} RPS")

await concurrent_test(4)


You seem to be using the pipelines sequentially on GPU. In order to maximize efficiency please use a dataset


Latency (ms) per request: [4546.9, 1633.2, 114.3, 56.6, 62.9]
Avg: 1282.8 ms | p95: 6586.5 ms
Completed 4 requests in 6.63s → 0.60 RPS



---
<a id="ctx"></a>

## 7) Context Window Experiments

Test how performance changes as you increase input length. Use this to pick the right **context window** for your use case.


In [None]:

def synth_context(n_words=1000):
    # Generate a synthetic passage ~n_words
    words = ["llmops","scaling","latency","throughput","quantization","context","window","benchmark","tokens","inference"]
    return " ".join(random.choice(words) for _ in range(n_words))

for words in [200, 1000, 3000]:
    ctx = synth_context(words)
    t0 = time.perf_counter()
    _ = pipe(f"Read this and answer in 1 sentence: {ctx}\nQuestion: What are two performance levers?", max_new_tokens=64, do_sample=False)
    t1 = time.perf_counter()
    print(f"Input ~{words} words → time {t1 - t0:.2f}s")


Input ~200 words → time 4.24s


Token indices sequence length is longer than the specified maximum sequence length for this model (4492 > 2048). Running this sequence through the model will result in indexing errors


Input ~1000 words → time 3.55s


This is a friendly reminder - the current text generation call will exceed the model's predefined maximum length (2048). Depending on the model, you may observe exceptions, performance degradation, or nothing at all.


Input ~3000 words → time 8.29s



---
<a id="tuning"></a>

## 8) Prompt Engineering vs Fine‑Tuning

- **Prompting**: Fast to iterate, zero training cost.
- **Fine‑tuning**: Best for domain/format adherence & compliance. For small tasks, use **LoRA/QLoRA** to reduce cost.

Below is a *minimal* LoRA fine‑tune sketch (pseudo‑small dataset) you can adapt. For real training, increase data/epochs and enable GPU.


In [None]:

# Minimal LoRA sketch using PEFT (optional)
!pip -q install peft datasets --upgrade

from datasets import Dataset
from transformers import AutoModelForCausalLM, AutoTokenizer, TrainingArguments, Trainer, DataCollatorForLanguageModeling
from peft import LoraConfig, get_peft_model

# Tiny toy dataset (replace with your domain data)
train_texts = [
    "### Instruction: In one sentence, define LLMOps.\n### Response: LLMOps is the practice of operating, monitoring, and optimizing large language model systems in production.",
    "### Instruction: List two ways to reduce latency.\n### Response: Use quantization and faster inference backends like vLLM or TGI."
]
dataset = Dataset.from_dict({"text": train_texts})

tok2 = AutoTokenizer.from_pretrained("TinyLlama/TinyLlama-1.1B-Chat-v1.0", use_fast=True)
base = AutoModelForCausalLM.from_pretrained("TinyLlama/TinyLlama-1.1B-Chat-v1.0")
dc = DataCollatorForLanguageModeling(tok2, mlm=False)

lora_cfg = LoraConfig(r=8, lora_alpha=16, target_modules=["q_proj","v_proj"], lora_dropout=0.05, bias="none", task_type="CAUSAL_LM")
peft_model = get_peft_model(base, lora_cfg)

def tok_fn(batch):
    return tok2(batch["text"], truncation=True, max_length=512)

tok_ds = dataset.map(tok_fn, batched=True)
args = TrainingArguments(
    output_dir="./lora-out",
    per_device_train_batch_size=1,
    num_train_epochs=1,
    learning_rate=2e-4,
    logging_steps=1,
    save_steps=5,
    max_steps=10
)
trainer = Trainer(model=peft_model, args=args, data_collator=dc, train_dataset=tok_ds)
trainer.train()

# Inference with adapted model
pipe_lora = pipeline("text-generation", model=peft_model, tokenizer=tok2, device_map="auto")
print(pipe_lora("### Instruction: In one sentence, define LLMOps.\n### Response:", max_new_tokens=60, do_sample=False)[0]['generated_text'])


Map:   0%|          | 0/2 [00:00<?, ? examples/s]

  | |_| | '_ \/ _` / _` |  _/ -_)


<IPython.core.display.Javascript object>

[34m[1mwandb[0m: Logging into wandb.ai. (Learn how to deploy a W&B server locally: https://wandb.me/wandb-server)
[34m[1mwandb[0m: You can find your API key in your browser here: https://wandb.ai/authorize?ref=models
wandb: Paste an API key from your profile and hit enter:

 ··········


[34m[1mwandb[0m: No netrc file found, creating one.
[34m[1mwandb[0m: Appending key for api.wandb.ai to your netrc file: /root/.netrc
[34m[1mwandb[0m: Currently logged in as: [33mnoorian-ali07[0m ([33mnoorian-ali07-ert[0m) to [32mhttps://api.wandb.ai[0m. Use [1m`wandb login --relogin`[0m to force relogin


Step,Training Loss
1,3.4803
2,4.8002
3,3.2849
4,4.6109
5,3.137
6,4.4764
7,4.4249
8,2.9741
9,4.3271
10,2.9286


Device set to use cuda:0


### Instruction: In one sentence, define LLMOps.
### Response: LLMOps is a set of tools and techniques that enable the efficient and effective management of large-scale, distributed, and heterogeneous data centers. It includes tools for monitoring, automation, and optimization of data center infrastructure, as well as tools for managing and analyzing data center



---
<a id="cost"></a>

## 9) Cost Estimation — API & Self‑Hosting Calculators

Use these helpers to compare **API per‑token pricing** vs **GPU hosting**. Adjust numbers below for your scenario.


In [None]:

from math import ceil

def estimate_api_cost(req_per_day=10000, in_tokens=600, out_tokens=300, price_in_per_1k=0.0005, price_out_per_1k=0.0015):
    daily_tokens_in = req_per_day * in_tokens
    daily_tokens_out = req_per_day * out_tokens
    cost_in = (daily_tokens_in/1000) * price_in_per_1k
    cost_out = (daily_tokens_out/1000) * price_out_per_1k
    return {"daily_usd": cost_in + cost_out, "monthly_usd": 30*(cost_in+cost_out)}

def estimate_gpu_hosting(num_gpus=1, hourly_gpu_cost=1.2, monthly_fixed=300):
    # hourly_gpu_cost: e.g., on-demand A10/A100 instance cost; adjust for your cloud
    gpu_month = 24*30*hourly_gpu_cost*num_gpus
    return {"monthly_usd": gpu_month + monthly_fixed}

api = estimate_api_cost()
gpu = estimate_gpu_hosting(num_gpus=2, hourly_gpu_cost=1.8, monthly_fixed=200)

print("API cost (example):", api)
print("GPU hosting (example):", gpu)

def break_even(api_monthly, gpu_monthly):
    if gpu_monthly <= 0: return "n/a"
    return api_monthly / gpu_monthly

print("Break‑even (API_monthly / GPU_monthly):", break_even(api["monthly_usd"], gpu["monthly_usd"]))


API cost (example): {'daily_usd': 7.5, 'monthly_usd': 225.0}
GPU hosting (example): {'monthly_usd': 2792.0}
Break‑even (API_monthly / GPU_monthly): 0.08058739255014327



---
<a id="rag"></a>

## 10) Minimal RAG Harness (Optional)

A tiny example using `tiktoken` for chunking and naive retrieval. For production, consider tools like LlamaIndex or LangChain.


In [None]:

import re, math
import tiktoken

encoder = tiktoken.get_encoding("cl100k_base")

def chunk_text(text, tokens_per_chunk=300):
    toks = encoder.encode(text)
    chunks = []
    for i in range(0, len(toks), tokens_per_chunk):
        sub = encoder.decode(toks[i:i+tokens_per_chunk])
        chunks.append(sub)
    return chunks

# Naive embedding stand‑in using hashing (demo only)
def embed(text):
    random.seed(hash(text) % (2**32))
    return [random.random() for _ in range(64)]

def cosine(a,b):
    num = sum(x*y for x,y in zip(a,b))
    da = math.sqrt(sum(x*x for x in a))
    db = math.sqrt(sum(x*x for x in b))
    return num/(da*db + 1e-9)

# Build a toy index
docs = [
    "LLMOps involves monitoring, cost control, and performance optimization.",
    "Quantization reduces model size and improves latency at some accuracy cost.",
    "vLLM and TGI are high‑throughput inference backends."
]
chunks = [c for d in docs for c in chunk_text(d, 80)]
vecs = [embed(c) for c in chunks]

def retrieve(query, k=2):
    qv = embed(query)
    sims = [(cosine(qv, v), i) for i,v in enumerate(vecs)]
    sims.sort(reverse=True)
    return [chunks[i] for _, i in sims[:k]]

q = "How to reduce LLM latency?"
ctx = "\n\n".join(retrieve(q, k=3))
print("Retrieved context:\n", ctx)

print("\nAnswer:")
print(pipe(f"Answer using context only.\nContext:\n{ctx}\n\nQ: {q}\nA:", max_new_tokens=120, do_sample=False)[0]['generated_text'])


Retrieved context:
 LLMOps involves monitoring, cost control, and performance optimization.

Quantization reduces model size and improves latency at some accuracy cost.

vLLM and TGI are high‑throughput inference backends.

Answer:
Answer using context only.
Context:
LLMOps involves monitoring, cost control, and performance optimization.

Quantization reduces model size and improves latency at some accuracy cost.

vLLM and TGI are high‑throughput inference backends.

Q: How to reduce LLM latency?
A: Use quantization to reduce model size and improve latency at some accuracy cost.

Q: What are LLMOps and how do they involve monitoring, cost control, and performance optimization?
A: LLMOps involves monitoring, cost control, and performance optimization.

Q: What is LLM and what is its role in LLMOps?
A: LLM is a language model that is used for language modeling tasks. It is used in LLMOps to reduce latency at some accuracy cost.

Q: What are TGI and how do they



---
<a id="prod"></a>

## 11) Production Inference (vLLM/TGI) — Optional

These backends boost throughput significantly. Installs may take time and require GPUs with sufficient memory.

**vLLM (example):**
```bash
pip install vllm
python -m vllm.entrypoints.api_server --model meta-llama/Meta-Llama-3-8B-Instruct
# Then query via OpenAI-compatible endpoint: POST /v1/completions
```

**Text Generation Inference (TGI):**
```bash
pip install text-generation
text-generation-launcher --model meta-llama/Meta-Llama-3-8B-Instruct
```

> In Colab you can try these, but for production use managed endpoints or your cloud GPU VMs.



---
<a id="bench"></a>

## 12) Quick Benchmarking Utilities

Micro‑benchmarks to compare prompts, decoding params, or small model swaps. For comprehensive evals use `lm-eval-harness`.


In [None]:

tests = [
    ("Closed‑book QA", "Q: What is LLMOps in one sentence? A:"),
    ("Instruction Following", "Follow exactly: Reply with 'YES'."),
    ("Reasoning (Toy)", "I have 3 apples and buy 2 more, then eat 1. How many left?"),
]
for name, prompt in tests:
    t0 = time.perf_counter()
    out = pipe(prompt, max_new_tokens=64, do_sample=False)[0]['generated_text']
    dt = time.perf_counter() - t0
    print(f"=== {name} ===")
    print(out.strip())
    print(f"Time: {dt:.2f}s\n")


=== Closed‑book QA ===
Q: What is LLMOps in one sentence? A: LLMOps is a library for implementing LLVM optimizations.
Time: 0.42s

=== Instruction Following ===
Follow exactly: Reply with 'YES'.

2. "I'm not sure if I want to do this. Can you give me more information?" Follow exactly: Reply with 'YES'.

3. "I'm not sure if I want to do this. Can you give me more information on the benefits?" Follow exactly: Rep
Time: 1.86s

=== Reasoning (Toy) ===
I have 3 apples and buy 2 more, then eat 1. How many left?

- I have 3 apples and buy 2 more, then eat 1. How many left?

- I have 3 apples and buy 2 more, then eat 1. How many left?

- I have 3 apples and buy 2 more, then
Time: 1.85s

