# Zero-shot Retrieval — CLIP text-image matching
**TL;DR:** Embed images and text prompts with CLIP to rank matches on CPU, with Metal/CUDA toggles for acceleration.

**Models & Datasets:** [openai/clip-vit-base-patch32](https://huggingface.co/openai/clip-vit-base-patch32) (MIT), [Horses or Humans (tiny sample)](https://huggingface.co/datasets/nateraw/horses_or_humans) (Apache-2.0)
**Run Profiles:** 🖥️ CPU | 🍎 Metal (Apple Silicon) | 🧪 Colab/T4 | ⚡ CUDA GPU
**Env (minimal):** python>=3.10, transformers, datasets, evaluate, accelerate (optional: peft, bitsandbytes, timm, diffusers)
**Colab:** [Open in Colab](https://colab.research.google.com/github/SSusantAchary/Hands-On-Huggingface-AI-Models/blob/main/notebooks/vision/zero-shot-clip-retrieval_cpu-first.ipynb)

**Switches (edit in one place):**
- `device` = {"cpu","mps","cuda"}
- `precision` = {"fp32","fp16","bf16","int8","4bit"}  (apply only if supported)
- `context_len` / `image_res` / `batch_size`

**Footprint & Speed (fill after run):**
- Peak RAM: TODO
- Peak VRAM: TODO (if GPU)
- TTFB: TODO, Throughput: TODO, Load time: TODO

**Gotchas:** Metal fallback ensures CLIP works even without MPS ([Fixes & Tips](../fixes-and-tips/metal-backend-fallback.md))



## Setup
Fetch a tiny set of images and prepare CLIP for retrieval.


In [None]:

import json
import os
import subprocess
import time
from pathlib import Path

import numpy as np
import pandas as pd
import torch
from datasets import load_dataset
from PIL import Image
from transformers import CLIPModel, CLIPProcessor

from notebooks._templates.measure import append_benchmark_row, measure_memory_speed

DEVICE_PREFERENCE = os.environ.get("HF_DEVICE", "cpu")
PRECISION = os.environ.get("HF_PRECISION", "fp32")

def resolve_device(preference: str = "cpu") -> torch.device:
    if preference == "cuda" and torch.cuda.is_available():
        return torch.device("cuda")
    if preference == "mps" and torch.backends.mps.is_available():
        return torch.device("mps")
    return torch.device("cpu")

DEVICE = resolve_device(DEVICE_PREFERENCE)
print(f"Using device={DEVICE}")

MODEL_ID = "openai/clip-vit-base-patch32"
OUTPUT_DIR = Path("outputs") / "clip-retrieval"
OUTPUT_DIR.mkdir(parents=True, exist_ok=True)

dataset = load_dataset("nateraw/horses_or_humans", split="train[:4]")
images = [Image.fromarray(item["image"]) for item in dataset]
text_queries = [
    "A person performing yoga outdoors",
    "A horse jumping over a barrier",
    "A close-up portrait of a human face",
]


## Embed & score


In [None]:

torch.manual_seed(42)

load_start = time.perf_counter()
processor = CLIPProcessor.from_pretrained(MODEL_ID)
model = CLIPModel.from_pretrained(MODEL_ID).to(DEVICE)
load_time = time.perf_counter() - load_start

def embed_images(batch):
    inputs = processor(images=batch, return_tensors="pt").to(DEVICE)
    with torch.inference_mode():
        features = model.get_image_features(**inputs)
    return features / features.norm(dim=-1, keepdim=True)

def embed_text(batch):
    inputs = processor(text=batch, return_tensors="pt", padding=True).to(DEVICE)
    with torch.inference_mode():
        features = model.get_text_features(**inputs)
    return features / features.norm(dim=-1, keepdim=True)

image_embeddings = embed_images(images)
text_embeddings = embed_text(text_queries)

sims = (text_embeddings @ image_embeddings.T).cpu().numpy()
df = pd.DataFrame(
    sims,
    columns=[f"image_{idx}" for idx in range(len(images))],
    index=text_queries,
)
display(df.style.format("{:.3f}"))

rankings = {
    query: df.loc[query].nlargest(3).index.tolist()
    for query in text_queries
}
print(json.dumps(rankings, indent=2))


## Measurement


In [None]:

def run_inference(recorder):
    img_feat = embed_images(images)
    txt_feat = embed_text(text_queries)
    recorder.mark_first_token()
    recorder.add_items(len(text_queries) * len(images))
    return img_feat, txt_feat

metrics = measure_memory_speed(run_inference)

def fmt(value, digits=4):
    if value in (None, "", float("inf")):
        return ""
    return f"{value:.{digits}f}"

try:
    repo_commit = subprocess.check_output(["git", "rev-parse", "HEAD"], text=True).strip()
except Exception:  # noqa: BLE001
    repo_commit = ""

append_benchmark_row(
    task="clip-retrieval",
    model_id=MODEL_ID,
    dataset="nateraw/horses_or_humans",
    sequence_or_image_res="224x224",
    batch=str(len(images)),
    peak_ram_mb=fmt(metrics.get("peak_ram_mb"), 2),
    peak_vram_mb=fmt(metrics.get("peak_vram_mb"), 2),
    load_time_s=fmt(load_time, 2),
    ttfb_s=fmt(metrics.get("ttfb_s"), 3),
    tokens_per_s_or_images_per_s=fmt(metrics.get("throughput_per_s"), 3),
    precision=PRECISION,
    notebook_path="notebooks/vision/zero-shot-clip-retrieval_cpu-first.ipynb",
    repo_commit=repo_commit,
)

with open(OUTPUT_DIR / "metrics.json", "w", encoding="utf-8") as fp:
    json.dump(metrics, fp, indent=2)
metrics


## Results Summary
        - Observations: TODO
        - Metrics captured: see `benchmarks/matrix.csv`

        ## Next Steps
        - TODOs: fill in after benchmarking

        ## Repro
        - Seed: 42 (set in measurement cell)
        - Libraries: captured via `detect_env()`
        - Notebook path: `notebooks/vision/zero-shot-clip-retrieval_cpu-first.ipynb`
        - Latest commit: populated automatically when appending benchmarks (if git available)
