<h1><center>Vehicle Dirt Classification Test Using Local CLIP + Qwen2-VL-2B-Instruct</center></h1>

This code defines the system and user prompts used to guide the vision-language model during inference. The system prompt establishes strict rules for determining visible dirt on vehicles, ensuring consistent, expert-level reasoning. The user prompt provides the model with the query image and retrieved visual evidence, instructing it to return only a clean, structured output containing a boolean dirt flag and a short explanation.

In [1]:
SYSTEM_PROMPT = """
You are an expert in visual dirt assessment for vehicles.
Rules:
- Mark VisibleDirtyFlag = True ONLY if there is real dirt (dust, mud, stains).
- Ignore reflections, shadows, car paint color, camera artifacts.
- If dirt is minimal or ambiguous, output VisibleDirtyFlag = False.
"""

USER_PROMPT = """
Here is:
1. The new vehicle image.
2. Visually retrieved support examples (dirty and clean patches).

Use these examples to reason.

RETURN ONLY:
VisibleDirtyFlag: True/False
Explanation: <short text>
"""

This script implements a complete vision-based Retrieval-Augmented Generation (RAG) pipeline using locally stored models. It loads CLIP to extract patch-level embeddings from both support images and the query image, building an in-memory vector store that separates “dirty” and “clean” vehicle examples. The system performs visual retrieval by comparing query patches against all stored patches, aggregating similarity scores and selecting representative evidence images for downstream reasoning.

Using these retrieved visual examples, the script then applies the Qwen2-VL-2B-Instruct model—also loaded fully offline—to generate a structured assessment of whether the vehicle appears dirty. The query image and the retrieved evidence patches are embedded into a multimodal chat template, enabling the model to reason about dirt presence. The final output includes raw VLM predictions and an automatically parsed VisibleDirtyFlag value.

In [2]:
from transformers import (
    CLIPModel,
    CLIPProcessor,
    AutoProcessor,
    AutoModelForVision2Seq,
)
from PIL import Image, UnidentifiedImageError
import numpy as np
import torch
import math
import re
import os

DEVICE = "cuda" if torch.cuda.is_available() else "cpu"

CLIP_LOCAL_DIR = os.path.join("..", "artifacts", "clip_vit_base_patch32")
QWEN_LOCAL_DIR = os.path.join("..", "artifacts", "qwen2_vl_2b_instruct")

SUPPORT_ROOT = os.path.join("..", "data", "support")
DIRTY_DIR = os.path.join(SUPPORT_ROOT, "dirty")
CLEAN_DIR = os.path.join(SUPPORT_ROOT, "clean")

QUERY_IMAGE = os.path.join("..", "..", "assets", "sample_dirty.jpg")

PATCH_SIZE = 32
RES = 224
VALID_EXTS = {".jpg", ".jpeg", ".png", ".bmp"}

print("Loading CLIP from local directory:", CLIP_LOCAL_DIR)

clip_model = CLIPModel.from_pretrained(CLIP_LOCAL_DIR).to(DEVICE).eval()
clip_processor = CLIPProcessor.from_pretrained(CLIP_LOCAL_DIR)

VISION_DIM = clip_model.vision_model.config.hidden_size
print("CLIP Vision Dimension:", VISION_DIM)

def get_patch_embeddings(img):
    inputs = clip_processor(images=img, return_tensors="pt").to(DEVICE)

    with torch.no_grad():
        out = clip_model.vision_model(pixel_values=inputs["pixel_values"])
        tokens = out.last_hidden_state[:, 1:, :]

    tokens = tokens.squeeze(0)
    tokens = tokens / tokens.norm(dim=-1, keepdim=True)

    n = tokens.shape[0]
    s = int(math.sqrt(n))
    return tokens.reshape(s, s, VISION_DIM)

def patch_coords(image_size=(RES, RES), patch_size=PATCH_SIZE):
    W, H = image_size
    coords = []
    for j in range(H // patch_size):
        for i in range(W // patch_size):
            x0, y0 = i * patch_size, j * patch_size
            coords.append((x0, y0, x0 + patch_size, y0 + patch_size))
    return coords

support_vectors = None
meta = []

def is_image_file(path):
    return os.path.splitext(path.lower())[1] in VALID_EXTS

def add_support_image_to_memory(path, label):
    global support_vectors, meta

    try:
        img = Image.open(path).convert("RGB")
    except Exception as e:
        print("Skipping invalid image:", path, "->", e)
        return

    img_r = img.resize((RES, RES), Image.BICUBIC)
    grid = get_patch_embeddings(img_r).detach().cpu().numpy()
    vecs = grid.reshape(-1, VISION_DIM).astype(np.float32)

    support_vectors = vecs if support_vectors is None else np.vstack([support_vectors, vecs])

    boxes = patch_coords()

    for p_idx, box in enumerate(boxes):
        meta.append({
            "label": label,
            "image_path": path,
            "patch_idx": p_idx,
            "box": box
        })

    print(f"Added support image: {path} ({label}) → {vecs.shape[0]} patches")

print("\nLoading support images...\n")

for fname in os.listdir(DIRTY_DIR):
    p = os.path.join(DIRTY_DIR, fname)
    if is_image_file(p):
        add_support_image_to_memory(p, "dirty")

for fname in os.listdir(CLEAN_DIR):
    p = os.path.join(CLEAN_DIR, fname)
    if is_image_file(p):
        add_support_image_to_memory(p, "clean")

print("\nVector store ready!")
print("Total vectors:", support_vectors.shape[0])
print("Vector dimension:", support_vectors.shape[1])

def retrieve_for_query_image(image_path, k=3):
    img = Image.open(image_path).convert("RGB").resize((RES, RES))

    grid = get_patch_embeddings(img).detach().cpu().numpy()
    vecs = grid.reshape(-1, VISION_DIM).astype(np.float32)

    S = support_vectors.astype(np.float32)

    class_score = {"dirty": 0.0, "clean": 0.0}
    evidence = []

    for q_idx in range(vecs.shape[0]):
        sims = S @ vecs[q_idx]
        top = sims.argsort()[-k:][::-1]

        for idx in top:
            sim = float(sims[idx])
            ref = meta[idx]
            class_score[ref["label"]] += sim
            evidence.append({"sim": sim, "ref": ref})

    evidence.sort(key=lambda e: e["sim"], reverse=True)
    return class_score, evidence, img

def crop_patch(image_path, box):
    img = Image.open(image_path).convert("RGB").resize((RES, RES))
    return img.crop(box)

print("\nLoading VLM: Qwen2-VL-2B-Instruct from:", QWEN_LOCAL_DIR)

vlm_processor = AutoProcessor.from_pretrained(QWEN_LOCAL_DIR)
vlm_model = AutoModelForVision2Seq.from_pretrained(
    QWEN_LOCAL_DIR,
    torch_dtype=torch.float16 if DEVICE == "cuda" else torch.float32,
)
vlm_model.to(DEVICE).eval()

print(f"\nProcessing query image: {QUERY_IMAGE}")

class_score, evidence, q_img = retrieve_for_query_image(QUERY_IMAGE, k=4)

print("\nScores:", class_score)

dirty_patches = [e for e in evidence if e["ref"]["label"] == "dirty"][:3]
clean_patches = [e for e in evidence if e["ref"]["label"] == "clean"][:3]

rag_images = [q_img] + [
    crop_patch(e["ref"]["image_path"], e["ref"]["box"])
    for e in (dirty_patches + clean_patches)
]

messages = [
    {"role": "system", "content": SYSTEM_PROMPT.strip()},
    {
        "role": "user",
        "content": [
            *[{"type": "image"} for _ in rag_images],
            {"type": "text", "text": USER_PROMPT.strip()}
        ],
    },
]

prompt = vlm_processor.apply_chat_template(
    messages, tokenize=False, add_generation_prompt=True
)

inputs = vlm_processor(
    text=prompt,
    images=rag_images,
    return_tensors="pt"
)

inputs = {k: (v.to(DEVICE) if isinstance(v, torch.Tensor) else v) for k, v in inputs.items()}

with torch.no_grad():
    output = vlm_model.generate(**inputs, max_new_tokens=200)

resp = vlm_processor.batch_decode(output, skip_special_tokens=True)[0]

print("\n=== RAW MODEL RESPONSE ===")
print(resp)

match = re.findall(r"VisibleDirtyFlag:\s*(True|False)", resp)

if match:
    print("\nParsed Flag:", match[-1])
else:
    print("\nCould not parse VisibleDirtyFlag.")

  from .autonotebook import tqdm as notebook_tqdm
Using a slow image processor as `use_fast` is unset and a slow processor was saved with this model. `use_fast=True` will be the default behavior in v4.52, even if the model was saved with a slow processor. This will result in minor differences in outputs. You'll still be able to use a slow processor with `use_fast=False`.


Loading CLIP from local directory: ../artifacts/clip_vit_base_patch32
CLIP Vision Dimension: 768

Loading support images...

Added support image: ../data/support/dirty/side-view-very-dirty-car-600nw-2296491121.jpg.jpg (dirty) → 49 patches
Added support image: ../data/support/dirty/79086697-off-road-vehicle-after-driving-in-the-rain-on-extremely-dirty-rural-road.jpg (dirty) → 49 patches
Added support image: ../data/support/dirty/16474284186573.jpg (dirty) → 49 patches
Added support image: ../data/support/clean/mg-zs-hybrid-front-view.jpg (clean) → 49 patches
Added support image: ../data/support/clean/1140-subaru-forester-sport-hero-esp.jpg (clean) → 49 patches
Added support image: ../data/support/clean/GAC-Eco-Amigable.jpg (clean) → 49 patches

Vector store ready!
Total vectors: 294
Vector dimension: 768

Loading VLM: Qwen2-VL-2B-Instruct from: ../artifacts/qwen2_vl_2b_instruct


`torch_dtype` is deprecated! Use `dtype` instead!
Loading checkpoint shards: 100%|██████████| 2/2 [00:00<00:00, 29.05it/s]



Processing query image: ../../assets/sample_dirty.jpg

Scores: {'dirty': 94.14763367176056, 'clean': 48.16642606258392}

=== RAW MODEL RESPONSE ===
system
You are an expert in visual dirt assessment for vehicles.
Rules:
- Mark VisibleDirtyFlag = True ONLY if there is real dirt (dust, mud, stains).
- Ignore reflections, shadows, car paint color, camera artifacts.
- If dirt is minimal or ambiguous, output VisibleDirtyFlag = False.
user
Here is:
1. The new vehicle image.
2. Visually retrieved support examples (dirty and clean patches).

Use these examples to reason.

RETURN ONLY:
VisibleDirtyFlag: True/False
Explanation: <short text>
assistant
VisibleDirtyFlag: True
Explanation: The vehicle is covered in mud, indicating real dirt.

Parsed Flag: True


---