<h1><center>CLIP Local Model Setup & Preparation</center></h1>

This script downloads the CLIP ViT-B/32 model and processor from Hugging Face and saves them locally into the project’s artifacts/clip_vit_base_patch32 directory.
It ensures the folder exists, loads the pretrained weights, and stores both the model and processor so they can be used offline in future inference steps.

In [8]:
from transformers import CLIPModel, CLIPProcessor
import os

LOCAL_MODEL_DIR = os.path.join("..", "artifacts", "clip_vit_base_patch32")

os.makedirs(LOCAL_MODEL_DIR, exist_ok=True)

MODEL_ID = "openai/clip-vit-base-patch32"

print("Downloading CLIP model to:", LOCAL_MODEL_DIR)

clip_model = CLIPModel.from_pretrained(MODEL_ID)
clip_model.save_pretrained(LOCAL_MODEL_DIR)

clip_processor = CLIPProcessor.from_pretrained(MODEL_ID)
clip_processor.save_pretrained(LOCAL_MODEL_DIR)

print("CLIP model successfully downloaded and stored locally.")

Downloading CLIP model to: ../artifacts/clip_vit_base_patch32
CLIP model successfully downloaded and stored locally.


This script loads a locally stored CLIP ViT-B/32 model and processor from the artifacts/clip_vit_base_patch32 directory.
It automatically selects GPU (cuda) if available, otherwise defaults to CPU.
The model is initialized in evaluation mode and ready for downstream inference tasks.
The script also prints the dimensionality of the visual features produced by CLIP’s vision encoder.

In [9]:
from transformers import CLIPModel, CLIPProcessor
import torch
import os

DEVICE = "cuda" if torch.cuda.is_available() else "cpu"

LOCAL_MODEL_DIR = os.path.join("..", "artifacts", "clip_vit_base_patch32")

print("Loading CLIP model from local directory:", LOCAL_MODEL_DIR)

clip_model = CLIPModel.from_pretrained(LOCAL_MODEL_DIR).to(DEVICE).eval()
clip_processor = CLIPProcessor.from_pretrained(LOCAL_MODEL_DIR)

print("CLIP model successfully loaded from local storage.")
print("Vision feature dimension:", clip_model.vision_model.config.hidden_size)

Loading CLIP model from local directory: ../artifacts/clip_vit_base_patch32
CLIP model successfully loaded from local storage.
Vision feature dimension: 768


This script loads a sample image from the project’s assets/ folder and processes it using the locally loaded CLIP ViT-B/32 model.
The image is converted to RGB, passed through the CLIP vision encoder, and the script prints the shape of the extracted patch-level embeddings (excluding the CLS token).
This helps verify that the model and processor are working correctly and that embeddings can be generated for downstream tasks such as visual retrieval or RAG-vision pipelines.

In [None]:
from PIL import Image
import os

TEST_IMAGE = os.path.join("..", "..", "assets", "sample_dirty.jpg")

print("Loading test image from:", TEST_IMAGE)

img = Image.open(TEST_IMAGE).convert("RGB")

inputs = clip_processor(images=img, return_tensors="pt").to(DEVICE)

with torch.no_grad():
    out = clip_model.vision_model(pixel_values=inputs["pixel_values"])
    emb = out.last_hidden_state[:, 1:, :]
    print("Embeddings shape:", emb.shape)

Loading test image from: ../../assets/sample_dirty.jpg
Embeddings shape: torch.Size([1, 49, 768])


---