Yes — CLIP is a multimodal embedding model that:

Takes text → creates an embedding

Takes image → creates an embedding

Learns to place matching image–text pairs close together in the embedding space

Pushes non-matching pairs far apart

This is exactly the principle of contrastive learning.

❗ But here’s what you’re missing (and must know to truly “get” CLIP)
🧠 1. CLIP doesn’t just “create embeddings” — it learns a shared semantic space

CLIP is trained so that:

“A dog playing in the park” (text)

An actual image of a dog playing in the park

…end up close together because they describe the same concept, not because of pixel similarity.

This is fundamentally different from:

a normal CNN (pure vision)

a normal transformer encoder (pure text)

CLIP aligns vision semantics and language semantics.

⚔️ 2. Training mechanism = Contrastive Loss (InfoNCE)

For each batch of N image–text pairs:

Correct pair: ✓ pulled together

All other N−1 wrong combinations: ✗ pushed apart

CLIP maximizes the similarity of the correct match while minimizing the similarity of incorrect matches.

This is why CLIP can work zero-shot without fine-tuning.

🎯 3. CLIP embeddings are not random — they have “semantic directions”

Example:

“dog” → embeddings cluster around dog-like images

“red car” → embeddings cluster around red cars

“sad expression” → clusters around sad-looking images

CLIP embeddings behave like a knowledge-rich semantic space.

📌 4. CLIP uses two separate encoders

Image Encoder = ResNet or ViT

Text Encoder = Transformer (like GPT/BERT style)

These encoders are trained together to align their output spaces.

They do not share parameters — only the final embedding space is aligned.

🧪 5. Zero-shot classification works because of this alignment

Zero-shot image classification with CLIP:

Convert class names to text prompts

“a photo of a cat”

“a photo of a dog”

Convert an image into an embedding

Choose whichever text embedding is closest to the image embedding.

This is why CLIP became huge for GenAI and vision-language tasks.

🧵 Your statement rewritten correctly and completely

CLIP is a multimodal contrastive model that learns a shared embedding space for images and text. It encodes images and text separately, but trains both encoders so that embeddings of matching image–text pairs lie close together, and mismatched ones are pushed apart. This creates a powerful semantic vector space where similarity between images and text can be measured directly.

These embedding model is used for
- Image classification
- Image captioning
- Image search


Yes — CLIP embeddings can absolutely be used for image classification, and in fact they are one of the strongest ways to do:

Zero-shot classification

Few-shot classification

Custom image categorization without retraining a CNN

But you need to know HOW and WHY it works.

Let me break it down sharply.

✅ 1. Zero-Shot Image Classification (Most Famous Use of CLIP)

CLIP can classify images without training on your dataset.

How it works

Convert your class labels into text prompts

“a photo of a cat”

“a photo of a dog”

Encode these text prompts → text embeddings

Encode your image → image embedding

Compute cosine similarity between image embedding and each text embedding

Highest similarity = predicted class

Example

If the image embedding is closest to

“a photo of a tiger” → model predicts tiger

This works because the image and text live in the same semantic vector space.

🟢 2. Few-Shot Classification

If you have small training data (like 10 images per class), you can:

Embed your training images

Average embeddings for each class

Classify a new image by finding the nearest class embedding

Better than training a CNN on tiny data.

🟡 3. Custom/Domain-Specific Image Classification

You can classify any category you invent:

“healthy leaf” vs “diseased leaf”

“real product” vs “fake product”

“shirt with stripes” vs “shirt with polka dots”

“happy face” vs “sad face”

No retraining required — just prompt engineering.

🔵 4. Fine-Tuning for Classification (Optional)

If your domain is unusual (medical, satellite, microscopic), you can:

fine-tune

or use CLIP as a frozen feature extractor and train a classifier head.

But often zero-shot already beats many fully trained models.

🚫 Important: CLIP is NOT optimal for some tasks

CLIP is bad for:

pixel-level tasks

segmentation

object detection

bounding boxes

depth estimation

Because its embeddings are global (whole image).

<h2> Limitation of using the CLIP </h2>

Big problem is that if I want to use the CLIP model for my person task, which is type of task on which the CLIP might on be train on. May be even if it is not trained on such task - so there are chance it won't be able to match the text and image embedding so well in the vectore space

So we fine tune it for are specific use case


In [4]:
"""
Fine-tuning CLIP on youtube data

step1: Gather positive(and negative)pairs

step2: Pre-process training data

step3: Define evals

step4: Fine-tune the model

step5: Evaluate the model"""

'\nFine-tuning CLIP on youtube data\n\nstep1: Gather positive(and negative)pairs\n\nstep2: Pre-process training data\n\nstep3: Define evals\n\nstep4: Fine-tune the model\n\nstep5: Evaluate the model'

In [1]:
#step1: https://github.com/ShawhinT/YouTube-Blog/blob/main/multimodal-ai/4-ft-mm-embeddings/1-prepare_training_data.ipynb
# above link is the file where it has the code for building the dataset
from datasets import load_dataset

from PIL import Image
import requests

from sentence_transformers import (
    SentenceTransformer,
    SentenceTransformerTrainer,
    SentenceTransformerTrainingArguments,
)
from sentence_transformers.losses import MultipleNegativesRankingLoss
from sentence_transformers.evaluation import TripletEvaluator, SentenceEvaluator

from typing import List, Dict
import torch

import model and dataset

In [2]:

model_name = "sentence-transformers/clip-ViT-L-14"
model = SentenceTransformer(model_name)

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


modules.json:   0%|          | 0.00/122 [00:00<?, ?B/s]

config_sentence_transformers.json:   0%|          | 0.00/118 [00:00<?, ?B/s]

README.md: 0.00B [00:00, ?B/s]

config.json: 0.00B [00:00, ?B/s]

tokenizer.json: 0.00B [00:00, ?B/s]

preprocessor_config.json:   0%|          | 0.00/354 [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/389 [00:00<?, ?B/s]

merges.txt: 0.00B [00:00, ?B/s]

tokenizer_config.json:   0%|          | 0.00/733 [00:00<?, ?B/s]

0_CLIPModel/model.safetensors:   0%|          | 0.00/1.71G [00:00<?, ?B/s]

0_CLIPModel/pytorch_model.bin:   0%|          | 0.00/1.71G [00:00<?, ?B/s]

vocab.json: 0.00B [00:00, ?B/s]

Using a slow image processor as `use_fast` is unset and a slow processor was saved with this model. `use_fast=True` will be the default behavior in v4.52, even if the model was saved with a slow processor. This will result in minor differences in outputs. You'll still be able to use a slow processor with `use_fast=False`.


In [3]:
dataset = load_dataset("shawhin/yt-title-thumbnail-pairs")

README.md:   0%|          | 0.00/995 [00:00<?, ?B/s]

data/train-00000-of-00001.parquet:   0%|          | 0.00/8.31k [00:00<?, ?B/s]

data/valid-00000-of-00001.parquet:   0%|          | 0.00/3.89k [00:00<?, ?B/s]

data/test-00000-of-00001.parquet:   0%|          | 0.00/3.99k [00:00<?, ?B/s]

Generating train split:   0%|          | 0/53 [00:00<?, ? examples/s]

Generating valid split:   0%|          | 0/11 [00:00<?, ? examples/s]

Generating test split:   0%|          | 0/12 [00:00<?, ? examples/s]

In [4]:
dataset

DatasetDict({
    train: Dataset({
        features: ['video_id', 'title', 'thumbnail_url', 'title_neg'],
        num_rows: 53
    })
    valid: Dataset({
        features: ['video_id', 'title', 'thumbnail_url', 'title_neg'],
        num_rows: 11
    })
    test: Dataset({
        features: ['video_id', 'title', 'thumbnail_url', 'title_neg'],
        num_rows: 12
    })
})

In [8]:
dataset['train'][0]

{'video_id': 'Y7pNmocrmi8',
 'title': 'Multimodal RAG: A Beginner-friendly Guide (with Python Code)',
 'thumbnail_url': 'https://i.ytimg.com/vi/Y7pNmocrmi8/hqdefault.jpg',
 'title_neg': 'What Nature Can Teach Us About Business...'}


freeze model params

In [9]:
# pick specific layers to train (note: you can add more layers to this list)
trainable_layers_list = ['projection']

# Apply freezing configuration
for name, param in model.named_parameters():
    # freeze all params
    param.requires_grad = False

    # unfreeze layers in trainable_layers_list
    if any(layer in name for layer in trainable_layers_list):
        param.requires_grad = True

In [10]:
# Verify trainable parameters
for name, param in model.named_parameters():
    if param.requires_grad:
        print(f"Trainable: {name}")

Trainable: 0.model.visual_projection.weight
Trainable: 0.model.text_projection.weight


In [11]:
model

SentenceTransformer(
  (0): CLIPModel()
)

In [12]:
clip_model = model[0]
clip_model

CLIPModel()

In [13]:
for name, module in clip_model.named_modules():
    print(name)


model
model.text_model
model.text_model.embeddings
model.text_model.embeddings.token_embedding
model.text_model.embeddings.position_embedding
model.text_model.encoder
model.text_model.encoder.layers
model.text_model.encoder.layers.0
model.text_model.encoder.layers.0.self_attn
model.text_model.encoder.layers.0.self_attn.k_proj
model.text_model.encoder.layers.0.self_attn.v_proj
model.text_model.encoder.layers.0.self_attn.q_proj
model.text_model.encoder.layers.0.self_attn.out_proj
model.text_model.encoder.layers.0.layer_norm1
model.text_model.encoder.layers.0.mlp
model.text_model.encoder.layers.0.mlp.activation_fn
model.text_model.encoder.layers.0.mlp.fc1
model.text_model.encoder.layers.0.mlp.fc2
model.text_model.encoder.layers.0.layer_norm2
model.text_model.encoder.layers.1
model.text_model.encoder.layers.1.self_attn
model.text_model.encoder.layers.1.self_attn.k_proj
model.text_model.encoder.layers.1.self_attn.v_proj
model.text_model.encoder.layers.1.self_attn.q_proj
model.text_model.en

The last of two layers in the model are unfreezed
- model.visual_projection
- model.text_projection

In [14]:
# Count total and trainable parameters
total_params = sum(p.numel() for p in model.parameters())
trainable_params = sum(p.numel() for p in model.parameters() if p.requires_grad)

print(f"Total parameters: {total_params:,}")
print(f"Trainable parameters: {trainable_params:,}")
print(f"Percentage of trainable parameters: {100 * trainable_params / total_params:.2f}%")

Total parameters: 427,616,513
Trainable parameters: 1,376,256
Percentage of trainable parameters: 0.32%


Preprocess the data

In [15]:
# process positive pairs
def preprocess(batch):
    """
        Preprocessing data without augmentations for test set
    """
    # get images from urls
    image_list = [Image.open(requests.get(url, stream=True).raw) for url in batch["thumbnail_url"]]

    # return columns with standard names
    return {
        "anchor": image_list,
        "positive": batch["title"],
        "negative": batch["title_neg"]
    }

In [16]:
# remove columns not relevant to training
columns_to_remove = [col for col in dataset['train'].column_names if col not in ['anchor', 'positive', 'negative']]
# applu transformations
dataset = dataset.map(preprocess, batched=True, remove_columns=columns_to_remove)

Map:   0%|          | 0/53 [00:00<?, ? examples/s]

Map:   0%|          | 0/11 [00:00<?, ? examples/s]

Map:   0%|          | 0/12 [00:00<?, ? examples/s]

In [17]:

dataset['train'][0]

{'anchor': <PIL.JpegImagePlugin.JpegImageFile image mode=RGB size=480x360>,
 'positive': 'Multimodal RAG: A Beginner-friendly Guide (with Python Code)',
 'negative': 'What Nature Can Teach Us About Business...'}

Evaluate the pretrained model

In [18]:
def create_triplet_evaluator(set_name):
    """
        Create triplet evaluator for "train", "valid", or "test" split
    """

    return TripletEvaluator(
        anchors=dataset[f"{set_name}"]["anchor"],
        positives=dataset[f"{set_name}"]["positive"],
        negatives=dataset[f"{set_name}"]["negative"],
        name=f"yt-title-thumbnail-{set_name}",
    )

In [19]:

evaluator_train = create_triplet_evaluator("train")
evaluator_valid = create_triplet_evaluator("valid")

print("Train:", evaluator_train(model))
print("Valid:", evaluator_valid(model))

Train: {'yt-title-thumbnail-train_cosine_accuracy': 0.9622641801834106}
Valid: {'yt-title-thumbnail-valid_cosine_accuracy': 1.0}


In [20]:
class ImageTextRetrievalEvaluator(SentenceEvaluator):
    def __init__(
        self,
        images: List,
        texts: List[str],
        name: str = '',
        k: int = 1,
        batch_size: int = 32,
        show_progress_bar: bool = False
    ):
        self.images = images
        self.texts = texts
        self.name = name
        self.k = k
        self.batch_size = batch_size
        self.show_progress_bar = show_progress_bar

    def __call__(self,
        model: SentenceTransformer,
        output_path: str = None,
        epoch: int = -1,
        steps: int = -1) -> Dict[str, float]:

        # Get embeddings for all images
        img_embeddings = model.encode(
            self.images,
            batch_size=self.batch_size,
            show_progress_bar=self.show_progress_bar,
            convert_to_tensor=True
        )

        # Get embeddings for all texts
        text_embeddings = model.encode(
            self.texts,
            batch_size=self.batch_size,
            show_progress_bar=self.show_progress_bar,
            convert_to_tensor=True
        )

        # Compute similarity matrix
        cos_scores = torch.nn.functional.cosine_similarity(
            img_embeddings.unsqueeze(1),
            text_embeddings.unsqueeze(0),
            dim=2
        )

        # Get indices of top k predictions for each image
        _, top_indices = torch.topk(cos_scores, k=self.k, dim=1)

        # Calculate Recall@k (correct if ground truth index is in top k predictions)
        correct = sum(i in top_indices[i].tolist() for i in range(len(self.images)))
        recall_at_k = correct / len(self.images)

        return {f'{self.name}_Recall@{self.k}': recall_at_k}

In [21]:
def create_recall_evaluator(set_name, k=1):
    """
        Create triplet evaluator for "train", "valid", or "test" split
    """

    return ImageTextRetrievalEvaluator(
        images=dataset[f"{set_name}"]["anchor"],
        texts=dataset[f"{set_name}"]["positive"],
        name=f"yt-title-thumbnail-{set_name}",
        k=k
    )

In [22]:
# Create new evaluator with Recall@k
evaluator_recall_train = create_recall_evaluator("train", k=1)
evaluator_recall_valid = create_recall_evaluator("valid", k=1)

print("Train:", evaluator_recall_train(model))
print("Valid:", evaluator_recall_valid(model))

Train: {'yt-title-thumbnail-train_Recall@1': 0.6415094339622641}
Valid: {'yt-title-thumbnail-valid_Recall@1': 0.6363636363636364}


define training args


In [23]:
# define loss (note: loss expects columns to be ordered as anchor-positive-negative)
loss = MultipleNegativesRankingLoss(model)

# hyperparameters
num_epochs = 2
batch_size = 16
lr = 1e-4
finetuned_model_name = "clip-title-thumbnail-embeddings"

train_args = SentenceTransformerTrainingArguments(
    output_dir=f"models/{finetuned_model_name}",
    num_train_epochs=num_epochs,
    per_device_train_batch_size=batch_size,
    per_device_eval_batch_size=batch_size,
    learning_rate=lr,
    # Evaluation settings
    eval_strategy="epoch",
    eval_steps=1,
    logging_steps=1,
)

fine-tune model

In [24]:
%%time
trainer = SentenceTransformerTrainer(
    model=model,
    args=train_args,
    train_dataset=dataset["train"],
    eval_dataset=dataset["valid"],
    loss=loss,
    evaluator=[evaluator_recall_train, evaluator_recall_valid],
)
trainer.train()

Computing widget examples:   0%|          | 0/1 [00:00<?, ?example/s]

  | |_| | '_ \/ _` / _` |  _/ -_)
[34m[1mwandb[0m: Logging into wandb.ai. (Learn how to deploy a W&B server locally: https://wandb.me/wandb-server)
[34m[1mwandb[0m: You can find your API key in your browser here: https://wandb.ai/authorize?ref=models
[34m[1mwandb[0m: Paste an API key from your profile and hit enter:

 ··········


[34m[1mwandb[0m: No netrc file found, creating one.
[34m[1mwandb[0m: Appending key for api.wandb.ai to your netrc file: /root/.netrc
[34m[1mwandb[0m: Currently logged in as: [33msahilkatiyar2024[0m ([33msahilkatiyar2024-mumbai-university-and-idol-news[0m) to [32mhttps://api.wandb.ai[0m. Use [1m`wandb login --relogin`[0m to force relogin


Epoch,Training Loss,Validation Loss,Yt-title-thumbnail-train Recall@1,Yt-title-thumbnail-valid Recall@1,Sequential Score
1,0.755,1.495506,0.830189,0.818182,0.818182
2,0.6268,1.490477,0.867925,0.818182,0.818182


CPU times: user 20.3 s, sys: 2.75 s, total: 23 s
Wall time: 2min 15s


TrainOutput(global_step=8, training_loss=1.3843472972512245, metrics={'train_runtime': 134.4526, 'train_samples_per_second': 0.788, 'train_steps_per_second': 0.06, 'total_flos': 0.0, 'train_loss': 1.3843472972512245, 'epoch': 2.0})

evaluate fine-tuned model

In [25]:
evaluator_test = create_triplet_evaluator("test")

print("Train:", evaluator_train(model))
print("Valid:", evaluator_valid(model))
print("Test:", evaluator_valid(model))

Train: {'yt-title-thumbnail-train_cosine_accuracy': 1.0}
Valid: {'yt-title-thumbnail-valid_cosine_accuracy': 1.0}
Test: {'yt-title-thumbnail-valid_cosine_accuracy': 1.0}


In [26]:
evaluator_recall_test = create_recall_evaluator("test")

print("Train:", evaluator_recall_train(model))
print("Valid:", evaluator_recall_valid(model))
print("Test:", evaluator_recall_test(model))

Train: {'yt-title-thumbnail-train_Recall@1': 0.8679245283018868}
Valid: {'yt-title-thumbnail-valid_Recall@1': 0.8181818181818182}
Test: {'yt-title-thumbnail-test_Recall@1': 0.75}


Here I do not need a lot of dataset to work on my problem fine tuning it makes it very easy.
- With only the small amount of dataset and we were able to fine tune it by freezing the models 99% of layers and training the last layers
