# CLIP-based Open-Vocabulary Image Classification Tutorial
[June Moh Goo](https://www.linkedin.com/in/jmgoo1118/) / PhD Student in Computer Vision for 3D Point Clouds

---
---


[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/drive/1i8ua1BmlfFGUm51QkngPFx_yahD15Tmb#scrollTo=Z5wZpN2EOIMR) [![Paper](https://img.shields.io/badge/arXiv-2103.00020-b31b1b.svg)](https://arxiv.org/pdf/2103.00020) [![GitHub](https://badges.aleen42.com/src/github.svg)](https://github.com/OpenAI/CLIP)

In this tutorial, we will demonstrate how to use the CLIP model proposed by OpenAI to perform open-vocabulary image classification. CLIP (Contrastive Language-Image Pre-training) aligns images and text in the same embedding space, enabling classification without fixed class labels, simply by providing textual prompts.

## Key Idea

The key idea of OpenAI's CLIP (Contrastive Language–Image Pre-training) is to create a shared embedding space for images and text, enabling mutual understanding between the two modalities. CLIP achieves this by using contrastive learning on a large dataset of image-text pairs.

Main Concepts:
- Image-Text Matching: CLIP learns to align images and their corresponding textual descriptions by maximizing their similarity in the shared embedding space.
- Multimodal Representation Learning: It uses two separate encoders (a Vision Transformer for images and a Transformer for text) to process each modality and project them into a common embedding space.
- Zero-Shot Learning: After training, CLIP can generalize to new tasks and datasets without additional fine-tuning, allowing text descriptions to classify images directly.

CLIP's strength lies in its generalization capability, performing at human-level accuracy across various datasets and tasks without task-specific training.

For example, CLIP can identify a picture of a cat without being explicitly trained on a "cat classification" dataset. Instead, you can provide text prompts like "a photo of a cat", "a photo of a dog", and "a photo of a car", and it will correctly associate the cat image with the corresponding text description based on its pre-trained image-text alignment. This flexibility allows CLIP to classify or describe images for tasks it wasn’t specifically trained for, such as identifying objects, scenes, or even abstract concepts, simply by providing appropriate textual labels.


![CLIP figure](https://github.com/openai/CLIP/blob/main/CLIP.png?raw=true)

## What You Will Learn

- How to install and load the CLIP model in Google Colab
- How to load and preprocess an example image
- How to create image and text embeddings using CLIP
- How to compare image embeddings against multiple textual prompts to find the best match

## Requirements

- Google Colab environment (GPU recommended)
- `torch`, `clip` libraries
- Internet connection (to download example images)

## References

- [OpenAI CLIP GitHub Repository](https://github.com/openai/CLIP)
- [Hugging Face Transformers (CLIP)](https://huggingface.co/docs/transformers/model_doc/clip)


In [None]:
# Check runtime environment (GPU recommended)
import torch
device = "cuda" if torch.cuda.is_available() else "cpu"
device


## Installation and Imports

Let's install CLIP and import the necessary libraries.


In [None]:
%pip install git+https://github.com/openai/CLIP.git --upgrade ftfy regex tqdm

In [None]:
import clip
from PIL import Image
import requests
from io import BytesIO
import torch

# Set the device (GPU if available)
device = "cuda" if torch.cuda.is_available() else "cpu"

## Load an Example Image

We'll fetch an example image from the web. For instance, let's use an image of a dog.


In [None]:
image_url = "https://a.travel-assets.com/findyours-php/viewfinder/images/res70/100000/100677-London.jpg"

response = requests.get(image_url)
image = Image.open(BytesIO(response.content)).convert("RGB")
image


## Load the CLIP Model

Load the CLIP model and corresponding tokenizer.


In [None]:
model, preprocess = clip.load("ViT-B/32", device=device)


## Preprocessing and Extracting Image Embeddings

Use the provided `preprocess` function to convert the image into a suitable tensor and then pass it through the model to obtain the image embedding.


In [None]:
image_input = preprocess(image).unsqueeze(0).to(device)

with torch.no_grad():
    image_features = model.encode_image(image_input)
    image_features = image_features / image_features.norm(dim=-1, keepdim=True)


## Define Text Prompts

We will define several text prompts and let CLIP determine which one best matches the image:

- "a photo of a dog"
- "a photo of a cat"
- "a photo of a bird"
- "a photo of a computer"
- "a photo of a car"


In [None]:
text_prompts = [
    "a photo of a bus",
    "a photo of a taxi",
    "a photo of a Road",
    "a photo of a cat",
    "a photo of a dog",
    "a photo of a bird",
    "a photo of a computer",
    "a photo of a phone"
]

text_inputs = clip.tokenize(text_prompts).to(device)

with torch.no_grad():
    text_features = model.encode_text(text_inputs)
    text_features = text_features / text_features.norm(dim=-1, keepdim=True)


## Computing Similarity and Results

We'll compute the cosine similarity between the image embedding and each text embedding. The text prompt with the highest similarity score should best describe the image.


In [None]:
similarity = (image_features @ text_features.T).cpu().numpy().flatten()

# Print all similarity scores along with their corresponding prompt
for prompt, score in zip(text_prompts, similarity):
    print(f"Prompt: {prompt} | Similarity score: {score}")

# Identify and print the best match
best_match_idx = similarity.argmax()
best_prompt = text_prompts[best_match_idx]
print("\nThe most similar text prompt to the input image is:", best_prompt)
print("Similarity score:", similarity[best_match_idx])


## Interpretation

If the image is of a dog, we expect "a photo of a dog" to have the highest similarity score.

This completes our simple demonstration of using CLIP for open-vocabulary image classification.

## Next Steps

- Try different prompts. For instance, try describing specific features of the object in the image.
- Experiment with various images and see how CLIP performs.
- Explore advanced applications: zero-shot classification for more nuanced concepts, image retrieval, or even image captioning tasks.

This concludes the basic CLIP open-vocabulary classification tutorial.
