# Interacting with CLIP

This is a self-contained notebook that shows how to download and run CLIP models, calculate the similarity between arbitrary image and text inputs, and perform zero-shot image classifications.

# Preparation for Colab

Make sure you're running a GPU runtime; if not, select "GPU" as the hardware accelerator in Runtime > Change Runtime Type in the menu. The next cells will install the `clip` package and its dependencies, and check if PyTorch 1.7.1 or later is installed.

In [1]:
! pip install ftfy regex tqdm
! pip install git+https://github.com/openai/CLIP.git

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting ftfy
  Downloading ftfy-6.1.1-py3-none-any.whl (53 kB)
[K     |████████████████████████████████| 53 kB 1.7 MB/s 
Installing collected packages: ftfy
Successfully installed ftfy-6.1.1
Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting git+https://github.com/openai/CLIP.git
  Cloning https://github.com/openai/CLIP.git to /tmp/pip-req-build-k_672h_e
  Running command git clone -q https://github.com/openai/CLIP.git /tmp/pip-req-build-k_672h_e
Building wheels for collected packages: clip
  Building wheel for clip (setup.py) ... [?25l[?25hdone
  Created wheel for clip: filename=clip-1.0-py3-none-any.whl size=1369409 sha256=3e56c3a0da186524bde173845af8bae3796c82ca339f790ddc7a904acdd2c9ac
  Stored in directory: /tmp/pip-ephem-wheel-cache-ylgnxy1o/wheels/fd/b9/c3/5b4470e35ed76e174bff77c92f91da82098d5e35fd5bc8cdac
Successfully

In [2]:
import numpy as np
import torch
from pkg_resources import packaging

print("Torch version:", torch.__version__)


Torch version: 1.12.1+cu113


# Loading the model

`clip.available_models()` will list the names of available CLIP models.

In [5]:
import clip
from PIL import Image
clip.available_models()

device = "cuda" if torch.cuda.is_available() else "cpu"
model, preprocess = clip.load("ViT-B/32",device=device)

100%|████████████████████████████████████████| 338M/338M [00:01<00:00, 182MiB/s]


In [9]:
image = preprocess(Image.open("OIP.jpg")).unsqueeze(0).to(device)
text = clip.tokenize(["cellphone","phone","telephone","book"]).to(device)

with torch.no_grad():
    image_features = model.encode_image(image).float()
    text_features = model.encode_text(text).float()

    logits_per_image, logits_per_text = model(image,text)
    probs = logits_per_image.softmax(dim=-1).cpu().numpy()

print("Label probs:",probs)

Label probs: [[0.1744   0.8193   0.004436 0.00151 ]]
