Here we will explore CLIP, a multimodal model that can learn joint representations of images and their associated textual descriptions.
It was developed by OpenAI.

## Setting up Environment

In [2]:
# Checking CUDA version.
!nvcc --version

nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2022 NVIDIA Corporation
Built on Wed_Sep_21_10:33:58_PDT_2022
Cuda compilation tools, release 11.8, V11.8.89
Build cuda_11.8.r11.8/compiler.31833905_0


In [3]:
!conda install --yes -c pytorch pytorch=1.7.1 torchvision cudatoolkit=11.8
!pip install ftfy regex tqdm
!pip install git+https://github.com/openai/CLIP.git

/bin/bash: line 1: conda: command not found
Collecting ftfy
  Downloading ftfy-6.1.3-py3-none-any.whl (53 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m53.4/53.4 kB[0m [31m1.5 MB/s[0m eta [36m0:00:00[0m
Installing collected packages: ftfy
Successfully installed ftfy-6.1.3
Collecting git+https://github.com/openai/CLIP.git
  Cloning https://github.com/openai/CLIP.git to /tmp/pip-req-build-9j8ot9d5
  Running command git clone --filter=blob:none --quiet https://github.com/openai/CLIP.git /tmp/pip-req-build-9j8ot9d5
  Resolved https://github.com/openai/CLIP.git to commit a1d071733d7111c9c014f024669f959182114e33
  Preparing metadata (setup.py) ... [?25l[?25hdone
Building wheels for collected packages: clip
  Building wheel for clip (setup.py) ... [?25l[?25hdone
  Created wheel for clip: filename=clip-1.0-py3-none-any.whl size=1369497 sha256=21584c60cf3c308f954a5ca51aa915581d56ee7e22027df05caa550bd9b0d953
  Stored in directory: /tmp/pip-ephem-wheel-cache-d7saljw

## Hello World
Have an image named `CLIP.png` in the root folder.

In [12]:
import torch
import clip
from PIL import Image

device = "cuda" if torch.cuda.is_available() else "cpu"
model, preprocess = clip.load("ViT-B/32", device=device)

captions= ["a diagram", "a dog", "a cat"]

image = preprocess(Image.open("CLIP.png")).unsqueeze(0).to(device)
text = clip.tokenize(captions).to(device)

with torch.no_grad():
    image_features = model.encode_image(image)
    text_features = model.encode_text(text)

    logits_per_image, logits_per_text = model(image, text)
    probs = logits_per_image.softmax(dim=-1).cpu().numpy()


In [14]:
print("Text Probs: ")
for idx, prob in enumerate(probs[0]):
  print(captions[idx])
  print(f"\t {prob}")

Text Probs: 
a diagram
	 0.0006062454776838422
a dog
	 0.0053796311840415
a cat
	 0.9940140843391418


### Remarks
The model overwhelmingly identifies the image with its correct text caption, namely a cat.