Use code from `https://huggingface.co/openai/clip-vit-base-patch32` to load the model.

In [None]:
from PIL import Image
import requests
from transformers import CLIPProcessor, CLIPModel

model = CLIPModel.from_pretrained("../models/clip-vit-base-patch32")
processor = CLIPProcessor.from_pretrained("../models/clip-vit-base-patch32")

url = "http://images.cocodataset.org/val2017/000000039769.jpg"
image = Image.open(requests.get(url, stream=True).raw)

inputs = processor(text=["a photo of a cat", "a photo of a dog"], images=image, return_tensors="pt", padding=True)

outputs = model(**inputs)
logits_per_image = outputs.logits_per_image # this is the image-text similarity score
probs = logits_per_image.softmax(dim=1) # we can take the softmax to get the label probabilities

In [6]:
inputs

{'input_ids': tensor([[49406,   320,  1125,   539,   320,  2368, 49407],
        [49406,   320,  1125,   539,   320,  1929, 49407]]), 'attention_mask': tensor([[1, 1, 1, 1, 1, 1, 1],
        [1, 1, 1, 1, 1, 1, 1]]), 'pixel_values': tensor([[[[ 1.9303,  1.9303,  1.9303,  ...,  1.9303,  1.9303,  1.9303],
          [ 1.9303,  1.9303,  1.9303,  ...,  1.9303,  1.9303,  1.9303],
          [ 1.9303,  1.9303,  1.9303,  ...,  1.9303,  1.9303,  1.9303],
          ...,
          [-1.7777, -1.7777, -1.7777,  ..., -1.7777, -1.7777, -1.7777],
          [-1.7777, -1.7777, -1.7777,  ..., -1.7777, -1.7777, -1.7777],
          [-1.7777, -1.7777, -1.7777,  ..., -1.7777, -1.7777, -1.7777]],

         [[ 2.0749,  2.0749,  2.0749,  ...,  2.0749,  2.0749,  2.0749],
          [ 2.0749,  2.0749,  2.0749,  ...,  2.0749,  2.0749,  2.0749],
          [ 2.0749,  2.0749,  2.0749,  ...,  2.0749,  2.0749,  2.0749],
          ...,
          [-1.7371, -1.7371, -1.7371,  ..., -1.7371, -1.7371, -1.7371],
          [-1.73

In [7]:
probs,logits_per_image

(tensor([[0.2435, 0.7565]], grad_fn=<SoftmaxBackward0>),
 tensor([[20.4315, 21.5653]], grad_fn=<TBackward0>))

In [8]:
image

array([[[255, 255, 255],
        [255, 255, 255],
        [255, 255, 255],
        ...,
        [255, 255, 255],
        [255, 255, 255],
        [255, 255, 255]],

       [[255, 255, 255],
        [255, 255, 255],
        [255, 255, 255],
        ...,
        [255, 255, 255],
        [255, 255, 255],
        [255, 255, 255]],

       [[255, 255, 255],
        [255, 255, 255],
        [255, 255, 255],
        ...,
        [255, 255, 255],
        [255, 255, 255],
        [255, 255, 255]],

       ...,

       [[  1,   1,   1],
        [  1,   1,   1],
        [  1,   1,   1],
        ...,
        [  1,   1,   1],
        [  1,   1,   1],
        [  1,   1,   1]],

       [[  1,   1,   1],
        [  1,   1,   1],
        [  1,   1,   1],
        ...,
        [  1,   1,   1],
        [  1,   1,   1],
        [  1,   1,   1]],

       [[  1,   1,   1],
        [  1,   1,   1],
        [  1,   1,   1],
        ...,
        [  1,   1,   1],
        [  1,   1,   1],
        [  1,   1,   1]]

It can successfully load the model and make predictions. Appearently, the image contains cats. So the logis is `tensor([[24.5701, 19.3049]])`

Check the architecture of `model` and text `processor`.

the CLIP processor contains image processor and tokenizer.
- **image processor:** resize, crop and normalize the images and convert them to tensors.
- **tokenizer:** convert the text to tokens via tokenizer (numerical representations).

In [6]:
processor

CLIPProcessor:
- image_processor: CLIPImageProcessor {
  "crop_size": {
    "height": 224,
    "width": 224
  },
  "do_center_crop": true,
  "do_convert_rgb": true,
  "do_normalize": true,
  "do_rescale": true,
  "do_resize": true,
  "image_mean": [
    0.48145466,
    0.4578275,
    0.40821073
  ],
  "image_processor_type": "CLIPImageProcessor",
  "image_std": [
    0.26862954,
    0.26130258,
    0.27577711
  ],
  "resample": 3,
  "rescale_factor": 0.00392156862745098,
  "size": {
    "shortest_edge": 224
  }
}

- tokenizer: CLIPTokenizerFast(name_or_path='../models/clip-vit-base-patch32', vocab_size=49408, model_max_length=1000000000000000019884624838656, is_fast=True, padding_side='right', truncation_side='right', special_tokens={'bos_token': '<|startoftext|>', 'eos_token': '<|endoftext|>', 'unk_token': '<|endoftext|>', 'pad_token': '<|endoftext|>'}, clean_up_tokenization_spaces=False),  added_tokens_decoder={
	49406: AddedToken("<|startoftext|>", rstrip=False, lstrip=False, single

In [7]:
model

CLIPModel(
  (text_model): CLIPTextTransformer(
    (embeddings): CLIPTextEmbeddings(
      (token_embedding): Embedding(49408, 512)
      (position_embedding): Embedding(77, 512)
    )
    (encoder): CLIPEncoder(
      (layers): ModuleList(
        (0-11): 12 x CLIPEncoderLayer(
          (self_attn): CLIPSdpaAttention(
            (k_proj): Linear(in_features=512, out_features=512, bias=True)
            (v_proj): Linear(in_features=512, out_features=512, bias=True)
            (q_proj): Linear(in_features=512, out_features=512, bias=True)
            (out_proj): Linear(in_features=512, out_features=512, bias=True)
          )
          (layer_norm1): LayerNorm((512,), eps=1e-05, elementwise_affine=True)
          (mlp): CLIPMLP(
            (activation_fn): QuickGELUActivation()
            (fc1): Linear(in_features=512, out_features=2048, bias=True)
            (fc2): Linear(in_features=2048, out_features=512, bias=True)
          )
          (layer_norm2): LayerNorm((512,), eps=1e

Appearently, the CLIP model consists of 2 parts:
- Text Model: it seems to be a Encoder only Transformer model with:
  - Embedding size: 512
  - Number of Layers: 12
  - Vocab size: 49408
  - Intermidiate size: 4*512
  - Normalization: Layernorm
  - Attenton type: Sparse Dot-Product Attention
  - Activation: GELU
- Image Model: A vision Transform model with:
  - Embedding size: 768
  - Number of Layers: 12
  - Patch size/Num patchs: 32/7
  - Intermidiate size: 4*768
  - Normalization: Layernorm
  - Attenton type: Sparse Dot-Product Attention
  - Activation: GELU

In [3]:
import cv2

img=cv2.imread("../dataset/Natural Human Face Images for Emotion Recognition/neutrality/images - 2020-11-06T190127.785_face.png")

In [4]:
img.shape

(224, 224, 3)