Author: [Nelson Lin](https://www.linkedin.com/in/nelson-l-842564164/)


# Inference PyTorch CLIP Model with ONNX Runtime on CPU


In this tutorial, you'll be introduced to how convert HuggingFace [CLIP Model](https://huggingface.co/openai/clip-vit-base-patch32) to ONNX, and inference it for high performance using ONNX Runtime On GPU. See the below comparisons between native pytorch and onnx on gpu inference, we notice that ONNX can much accelebrate the inference and it has been one of best choice for deployment of  our LLM models for cost saving and more efficiency.

And, now we try to convert CLIP to ONNX to see how much it can accelerate the inference !


<img src='https://miro.medium.com/v2/resize:fit:720/format:webp/1*4GREvqUWnFU9VXuNk2HEFQ.png'></img>

reference: https://medium.com/microsoftazure/accelerate-your-nlp-pipelines-using-hugging-face-transformers-and-onnx-runtime-2443578f4333

## (1) Prerequisites

In [1]:
!pip install transformers==4.31.0
!pip install onnx
!pip install onnxruntime-gpu

Collecting transformers==4.31.0
  Downloading transformers-4.31.0-py3-none-any.whl (7.4 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m7.4/7.4 MB[0m [31m18.1 MB/s[0m eta [36m0:00:00[0m
Collecting huggingface-hub<1.0,>=0.14.1 (from transformers==4.31.0)
  Downloading huggingface_hub-0.16.4-py3-none-any.whl (268 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m268.8/268.8 kB[0m [31m21.3 MB/s[0m eta [36m0:00:00[0m
Collecting tokenizers!=0.11.3,<0.14,>=0.11.1 (from transformers==4.31.0)
  Downloading tokenizers-0.13.3-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (7.8 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m7.8/7.8 MB[0m [31m43.6 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting safetensors>=0.3.1 (from transformers==4.31.0)
  Downloading safetensors-0.3.2-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (1.3 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.3/1.3 MB[0m [

## (2) Load Pretrained CLIP model

In [2]:
import os
import time
import torch
import requests
from tqdm import tqdm
from PIL import Image
from transformers import CLIPProcessor, CLIPModel

In [3]:
device = "cuda" if torch.cuda.is_available() else "cpu"

In [4]:
model_name = "openai/clip-vit-base-patch32"

# Load the Model
model = CLIPModel.from_pretrained(model_name)
model = model.eval()
model = model.to(device)


# Load Processor
processor = CLIPProcessor.from_pretrained(model_name)

Downloading (…)lve/main/config.json:   0%|          | 0.00/4.19k [00:00<?, ?B/s]

Downloading pytorch_model.bin:   0%|          | 0.00/605M [00:00<?, ?B/s]

Downloading (…)rocessor_config.json:   0%|          | 0.00/316 [00:00<?, ?B/s]

Downloading (…)okenizer_config.json:   0%|          | 0.00/568 [00:00<?, ?B/s]

Downloading (…)olve/main/vocab.json:   0%|          | 0.00/862k [00:00<?, ?B/s]

Downloading (…)olve/main/merges.txt:   0%|          | 0.00/525k [00:00<?, ?B/s]

Downloading (…)/main/tokenizer.json:   0%|          | 0.00/2.22M [00:00<?, ?B/s]

Downloading (…)cial_tokens_map.json:   0%|          | 0.00/389 [00:00<?, ?B/s]

## (3) Inference Using Pytorch

### (3.1) Load Image

In [5]:
url = "https://static.independent.co.uk/s3fs-public/thumbnails/image/2016/02/14/12/duck-rabbit.png"
image = Image.open(requests.get(url, stream=True).raw)

<img src="https://static.independent.co.uk/s3fs-public/thumbnails/image/2016/02/14/12/duck-rabbit.png" width=512> </img>

### (3.2) Design Prompt As Labels For Zero Shot Inference

In [6]:
prompt_labels = ["a photo of a duck", "a photo of a rabbit"]

### (3.3) Understand Inputs

In [7]:
# preprocess text and images to tensors: tensors are the features generated from the image and text that can be understood by model
inputs = processor(text=prompt_labels,
                   images=[image], return_tensors="pt", padding=True)

In [8]:
for input_type in inputs:
    inputs[input_type] = inputs[input_type].to(device)

In [9]:
print(inputs.keys())

dict_keys(['input_ids', 'attention_mask', 'pixel_values'])


In [10]:
# input_ids: input_ids are the index of input text token. for example, the inputs_id of "love" [0,1]
# attention_mask: since we pad the input with the same equal length, attention mask is to control which features are not valid
# pixel_values: The features of images

In [11]:
inputs['input_ids'].shape  # ['num_prompt_label','max_prompt_seq_len']

torch.Size([2, 7])

In [12]:
inputs['attention_mask'].shape  # ['num_prompt_label','max_prompt_seq_len']

torch.Size([2, 7])

In [13]:
inputs['pixel_values'].shape  # ['batch_size','channel','height','width]

torch.Size([1, 3, 224, 224])

### (3.4) Inference

In [14]:
with torch.no_grad():
    start = time.time()
    outputs = model(**inputs)

### (3.5) Understand Output

In [15]:
outputs.keys()

odict_keys(['logits_per_image', 'logits_per_text', 'text_embeds', 'image_embeds', 'text_model_output', 'vision_model_output'])

In [16]:
outputs['logits_per_image'].shape # ['batch_size',num_prompt_label']

torch.Size([1, 2])

In [17]:
outputs['logits_per_text'].shape # ['num_prompt_label','batch_size']

torch.Size([2, 1])

In [18]:
outputs['text_embeds'].shape # ['num_prompt_label','embedding_size']

torch.Size([2, 512])

In [19]:
outputs['image_embeds'].shape # ['image_size','embedding_size']

torch.Size([1, 512])

In [20]:
outputs['text_model_output'].pooler_output.shape # ['num_prompt_label','text_model_hidden_size']

torch.Size([2, 512])

In [21]:
outputs['vision_model_output'].last_hidden_state.shape # ['batch_size','hidden_layer','vision_model_hidden_size']

torch.Size([1, 50, 768])

In [22]:
# this is the image-text similarity score
logits_per_image = outputs.logits_per_image
# we can take the softmax to get the label probabilities from 0 - 1
probs = logits_per_image.softmax(dim=1)

In [25]:
for prob,label in zip(probs.cpu().numpy()[0],prompt_labels):
    print(f"Label::{label} -> Probability:{prob}")

Label::a photo of a duck -> Probability:0.007603303529322147
Label::a photo of a rabbit -> Probability:0.9923966526985168


## (4) Inference Time  Using Pytorch

In [26]:
inference_count = 100

In [27]:
sum_time_spent = 0
for idx in tqdm(range(inference_count)):
    with torch.no_grad():
        start = time.time()
        outputs = model(**inputs)
        end = time.time()
        time_spent = end-start
        sum_time_spent+=time_spent

pytorch_avg_time_spent = sum_time_spent/inference_count

100%|██████████| 100/100 [00:04<00:00, 23.37it/s]


In [28]:
print(f"Average Inference Time Using Pytorch: {pytorch_avg_time_spent*1000} ms")

Average Inference Time Using Pytorch: 41.29441976547241 ms


## (5) Convert To ONNX

In [29]:
export_model_path = '/tmp/clip-vit-base-patch32.onnx'

In [30]:
num_class = len(prompt_labels)
print(f"num_class:{num_class}")

num_class:2


### (5.1) Define Dynamic Axes
If we convert model to onnx without defining dynamic axes, the input and output shape will have to be restrictly  same as the input that's used for conversion.
However, the shape of input should change sometime. For example, we increase the batch size increase for 1 to 128, we add more labels prompts. In the this case we should convert the model with dynamic Axes.

In [31]:
dynamic_axes = {
    'input_ids': {0: 'num_class', 1: 'max_prompt_seq_len'},
    'attention_mask': {0: 'num_class', 1: 'max_prompt_seq_len'},
    'pixel_values': {0: 'batch_size'},

    'logits_per_image': {0: "batch_size", 1: "num_class"},
    'logits_per_text': {0: "num_class", 1: "batch_size"},

    'text_embeds': {0: "num_class"},
    'image_embeds': {0: 'batch_size'},

    'text_model_output': {0: 'num_class'},
    'vision_model_output': {0: 'batch_size'},
}

In [32]:
# how we know the correct order of input names: https://github.com/huggingface/transformers/blob/v4.31.0/src/transformers/models/clip/modeling_clip.py#L1085-L1087

input_names = ['input_ids', 'pixel_values', 'attention_mask']
args = tuple([inputs[name] for name in input_names])

In [33]:
output_names = list(outputs.keys())

### (5.2) convert pytorch to onnx

In [34]:
with torch.no_grad():

    torch.onnx.export(
        # model being run
        model,
        # model input (or a tuple for multiple inputs)
        args=args,
        # where to save the model (can be a file or file-like object)
        f=export_model_path,
        # the ONNX version to export the model to
        opset_version=18,
        # whether to execute constant folding for optimization
        do_constant_folding=True,
        input_names=input_names,
        # the model's output names
        output_names=output_names,
        dynamic_axes=dynamic_axes,
    )

    print("Model exported at ", export_model_path)

  if attn_weights.size() != (bsz * self.num_heads, tgt_len, src_len):
  if attn_output.size() != (bsz * self.num_heads, tgt_len, self.head_dim):
  if causal_attention_mask.size() != (bsz, 1, tgt_len, src_len):
  if attention_mask.size() != (bsz, 1, tgt_len, src_len):


verbose: False, log level: Level.ERROR

Model exported at  /tmp/clip-vit-base-patch32.onnx


## (6) Use ONNX To Inference

In [35]:
import psutil
import onnxruntime
import numpy

assert 'CUDAExecutionProvider' in onnxruntime.get_available_providers()

In [36]:
onnxruntime.get_available_providers()

['TensorrtExecutionProvider', 'CUDAExecutionProvider', 'CPUExecutionProvider']

In [40]:
sess_options = onnxruntime.SessionOptions()
sess_options.intra_op_num_threads=psutil.cpu_count(logical=True)
session = onnxruntime.InferenceSession(export_model_path, sess_options,providers=['CUDAExecutionProvider'])

In [42]:
ort_inputs = {key: tensor.cpu().numpy() for key, tensor in inputs.items()}

In [47]:
sum_time_spent = 0
for idx in tqdm(range(inference_count)):

    start = time.time()
    ort_logits_per_image_output = session.run(
        output_names=['logits_per_image'], input_feed=ort_inputs)

    end = time.time()
    time_spent = end-start
    sum_time_spent+=time_spent

onnx_avg_time_spent = sum_time_spent/inference_count

100%|██████████| 100/100 [00:00<00:00, 107.22it/s]


In [48]:
print(f"Average Inference Time Using ONNX: {onnx_avg_time_spent*1000} ms")

Average Inference Time Using ONNX: 9.147627353668213 ms


In [53]:
faster_time = pytorch_avg_time_spent / onnx_avg_time_spent - 1

In [55]:
print("ONNX is {:.2f} × Faster than Pytorch".format(faster_time))

ONNX is 3.51 × Faster than Pytorch


## (7)  Acceleration Results

<img src='Speed-up.png'></img>