<a href="https://colab.research.google.com/github/Lednik7/CLIP-ONNX/blob/main/examples/RuCLIP_onnx_example.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## Restart colab session after installation
Reload the session if something doesn't work

In [1]:
%%capture
!pip install git+https://github.com/Lednik7/CLIP-ONNX.git
!pip install ruclip==0.0.1rc7
!pip install onnxruntime-gpu

In [2]:
%%capture
!wget -c -O CLIP.png https://github.com/openai/CLIP/blob/main/CLIP.png?raw=true

In [3]:
import onnxruntime

# priority device (if available)
print(onnxruntime.get_device())

GPU


## RuCLIP
WARNING: specific RuCLIP like forward "model(text, image)" instead of classic(OpenAI CLIP) "model(image, text)"

In [1]:
import warnings

# disable warnings
warnings.filterwarnings("ignore", category=UserWarning)

In [2]:
import ruclip

# onnx cannot export with cuda
model, processor = ruclip.load("ruclip-vit-base-patch32-384", device="cpu")

In [3]:
from PIL import Image
import numpy as np

# simple input
pil_images = [Image.open("CLIP.png")]
labels = ['диаграмма', 'собака', 'кошка']
dummy_input = processor(text=labels, images=pil_images,
                        return_tensors='pt', padding=True)

# batch first
image = dummy_input["pixel_values"] # torch tensor [1, 3, 384, 384]
image_onnx = dummy_input["pixel_values"].cpu().detach().numpy().astype(np.float32)

# batch first
text = dummy_input["input_ids"] # torch tensor [3, 77]
text_onnx = dummy_input["input_ids"].cpu().detach().numpy()[::-1].astype(np.int64)

In [4]:
#RuCLIP output
logits_per_image, logits_per_text = model(text, image)
probs = logits_per_image.softmax(dim=-1).detach().cpu().numpy()

print("Label probs:", probs)  # prints: [[0.9885839  0.00894288 0.0024732 ]]

Label probs: [[0.9885839  0.00894288 0.0024732 ]]


## [ONNX] CPU inference mode

In [5]:
from clip_onnx import clip_onnx

visual_path = "clip_visual.onnx"
textual_path = "clip_textual.onnx"

onnx_model = clip_onnx(model, visual_path=visual_path, textual_path=textual_path)
onnx_model.convert2onnx(image, text, verbose=True)

[CLIP ONNX] Start convert visual model
[CLIP ONNX] Start check visual model
[CLIP ONNX] Start convert textual model
[CLIP ONNX] Start check textual model
[CLIP ONNX] Models converts successfully


In [6]:
# ['TensorrtExecutionProvider', 'CUDAExecutionProvider', 'CPUExecutionProvider']
onnx_model.start_sessions(providers=["CPUExecutionProvider"]) # cpu mode

In [7]:
image_features = onnx_model.encode_image(image_onnx)
text_features = onnx_model.encode_text(text_onnx)

logits_per_image, logits_per_text = onnx_model(image_onnx, text_onnx)
probs = logits_per_image.softmax(dim=-1).detach().cpu().numpy()

print("Label probs:", probs)  # prints: Label probs: [[0.90831375 0.07174418 0.01994203]]

Label probs: [[0.90831375 0.07174418 0.01994203]]


In [8]:
%timeit onnx_model.encode_text(text_onnx) # text representation

1 loop, best of 5: 246 ms per loop


In [9]:
%timeit onnx_model.encode_image(image_onnx) # image representation

1 loop, best of 5: 352 ms per loop


## [ONNX] GPU inference mode

In [10]:
onnx_model.start_sessions(providers=["CUDAExecutionProvider"]) # cuda mode

In [11]:
%timeit onnx_model.encode_text(text_onnx) # text representation

The slowest run took 5.08 times longer than the fastest. This could mean that an intermediate result is being cached.
100 loops, best of 5: 6.86 ms per loop


In [12]:
%timeit onnx_model.encode_image(image_onnx) # image representation

The slowest run took 280.76 times longer than the fastest. This could mean that an intermediate result is being cached.
1 loop, best of 5: 18.1 ms per loop
