<a href="https://colab.research.google.com/github/Manikanta5112/embeddings/blob/main/Faster_Embeddings_with_Optimum.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Check the full benchmark report on [Optimum Benchmark x MTEB](https://github.com/huggingface/optimum-benchmark/tree/main/examples/fast-mteb) 📊
CPU benchmarks are coming soon!

<p align="center">
  <img src="https://raw.githubusercontent.com/huggingface/optimum-benchmark/main/examples/fast-mteb/artifacts/forward_latency_plot.png" alt="Latency" width="45%"/>
  <img src="https://raw.githubusercontent.com/huggingface/optimum-benchmark/main/examples/fast-mteb/artifacts/forward_throughput_plot.png" alt="Latency" width="45%"/>
</p>

This is why your RAG system is slow 🐌 and unscalable 😱!

All RAG implementations I have seen so far are embedding text using vanilla Pytorch (Sentence-Transformers) as a backend 🤦.

What are the consequences ❓
📄 With documents, this makes your vector database/index costs more compute than it should 💸!
🌐 With web search, this makes your app slower and limited in terms of how many search results it can process 🔎!

How do we solve this ❓
- More throughput -> Better scalability!
- Less latency -> Better user experience!
Concretely, we use Optimum which provides direct support for ONNX export and inference with ONNX Runtime's advanced graph optimizations (no quality degradation 😎).

Let's take Beijing Academy of Artificial Intelligence(BAAI)'s bge-base-en-v1.5, the number one base embedding model on the MTEB Leaderboard 🏆, for a ride on Optimum 🏎️.

With one CLI command I got 1 millisecond latency and 2000 samples per second throughput. Compared to vanilla Pytorch, this is 7x acceleration on both axes 🤯!

📒 Notebook demonstrating how to use Optimum for faster sentence embeddings: https://lnkd.in/emt-2My5

📊 Full benchmark configurations and report for reproduction:

In [None]:
#@title We'll be using Optimum's OnnxRuntime support with `CUDAExecutionProvider` [because it's fast while also supporting dynamic shapes](https://github.com/huggingface/optimum-benchmark/tree/main/examples/fast-mteb#notes)

!pip install optimum[onnxruntime-gpu]

In [None]:
#@title [`optimum-cli`](https://huggingface.co/docs/optimum/onnxruntime/usage_guides/optimization#optimizing-a-model-during-the-onnx-export) makes it extremely easy to export a model to ONNX and apply SOTA graph optimizations/fusions

!optimum-cli export onnx \
  --model BAAI/bge-base-en-v1.5 \
  --task feature-extraction \
  --optimize O4 \
  --device cuda \
  bge_auto_opt_O4 # output folder

In [None]:
#@title Based on the example given in [BAAI/bge-base-en-v1.5](https://huggingface.co/BAAI/bge-base-en-v1.5#using-huggingface-transformers)

import torch
from transformers import AutoTokenizer
from optimum.onnxruntime import ORTModelForFeatureExtraction

# Load model from HuggingFace Hub
tokenizer = AutoTokenizer.from_pretrained('/content/bge_auto_opt_O4')
ort_model = ORTModelForFeatureExtraction.from_pretrained('/content/bge_auto_opt_O4', provide="CUDAExecutionProvider")

# Sentences we want sentence embeddings for
sentences = ['This is an example sentence', 'Each sentence is converted']

# Tokenize sentences
encoded_input = tokenizer(sentences, padding=True, truncation=True, return_tensors='pt')
# for s2p(short query to long passage) retrieval task, add an instruction to query (not add instruction for passages)
# encoded_input = tokenizer([instruction + q for q in queries], padding=True, truncation=True, return_tensors='pt')

# Compute token embeddings
with torch.no_grad():
    model_output = ort_model(**encoded_input)
    # Perform pooling. In this case, cls pooling.
    sentence_embeddings = model_output[0][:, 0]
# normalize embeddings
sentence_embeddings = torch.nn.functional.normalize(sentence_embeddings, p=2, dim=1)
print("Sentence embeddings:")
print(sentence_embeddings)

Sentence embeddings:
tensor([[ 0.0251,  0.0052,  0.0221,  ...,  0.0092, -0.0090, -0.0150],
        [-0.0125,  0.0129,  0.0137,  ...,  0.0215,  0.0258,  0.0107]])
