# Binary Quantization in Sentence Transformers
Quantizing an embedding with a dimensionality of 1024 to binary would result in 1024 bits. In practice, it is much more common to store bits as bytes instead, so when we quantize to binary embeddings, we pack the bits into bytes using np.packbits.

As a result, in practice quantizing a float32 embedding with a dimensionality of 1024 yields an int8 or uint8 embedding with a dimensionality of 128. See two approaches of how you can produce quantized embeddings using Sentence Transformers below:

In [1]:
from sentence_transformers import SentenceTransformer
from sentence_transformers.quantization import quantize_embeddings
import warnings
warnings.filterwarnings("ignore")


# 1. Load an embedding model
model = SentenceTransformer("mixedbread-ai/mxbai-embed-large-v1")
model

  from tqdm.autonotebook import tqdm, trange


SentenceTransformer(
  (0): Transformer({'max_seq_length': 512, 'do_lower_case': False}) with Transformer model: BertModel 
  (1): Pooling({'word_embedding_dimension': 1024, 'pooling_mode_cls_token': True, 'pooling_mode_mean_tokens': False, 'pooling_mode_max_tokens': False, 'pooling_mode_mean_sqrt_len_tokens': False, 'pooling_mode_weightedmean_tokens': False, 'pooling_mode_lasttoken': False, 'include_prompt': True})
)

In [2]:
# 2a. Encode some text using "binary" quantization
binary_embeddings = model.encode(
    ["I am driving to the lake.", "It is a beautiful day."],
    precision="binary",
)
binary_embeddings

array([[ -25,  114, -112,  -20,   62,  -32, -101,  -14,   -7,  127,  -79,
          18,   -1,  -48,   63,    1,  -17,  -81,  -11,    9,  -41,  -89,
         119,  -77,   80,   19,  123,   74, -116,  -36,  124,  -53,   79,
         -56, -113,   24,   28,    9,  117,  -38,   84,   69, -111,  -15,
         -13,   27,   36,    9,  -58, -116,   91,    5,   68,   14,   26,
          96,  -83,   99,  -97,   46, -109,  114,   -3, -102,   51,  126,
         -96,  -36,  -54,   25,  -60,    8, -112, -111,   46,   60,  126,
         -74,  -63,  -27, -117, -101, -104, -128,  -65,   71,   13,  -19,
         -20,  -79,   20,   -6,   13, -128,  -17,  121,   -2,   48,    7,
          93,  -16,   50,  126,  106,   91,  -57,   31,  -29,    2,   79,
         -76, -111,  -51,   21,   38,   60,  -46,   21,   55,   83, -120,
          82,   86,    5,   57,  -37,   91,   58],
       [ -61,   60,  -44,  108,   58,   32,   81,  -42,   89,  -43, -109,
          59,   -1,  -55,  124,  -31,  -28,  -55,  116,   81,

In [3]:
print(binary_embeddings.shape)
print(binary_embeddings.nbytes)
print(binary_embeddings.dtype)

(2, 128)
256
int8


In [4]:
# 2b. or, encode some text without quantization & apply quantization afterwards
embeddings = model.encode(["I am driving to the lake.", 
                           "It is a beautiful day."])
print(embeddings)


binary_embeddings = quantize_embeddings(embeddings, precision="binary")
print(binary_embeddings)

[[-0.25717044  0.35520184  0.00353763 ... -0.6651358   0.17863286
  -0.17328042]
 [-0.7636851   0.5963278  -0.0395734  ... -0.06096186 -0.0360474
   0.15181805]]
[[ -25  114 -112  -20   62  -32 -101  -14   -7  127  -79   18   -1  -48
    63    1  -17  -81  -11    9  -41  -89  119  -77   80   19  123   74
  -116  -36  124  -53   79  -56 -113   24   28    9  117  -38   84   69
  -111  -15  -13   27   36    9  -58 -116   91    5   68   14   26   96
   -83   99  -97   46 -109  114   -3 -102   51  126  -96  -36  -54   25
   -60    8 -112 -111   46   60  126  -74  -63  -27 -117 -101 -104 -128
   -65   71   13  -19  -20  -79   20   -6   13 -128  -17  121   -2   48
     7   93  -16   50  126  106   91  -57   31  -29    2   79  -76 -111
   -51   21   38   60  -46   21   55   83 -120   82   86    5   57  -37
    91   58]
 [ -61   60  -44  108   58   32   81  -42   89  -43 -109   59   -1  -55
   124  -31  -28  -55  116   81  -40  -77  123 -126   80  -37  106   79
     9  126  102  109  102   81 -

Here you can see the differences between default float32 embeddings and binary embeddings in terms of shape, size, and numpy dtype:

In [5]:
print(embeddings.shape)
print(embeddings.nbytes)
print(embeddings.dtype)
print("---------------------")
print(binary_embeddings.shape)
print(binary_embeddings.nbytes)
print(binary_embeddings.dtype)

(2, 1024)
8192
float32
---------------------
(2, 128)
256
int8


# Scalar (int8) Quantization
To convert the float32 embeddings into int8, we use a process called scalar quantization. This involves mapping the continuous range of float32 values to the discrete set of int8 values, which can represent 256 distinct levels (from -128 to 127). This is done by using a large calibration dataset of embeddings. We compute the range of these embeddings, i.e. the min and max of each of the embedding dimensions. From there, we calculate the steps (buckets) in which we categorize each value.

To further boost the retrieval performance, you can optionally apply the same rescoring step as for the binary embeddings. It is important to note here that the calibration dataset has a large influence on the performance, since it defines the buckets.



# Scalar Quantization in Sentence Transformers
Quantizing an embedding with a dimensionality of 1024 to int8 results in 1024 bytes. In practice, we can choose either uint8 or int8. This choice is usually made depending on what your vector library/database supports.

In practice, it is recommended to provide the scalar quantization with either:

a large set of embeddings to quantize all at once, or

min and max ranges for each of the embedding dimensions, or

a large calibration dataset of embeddings from which the min and max ranges can be computed.

If none of these are the case, you will be given a warning like this:

Computing int8 quantization buckets based on 2 embeddings. int8 quantization is more stable with 'ranges' calculated from more embeddings or a 'calibration_embeddings' that can be used to calculate the buckets.

In [6]:
from sentence_transformers import SentenceTransformer
from sentence_transformers.quantization import quantize_embeddings
from datasets import load_dataset

# 1. Load an embedding model
model = SentenceTransformer("mixedbread-ai/mxbai-embed-large-v1")
model

SentenceTransformer(
  (0): Transformer({'max_seq_length': 512, 'do_lower_case': False}) with Transformer model: BertModel 
  (1): Pooling({'word_embedding_dimension': 1024, 'pooling_mode_cls_token': True, 'pooling_mode_mean_tokens': False, 'pooling_mode_max_tokens': False, 'pooling_mode_mean_sqrt_len_tokens': False, 'pooling_mode_weightedmean_tokens': False, 'pooling_mode_lasttoken': False, 'include_prompt': True})
)

In [7]:
# 2. Prepare an example calibration dataset
corpus = load_dataset("nq_open", split="train[:1000]")["question"]
print(corpus)

calibration_embeddings = model.encode(corpus)
print(calibration_embeddings)

Downloading readme: 100%|██████████| 8.77k/8.77k [00:00<00:00, 19.0kB/s]
Downloading data: 100%|██████████| 4.46M/4.46M [00:04<00:00, 1.01MB/s]
Downloading data: 100%|██████████| 214k/214k [00:02<00:00, 103kB/s]
Generating train split: 100%|██████████| 87925/87925 [00:00<00:00, 1250980.95 examples/s]
Generating validation split: 100%|██████████| 3610/3610 [00:00<00:00, 796372.87 examples/s]


['where did they film hot tub time machine', 'who has the right of way in international waters', 'who does annie work for attack on titan', 'when was the immigration reform and control act passed', 'when was puerto rico added to the usa', 'who has been chosen for best supporting actress in 64 national filmfare award', 'which side of the white house is the front', 'names of the metropolitan municipalities in south africa', "who's hosting the super bowl in 2019", 'in which year vivo launch its first phone in india', 'where does it talk about mary magdalene in the bible', 'who carries the nuclear football for the president', 'what is the origin of the name cynthia', 'who is the guy who voiced disney channel', "what's the legal marriage age in new york", 'when do the red hot chili peppers tour', 'who plays mavis in the movie hotel transylvania', 'what is the channel number for cartoon network on spectrum', 'when are the fa cup semi finals played', 'when did the ipod touch 6 gen came out', 

In [8]:
# 3. Encode some text without quantization & apply quantization afterwards
embeddings = model.encode(["I am driving to the lake.", "It is a beautiful day."])
print(embeddings)

# Choose a target precision for the corpus embeddings
# Valid options are: "float32", "uint8", "int8", "ubinary", and "binary"
int8_embeddings = quantize_embeddings(
    embeddings,
    precision="int8",
    calibration_embeddings=calibration_embeddings,
)
print(int8_embeddings)

[[-0.25717044  0.35520184  0.00353763 ... -0.6651358   0.17863286
  -0.17328042]
 [-0.7636851   0.5963278  -0.0395734  ... -0.06096186 -0.0360474
   0.15181805]]
[[-26  10   4 ...  -7  32  -4]
 [-69  31   0 ...  39  11  26]]


In [9]:
print(embeddings.shape)
print(embeddings.nbytes)
print(embeddings.dtype)
print("---------------------")
print(int8_embeddings.shape)
print(int8_embeddings.nbytes)
print(int8_embeddings.dtype)

(2, 1024)
8192
float32
---------------------
(2, 1024)
2048
int8
