# Binary Quantization of OpenAI Embeddings: A Leap in Retrieval Efficiency

In the world of large-scale data retrieval and processing, efficiency is crucial. With the exponential growth of data, the ability to retrieve information quickly and accurately can significantly affect system performance. This blog post explores a technique known as binary quantization applied to OpenAI embeddings, demonstrating how it can enhance retrieval latency by a factor of 20 or more.

## What Are OpenAI Embeddings?
OpenAI embeddings are numerical representations of textual information. They transform text into a vector space where semantically similar texts are mapped close together. This mathematical representation enables computers to understand and process human language more effectively.

## Binary Quantization
Binary quantization is a method which converts continuous numerical values into binary values (0 or 1). It simplifies the data structure, allowing faster computations. Here's a brief overview of the binary quantization process applied to OpenAI embeddings:

1. Load Embeddings: OpenAI embeddings are loaded from parquet files.
2. Binary Transformation: The continuous-valued vectors are converted into binary form. Values greater than 0 are set to 1, and others remain 0.
3. Comparison & Retrieval: Binary vectors are used for comparison using logical XOR operations and other efficient algorithms.

## Setup: Install Dependencies, Imports & Download Embeddings

In [None]:
!pip install matplotlib tqdm pandas numpy --quiet

In [6]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from tqdm import tqdm

#### Download Embeddings

### Code Walkthrough
Here's an explanation of the code structure provided:

1. Loading Data: OpenAI embeddings are loaded from a parquet files (we can load upto 1M embedding) and concatenated into one array.
2. Binary Conversion: A new array with the same shape is initialized with zeros, and the positive values in the original vectors are set to 1.
3. Accuracy Function: The accuracy function compares original vectors with binary vectors for a given index, limit, and oversampling rate. The comparison is done using dot products and logical XOR, sorting the results, and measuring the intersection.
4. Testing: The accuracy is tested for different oversampling rates (1, 2, 4), revealing a correctness of 0.9697 for an oversampling of 4.


### 💿 Loading Data

In [10]:
def get_openai_vectors(force_download: bool = False):
    res = []
    for i in tqdm(range(26)):
        if force_download:
            !wget https://huggingface.co/api/datasets/KShivendu/dbpedia-entities-openai-1M/parquet/KShivendu--dbpedia-entities-openai-1M/train/{i}.parquet
        df = pd.read_parquet(f"{i}.parquet", engine="pyarrow")
        res.append(np.stack(df.openai))
        del df

    openai_vectors = np.concatenate(res)
    del res
    return openai_vectors


openai_vectors = get_openai_vectors(force_download=False)
openai_vectors.shape

100%|██████████| 26/26 [00:12<00:00,  2.02it/s]


(1000000, 1536)

## Binary Conversion

Here, we will use 0 as the threshold for the binary conversion. All values greater than 0 will be set to 1, and others will remain 0. This is a simple and effective way to convert continuous values into binary values for OpenAI embeddings.

In [14]:
openai_bin = np.zeros_like(openai_vectors, dtype=np.int8)
openai_bin[openai_vectors > 0] = 1

## Accuracy Function

We will use the accuracy function to compare the original vectors with the binary vectors for a given index, limit, and oversampling rate. The comparison is done using dot products and logical XOR, sorting the results, and measuring the intersection.

In [15]:
def accuracy(idx, limit: int, oversampling: int):
    scores = np.dot(openai_vectors, openai_vectors[idx])
    dot_results = np.argsort(scores)[-limit:][::-1]

    bin_scores = 1536 - np.logical_xor(openai_bin, openai_bin[idx]).sum(axis=1)
    bin_results = np.argsort(bin_scores)[-(limit * oversampling) :][::-1]

    return len(set(dot_results).intersection(set(bin_results))) / limit

# 📊 Results

In [21]:
number_of_samples = 10
limits = [10, 100]
sampling_rate = [1, 2, 4, 8, 16]
results = []

def mean_accuracy(number_of_samples, limit, sampling_rate):
    return np.mean([accuracy(i, limit=limit, oversampling=sampling_rate) for i in range(number_of_samples)])

for i in tqdm(sampling_rate):
    for j in tqdm(limits):
        result = {"sampling_rate": i, "limit": j, "accuracy": mean_accuracy(number_of_samples, j, i)}
        print(result)
        results.append(result)

  0%|          | 0/5 [00:00<?, ?it/s]

{'sampling_rate': 1, 'limit': 10, 'accuracy': 0.8}


100%|██████████| 2/2 [00:44<00:00, 22.11s/it]
 20%|██        | 1/5 [00:44<02:56, 44.23s/it]

{'sampling_rate': 1, 'limit': 100, 'accuracy': 0.708}




{'sampling_rate': 2, 'limit': 10, 'accuracy': 0.95}


100%|██████████| 2/2 [00:45<00:00, 22.55s/it]
 40%|████      | 2/5 [01:29<02:14, 44.74s/it]

{'sampling_rate': 2, 'limit': 100, 'accuracy': 0.877}




{'sampling_rate': 4, 'limit': 10, 'accuracy': 0.97}


100%|██████████| 2/2 [00:44<00:00, 22.27s/it]
 60%|██████    | 3/5 [02:13<01:29, 44.65s/it]

{'sampling_rate': 4, 'limit': 100, 'accuracy': 0.9560000000000001}




{'sampling_rate': 8, 'limit': 10, 'accuracy': 0.99}


100%|██████████| 2/2 [00:43<00:00, 21.89s/it]
 80%|████████  | 4/5 [02:57<00:44, 44.30s/it]

{'sampling_rate': 8, 'limit': 100, 'accuracy': 0.99}




{'sampling_rate': 16, 'limit': 10, 'accuracy': 1.0}


100%|██████████| 2/2 [00:44<00:00, 22.10s/it]
100%|██████████| 5/5 [03:41<00:00, 44.37s/it]

{'sampling_rate': 16, 'limit': 100, 'accuracy': 0.998}





In [23]:
results = pd.DataFrame(results)
results

Unnamed: 0,sampling_rate,limit,accuracy
0,1,10,0.8
1,1,100,0.708
2,2,10,0.95
3,2,100,0.877
4,4,10,0.97
5,4,100,0.956
6,8,10,0.99
7,8,100,0.99
8,16,10,1.0
9,16,100,0.998


![image.png](./Accuracy_vs_SamplingRate.png)

## Observations
As the sampling rate increases, the accuracy generally improves. Higher limits also tend to yield higher accuracy, especially at larger sampling rates. The accuracy reaches 1.0 for a sampling rate of 16 and a limit of 10.

## Conclusion
The binary quantization of OpenAI embeddings offers a highly efficient way to improve latency in data retrieval. With a correctness of nearly 97% and an impressive 20x or more improvement in retrieval speed, this technique is a promising avenue for systems dealing with large-scale data.