# Overview

This notebook demonstrates the high retrieval quality at small (128 byte) embedding vectors from the `snowflake-arctic-m-v1.5` model.

We begin by loading the full 768-dimensional embeddings in full float32 precision (these precomputed embeddings are made available as a Huggingface dataset). We then demonstrate proper truncation with renormalization to unit norm, plus uniform scalar quantization to int4 datatype. The final result is a reproduction of our 53.7 MTEB retrieval score at 128 bytes per vector.

## Int4 Quantization

Given a floating point number $f$ and a min/max clipping limit $b$ (e.g. clipping to the range -0.18 to 0.18), uniform scalar quantization to $k$ levels (where $k = 16$ for a 4-bit quantization) will entail a spacing between levels of $c = 2 * b / (k - 1)$. If we term the integer result of this quantization $i$, we have the following mappings between float and integer:

$$i = \text{round}\left(\frac{f + b}{c}\right)$$
$$f \approx c \cdot i - b$$

Furthermore, we can perform floating point multiplication via integers for a faster implementation:

$$f_1 \cdot f_2 \approx (c i_1 - b)(c i_2 - b)$$
$$f_1 \cdot f_2 \approx c^2 i_1 i_2 - c b (i_1 + i_2) + b^2$$

When $c$ and $b$ are constant across all $f$'s, we can perform a vector dot product even more efficiently by factoring out the constants further:

$$\langle \pmb{f}_1, \pmb{f}_2 \rangle = \sum_{k=1}^{K}f_1^{(k)} \cdot f_2^{(k)} \approx \sum_{k} \left( c^2 i_1^{(k)} i_2^{(k)} - c b (i_1^{(k)} + i_2^{(k)}) + b^2 \right)$$
$$\langle \pmb{f}_1, \pmb{f}_2 \rangle \approx c^2 \cdot \sum_{k} \left(  i_1^{(k)} i_2^{(k)} \right) - c b \cdot \sum_{k} \left( i_1^{(k)} + i_2^{(k)} \right) + \sum_{k} \left(b^2 \right)$$
$$\langle \pmb{f}_1, \pmb{f}_2 \rangle \approx c^2 \cdot \sum_{k} \left(  i_1^{(k)} i_2^{(k)} \right)  - c b \left( \sum_{k} \left( i_1^{(k)} \right) + \sum_{k} \left( i_2^{(k)} \right) \right) + K \cdot b^2 $$
$$\langle \pmb{f}_1, \pmb{f}_2 \rangle \approx c^2 \cdot \langle \pmb{i}_1, \pmb{i}_2 \rangle  - c b \left( \text{sum}(\pmb{i}_1) + \text{sum}(\pmb{i}_2) \right) + K \cdot b^2 $$


## Integer quantization in practice

Using modern CPU hardware and [SIMD](https://en.wikipedia.org/wiki/Single_instruction,_multiple_data) programming, it is possible to perform these integer multiplications very quickly, with well-executed implementation often bottlenecked by memory bandwidth (i.e. performing vector similarity calculations as fast as the vectors can be passed from RAM to CPU -- very fast indeed!). Since int4 quantization reduces the memory footprint of embedding vectors, this means that it has the potential to not just reduce RAM requirements for storing large collections of vectors, but also to accelerate vector similarity calculations. 

Below, however, we are just doing our best with `numba` to implement a working, fast-enough-to-run-eventually evaluation. The highly-optimized BLAS-based float32 vector similarity operation on uncompressed vectors actually runs substantially faster on most machines up until a very high CPU count in our experimentation. However, our implementation is fairly readable and runs fast enough to get a full evaluation completed in under an hour, so you may find it useful despite its lack of low-level optimizations.

In [1]:
!pip install -q numpy pandas torch numba pytrec-eval pyarrow tqdm huggingface-hub beir Cython

In [2]:
%load_ext cython

In [3]:
import json
import logging
from multiprocessing.pool import ThreadPool
from multiprocessing import cpu_count
from pathlib import Path
from typing import cast

import huggingface_hub
import numba
import numpy as np
import pandas as pd
import pytrec_eval
import pyarrow as pa
import pyarrow.parquet as pq
import torch
import torch.nn.functional as F
from beir.retrieval.evaluation import EvaluateRetrieval
from numpy.typing import NDArray
from tqdm.auto import tqdm

# Alias this static evalution function for standalone use.
beir_evaluate = EvaluateRetrieval.evaluate

In [4]:
# Global config.
EMBEDDINGS_DATASET_ID = "Snowflake/mteb-retrieval-snowflake-arctic-embed-m-v1.5"
COLUMN_DOC_ID = "DOC_ID"
COLUMN_QUERY_ID = "QUERY_ID"
COLUMN_VECTOR = "VECTOR_MAIN"
SCALAR_QUANTIZATION_LIMIT = 0.18
TRUNCATION_DIM = 256

In [5]:
%%cython --compile-args=-fopenmp --link-args=-fopenmp

# A compiled parallelized int4 dotproduct operation designed to run pretty fast
# (though not nearly as optimized as true BLAS systems).

import cython
from cython.parallel import prange

@cython.boundscheck(False)
@cython.wraparound(False)
def inner_query_doc_4bit_dotproduct_cython(
    cython.float[:, ::1] out,
    cython.char[:, ::1] query,
    cython.char[:, ::1] doc,
    cython.float limit,
    cython.long n,
    cython.long m,
    cython.long d_packed
):
    cdef cython.float bin_width = 2 * limit / 15
    cdef cython.int i, j, k
    cdef cython.char qv1, qv2, dv1, dv2
    cdef cython.uint sum_of_sums, sum_of_prods
    for j in prange(m, nogil=True):  # NOTE: We paralellize over docs to support parallel single-query runs.
        for i in range(n):
            sum_of_sums = 0
            sum_of_prods = 0
            for k in range(d_packed):
                # Unpack the values from this byte.
                qv1 = query[i, k] >> 4
                qv2 = query[i, k] & 0b1111
                dv1 = doc[j, k] >> 4
                dv2 = doc[j, k] & 0b1111

                # Accumulate running statistics.
                sum_of_sums = sum_of_sums + qv1 + dv1 + qv2 + dv2
                sum_of_prods = sum_of_prods + qv1 * dv1 + qv2 * dv2

            # Convert from integer statistics back to floating point.
            out[i, j] = (
                limit * limit * float(2 * d_packed)
                - limit * bin_width * float(sum_of_sums)
                + bin_width * bin_width * float(sum_of_prods)
            )

In [6]:
# Utility functions.

#### BEGIN LOADING EMBEDDINGS ####

def load_embeddings(
    pq_paths: list[Path],
    id_column_name: str,
    vector_column_name: str = COLUMN_VECTOR,
    truncate_dim: int | None = None,
    num_read_threads: int = 10,
) -> tuple[list[str], NDArray[np.float32]]:
    total_rows = _total_rows(pq_paths)
    vector_chunks = []
    ids = []
    with tqdm(total=total_rows, unit="row", desc="Loading embeddings from disk") as pbar, ThreadPool(num_read_threads) as pool:
        table_iter = pool.imap(pq.read_table, pq_paths)
        for table in table_iter:
            id_chunk = table[id_column_name].to_pylist()
            vector_chunk = _pa_vector_column_to_np_matrix(table[vector_column_name])
            if truncate_dim is not None:
                vector_chunk = truncate_embeddings(vector_chunk, truncate_dim)
            assert len(id_chunk) == vector_chunk.shape[0]
            ids.extend(id_chunk)
            vector_chunks.append(vector_chunk)
            pbar.update(len(id_chunk))
    return ids, np.row_stack(vector_chunks)

def _pa_vector_column_to_np_matrix(pa_array: pa.ChunkedArray) -> NDArray[np.float32]:
    embed_dim = len(pa_array[0])
    res = pa_array.combine_chunks().flatten().to_numpy().reshape(-1, embed_dim)
    return cast(NDArray[np.float32], res)


def _normalize_embeddings(embedings_matrix: NDArray[np.float32]) -> NDArray[np.float32]:
    """Normalize embeddings to unit norm along axis 1."""
    return cast(
        NDArray[np.float32], F.normalize(torch.tensor(embedings_matrix), dim=1).numpy()
    )

def truncate_embeddings(embedings_matrix: NDArray[np.float32], dim: int) -> NDArray[np.float32]:
    """Truncate and renomalize embeddings to lower dimensionality."""
    assert dim <= embedings_matrix.shape[1]
    return _normalize_embeddings(embedings_matrix[:, :dim])

def _total_rows(pq_paths: list[Path]) -> int:
    total = 0
    for p in pq_paths:
        with pq.ParquetFile(p) as pqf:
            total += pqf.metadata.num_rows
    return total


#### BEGIN 4BIT QUANTIZATION ####

@numba.njit(error_model="numpy", parallel=True)
def fast_4bit_uniform_scalar_quantize(
    emb_matrix: NDArray[np.float32], limit: float
) -> NDArray[np.uint8]:
    num_row, num_col = emb_matrix.shape
    assert num_col % 2 == 0
    assert limit > 0
    out = np.empty((num_row, num_col // 2), dtype=np.uint8)
    bin_width = 2 * limit / 15
    for i in numba.prange(num_row):
        row = emb_matrix[i, :]
        for out_j in range(num_col // 2):
            # Pull out two values at a time.
            in_j = out_j * 2
            value1 = row[in_j]
            value2 = row[in_j + 1]

            # 4-bit quantize the values.
            value1 = round(max(0, min(2 * limit, limit + value1)) / bin_width)
            value2 = round(max(0, min(2 * limit, limit + value2)) / bin_width)

            # Pack the values into a single uint8.
            value_packed = (value1 << 4) | value2
            out[i, out_j] = value_packed

    return out

def uint8_matmul(a: NDArray[np.uint8], b: NDArray[np.uint8], out=None) -> NDArray[np.int32]:
    """
    NOTE: A direct `a @ b` will cause integer overflow in datatype uint8.
    NOTE: `np.matmul(a, b, dtype=np.int32)` was ~4x slower than the `np.einsum` version on my machine.
    """
    n, d = a.shape
    d2, m = b.shape
    assert d2 == d
    # return np.matmul(a, b, dtype=np.int32)  # SLOW!
    return np.einsum("ik, kj -> ij", a, b, dtype=np.int32, out=out)


def fast_multi_query_4bit_dotproduct(
    query_emb_quant: NDArray[np.uint8], doc_emb_quant: NDArray[np.uint8], limit: float
) -> NDArray[np.uint32]:
    num_query, dim_packed = query_emb_quant.shape
    num_doc, dim_packed2 = doc_emb_quant.shape
    assert dim_packed == dim_packed2
    assert limit > 0
    assert query_emb_quant.flags.c_contiguous
    assert doc_emb_quant.flags.c_contiguous
    out = np.zeros((num_query, num_doc), dtype=np.float32)
    inner_query_doc_4bit_dotproduct_cython(out, query_emb_quant, doc_emb_quant, limit, num_query, num_doc, dim_packed)
    return out



#### BEGIN RETRIEVAL AND IR EVALUATION ####

# NOTE: Without `numba` JIT compiling, this function can be quite slow and take a *ton* of RAM at large batch sizes.
@numba.njit(error_model="numpy", parallel=True)
def sorted_indices_and_scores(scores: NDArray[np.float32], depth: int) -> tuple[NDArray[np.int64], NDArray[np.float32]]:
    idx_argpartition = np.argpartition(scores, -depth, axis=1)
    topk_indices_slice = idx_argpartition[:, -depth:]
    topk_scores = np.take_along_axis(scores_slice, topk_indices_slice, axis=1)
    idx_argsort = np.argsort(-topk_scores)
    topk_indices_sorted = np.take_along_axis(topk_indices_slice, idx_argsort, axis=1)
    topk_scores_sorted = np.take_along_axis(topk_scores, idx_argsort, axis=1)
    return topk_indices_sorted, topk_scores_sorted


def dense_retrieval(
    query_ids: list[str],
    doc_ids: list[str],
    query_embeddings: NDArray[np.float32],
    doc_embeddings: NDArray[np.float32],
    retrieval_depth: int,
    quantize_4bit_with_limit: float | None = None,
    batch_size: int = 64,
) -> dict[str, dict[str, float]]:
    """Perform dense retrieval with a set of ids and embeddings to get query results."""
    if quantize_4bit_with_limit is not None:
        query_embeddings = fast_4bit_uniform_scalar_quantize(query_embeddings, quantize_4bit_with_limit)
        doc_embeddings = fast_4bit_uniform_scalar_quantize(doc_embeddings, quantize_4bit_with_limit)
    
    query_results = {}
    num_queries, num_docs = query_embeddings.shape[0], doc_embeddings.shape[0]
    retrieval_depth = min(retrieval_depth, num_docs)

    batch_slices = [slice(start_i, start_i + batch_size) for start_i in range(0, num_queries, batch_size)]
    with tqdm(total=num_queries, desc="dense retrieval", unit="query") as pbar:
        for batch_slice in batch_slices:
            q_emb_slice = query_embeddings[batch_slice]
            if quantize_4bit_with_limit is None:
                scores_slice = q_emb_slice @ doc_embeddings.T
            else:
                scores_slice = fast_multi_query_4bit_dotproduct(q_emb_slice, doc_embeddings, quantize_4bit_with_limit)
                
            # Get indices and values of top-k scores.
            topk = torch.topk(torch.tensor(scores_slice), retrieval_depth)
            topk_indices_sorted = topk.indices.numpy()
            topk_scores_sorted = topk.values.numpy()
    
            # Convert each set of scores in the slice to a top-k dictionary.
            query_ids_slice = query_ids[batch_slice]
            for slice_offset in range(scores_slice.shape[0]):
                # Populate the results dictionary.
                query_id = query_ids_slice[slice_offset]
                sorted_doc_ids = [doc_ids[idx] for idx in topk_indices_sorted[slice_offset]]
                query_results[query_id] = dict(zip(sorted_doc_ids, topk_scores_sorted[slice_offset].tolist()))
            pbar.update(len(query_ids_slice))

    return query_results

In [7]:
# Download the precomputed embeddings for MTEB Retrieval.
# NOTE: The full dataset is around ~100GB.

# # Example of downloading a subset of datasets.
# dataset_subset = ["NFCorpus", "FiQA2018"]
# embeddings_dataset_path_str = huggingface_hub.snapshot_download(
#     repo_id=EMBEDDINGS_DATASET_ID,
#     repo_type="dataset",
#     allow_patterns=["_qrels/*"] + [f"{x}/*" for x in dataset_subset],
# )

embeddings_dataset_path_str = huggingface_hub.snapshot_download(
    repo_id=EMBEDDINGS_DATASET_ID, repo_type="dataset"
)
embeddings_dataset_path = Path(embeddings_dataset_path_str)

Fetching 597 files:   0%|          | 0/597 [00:00<?, ?it/s]

# Demonstration of retrieval on a single dataset

In [8]:
example_dataset = "QuoraRetrieval"
emb_dir = embeddings_dataset_path / example_dataset / "embeddings"
doc_emb_file_paths = sorted(emb_dir.glob("documents*.parquet"))
query_emb_file_paths = sorted(emb_dir.glob("queries*.parquet"))
doc_ids, doc_emb = load_embeddings(doc_emb_file_paths, id_column_name=COLUMN_DOC_ID, truncate_dim=TRUNCATION_DIM)
query_ids, query_emb = load_embeddings(query_emb_file_paths, id_column_name=COLUMN_QUERY_ID, truncate_dim=TRUNCATION_DIM)

Loading embeddings from disk:   0%|          | 0/522931 [00:00<?, ?row/s]

Loading embeddings from disk:   0%|          | 0/10000 [00:00<?, ?row/s]

In [9]:
%%time
scores = dense_retrieval(query_ids, doc_ids, query_emb, doc_emb, retrieval_depth=10)

dense retrieval:   0%|          | 0/10000 [00:00<?, ?query/s]

CPU times: user 3min 10s, sys: 16 s, total: 3min 26s
Wall time: 22.3 s


In [10]:
%%time
# NOTE: This code isn't super fast because even with our `numba` "fast" implementation above, our code for int4 matmuls is
# much much less optimized than standard float32 matmul code behind non-int4-quantized dense retrieval.
scores_quant = dense_retrieval(
    query_ids,
    doc_ids,
    query_emb,
    doc_emb,
    retrieval_depth=10,
    quantize_4bit_with_limit=SCALAR_QUANTIZATION_LIMIT,
)

dense retrieval:   0%|          | 0/10000 [00:00<?, ?query/s]

CPU times: user 2min 9s, sys: 19.7 s, total: 2min 28s
Wall time: 22.6 s


In [11]:
def load_mteb_qrels(task_name: str) -> dict:
    path = embeddings_dataset_path / "_qrels" / f"{task_name}.json"
    return json.loads(path.read_text())

qrel = load_mteb_qrels(example_dataset)

In [12]:
score_unquant = beir_evaluate(qrel, scores, k_values=[10])[0]["NDCG@10"]
score_quant = beir_evaluate(qrel, scores_quant, k_values=[10])[0]["NDCG@10"]
score_unquant, score_quant

(0.8717, 0.86837)

# Single-query speedup example

Our "casual" int4 dotproduct implementation doesn't actually keep up with the highly-tuned BLAS system that numpy uses for float32 matrix multiplication when it comes to the batch retrieval above. However, on the easier-to-optimize vector-matrix multiplication used for single-query lookup (which is what's actually the most common case in live retrieval systems!), our implementation does start to hint at the runtime improvements possible.

In [13]:
q_vec = query_emb[0]

In [14]:
%%timeit
_ = q_vec[None, :] @ doc_emb.T

6.22 ms ± 1.29 ms per loop (mean ± std. dev. of 7 runs, 100 loops each)


In [15]:
q_vec_quant = fast_4bit_uniform_scalar_quantize(q_vec[None, :], SCALAR_QUANTIZATION_LIMIT)
doc_emb_quant = fast_4bit_uniform_scalar_quantize(doc_emb, SCALAR_QUANTIZATION_LIMIT)

In [16]:
%%timeit
_ = fast_multi_query_4bit_dotproduct(q_vec_quant, doc_emb_quant, SCALAR_QUANTIZATION_LIMIT)

2.27 ms ± 106 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)


In [17]:
unquant_scores = q_vec @ doc_emb.T
quant_scores = fast_multi_query_4bit_dotproduct(q_vec_quant, doc_emb_quant, SCALAR_QUANTIZATION_LIMIT)
relative_error = np.abs(unquant_scores - quant_scores) / unquant_scores
print(f"Relative error μ ± σ: {relative_error.mean():.2%} ± {relative_error.std():.2%}")

Relative error μ ± σ: 2.99% ± 2.59%


# Score all the datasets

Below we provide a reproducible implementation of int4 compressed retrieval quality scoring to show how `snowflake-arctic-embed-m-v1.5` is capable of achieving a 53.7 MTEB Retrieval score in just 128 bytes per vector. 

In [18]:
names = [p.parent.name for p in sorted(embeddings_dataset_path.glob("*/embeddings"))]
print(f"Scoring: {names}")
ndcg10_scores_unquantized = {}
ndcg10_scores_quantized = {}
for name in tqdm(names):
    print(name)
    emb_dir = embeddings_dataset_path / name / "embeddings"
    doc_emb_file_paths = sorted(emb_dir.glob("documents*.parquet"))
    query_emb_file_paths = sorted(emb_dir.glob("queries*.parquet"))
    doc_ids, doc_emb = load_embeddings(doc_emb_file_paths, id_column_name=COLUMN_DOC_ID, truncate_dim=TRUNCATION_DIM)
    query_ids, query_emb = load_embeddings(query_emb_file_paths, id_column_name=COLUMN_QUERY_ID, truncate_dim=TRUNCATION_DIM)
    qrel = load_mteb_qrels(name)
    scores = dense_retrieval(query_ids, doc_ids, query_emb, doc_emb, 10)
    scores_quant = dense_retrieval(query_ids, doc_ids, query_emb, doc_emb, 10, SCALAR_QUANTIZATION_LIMIT)
    ndcg10_scores_unquantized[name] = beir_evaluate(qrel, scores, k_values=[10])[0]["NDCG@10"]
    ndcg10_scores_quantized[name] = beir_evaluate(qrel, scores_quant, k_values=[10])[0]["NDCG@10"]

Scoring: ['ArguAna', 'CQADupstackAndroidRetrieval', 'CQADupstackEnglishRetrieval', 'CQADupstackGamingRetrieval', 'CQADupstackGisRetrieval', 'CQADupstackMathematicaRetrieval', 'CQADupstackPhysicsRetrieval', 'CQADupstackProgrammersRetrieval', 'CQADupstackStatsRetrieval', 'CQADupstackTexRetrieval', 'CQADupstackUnixRetrieval', 'CQADupstackWebmastersRetrieval', 'CQADupstackWordpressRetrieval', 'ClimateFEVER', 'DBPedia', 'FEVER', 'FiQA2018', 'HotpotQA', 'MSMARCO', 'NFCorpus', 'NQ', 'QuoraRetrieval', 'SCIDOCS', 'SciFact', 'TRECCOVID', 'Touche2020']


  0%|          | 0/26 [00:00<?, ?it/s]

ArguAna


Loading embeddings from disk:   0%|          | 0/8674 [00:00<?, ?row/s]

Loading embeddings from disk:   0%|          | 0/1406 [00:00<?, ?row/s]

dense retrieval:   0%|          | 0/1406 [00:00<?, ?query/s]

dense retrieval:   0%|          | 0/1406 [00:00<?, ?query/s]

CQADupstackAndroidRetrieval


Loading embeddings from disk:   0%|          | 0/22998 [00:00<?, ?row/s]

Loading embeddings from disk:   0%|          | 0/699 [00:00<?, ?row/s]

dense retrieval:   0%|          | 0/699 [00:00<?, ?query/s]

dense retrieval:   0%|          | 0/699 [00:00<?, ?query/s]

CQADupstackEnglishRetrieval


Loading embeddings from disk:   0%|          | 0/40221 [00:00<?, ?row/s]

Loading embeddings from disk:   0%|          | 0/1570 [00:00<?, ?row/s]

dense retrieval:   0%|          | 0/1570 [00:00<?, ?query/s]

dense retrieval:   0%|          | 0/1570 [00:00<?, ?query/s]

CQADupstackGamingRetrieval


Loading embeddings from disk:   0%|          | 0/45301 [00:00<?, ?row/s]

Loading embeddings from disk:   0%|          | 0/1595 [00:00<?, ?row/s]

dense retrieval:   0%|          | 0/1595 [00:00<?, ?query/s]

dense retrieval:   0%|          | 0/1595 [00:00<?, ?query/s]

CQADupstackGisRetrieval


Loading embeddings from disk:   0%|          | 0/37637 [00:00<?, ?row/s]

Loading embeddings from disk:   0%|          | 0/885 [00:00<?, ?row/s]

dense retrieval:   0%|          | 0/885 [00:00<?, ?query/s]

dense retrieval:   0%|          | 0/885 [00:00<?, ?query/s]

CQADupstackMathematicaRetrieval


Loading embeddings from disk:   0%|          | 0/16705 [00:00<?, ?row/s]

Loading embeddings from disk:   0%|          | 0/804 [00:00<?, ?row/s]

dense retrieval:   0%|          | 0/804 [00:00<?, ?query/s]

dense retrieval:   0%|          | 0/804 [00:00<?, ?query/s]

CQADupstackPhysicsRetrieval


Loading embeddings from disk:   0%|          | 0/38316 [00:00<?, ?row/s]

Loading embeddings from disk:   0%|          | 0/1039 [00:00<?, ?row/s]

dense retrieval:   0%|          | 0/1039 [00:00<?, ?query/s]

dense retrieval:   0%|          | 0/1039 [00:00<?, ?query/s]

CQADupstackProgrammersRetrieval


Loading embeddings from disk:   0%|          | 0/32176 [00:00<?, ?row/s]

Loading embeddings from disk:   0%|          | 0/876 [00:00<?, ?row/s]

dense retrieval:   0%|          | 0/876 [00:00<?, ?query/s]

dense retrieval:   0%|          | 0/876 [00:00<?, ?query/s]

CQADupstackStatsRetrieval


Loading embeddings from disk:   0%|          | 0/42269 [00:00<?, ?row/s]

Loading embeddings from disk:   0%|          | 0/652 [00:00<?, ?row/s]

dense retrieval:   0%|          | 0/652 [00:00<?, ?query/s]

dense retrieval:   0%|          | 0/652 [00:00<?, ?query/s]

CQADupstackTexRetrieval


Loading embeddings from disk:   0%|          | 0/68184 [00:00<?, ?row/s]

Loading embeddings from disk:   0%|          | 0/2906 [00:00<?, ?row/s]

dense retrieval:   0%|          | 0/2906 [00:00<?, ?query/s]

dense retrieval:   0%|          | 0/2906 [00:00<?, ?query/s]

CQADupstackUnixRetrieval


Loading embeddings from disk:   0%|          | 0/47382 [00:00<?, ?row/s]

Loading embeddings from disk:   0%|          | 0/1072 [00:00<?, ?row/s]

dense retrieval:   0%|          | 0/1072 [00:00<?, ?query/s]

dense retrieval:   0%|          | 0/1072 [00:00<?, ?query/s]

CQADupstackWebmastersRetrieval


Loading embeddings from disk:   0%|          | 0/17405 [00:00<?, ?row/s]

Loading embeddings from disk:   0%|          | 0/506 [00:00<?, ?row/s]

dense retrieval:   0%|          | 0/506 [00:00<?, ?query/s]

dense retrieval:   0%|          | 0/506 [00:00<?, ?query/s]

CQADupstackWordpressRetrieval


Loading embeddings from disk:   0%|          | 0/48605 [00:00<?, ?row/s]

Loading embeddings from disk:   0%|          | 0/541 [00:00<?, ?row/s]

dense retrieval:   0%|          | 0/541 [00:00<?, ?query/s]

dense retrieval:   0%|          | 0/541 [00:00<?, ?query/s]

ClimateFEVER


Loading embeddings from disk:   0%|          | 0/5416593 [00:00<?, ?row/s]

Loading embeddings from disk:   0%|          | 0/1535 [00:00<?, ?row/s]

dense retrieval:   0%|          | 0/1535 [00:00<?, ?query/s]

dense retrieval:   0%|          | 0/1535 [00:00<?, ?query/s]

DBPedia


Loading embeddings from disk:   0%|          | 0/4635922 [00:00<?, ?row/s]

Loading embeddings from disk:   0%|          | 0/400 [00:00<?, ?row/s]

dense retrieval:   0%|          | 0/400 [00:00<?, ?query/s]

dense retrieval:   0%|          | 0/400 [00:00<?, ?query/s]

FEVER


Loading embeddings from disk:   0%|          | 0/5416568 [00:00<?, ?row/s]

Loading embeddings from disk:   0%|          | 0/6666 [00:00<?, ?row/s]

dense retrieval:   0%|          | 0/6666 [00:00<?, ?query/s]

dense retrieval:   0%|          | 0/6666 [00:00<?, ?query/s]

FiQA2018


Loading embeddings from disk:   0%|          | 0/57638 [00:00<?, ?row/s]

Loading embeddings from disk:   0%|          | 0/648 [00:00<?, ?row/s]

dense retrieval:   0%|          | 0/648 [00:00<?, ?query/s]

dense retrieval:   0%|          | 0/648 [00:00<?, ?query/s]

HotpotQA


Loading embeddings from disk:   0%|          | 0/5233329 [00:00<?, ?row/s]

Loading embeddings from disk:   0%|          | 0/7405 [00:00<?, ?row/s]

dense retrieval:   0%|          | 0/7405 [00:00<?, ?query/s]

dense retrieval:   0%|          | 0/7405 [00:00<?, ?query/s]

MSMARCO


Loading embeddings from disk:   0%|          | 0/8841823 [00:00<?, ?row/s]

Loading embeddings from disk:   0%|          | 0/6980 [00:00<?, ?row/s]

dense retrieval:   0%|          | 0/6980 [00:00<?, ?query/s]

dense retrieval:   0%|          | 0/6980 [00:00<?, ?query/s]

NFCorpus


Loading embeddings from disk:   0%|          | 0/3633 [00:00<?, ?row/s]

Loading embeddings from disk:   0%|          | 0/323 [00:00<?, ?row/s]

dense retrieval:   0%|          | 0/323 [00:00<?, ?query/s]

dense retrieval:   0%|          | 0/323 [00:00<?, ?query/s]

NQ


Loading embeddings from disk:   0%|          | 0/2681468 [00:00<?, ?row/s]

Loading embeddings from disk:   0%|          | 0/3452 [00:00<?, ?row/s]

dense retrieval:   0%|          | 0/3452 [00:00<?, ?query/s]

dense retrieval:   0%|          | 0/3452 [00:00<?, ?query/s]

QuoraRetrieval


Loading embeddings from disk:   0%|          | 0/522931 [00:00<?, ?row/s]

Loading embeddings from disk:   0%|          | 0/10000 [00:00<?, ?row/s]

dense retrieval:   0%|          | 0/10000 [00:00<?, ?query/s]

dense retrieval:   0%|          | 0/10000 [00:00<?, ?query/s]

SCIDOCS


Loading embeddings from disk:   0%|          | 0/25657 [00:00<?, ?row/s]

Loading embeddings from disk:   0%|          | 0/1000 [00:00<?, ?row/s]

dense retrieval:   0%|          | 0/1000 [00:00<?, ?query/s]

dense retrieval:   0%|          | 0/1000 [00:00<?, ?query/s]

SciFact


Loading embeddings from disk:   0%|          | 0/5183 [00:00<?, ?row/s]

Loading embeddings from disk:   0%|          | 0/300 [00:00<?, ?row/s]

dense retrieval:   0%|          | 0/300 [00:00<?, ?query/s]

dense retrieval:   0%|          | 0/300 [00:00<?, ?query/s]

TRECCOVID


Loading embeddings from disk:   0%|          | 0/171332 [00:00<?, ?row/s]

Loading embeddings from disk:   0%|          | 0/50 [00:00<?, ?row/s]

dense retrieval:   0%|          | 0/50 [00:00<?, ?query/s]

dense retrieval:   0%|          | 0/50 [00:00<?, ?query/s]

Touche2020


Loading embeddings from disk:   0%|          | 0/382545 [00:00<?, ?row/s]

Loading embeddings from disk:   0%|          | 0/49 [00:00<?, ?row/s]

dense retrieval:   0%|          | 0/49 [00:00<?, ?query/s]

dense retrieval:   0%|          | 0/49 [00:00<?, ?query/s]

In [19]:
df_ndcg10 = pd.DataFrame({"unquantized": ndcg10_scores_unquantized, "quantized": ndcg10_scores_quantized})

# Cache results to CSV.
df_ndcg10.to_csv("ndcgs_validation.csv")

# Roll up CQA Dupstack Retrieval.
is_cqa = df_ndcg10.index.to_series().str.startswith("CQA")
cqa_mean = df_ndcg10.loc[is_cqa].mean().to_frame().T
cqa_mean.index = ["CQADupstackRetrieval"]
df_ndcg10 = pd.concat([df_ndcg10.loc[~is_cqa], cqa_mean]).sort_index()

# Show scores across MTEB retrieval.
df_ndcg10

Unnamed: 0,unquantized,quantized
ArguAna,0.58476,0.57953
CQADupstackRetrieval,0.442101,0.433432
ClimateFEVER,0.36229,0.36064
DBPedia,0.44826,0.43716
FEVER,0.87224,0.866
FiQA2018,0.41671,0.41258
HotpotQA,0.69174,0.68013
MSMARCO,0.41249,0.40598
NFCorpus,0.35799,0.35728
NQ,0.61669,0.61018


In [20]:
# Print mean MTEB Retrieval scores.
df_ndcg10.mean()

unquantized    0.542337
quantized      0.537291
dtype: float64