Optimization of embeddings storage for RAG systems using quantization and dimensionality reduction techniques

Naamán Huerga-Pérez, Rubén Álvarez, Rubén Ferrero-Guillén, Alberto Martínez-Gutiérrez, Javier Díez-González (2025)

Updates

[30/04/2025] Preprint available on arXiv

[22/05/2025] Code available on Github

Abstract

Retrieval-Augmented Generation enhances language models by retrieving relevant information from external knowledge bases, relying on high-dimensional vector embeddings typically stored in float32 precision. However, storing these embeddings at scale presents significant memory challenges.

This work investigates two complementary optimization strategies on the MTEB benchmark:

Quantization: evaluating standard formats (float16, int8, binary) and low-bit floating-point types (float8)
Dimensionality reduction: assessing methods like PCA, Kernel PCA, UMAP, Random Projections and Autoencoders

Key Findings

float8 achieves 4× storage reduction with minimal performance degradation (<0.3%), outperforming int8
PCA emerges as the most effective dimensionality reduction technique
Combining moderate PCA (50% dimensions) with float8 offers 8× total compression with less impact than int8 alone

Implementation

Python repository to reproduce the experiments:

Features

Quantization types: FLOAT32, FLOAT16, BFLOAT16, FLOAT8 (E4M3, E5M2), FLOAT4 (E2M1), INT8, and Binary
Multiple dimensionality reduction techniques with configurable parameters
Caching system using Qdrant vector database
Comprehensive evaluation using MTEB benchmarks
Automated experiment configuration through YAML files

Requirements

Python 3.12
Docker and Docker Compose (for persistent cache of embeddings)
CUDA-capable GPU (recommended)

Usage

Configure your experiments in a YAML file (see experiments folder examples):
- Select model
- Choose benchmarks
- Define compression experiments
Run experiments:

HF_DATASETS_TRUST_REMOTE_CODE=1 python run_experiments.py --config configs/experiment.yml

With the default configuration, the experiment uses an in-memory cache to avoid re-computing unnecesary embeddings. If you wish to use a persistent cache, spin up a qDrant instance via Docker:

docker compose up -d

HF_DATASETS_TRUST_REMOTE_CODE=1 python run_experiments.py --config configs/experiment.yml --cache-location localhost:6333

You can use download_datasets.py to download the required datasets from HuggingFace and avoid download times when running the evaluation script:

HF_DATASETS_TRUST_REMOTE_CODE=1 python download_datasets.py --config configs/experiment.yml

Citation

@misc{huergapérez2025optimizationembeddingsstoragerag,
      title={Optimization of embeddings storage for RAG systems using quantization and dimensionality reduction techniques}, 
      author={Naamán Huerga-Pérez and Rubén Álvarez and Rubén Ferrero-Guillén and Alberto Martínez-Gutiérrez and Javier Díez-González},
      year={2025},
      eprint={2505.00105},
      archivePrefix={arXiv},
      primaryClass={cs.IR},
      url={https://arxiv.org/abs/2505.00105}, 
}

License

MIT

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
core		core
experiments		experiments
utils		utils
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
docker-compose.yaml		docker-compose.yaml
download_datasets.py		download_datasets.py
requirements.txt		requirements.txt
run_experiments.py		run_experiments.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Optimization of embeddings storage for RAG systems using quantization and dimensionality reduction techniques

Updates

Abstract

Key Findings

Implementation

Features

Requirements

Usage

Citation

License

About

Uh oh!

Releases

Packages

Languages

License

Napuh/rag-embedding-storage-optimization

Folders and files

Latest commit

History

Repository files navigation

Optimization of embeddings storage for RAG systems using quantization and dimensionality reduction techniques

Updates

Abstract

Key Findings

Implementation

Features

Requirements

Usage

Citation

License

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages