This repository explores an implementation of the Stochastic Quasi-Gradient K-means (also referred to as the Stochastic Quantization algorithm), a robust and scalable alternative to existing K-means solvers, designed to handle large datasets and utilize memory more efficiently during computation. The implementation examines the application of the algorithm to high-dimensional unsupervised and semi-supervised learning tasks. The repository contains both a Python package for reproducing experimental results and a LaTeX manuscript documenting the theoretical and experimental outcomes of the algorithm. The Python package continues to evolve independently of the research documentation; therefore, to reproduce specific results presented in the paper, researchers should refer to the commit hash mentioned in the description.
The Python implementation of the algorithm has a scikit-learn-friendly API, thus enabling its integration into the Pipeline sequence of built-in data transformers.
from sklearn.datasets import load_iris
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline
import sqg
# Load the Iris dataset
X, _ = load_iris(return_X_y=True)
# Create an optimizer for SQG-clustering
optimizer = sqg.SGDOptimizer()
# Create and fit a pipeline with preprocessing and SQG-clustering
pipeline = Pipeline(
[
("scaler", StandardScaler()), # Scale features to have mean=0 and variance=1
("sqg", sqg.StochasticQuantization(optimizer, n_clusters=3)),
]
).fit(X)
# Get the cluster labels
labels = pipeline.predict(X)
by Anton Kozyriev1, Vladimir Norkin1,2
- Igor Sikorsky Kyiv Polytechnic Institute, National Technical University of Ukraine, Kyiv, 03056, Ukraine
- V.M.Glushkov Institute of Cybernetics, National Academy of Sciences of Ukraine, Kyiv, 03178, Ukraine
Published in the International Scientific Technical Journal "Problems of Control and Informatics". This paper addresses the inherent limitations of traditional vector quantization (clustering) algorithms, particularly K-means and its variant K-means++, and investigates the stochastic quantization (SQ) algorithm as a scalable alternative methodology for high-dimensional unsupervised and semi-supervised learning problems.
Latest commit hash
ed22ae0b5507564d917b57d4cbdea952cc134d77
Citation
@article{Kozyriev_Norkin_2025,
title = {Robust clustering on high-dimensional data with stochastic quantization},
author = {Kozyriev, Anton and Norkin, Vladimir},
year = {2025},
month = {Feb.},
journal = {International Scientific Technical Journal "Problems of Control and Informatics"},
volume = {70},
number = {1},
pages = {32–48},
doi = {10.34229/1028-0979-2025-1-3},
url = {https://jais.net.ua/index.php/files/article/view/438},
abstractnote = {<p>This paper addresses the limitations of traditional vector quantization (clustering) algorithms, particularly K-means and its variant K-means++, and explores the stochastic quantization (SQ) algorithm as a scalable alternative for high-dimensional unsupervised and semi-supervised learning problems. Some traditional clustering algorithms suffer from inefficient memory utilization during computation, necessitating the loading of all data samples into memory, which becomes impractical for large-scale datasets. While variants such as mini-batch K-means partially mitigate this issue by reducing memory usage, they lack robust theoretical convergence guarantees due to the non-convex nature of clustering problems. In contrast, SQ-algorithm provides strong theoretical convergence guarantees, making it a robust alternative for clustering tasks. We demonstrate the computational efficiency and rapid convergence of the algorithm on an image classification problem with partially labeled data. To address the challenge of high dimensionality, we trained Triplet Network to encode images into low-dimensional representations in a latent space, which serve as a basis for comparing the efficiency of both SQ-algorithm and traditional quantization algorithm.</p>},
}
Before working with the source code, it is important to note that the Python package in the repository is intended SOLELY FOR EXPERIMENTAL PURPOSES and is not production-ready. To proceed with this project, follow the instructions below to configure your environment, install the necessary dependencies, and execute the code to reproduce the results presented in the paper.
The installation process requires a Conda package manager for managing third-party dependencies and virtual environments. A step-by-step guide on installing the CLI tool is available on the official website. The third-party dependencies used are listed in the environment.yml file, with the corresponding licenses in the NOTICES file.
Clone the repository (alternatively, you can download the source code as a zip archive):
git clone https://github.com/kaydotdev/stochastic-quantization.git
cd stochastic-quantization
then, create a Conda virtual environment and activate it:
conda env create -f environment.yml
conda activate stochastic-quantization
Use the following command to install the core sq
package with third-party dependencies, run the test suite, compile
LaTeX files, and generate results:
make all
Produced figures and other artifacts (except compiled LaTeX files) will be stored in the results directory. Optionally, use the following command to perform the actions above without LaTeX file compilation:
make -C code all
To automatically remove all generated results and compiled LaTeX files produced by scripts, use the following command:
make clean
This repository contains both software (source code) and an academic manuscript. Different licensing terms apply to these components as follows:
-
Source Code: All source code contained in this repository, unless otherwise specified, is licensed under the MIT License. The full text of the MIT License can be found in the file LICENSE.code.md in the
code
directory. -
Academic Manuscript: The academic manuscript, including all LaTeX source files and associated content (e.g., figures), is licensed under the Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License (CC BY-NC-ND 4.0). The full text of the CC BY-NC-ND 4.0 License can be found in the file LICENSE.manuscript.md in the
manuscript
directory.