├── data - contains small test data
├── experiments - contains experimentation tools
└── src
├── cpop
├── cluster.cc/hh - clustering methods
├── compute_kernel.cc/hh - code relevant to computing the kernel matrix
├── gpu_kernels.cu/cuh - CUDA kernels
├── utils.cc - other useful helpers
├── CMakeLists.txt - build configuration
└── main.cc - main algorithm implementation
├── src_combblas - implementation with CombBLAS
├── tests - Python serial implementation
└── Makefile - start point for building and testing
This library has been tested with the following requirements:
- CUDA Toolkit 12.2
- GCC 12.3
- SLATE 2024.10.29
- CombBLAS
The following can be used to install SLATE. SLATE should be installed in the user’s home directory ~/ (if not, the line export SLATE_INSTALL := … in Makefile will need to be amended). This will take a while (>30 mins).
cd ~/
export mpi=cray
export blas=libsci
export CXX=CC
git clone --recursive https://github.com/icl-utk-edu/slate.git
cd slate/blaspp
make -j`nproc`
cd ../lapackpp
make -j`nproc`
cd ..
make -j`nproc`
mkdir _install
make install prefix=_install
cd _install
export SLATE_INSTALL=$(pwd)The following can be used to install CombBLAS. CombBLAS should be installed in the user’s home directory ~/ (if not, the line export COMBBLAS_INSTALL := … in Makefile will need to be amended). This will take a few minutes.
cd ~/
wget https://zenodo.org/records/15208078/files/CombBLAS-combblas-gpu.zip
unzip CombBLAS-combblas-gpu.zip
mv CombBLAS-combblas-gpu CombBLAS
cd CombBLAS
mkdir _build && mkdir _install
cd _build
cmake -DCMAKE_INSTALL_PREFIX=../_install ..
cmake --build . --target install
cd ../_install
export COMBBLAS_INSTALL=$(pwd)The zip for CombBLAS repo can also be directly downloaded from here.
Build with make build. Relevant options are
make build: build Kettlecorn inbuilddirectory, to build without fine-grained timing (e.g. for benchmarking without breakdown), you may runmake build BASIC=1as wellmake blasbuild: build alternative CombBLAS implementation inblasbuilddirectory
Before you proceed, please replace ACCOUNT in Makefile with your account id for your project.
make allocrequests interactive compute session on your cluster
Naive testing on smaller datasets can be done with:
make australian(690 points, 14 features, 2 clusters, Sparse V)make svmguide1(3089 points, 4 features, 2 clusters, Sparse V)make letter(15k points, 5k features, 26 clusters, Sparse V)make rand(70k points, 64 features, 128 clusters, Sparse V)- You may append
--convergence=1at the end ofin the Makefile to run with convergence
All of these tests launch their own allocated interactive session. Svmguide1 and Letter request 16 GPUs while the others request 4 GPUs, where 1 node has 4 GPUs. Note that these commands may need to be modified according to the architecture of the cluster you run on.
More rigorous scaling testing should be done from within the experiments folder (for more, see the README there). Rigorous scaling testing must not use the interactive session and or convergence checking.
- use
--convergence=1in the Makefile to run with convergence (ex: build/device_wrapper build/main -i data/letter -m 690 -n 14 -k 2 --convergence=1) - use
--sparse=0in the Makefile to run in dense V mode (ex: build/device_wrapper build/main -i data/letter -m 690 -n 14 -k 2 --sparse=0) - see
src/cpop/utils.ccfor other runtime arguments
make profilelaunches nsys profiling on rand on 1 GPU
If you find this repo helpful to your work, please cite our article:
@inproceedings{bellavita2026kkmeans,
title={Communication-Avoiding Linear Algebraic Kernel K-Means on GPUs},
author={Bellavita, Julian and Rubino, Matthew and Iyer, Nakul and Chang, Andrew and Devarakonda, Aditya and Vella, Flavio and Guidi, Giulia},
booktitle={Proceedings of the 40th IEEE International Parallel and Distributed Processing Symposium (IPDPS)},
year={2026},
organization={IEEE},
address={to appear},
pages={--},
doi={},
}
Our algorithms are implemented in the open-source software Vivaldi (named after the composer) and are available in this repository. The sliding window algorithm used as a baseline in the paper is available here.
This research used resources of the National Energy Research Scientific Computing Center (NERSC), a DOE Office of Science User Facility supported by the Office of Science of the U.S. Department of Energy under Contract No. DE-AC02-05CH11231, using NERSC award ASCR-ERCAP0030076. This material is based upon work supported by the U.S. Department of Energy, Office of Science, Office of Advanced Scientific Computing Research, Department of Energy Computational Science Graduate Fellowship under Award Number DE-SC0025528. The authors acknowledge financial support from ICSC – Centro Nazionale di Ricerca in High-Performance Computing, Big Data and Quantum Computing, funded by the European Union – NextGenerationEU. The second through fourth authors were affiliated with Cornell University at the time this work was conducted. This work was carried out in collaboration with the Hicrest Laboratory at the University of Trento.