FFCL: Flexible and (probably not the) fast(est) c++ clustering library.
-
TODO
- performance tests comaring
ffcl
to the most popular libraries (done for kdtrees with PCA and FLANN and DBSCAN from sklearn) - Some optimization for the nearest neighbors computations of KMeans and KMedoids. I might enable KDTree acceleration but its not beneficial unless a large number of centroids/medoids is needed. Its not a priority for now.
- Proper unit testing (update: all the generic code is now unit tested)
- OPTICS: might do if it can be done very quickly. Otherwise I wont do since HDBSCAN* is supposed to be better.
- HDBSCAN* (ongoing):
- MST with Boruvka's algorithm: done
- Single Linkage Cluster Tree: done. Dendrogram python plots are also done
- Condensed cluster tree: done.
- HDBSCAN*: assembly of the aforementionned parts done.
- The performance of HDBSCAN* can be improved by using a Dual Tree Traversal instead of the single one currently used to build the MST.
- A few options to add such as
cluster_selection_epsilon
,max_cluster_size
,alpha
to conform with the python hdbscan library's features.
- Dual tree traversal: done for dual set closest edge. Need to replace the single tree method in boruvka by the dual tree one.
- performance tests comaring
-
Last features
- HDBSCAN
- Dual tree traversal for dual set closest edge (not added to boruvka yet)
-
- HDBSCAN
- CondensedClusterTree
- SingleLinkageTree
- BoruvkasAlgorithm
-
- FasterMSC paper | author's repo
- FasterPAM paper | author's repo
-
- Lloyd
- Hamerly paper | authors' repo
-
KDTree
is currently used forDBSCAN
- Options
bucket_size(ssize_t)
: the maximum number of samples that a leaf node can containmax_depth(ssize_t)
: the maximum recursion depth of the tree. Themax_depth
takes priority overbucket_size
, meaning that if the maximum recursion depth cannot be satisfied without violating the bucket size condition, the latter may not be fulfilled.axis_selection_policy
:HighestVarianceBuild
(default) orMaximumSpreadBuild
. Custom policies can be implmented and used.feature_mask({feature_index_0, feature_index_1, ..., feature_index_n})
: dynamic feature mask used the select the features of interest for the axis selection procedure during the build phase. The feature indices can be specified in any order, as long as0 <= feature_index_j < n_features
. Duplicates feature indices should result in valid results but it would bias the axis selection procedure and thus defeat its purpose.splitting_rule_policy
:QuickselectMedianRange
(default): quickselect median selection strategy for partitioning the children leaves around the pivot median value of the binary tree.
- Options
LowerTriangleMatrixPrecomputedPairwiseDistances
currently used for the pairwise distance values that can be buffered forKMedoids
.
-
- euclidean
- manhattan
- cosine similarity
- levenstein
-
- random uniform
- spatial uniform
- pam::build
- kmeans++
-
- silhouette method
- Cpu:
Intel® Core™ i5-9600KF CPU @ 3.70GHz × 6
KDTree library | build time (s) | 10K queries time (s) | build + queries (%) |
---|---|---|---|
KDTree (FFCL) 10000 queries | 0.000694 | 0.097632 | 0.098638 |
KDTree (PCL) 10000 queries | 0.001181 | 0.119764 | 0.121376 |
KDTree (FLANN) 10000 queries | 0.000558 | 0.125653 | 0.126366 |
KDTree library | build speedup (%) | 10K queries speedup (%) | build + queries (%) |
---|---|---|---|
FFCL over PCL | 41.235 | 18.480 | 18.733 |
FFCL over FLANN | -24.326 | 22.300 | 21.943 |
- Dataset: MNIST
- initialization: random uniform
The following table summarizes the results of single runs made with different parameters. It provides a rough estimate about what one could expect from the present library. For a more in depth analysis of the performance of the algorithms, refer to the articles (1) and (2)
type | n_samples | n_features | n_centroids | n_iter (converged) | n_threads | computation time (s) | |
---|---|---|---|---|---|---|---|
FasterPAM | float | 15,000 | 784 | 10 | 2 | 1 | 1.7 |
FasterPAM | float | 15,000 | 784 | 100 | 4 | 1 | 4.5 |
FasterPAM | float | 30,000 | 784 | 10 | 3 | 1 | 11.1 |
FasterPAM | float | 30,000 | 784 | 100 | 4 | 1 | 17 |
FasterMSC | float | 15,000 | 784 | 10 | 2 | 1 | 3.3 |
FasterMSC | float | 15,000 | 784 | 100 | 2 | 1 | 6.4 |
FasterMSC | float | 30,000 | 784 | 10 | 2 | 1 | 12.8 |
FasterMSC | float | 30,000 | 784 | 100 | 3 | 1 | 27.1 |
PairwiseDistanceMatrix | float | 15,000 | 784 | 1 | 40 | ||
PairwiseDistanceMatrix | float | 30,000 | 784 | 1 | 162 | ||
PairwiseDistanceMatrix | float | 15,000 | 784 | 6 | 6.27 | ||
PairwiseDistanceMatrix | float | 30,000 | 784 | 6 | 25.6 |
cd <your_repo>
git clone <this_repo>
mkdir build
cd build
cmake ..
make
This will create a libcpp_clustering.a
file in the build/
firectory
Generate the datasets automatically:
python3 ../bin/MakeClusteringDatasets.py
This will create the following files in ../bin/clustering/
.
├── inputs
│ ├── aniso.txt
│ ├── blobs.txt
│ ├── iris.txt
│ ├── mnist.txt
│ ├── noisy_circles.txt
│ ├── noisy_moons.txt
│ ├── no_structure.txt
│ ├── unbalanced_blobs.txt
│ └── varied.txt
└── targets
├── aniso.txt
├── blobs.txt
├── iris.txt
├── mnist.txt
├── noisy_circles.txt
├── noisy_moons.txt
├── no_structure.txt
├── unbalanced_blobs.txt
└── varied.txt
The average default number of samples for the datasets is ~2000 samples. This can be changed in the MakeClusteringDatasets.py
script. MNIST can have up to 70000 samples and 784 features. Setting n_samples_mnist=None
results in loading mnist with the maximum number of samples.
Compile (this can also be done before generating the datasets except running the benchmarks)
cmake .. -DMODE="benchmark"
make
This will create a ffcl
executable in the build
folder. To run the benchmarks, simply:
./ffcl
The results will be written in ../bin/clustering/
and the folder structure might now look like this:
├── centroids
│ ├── aniso.txt
<...>
│ └── varied.txt
├── inputs
│ ├── aniso.txt
<...>
│ └── varied.txt
├── predictions
│ ├── aniso.txt
<...>
│ └── varied.txt
└── targets
├── aniso.txt
<...>
└── varied.txt
The clusters can be visualised:
python3 ../bin/plot.py
cmake .. -DMODE="gtest"
make
Run the gtests:
ctest -V
-DMODE="gtest"
-DMODE="benchmark"
-DVERBOSE=true
(*)-DTHREADS_ENABLED=true
(*)
fs::path filename = "unbalanced_blobs.txt";
// flattened vector: std::vector<dType>
std::vector<dType> data = load_data<dType>(inputs_folder_ / filename, ' ');
// the number of features is needed to use the flattened vector
const std::size_t n_features = get_num_features_in_file(inputs_folder_ / filename);
const std::size_t n_samples = common::get_n_samples(data.begin(), data.end(), n_features);
// the vector of indices that will be rearranged by the indexer: {0, 1, ..., n_samples-1}
auto indices = generate_indices(n_samples);
/*
unfortunately those aliases will be needed if you want to explicitly specify axis selection or splitting rule algorithms as options.
The default ones are currently based on a quickselect median selection strategy for the split and the highest variance with a sampling rate of 0.1 for the axis selection.
You can also implement your own custom policies.
axis_selection_policy: HighestVarianceBuild (default), MaximumSpreadBuild, CycleThroughAxesBuild
splitting_rule_policy: QuickselectMedianRange (default)
*/
using IndicesIterator = decltype(indices)::iterator;
using SamplesIterator = decltype(data)::iterator;
using OptionsType = ffcl::datastruct::KDTree<IndicesIterator, SamplesIterator>::Options;
using AxisSelectionPolicyType = kdtree::policy::MaximumSpreadBuild<IndicesIterator, SamplesIterator>;
using SplittingRulePolicyType = kdtree::policy::QuickselectMedianRange<IndicesIterator SamplesIterator>;
// KDTree will only rearrange the 'indices' container. The 'data' container remains unchanged.
auto kdtree = ffcl::datastruct::KDTree(indices.begin(),
indices.end(),
data.begin(),
data.end(),
n_features,
OptionsType()
.bucket_size(std::sqrt(n_samples))
.max_depth(std::log2(n_samples))
.axis_selection_policy(AxisSelectionPolicyType{}.feature_mask({0, 1}))
.splitting_rule_policy(SplittingRulePolicyType{})
auto dbscan = ffcl::DBSCAN<dType>();
// radius: 'epsilon' in DBSCAN paper
// min_samples_in_radius: 'minPts' in DBSCAN paper
dbscan.set_options(ffcl::DBSCAN<dType>::Options().radius(2).min_samples_in_radius(5));
// returns a vector of labels that describe each sample of the data described by the following vector
// label[sample_index] <-> std::vector(data.begin() + sample_index * n_features, data.begin() + sample_index * n_features + n_features]
const auto predictions = dbscan.predict(kdtree);
using KMeans = ffcl::KMeans<float>;
const std::size_t n_features = 3;
// input_data.size() / n_features -> n_samples (4 in this case)
std::vector<float> input_data = {1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12};
// must be less than n_samples
const std::size_t n_centroids = 2;
const auto centroids_init = ffcl::kmeansplusplus::make_centroids(input_data.begin(), input_data.end(), n_features, n_centroids))
// initializing the centroids manually is optional
auto kmeans = KMeans(n_centroids, n_features, centroids_init);
const auto centroids = kmeans.fit(input_data.begin(), input_data.end());
using KMedoids = ffcl::KMedoids<float>;
const std::size_t n_features = 3;
// input_data.size() / n_features -> n_samples (4 in this case)
std::vector<float> input_data = {1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12};
// must be less than n_samples
const std::size_t n_medoids = 2;
auto kmedoids = KMedoids(n_medoids, n_features);
// swap medoids and returns the centroids
const auto centroids = kmedoids.fit(input_data.begin(), input_data.end());
#include "ffcl/kmedoids/KMedoids.hpp"
// dType: type of the training samples
// true: PrecomputePairwiseDistanceMatrix (set to true by default)
using KMedoids = ffcl::KMedoids<SomeDataType, true>;
const std::size_t n_features = 3;
// input_data.size() / n_features -> n_samples (4 in this case)
std::vector<float> input_data = {1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12};
// each elements of the labels vector maps to a sample from input_data
std::vector<std::size_t> labels = {0, 0, 2, 1};
// 3 for class 0, 1 and 2 from the labels vector
const auto n_classes = 3;
// must be less than n_samples
const std::size_t n_medoids = 2;
auto kmedoids = KMedoids(n_medoids, n_features);
// set the options. The ones presented in this example are the same as the ones by default and provide no change
kmedoids.set_options(KMedoids::Options().max_iter(100).early_stopping(true).patience(0).n_init(1));
// Fit the data (inputs_first, inputs_last) with a PAM algorithm. Default: ffcl::FasterPAM
const auto centroids = kmedoids.fit<ffcl::FasterMSC>(input_data.begin(), input_data.end());
// map each data sample to its medoid
const std::vector<std::size_t> predictions = kmedoids.predict(input_data.begin(), input_data.end());
std::size_t k_min = 2;
std::size_t k_max = 10;
std::vector<float> scores(k_max - k_min);
// range n_centroids/n_medoids in [2, 10[
for (std::size_t k = k_min; k < k_max; ++k) {
// use any clustering algorithm that better suits your use case
common::clustering::KMeans<dType> kmeans(k, n_features);
// fit the centroids (or medoids if it was KMedoids)
kmeans.fit(data.begin(), data.end());
// map the samples to their closest centroid/medoid
const auto predictions = kmeans.predict(data.begin(), data.end());
// compute the silhouette scores for each sample
const auto samples_silhouette_values =
math::heuristics::silhouette(data.begin(), data.end(), n_features, predictions.begin(), predictions.end());
// get the average score
const auto mean_silhouette_coefficient = math::heuristics::get_average_silhouette(
samples_silhouette_values.begin(), samples_silhouette_values.end());
// accumulate the current scores
scores[k - k_min] = mean_silhouette_coefficient;
}
// find the k corresponding to the number of centroids/medoids k with the best average silhouette score
const auto best_k = k_min + math::statistics::argmax(scores.begin(), scores.end())