FFCL

FFCL: Flexible and (probably not the) fast(est) c++ clustering library.

Current features
Performance
Installation
Compiling
How to use

TODO | New features | Updates

TODO
- performance tests comaring ffcl to the most popular libraries (done for kdtrees with PCA and FLANN and DBSCAN from sklearn)
- Some optimization for the nearest neighbors computations of KMeans and KMedoids. I might enable KDTree acceleration but its not beneficial unless a large number of centroids/medoids is needed. Its not a priority for now.
- Proper unit testing (update: all the generic code is now unit tested)
- OPTICS: might do if it can be done very quickly. Otherwise I wont do since HDBSCAN* is supposed to be better.
- HDBSCAN* (ongoing):
  - MST with Boruvka's algorithm: done
  - Single Linkage Cluster Tree: done. Dendrogram python plots are also done
  - Condensed cluster tree: done.
  - HDBSCAN*: assembly of the aforementionned parts done.
  - The performance of HDBSCAN* can be improved by using a Dual Tree Traversal instead of the single one currently used to build the MST.
  - A few options to add such as cluster_selection_epsilon, max_cluster_size, alpha to conform with the python hdbscan library's features.
- Dual tree traversal: done for dual set closest edge. Need to replace the single tree method in boruvka by the dual tree one.
Last features
- HDBSCAN
- Dual tree traversal for dual set closest edge (not added to boruvka yet)

Current features (not up to date)

HDBSCAN (kdtree)
- HDBSCAN
- CondensedClusterTree
- SingleLinkageTree
- BoruvkasAlgorithm
DBSCAN (kdtree)
- DBSCAN original paper | used algorithm (sequential only)
KMedoids
- FasterMSC paper | author's repo
- FasterPAM paper | author's repo
KMeans
- Lloyd
- Hamerly paper | authors' repo
Containers
- KDTree is currently used for DBSCAN
  - Options
    - bucket_size(ssize_t): the maximum number of samples that a leaf node can contain
    - max_depth(ssize_t): the maximum recursion depth of the tree. The max_depth takes priority over bucket_size, meaning that if the maximum recursion depth cannot be satisfied without violating the bucket size condition, the latter may not be fulfilled.
    - axis_selection_policy: HighestVarianceBuild (default) or MaximumSpreadBuild. Custom policies can be implmented and used.
    - feature_mask({feature_index_0, feature_index_1, ..., feature_index_n}): dynamic feature mask used the select the features of interest for the axis selection procedure during the build phase. The feature indices can be specified in any order, as long as 0 <= feature_index_j < n_features. Duplicates feature indices should result in valid results but it would bias the axis selection procedure and thus defeat its purpose.
    - splitting_rule_policy: QuickselectMedianRange (default): quickselect median selection strategy for partitioning the children leaves around the pivot median value of the binary tree.
- LowerTriangleMatrixPrecomputedPairwiseDistances currently used for the pairwise distance values that can be buffered for KMedoids.
Distance functions
- euclidean
- manhattan
- cosine similarity
- levenstein
initialization
- random uniform
- spatial uniform
- pam::build
- kmeans++
Selecting the number of clusters
- silhouette method

Performance

Cpu: Intel® Core™ i5-9600KF CPU @ 3.70GHz × 6

KDTree

KDTree library	build time (s)	10K queries time (s)	build + queries (%)
KDTree (FFCL) 10000 queries	0.000694	0.097632	0.098638
KDTree (PCL) 10000 queries	0.001181	0.119764	0.121376
KDTree (FLANN) 10000 queries	0.000558	0.125653	0.126366

KDTree library	build speedup (%)	10K queries speedup (%)	build + queries (%)
FFCL over PCL	41.235	18.480	18.733
FFCL over FLANN	-24.326	22.300	21.943

KMedoids algorithms

Dataset: MNIST
initialization: random uniform

The following table summarizes the results of single runs made with different parameters. It provides a rough estimate about what one could expect from the present library. For a more in depth analysis of the performance of the algorithms, refer to the articles (1) and (2)

	type	n_samples	n_features	n_centroids	n_iter (converged)	n_threads	computation time (s)
FasterPAM	float	15,000	784	10	2	1	1.7
FasterPAM	float	15,000	784	100	4	1	4.5
FasterPAM	float	30,000	784	10	3	1	11.1
FasterPAM	float	30,000	784	100	4	1	17
FasterMSC	float	15,000	784	10	2	1	3.3
FasterMSC	float	15,000	784	100	2	1	6.4
FasterMSC	float	30,000	784	10	2	1	12.8
FasterMSC	float	30,000	784	100	3	1	27.1
PairwiseDistanceMatrix	float	15,000	784			1	40
PairwiseDistanceMatrix	float	30,000	784			1	162
PairwiseDistanceMatrix	float	15,000	784			6	6.27
PairwiseDistanceMatrix	float	30,000	784			6	25.6

Installation

cd <your_repo>
git clone <this_repo>
mkdir build
cd build

Compiling

1) As a library

cmake ..
make

This will create a libcpp_clustering.a file in the build/ firectory

2) To run the benchmarks

Generate the datasets automatically:

python3 ../bin/MakeClusteringDatasets.py

This will create the following files in ../bin/clustering/

.
├── inputs
│   ├── aniso.txt
│   ├── blobs.txt
│   ├── iris.txt
│   ├── mnist.txt
│   ├── noisy_circles.txt
│   ├── noisy_moons.txt
│   ├── no_structure.txt
│   ├── unbalanced_blobs.txt
│   └── varied.txt
└── targets
    ├── aniso.txt
    ├── blobs.txt
    ├── iris.txt
    ├── mnist.txt
    ├── noisy_circles.txt
    ├── noisy_moons.txt
    ├── no_structure.txt
    ├── unbalanced_blobs.txt
    └── varied.txt

The average default number of samples for the datasets is ~2000 samples. This can be changed in the MakeClusteringDatasets.py script. MNIST can have up to 70000 samples and 784 features. Setting n_samples_mnist=None results in loading mnist with the maximum number of samples.

Compile (this can also be done before generating the datasets except running the benchmarks)

cmake .. -DMODE="benchmark"
make

This will create a ffcl executable in the build folder. To run the benchmarks, simply:

./ffcl

The results will be written in ../bin/clustering/ and the folder structure might now look like this:

├── centroids
│   ├── aniso.txt
<...>
│   └── varied.txt
├── inputs
│   ├── aniso.txt
<...>
│   └── varied.txt
├── predictions
│   ├── aniso.txt
<...>
│   └── varied.txt
└── targets
    ├── aniso.txt
<...>
    └── varied.txt

The clusters can be visualised:

python3 ../bin/plot.py

3) To run the unit tests

cmake .. -DMODE="gtest"
make

Run the gtests:

ctest -V

) All cmake options (the ones with a () can be accumulated)

-DMODE="gtest"
-DMODE="benchmark"
-DVERBOSE=true (*)
-DTHREADS_ENABLED=true (*)

How to use

DBSCAN

Practical example using one of the autogenerated datasets

fs::path filename = "unbalanced_blobs.txt";

// flattened vector: std::vector<dType>
std::vector<dType> data = load_data<dType>(inputs_folder_ / filename, ' ');
// the number of features is needed to use the flattened vector
const std::size_t n_features = get_num_features_in_file(inputs_folder_ / filename);
const std::size_t n_samples  = common::get_n_samples(data.begin(), data.end(), n_features);
// the vector of indices that will be rearranged by the indexer: {0, 1, ..., n_samples-1}
auto indices = generate_indices(n_samples);

/*
unfortunately those aliases will be needed if you want to explicitly specify axis selection or splitting rule algorithms as options.
The default ones are currently based on a quickselect median selection strategy for the split and the highest variance with a sampling rate of 0.1 for the axis selection.
You can also implement your own custom policies.

axis_selection_policy: HighestVarianceBuild (default), MaximumSpreadBuild, CycleThroughAxesBuild
splitting_rule_policy: QuickselectMedianRange (default)
*/
using IndicesIterator         = decltype(indices)::iterator;
using SamplesIterator         = decltype(data)::iterator;
using OptionsType             = ffcl::datastruct::KDTree<IndicesIterator, SamplesIterator>::Options;
using AxisSelectionPolicyType = kdtree::policy::MaximumSpreadBuild<IndicesIterator, SamplesIterator>;
using SplittingRulePolicyType = kdtree::policy::QuickselectMedianRange<IndicesIterator SamplesIterator>;


// KDTree will only rearrange the 'indices' container. The 'data' container remains unchanged.
auto kdtree = ffcl::datastruct::KDTree(indices.begin(),
                                              indices.end(),
                                              data.begin(),
                                              data.end(),
                                              n_features,
                                              OptionsType()
                                                  .bucket_size(std::sqrt(n_samples))
                                                  .max_depth(std::log2(n_samples))
                                                  .axis_selection_policy(AxisSelectionPolicyType{}.feature_mask({0, 1}))
                                                  .splitting_rule_policy(SplittingRulePolicyType{})


auto dbscan = ffcl::DBSCAN<dType>();
// radius: 'epsilon' in DBSCAN paper
// min_samples_in_radius: 'minPts' in DBSCAN paper
dbscan.set_options(ffcl::DBSCAN<dType>::Options().radius(2).min_samples_in_radius(5));
// returns a vector of labels that describe each sample of the data described by the following vector
// label[sample_index] <-> std::vector(data.begin() + sample_index * n_features, data.begin() + sample_index * n_features + n_features]
const auto predictions = dbscan.predict(kdtree);

KMeans

Example with kmeans++ initialization

using KMeans = ffcl::KMeans<float>;

const std::size_t n_features = 3;
// input_data.size() / n_features -> n_samples (4 in this case)
std::vector<float> input_data = {1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12};
// must be less than n_samples
const std::size_t n_centroids = 2;

const auto centroids_init = ffcl::kmeansplusplus::make_centroids(input_data.begin(), input_data.end(), n_features, n_centroids))
// initializing the centroids manually is optional
auto kmeans = KMeans(n_centroids, n_features, centroids_init);

const auto centroids = kmeans.fit(input_data.begin(), input_data.end());

KMedoids

Simple example

using KMedoids = ffcl::KMedoids<float>;

const std::size_t n_features = 3;
// input_data.size() / n_features -> n_samples (4 in this case)
std::vector<float> input_data = {1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12};
// must be less than n_samples
const std::size_t n_medoids = 2;

auto kmedoids = KMedoids(n_medoids, n_features);
// swap medoids and returns the centroids
const auto centroids = kmedoids.fit(input_data.begin(), input_data.end());

Complete example

#include "ffcl/kmedoids/KMedoids.hpp"
// dType: type of the training samples
// true: PrecomputePairwiseDistanceMatrix (set to true by default)
using KMedoids = ffcl::KMedoids<SomeDataType, true>;

const std::size_t n_features = 3;
// input_data.size() / n_features -> n_samples (4 in this case)
std::vector<float> input_data = {1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12};
// each elements of the labels vector maps to a sample from input_data
std::vector<std::size_t> labels = {0, 0, 2, 1};
// 3 for class 0, 1 and 2 from the labels vector
const auto n_classes = 3;
// must be less than n_samples
const std::size_t n_medoids = 2;

auto kmedoids = KMedoids(n_medoids, n_features);
// set the options. The ones presented in this example are the same as the ones by default and provide no change
kmedoids.set_options(KMedoids::Options().max_iter(100).early_stopping(true).patience(0).n_init(1));

// Fit the data (inputs_first, inputs_last) with a PAM algorithm. Default: ffcl::FasterPAM
const auto centroids = kmedoids.fit<ffcl::FasterMSC>(input_data.begin(), input_data.end());

// map each data sample to its medoid
const std::vector<std::size_t> predictions = kmedoids.predict(input_data.begin(), input_data.end());

Silhouette method

Example: select the best number of centroids/medoids k in a range

std::size_t k_min = 2;
std::size_t k_max = 10;

std::vector<float> scores(k_max - k_min);

// range n_centroids/n_medoids in [2, 10[
for (std::size_t k = k_min; k < k_max; ++k) {
    // use any clustering algorithm that better suits your use case
    common::clustering::KMeans<dType> kmeans(k, n_features);
    // fit the centroids (or medoids if it was KMedoids)
    kmeans.fit(data.begin(), data.end());
    // map the samples to their closest centroid/medoid
    const auto predictions = kmeans.predict(data.begin(), data.end());
    // compute the silhouette scores for each sample
    const auto samples_silhouette_values =
        math::heuristics::silhouette(data.begin(), data.end(), n_features, predictions.begin(), predictions.end());

    // get the average score
    const auto mean_silhouette_coefficient = math::heuristics::get_average_silhouette(
        samples_silhouette_values.begin(), samples_silhouette_values.end());
    // accumulate the current scores
    scores[k - k_min] = mean_silhouette_coefficient;
}
// find the k corresponding to the number of centroids/medoids k with the best average silhouette score
const auto best_k = k_min + math::statistics::argmax(scores.begin(), scores.end())

Name		Name	Last commit message	Last commit date
Latest commit History 680 Commits
benchmark		benchmark
bin		bin
include		include
test		test
.gitignore		.gitignore
CMakeLists.txt		CMakeLists.txt
CMakeLists.txt.in		CMakeLists.txt.in
LICENSE		LICENSE
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

FFCL

TODO | New features | Updates

Current features (not up to date)

HDBSCAN (kdtree)

DBSCAN (kdtree)

KMedoids

KMeans

Containers

Distance functions

initialization

Selecting the number of clusters

Performance

KDTree

KMedoids algorithms

Installation

Compiling

1) As a library

2) To run the benchmarks

3) To run the unit tests

) All cmake options (the ones with a () can be accumulated)

How to use

DBSCAN

Practical example using one of the autogenerated datasets

KMeans

Example with kmeans++ initialization

KMedoids

Simple example

Complete example

Silhouette method

Example: select the best number of centroids/medoids k in a range

About

Releases

Packages

Languages

License

Palinody/FFCL

Folders and files

Latest commit

History

Repository files navigation

FFCL

TODO | New features | Updates

Current features (not up to date)

HDBSCAN (kdtree)

DBSCAN (kdtree)

KMedoids

KMeans

Containers

Distance functions

initialization

Selecting the number of clusters

Performance

KDTree

KMedoids algorithms

Installation

Compiling

1) As a library

2) To run the benchmarks

3) To run the unit tests

*) All cmake options (the ones with a (*) can be accumulated)

How to use

DBSCAN

Practical example using one of the autogenerated datasets

KMeans

Example with kmeans++ initialization

KMedoids

Simple example

Complete example

Silhouette method

Example: select the best number of centroids/medoids k in a range

About

Topics

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

) All cmake options (the ones with a () can be accumulated)

Packages