The evaluation of Community Discovery algorithms is not an easy task.
cdlib
implements two families of evaluation strategies:
- Internal evaluation through fitness scores;
- External evaluation through partition comparison.
Moreover, cdlib
integrates both standard synthetic network benchmarks and real networks with annotated ground truths, thus allowing for testing identified communities against ground truths.
Finally, cdlib
also provides a way to generate rank clustering results algorithms over a given input graph.
Note
The following lists are aligned to CD evaluation methods available in the GitHub main branch of cdlib.
Fitness functions allow to summarize the characteristics of a computed set of communities. cdlib
implements the following quality scores:
.. automodule:: cdlib.evaluation
.. autosummary:: :toctree: generated/ avg_distance avg_embeddedness average_internal_degree avg_transitivity conductance cut_ratio edges_inside expansion fraction_over_median_degree hub_dominance internal_edge_density normalized_cut max_odf avg_odf flake_odf scaled_density significance size surprise triangle_participation_ratio purity
Among the fitness function, a well-defined family of measures is the Modularity-based one:
.. autosummary:: :toctree: generated/ erdos_renyi_modularity link_modularity modularity_density modularity_overlap newman_girvan_modularity z_modularity
Some measures will return an instance of FitnessResult
that takes together min/max/mean/std values of the computed index.
.. autosummary:: :toctree: generated/ FitnessResult
It is often useful to compare different graph partitions to assess their resemblance.
cdlib
implements the following partition comparisons scores:
.. autosummary:: :toctree: generated/ adjusted_mutual_information mi rmi normalized_mutual_information overlapping_normalized_mutual_information_LFK overlapping_normalized_mutual_information_MGH variation_of_information rand_index adjusted_rand_index omega f1 nf1 southwood_index rogers_tanimoto_index sorensen_index dice_index czekanowski_index fowlkes_mallows_index jaccard_index sample_expected_sim overlap_quality geometric_accuracy classification_error ecs
Some measures will return an instance of MatchingResult
that takes together the computed index's mean and standard deviation values.
.. autosummary:: :toctree: generated/ MatchingResult
External evaluation scores can be fruitfully used to compare alternative clusterings of the same network and to assess to what extent an identified node clustering matches a known ground truth partition.
To facilitate such a standard evaluation task, cdlib
exposes a set of standard synthetic network generators providing topological community ground truth annotations.
In particular, cdlib
make available benchmarks for:
- static community discovery;
- dynamic community discovery;
- feature-rich (i.e., node-attributed) community discovery.
All details can be found on the dedicated page.
.. toctree:: :maxdepth: 1 benchmark.rst
Although evaluating a topological partition against an annotated "semantic" one is not among the safest paths to follow [Peel17], cdlib
natively integrates well-known medium-size network datasets with ground-truth communities.
Due to the non-negligible sizes of such datasets, we designed a simple API to gather them transparently from a dedicated remote repository.
All details on remote datasets can be found on the dedicated page.
.. toctree:: :maxdepth: 1 datasets.rst
[Peel17] | Peel, Leto, Daniel B. Larremore, and Aaron Clauset. "The ground truth about metadata and community detection in networks." Science Advances 3.5 (2017): e1602548. |