The evaluation of Community Discovery algorithms is not an easy task. cdlib
implements two families of evaluation strategies:
- Internal evaluation through fitness scores;
- External evaluation through partitions comparison.
Moreover, cdlib
integrates both standard synthetic network benchmarks and real networks with annotated ground truths, thus allowing for testing identified communities against ground-truths.
Finally, cdlib
also provides a way to rank clustering results generated by a set of algorithms over a given input graph.
Note
The following lists are aligned to CD evaluation methods available in the GitHub main branch of cdlib.
Fitness functions allows to summarize the characteristics of a computed set of communities. cdlib
implements the following quality scores:
cdlib.evaluation
avg_distance avg_embeddedness average_internal_degree avg_transitivity conductance cut_ratio edges_inside expansion fraction_over_median_degree hub_dominance internal_edge_density normalized_cut max_odf avg_odf flake_odf scaled_density significance size surprise triangle_participation_ratio purity
Among the fitness function a well-defined family of measures is the Modularity-based one:
erdos_renyi_modularity link_modularity modularity_density modularity_overlap newman_girvan_modularity z_modularity
Some measures will return an instance of FitnessResult
that takes together min/max/mean/std values of the computed index.
FitnessResult
It is often useful to compare different graph partition to assess their resemblance. cdlib
implements the following partition comparisons scores:
adjusted_mutual_information adjusted_rand_index f1 nf1 normalized_mutual_information omega overlapping_normalized_mutual_information_LFK overlapping_normalized_mutual_information_MGH variation_of_information
Some measures will return an instance of MatchingResult
that takes together mean and standard deviation values of the computed index.
MatchingResult
External evaluation scores can be fruitfully used to compare alternative clusterings of the same network, but also to asses to what extent an identified node clustering matches a known ground truth partition.
To facilitate such standard evaluation task, cdlib
exposes a set of standard synthetic network generators providing topological community ground truth annotations.
In particular, cdlib
make available benchmarks for:
- static community discovery;
- dynamic community discovery;
- feature-rich (i.e., node-attributed) community discovery.
All details can be found in the dedicated page.
benchmark.rst
Although evaluating a topological partition against an annotated "semantic" one is not among the safest path to follow [Peel17], cdlib
natively integrates well-known medium-size network datasets with ground-truth communities.
Due to the non-negligible sizes of such datasets, we designed a simple API to gather them from a dedicated remote repository transparently.
All details on remote datasets can be found on the dedicated page.
datasets.rst
Once a set of alternative clusterings have been extracted from a given network, is there a way to select the best one given a set of target fitness functions?
cdlib
exposes a few standard techniques to address such an issue: all details can be found in the dedicated documentation page.
validation.rst
- Peel17
Peel, Leto, Daniel B. Larremore, and Aaron Clauset. "The ground truth about metadata and community detection in networks." Science advances 3.5 (2017): e1602548.