Set::Similarity - similarity measures for sets
use Set::Similarity::Dice;
# object method
my $dice = Set::Similarity::Dice->new;
my $similarity = $dice->similarity('Photographer','Fotograf');
# class method
my $dice = 'Set::Similarity::Dice';
my $similarity = $dice->similarity('Photographer','Fotograf');
# from 2-grams
my $width = 2;
my $similarity = $dice->similarity('Photographer','Fotograf',$width);
# from arrayref of tokens
my $similarity = $dice->similarity(['a','b'],['b']);
# from hashref of features
my $bird = {
wings => true,
eyes => true,
feathers => true,
hairs => false,
legs => true,
arms => false,
};
my $mammal = {
wings => false,
eyes => true,
feathers => false,
hairs => true,
legs => true,
arms => true,
};
my $similarity = $dice->similarity($bird,$mammal);
# from arrayref sets
my $bird = [qw(
wings
eyes
feathers
legs
)];
my $mammal = [qw(
eyes
hairs
legs
arms
)];
my $similarity = $dice->from_sets($bird,$mammal);
This is the base class including mainly helper and convenience methods.
( A intersect B ) / min(A,B)
The Jaccard coefficient measures similarity between sample sets, and is defined as the size of the intersection divided by the size of the union of the sample sets
( A intersect B ) / (A union B)
The Tanimoto coefficient is the ratio of the number of features common to both sets to the total number of features, i.e.
( A intersect B ) / ( A + B - ( A intersect B ) ) # the same as Jaccard
The range is 0 to 1 inclusive.
The Dice coefficient is the number of features in common to both sets relative to the average size of the total number of features present, i.e.
( A intersect B ) / 0.5 ( A + B ) # the same as sorensen
The weighting factor comes from the 0.5 in the denominator. The range is 0 to 1.
All methods can be used as class or object methods.
$object = Set::Similarity->new();
my $similarity = $object->similarity($any1,$any1,$width);
$any
can be an arrayref, a hashref or a string. Strings are tokenized into n-grams of width $width
.
$width
must be integer, or defaults to 1.
my $similarity = $object->from_tokens(['a','b'],['b']);
my $similarity = $object->from_sets(['a'],['b']);
Croaks if called directly. This method should be implemented in a child module.
my $intersection_size = $object->intersection(['a'],['b']);
my @uniq = $object->uniq(['a','b']);
Transforms an arrayref of strings into an array of unique elements.
my $set_size_sum = $object->combined_length(['a'],['b']);
my $min_set_size = $object->min(['a'],['b']);
my @monograms = $object->ngrams('abc');
my @bigrams = $object->ngrams('abc',2);
my $arrayref = $object->_any($any,$width);
Bag::Similarity doing the same for bags or multisets.
Text::Levenshtein for distance measures of strings, and a very overview of similar modules,
http://en.wikipedia.org/wiki/String_metric for an overview of similarity measures.
Cluster::Similarity for clusters.
http://github.com/wollmers/Set-Similarity
Helmut Wollmersdorfer, helmut@wollmersdorfer.at
Copyright (C) 2013-2020 by Helmut Wollmersdorfer
This library is free software; you can redistribute it and/or modify it under the same terms as Perl itself.