Gregory W. Schwartz
Find the clumpiness of labels in a Haskell or JSON formatted tree.
Please cite this paper if you are using this program
stack install find-clumpiness
Say we have a dendrogram where the leaves of the tree are labeled with species. Then we put our tree in JSON format, where we follow the rules of:
test.JSON [{ "nodeID": "ID", "nodeLabels": [ "LABEL1", "LABEL2", etc. ] }, [RECURSION]]
where RECURSION
is a list of { "nodeID": "ID", "nodeLabels": [ "LABEL1",
"LABEL2", etc. ] }, [RECURSION]
entries, and ID
is a unique node ID for each
node in the tree. By default, predefined IDs are ignored and are automatically
reset to be 0,1,.. etc. unique IDs (to use the predefined IDs, use -p
). Then
we can find the clumpiness of each label with every other label with
cat test.JSON | find-clumpiness --format "JSON"
Note that with multiple labels, we must treat the metric differently. The
options are Exclusive
, AllExclusive
, and Majority
. Exclusive
ignores all
nodes with more than one label, AllExclusive
looks at all nodes, treating a
node with multiple labels as having all of those labels, and Majority
converts
nodes with multiple labels to one label by using the most frequent label, so a
node with
"nodeLabels": [ "A", "A", "A", "B", "C" ]
would be converted to
"nodeLabels": [ "A" ]
This algorithm converts inner nodes (any non-leaf node) that have labels into
leaves (unless -E
is specified to ignore inner nodes), introducing a dummy
node to attach itself to so the clumpiness algorithm, which looks at leaves, can
do its thing.
As an example for how to analyze clumpiness from hclust
in R, let’s look at
the clumpiness of the USArrests
data, where the labels are the first letter of
each state:
library(dendextend)
library(data.tree)
library(jsonlite)
# Get hclust tree.
hc = hclust(dist(USArrests), "ave")
# Get dendrogram.
dend = as.dendrogram(hc)
# Get first letters of states.
labels(dend) = substring(labels(dend), 1, 1)
# Get nicely formatted tree from dendrogram.
tree = as.Node(dend)
# Convert to JSON
toJSON(as.list(tree, mode = "explicit", unname = TRUE))
This JSON
string can be directly inputted into find-clumpiness
with
find-clumpiness --format RJSON -E
.
All of the above options apply to Haskell
and Newick
as well, but Haskell
should already be of the form Tree NodeLabel
from this library.