Skip to content

GregorySchwartz/find-clumpiness

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

33 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

find-clumpiness

Gregory W. Schwartz

Find the clumpiness of labels in a Haskell or JSON formatted tree.

Citation

Please cite this paper if you are using this program

Using a novel clumpiness measure to unite data with metadata: finding common sequence patterns in immune receptor germline V genes

Installation

stack install find-clumpiness

Usage

JSON

Say we have a dendrogram where the leaves of the tree are labeled with species. Then we put our tree in JSON format, where we follow the rules of:

test.JSON [{ "nodeID": "ID", "nodeLabels": [ "LABEL1", "LABEL2", etc. ] },
[RECURSION]]

where RECURSION is a list of { "nodeID": "ID", "nodeLabels": [ "LABEL1", "LABEL2", etc. ] }, [RECURSION] entries, and ID is a unique node ID for each node in the tree. By default, predefined IDs are ignored and are automatically reset to be 0,1,.. etc. unique IDs (to use the predefined IDs, use -p). Then we can find the clumpiness of each label with every other label with

cat test.JSON | find-clumpiness --format "JSON"

Note that with multiple labels, we must treat the metric differently. The options are Exclusive, AllExclusive, and Majority. Exclusive ignores all nodes with more than one label, AllExclusive looks at all nodes, treating a node with multiple labels as having all of those labels, and Majority converts nodes with multiple labels to one label by using the most frequent label, so a node with

"nodeLabels": [ "A", "A", "A", "B", "C" ]

would be converted to

"nodeLabels": [ "A" ]

This algorithm converts inner nodes (any non-leaf node) that have labels into leaves (unless -E is specified to ignore inner nodes), introducing a dummy node to attach itself to so the clumpiness algorithm, which looks at leaves, can do its thing.

R JSON

As an example for how to analyze clumpiness from hclust in R, let’s look at the clumpiness of the USArrests data, where the labels are the first letter of each state:

library(dendextend)
library(data.tree)
library(jsonlite)

# Get hclust tree.
hc = hclust(dist(USArrests), "ave")
# Get dendrogram.
dend = as.dendrogram(hc)
# Get first letters of states.
labels(dend) = substring(labels(dend), 1, 1)
# Get nicely formatted tree from dendrogram.
tree = as.Node(dend)
# Convert to JSON
toJSON(as.list(tree, mode = "explicit", unname = TRUE))

This JSON string can be directly inputted into find-clumpiness with find-clumpiness --format RJSON -E.

The rest

All of the above options apply to Haskell and Newick as well, but Haskell should already be of the form Tree NodeLabel from this library.

About

Find the clumpiness of labels in a Haskell or JSON formatted tree.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published