find-clumpiness

Gregory W. Schwartz

Find the clumpiness of labels in a Haskell or JSON formatted tree.

Citation

Please cite this paper if you are using this program

Using a novel clumpiness measure to unite data with metadata: finding common sequence patterns in immune receptor germline V genes

Installation

stack install find-clumpiness

Usage

JSON

Say we have a dendrogram where the leaves of the tree are labeled with species. Then we put our tree in JSON format, where we follow the rules of:

test.JSON [{ "nodeID": "ID", "nodeLabels": [ "LABEL1", "LABEL2", etc. ] },
[RECURSION]]

where RECURSION is a list of { "nodeID": "ID", "nodeLabels": [ "LABEL1", "LABEL2", etc. ] }, [RECURSION] entries, and ID is a unique node ID for each node in the tree. By default, predefined IDs are ignored and are automatically reset to be 0,1,.. etc. unique IDs (to use the predefined IDs, use -p). Then we can find the clumpiness of each label with every other label with

cat test.JSON | find-clumpiness --format "JSON"

Note that with multiple labels, we must treat the metric differently. The options are Exclusive, AllExclusive, and Majority. Exclusive ignores all nodes with more than one label, AllExclusive looks at all nodes, treating a node with multiple labels as having all of those labels, and Majority converts nodes with multiple labels to one label by using the most frequent label, so a node with

"nodeLabels": [ "A", "A", "A", "B", "C" ]

would be converted to

"nodeLabels": [ "A" ]

This algorithm converts inner nodes (any non-leaf node) that have labels into leaves (unless -E is specified to ignore inner nodes), introducing a dummy node to attach itself to so the clumpiness algorithm, which looks at leaves, can do its thing.

R JSON

As an example for how to analyze clumpiness from hclust in R, let’s look at the clumpiness of the USArrests data, where the labels are the first letter of each state:

library(dendextend)
library(data.tree)
library(jsonlite)

# Get hclust tree.
hc = hclust(dist(USArrests), "ave")
# Get dendrogram.
dend = as.dendrogram(hc)
# Get first letters of states.
labels(dend) = substring(labels(dend), 1, 1)
# Get nicely formatted tree from dendrogram.
tree = as.Node(dend)
# Convert to JSON
toJSON(as.list(tree, mode = "explicit", unname = TRUE))

This JSON string can be directly inputted into find-clumpiness with find-clumpiness --format RJSON -E.

The rest

All of the above options apply to Haskell and Newick as well, but Haskell should already be of the form Tree NodeLabel from this library.

Name		Name	Last commit message	Last commit date
Latest commit History 33 Commits
app		app
src		src
LICENSE		LICENSE
README.org		README.org
Setup.hs		Setup.hs
find-clumpiness.cabal		find-clumpiness.cabal
stack.yaml		stack.yaml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

app

app

src

src

LICENSE

LICENSE

README.org

README.org

Setup.hs

Setup.hs

find-clumpiness.cabal

find-clumpiness.cabal

stack.yaml

stack.yaml

Repository files navigation

find-clumpiness

Citation

Installation

Usage

JSON

R JSON

The rest

About

Releases

Packages

Languages

License

GregorySchwartz/find-clumpiness

Folders and files

Latest commit

History

Repository files navigation

find-clumpiness

Citation

Installation

Usage

JSON

R JSON

The rest

About

Resources

License

Stars

Watchers

Forks

Languages