Skip to content

GregorySchwartz/hierarchical-spectral-clustering

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

66 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

hierarchical-spectral-clustering

See https://github.com/GregorySchwartz/hierarchical-spectral-clustering for latest version.

Description

hierarchical-spectral-clustering is a program (cluster-tree) and library for hierarchical spectral clustering of sparse and dense matrices. Outputted JSON trees can be used with birch-beer for pretty trees!

Installation

Install stack

See https://docs.haskellstack.org/en/stable/README/ for more details.

curl -sSL https://get.haskellstack.org/ | sh
stack setup

Install hierarchical-spectral-clustering

Source

Probably easiest if you don’t want to mess with dependencies:

git clone https://github.com/GregorySchwartz/hierarchical-spectral-clustering.git
cd hierarchical-spectral-clustering
stack install

Online

We only require stack (or cabal), you do not need to download any source code (but you might need the stack.yaml dependency versions), just run the following command to place cluster-tree in your ~/.local/bin/:

stack install hierarchical-spectral-clustering

If you run into errors like Error: While constructing the build plan, the following exceptions were encountered:, then follow it’s advice. Usually you just need to follow the suggestion and add the dependencies to the specified file. For a quick yaml configuration, refer to https://github.com/GregorySchwartz/hierarchical-spectral-clustering/blob/master/stack.yaml. Relies on eigen-3.3.4.1 right now.

Example usage

We first need our data in the format of “vertex,vertex,value”, such that a vertex “x” connects with vertex “y” with an edge weight of value. This edge list can be used as input into the program. Make sure there is a newline at the end, otherwise, the program may expect more from stdin.

a,b,1
a,c,2
a,d,4.5
b,d,2.1
c,d,1.1
e,f,1

Then we can give this file to the program:

cat input.csv | cluster-tree --clustering-type Dense
| item | cluster |
| a    |       1 |
| b    |       1 |
| c    |       1 |
| d    |       1 |
| e    |       2 |
| f    |       2 |

As a more advanced example, we can get:

  • Output clustering
  • Output tree structure in JSON format.
  • First isolate each connected component before clustering and generating their own trees.
  • Use k-means instead of sign on the eigenvectors to determine cluster assignment.
  • Use k-means on two eigenvectors rather than one.
cat advanced_input.csv | cluster-tree --output-tree "output_trees/tree.json" --clustering-type Dense --eigen-group KMeansGroup --num-eigen 2 --separate-components > "clusters.csv"

Documentation

cluster-tree -h
cluster-tree, Gregory W. Schwartz. Hierarchical spectral clustering of data
Computes real symmetric part of matrix, so ensure the input is real and
symmetric. Diagonal should be 0s for adjacency matrix. Format is
row,column,value with no header. Must end with a newline.

Usage: cluster-tree [-c|--clustering-type STRING] [-d|--delimiter CHAR]
                    [-S|--min-size INT] [-n|--num-eigen INT]
                    [-m|--min-modularity DOUBLE] [-e|--eigen-group STRING]
                    [-s|--separate-components] [-o|--output-tree STRING]

Available options:
  -h,--help                Show this help text
  -c,--clustering-type STRING
                           ([Sparse] | Dense) Method for clustering data.
  -d,--delimiter CHAR      ([,] | CHAR) The delimiter of the CSV file. Format is
                           row,column,value with no header.
  -S,--min-size INT        ([Nothing] | INT) Minimum size of a cluster.
  -n,--num-eigen INT       ([1] | INT) Number of eigenvectors to use while
                           clustering with kmeans. Takes from the first
                           eigenvector. Recommended to start at 2 and work up
                           from there if needed.
  -m,--min-modularity DOUBLE
                           ([0] | DOUBLE) Minimum modularity to be over to
                           continue recursion.
  -e,--eigen-group STRING  ([SignGroup] | KMeansGroup) Whether to group the
                           eigenvector using the sign or kmeans while
                           clustering. While the default is sign, kmeans may be
                           more accurate (but starting points are arbitrary).
  -s,--separate-components Whether to first separate connected components of the
                           graph first. Will output a dendrogram for each
                           component with the name of the tree and the number of
                           nodes within the tree, along with the base set by
                           --output-tree.
  -o,--output-tree STRING  ([Nothing] | FILE) The name of the file to output the
                           tree in JSON format.

About

Hierarchical spectral clustering of a graph.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published