Every repository with this icon (
Every repository with this icon (
| name | age | message | |
|---|---|---|---|
| |
LICENSE | Sun Oct 12 01:05:42 -0700 2008 | |
| |
README.textile | Sun Oct 12 01:13:29 -0700 2008 | |
| |
cluster.clj | Thu Nov 06 23:40:33 -0800 2008 | |
| |
internal/ | Fri Nov 07 09:59:34 -0800 2008 | |
| |
run-tests.clj | Thu Nov 06 23:41:14 -0800 2008 | |
| |
test/ | Thu Nov 06 23:41:14 -0800 2008 |
Clustering algorithms for Clojure
Two clustering algorithms for Clojure: k-means and hierarchical.
Usage
(ns my-namespace (:use cluster))
k-means
Currently we expose two clustering algorithms: k-means and hierarchical. Use the k-means algorithm like so:
;; kcluster --
;; :vectors - a sequence of vectors which you want clustered
;; :count - number of clusters to find
;; :range-start - lower limit for the randomized cluster nodes
;; :range-end - upper limit for the randomized cluster nodes
(kcluster [[1 2 3] [3 4 5] [5 6 7]] 2 0 7)
So, range-start and range-end may need a bit of clarification. A k-means algorithm works by randomly
placing a number of nodes amonst the nodes you want clustered, then moving those nodes until they fall
into the center of a cluster. Those random nodes need upper and lower limits. Usually these are just
the highest and lowest possible values for numbers in the vectors which you’re clustering.
The return value of kcluster is a tuple. The first value is a sequence of vectors which contain the
indices of the clustered vectors. So if you passed in five vectors the first return value might look like:
[[0 3 4] [1 2]]. The second value contains the final vectors for the cluster nodes.
Hierarchical
;; hcluster --
;; :nodes - a sequence of maps in the form: { :vec [1 2 3] }
(hcluster [{:vec [1 2 3]} {:vec [3 4 5]} {:vec [7 9 9]}])
The return value of hcluster is a tree of Maps. It might look something like this, for the above input:
{:vec (9/2 6 13/2)
:right {:vec [7 9 9]},
:left {:right {:vec [3 4 5]},
:left {:vec [1 2 3]},
:vec (2 3 4)}}
Known Bugs
Passing vectors of all the same number to either clustering function will cause a division-by-zero error due
to my sucking at implementing Pearson correctly.
To Do
- Fix Pearson
- Add more similarity functions and allow use to choose which to use







