Skip to content

deepfryed/flock

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

38 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Flock

Ruby bindings to Cluster 3.0

Description

Provides bindings to clustering methods in Cluster 3.0.

* K-Means
* Kohonen Self-Organizing Maps
* Tree Cluster or Hierarchical Clustering

Synopsis

Specify vectors explicitly

require 'pp'
require 'flock'

data     = Array.new(13) {[]}
mask     = Array.new(13) {[]}
weights  = Array.new(13) {1.0}

data[ 0][ 0]=0.1; data[ 0][ 1]=0.0; data[ 0][ 2]=9.6; data[ 0][ 3] = 5.6;
data[ 1][ 0]=1.4; data[ 1][ 1]=1.3; data[ 1][ 2]=0.0; data[ 1][ 3] = 3.8;
data[ 2][ 0]=1.2; data[ 2][ 1]=2.5; data[ 2][ 2]=0.0; data[ 2][ 3] = 4.8;
data[ 3][ 0]=2.3; data[ 3][ 1]=1.5; data[ 3][ 2]=9.2; data[ 3][ 3] = 4.3;
data[ 4][ 0]=1.7; data[ 4][ 1]=0.7; data[ 4][ 2]=9.6; data[ 4][ 3] = 3.4;
data[ 5][ 0]=0.0; data[ 5][ 1]=3.9; data[ 5][ 2]=9.8; data[ 5][ 3] = 5.1;
data[ 6][ 0]=6.7; data[ 6][ 1]=3.9; data[ 6][ 2]=5.5; data[ 6][ 3] = 4.8;
data[ 7][ 0]=0.0; data[ 7][ 1]=6.3; data[ 7][ 2]=5.7; data[ 7][ 3] = 4.3;
data[ 8][ 0]=5.7; data[ 8][ 1]=6.9; data[ 8][ 2]=5.6; data[ 8][ 3] = 4.3;
data[ 9][ 0]=0.0; data[ 9][ 1]=2.2; data[ 9][ 2]=5.4; data[ 9][ 3] = 0.0;
data[10][ 0]=3.8; data[10][ 1]=3.5; data[10][ 2]=5.5; data[10][ 3] = 9.6;
data[11][ 0]=0.0; data[11][ 1]=2.3; data[11][ 2]=3.6; data[11][ 3] = 8.5;
data[12][ 0]=4.1; data[12][ 1]=4.5; data[12][ 2]=5.8; data[12][ 3] = 7.6;

mask[ 0][ 0]=1; mask[ 0][ 1]=1; mask[ 0][ 2]=1; mask[ 0][ 3] = 1;
mask[ 1][ 0]=1; mask[ 1][ 1]=1; mask[ 1][ 2]=0; mask[ 1][ 3] = 1;
mask[ 2][ 0]=1; mask[ 2][ 1]=1; mask[ 2][ 2]=0; mask[ 2][ 3] = 1;
mask[ 3][ 0]=1; mask[ 3][ 1]=1; mask[ 3][ 2]=1; mask[ 3][ 3] = 1;
mask[ 4][ 0]=1; mask[ 4][ 1]=1; mask[ 4][ 2]=1; mask[ 4][ 3] = 1;
mask[ 5][ 0]=0; mask[ 5][ 1]=1; mask[ 5][ 2]=1; mask[ 5][ 3] = 1;
mask[ 6][ 0]=1; mask[ 6][ 1]=1; mask[ 6][ 2]=1; mask[ 6][ 3] = 1;
mask[ 7][ 0]=0; mask[ 7][ 1]=1; mask[ 7][ 2]=1; mask[ 7][ 3] = 1;
mask[ 8][ 0]=1; mask[ 8][ 1]=1; mask[ 8][ 2]=1; mask[ 8][ 3] = 1;
mask[ 9][ 0]=1; mask[ 9][ 1]=1; mask[ 9][ 2]=1; mask[ 9][ 3] = 0;
mask[10][ 0]=1; mask[10][ 1]=1; mask[10][ 2]=1; mask[10][ 3] = 1;
mask[11][ 0]=0; mask[11][ 1]=1; mask[11][ 2]=1; mask[11][ 3] = 1;
mask[12][ 0]=1; mask[12][ 1]=1; mask[12][ 2]=1; mask[12][ 3] = 1;

pp Flock.kcluster(6, data, mask: mask)

# method: (kcluster)
#    - Flock::METHOD_AVERAGE (kmeans, this is the default)
#    - Flock::METHOD_MEDIAN  (kmedians)
# method: (treecluster)
#    - Flock::METHOD_AVERAGE_LINKAGE (default)
#    - Flock::METHOD_SINGLE_LINKAGE
#    - Flock::METHOD_MAXIMUM_LINKAGE
#    - Flock::METHOD_CENTROID_LINKAGE
# metric:
#    - Flock::METRIC_EUCLIDIAN (default)
#    - Flock::METRIC_CITY_BLOCK
#    - Flock::METRIC_CORRELATION
#    - Flock::METRIC_ABSOLUTE_CORRELATION
#    - Flock::METRIC_UNCENTERED_CORRELATION
#    - Flock::METRIC_ABSOLUTE_UNCENTERED_CORRELATION
#    - Flock::METRIC_SPEARMAN
#    - Flock::METRIC_KENDALL
# seed: (initial cluster assignment)
#    - Flock::SEED_RANDOM            (uniform random, this is the default)
#    - Flock::SEED_KMEANS_PLUSPLUS   (kmeans++ - initial cluster centers chosen weighted by distance from closest center)
#    - Flock::SEED_SPREADOUT         (similar to kmeans++ but deterministic, spreads out cluster centers)

pp Flock.kcluster(
  6,
  data,
  mask:      mask,
  method:    Flock::METHOD_AVERAGE,
  metric:    Flock::METRIC_EUCLIDIAN,
  transpose: 0,
  weights:   Array.new(13) {1.0},
  seed:      Flock::SEED_RANDOM
)

pp Flock.treecluster(
  6,
  data,
  mask:      mask,
  method:    Flock::METHOD_AVERAGE,
  metric:    Flock::METRIC_EUCLIDIAN,
  transpose: 0,
  weights:   Array.new(13) {1.0},
)

Sparse data and clustering string labels

require 'pp'
require 'flock'

data = []

# keys don't need to be numeric
data << { 1 => 0.5, 2 => 0.5 }
data << { 3 => 1, 4 => 1 }
data << { 4 => 1, 5 => 0.3 }
data << { 2 => 0.75 }
data << { 1 => 0.60 }

pp Flock.kcluster(2, data, sparse: true)

data = []

# a much simpler way to cluster text labels.
data << %w(apple orange)
data << %w(black white)
data << %w(white cyan)
data << %w(orange)
data << %w(apple)

# additional options such as metric, iterations can be passed in a hash.
pp Flock.kcluster(2, data, sparse: true)
pp Flock.treecluster(2, data, sparse: true)

Self-Organizing Map

Self-Organizing Maps (SOM) require that you specify a 2D grid on which data points can cluster. Some of the grid points may be empty and others might have clusters mapped to them. There is no need to provide a fixed cluster size.

require 'pp'
require 'flock'

data = []

# a much simpler way to cluster text
data << %w(apple orange)
data << %w(black white)
data << %w(white cyan)
data << %w(orange)
data << %w(apple)

# nxgrid, nygrid, data are required.
# additional options such as metric, iterations can be passed in a hash.

# cluster upto 4 groups in a 2x2 grid.
pp Flock.self_organizing_map(2, 2, data, sparse: true)

Note: SOM clustering provides the 2D grid coordinate for each vector instead of an integer cluster value for each vector like kcluster and treecluster.

Changes from 0.4.x to 0.5.0

Deprecated methods

  • kmeans: use kcluster instead

  • sparse_kmeans: use kcluster with option sparse: true

  • sparse_treecluster: use treecluster with option sparse: true

  • sparse_self_organizing_map: use self_organizing_map with option sparse: true

Method signature

  • kmeans, treecluster and self_organizing_map no longer take mask as a parameter.

  • mask needs to passed along with other optional parameters in the options hash.

TODO

  • K-Tree clustering

  • Use Sparse Matrix instead of converting sparse data into dense matrices.

  • BIRCH hierarchical clustering.

  • EM clustering.

  • kcluster auto-suggest cluster size.

License

Creative Commons Attribution - CC BY

About

Clustering using Cluster 3.0. K-Means, Kohonen Maps & Tree Cluster.

Resources

Stars

Watchers

Forks

Packages

No packages published