C4E, a Scala or Spark Library for Big Data Clustering.
Clone or download
Fetching latest commit…
Cannot retrieve the latest commit at this time.

README.md

Clustering 4 Ever Download Maven Central

Welcome to Big Data Clustering Library gathering clustering algorithms and quality indexes. Don't hesitate to check our Wiki, ask questions or make recommendations in our Gitter. You will find additional contents about clustering algorithms here.

API documentation

Include it in your project

Add following line in your build.sbt :

  • "org.clustering4ever" % "clustering4ever_2.11" % "0.7.2" to your libraryDependencies

Eventually add this resolver :

  • resolvers += Resolver.bintrayRepo("clustering4ever", "C4E")

You can also take specifics parts :

  • Core Download
  • Scala Clustering Download
  • Spark Clustering Download

Citation

If you publish material based on informations obtained from this repository, then, in your acknowledgements, please note the assistance you received by using this community work. This will help others to obtain the same informations and replicate your experiments, because having results is cool but being able to compare to others is better. Citation: @misc{C4E, url = “https://github.com/Clustering4Ever/Clustering4Ever“, institution = “Paris 13 University, LIPN UMR CNRS 7030”}

Available algorithms

  • emphasized algorithms are in scala.
  • bold algorithms are implemented in spark.
  • They can be available in both versions

Clustering algorithms

Quality indexes

  • Mutual Information
  • Normalized Mutual Information
  • Davies Bouldin
  • Silhouette
  • Ball Hall

C4E-Notebook examples

Basic usages of implemented algorithms are exposed with SparkNotebooks in Spark-Clustering-Notebook organization.

Miscellaneous

Implicit conversions

If you have classic real or binary matrix, you can bypass transformation from GenSeq/RDD[Vector] to GenSeq/RDD[Clusterizable] by importing implicit conversions functions :

  • org.clustering4ever.util.ScalaImplicits._
  • org.clustering4ever.util.SparkImplicits._

References

Incoming soon

  • Improved Spark Mean Shift
  • new scalable clustering algorithms
  • Gaussian Mixture Models
  • Meta heuristic
  • Rough Set Features selection

What data structures are recommended for best performances

  • ArrayBuffer as vector are a good start
  • ArrayBuffer or ParArray as vector containers are also recommended