Skip to content

Implementations of various common Clustering algorithms.

Notifications You must be signed in to change notification settings

TSunny007/Clustering

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

44 Commits
 
 
 
 
 
 

Repository files navigation

You can click on any of the method names to preview the Jupyter Notebooks online:

Agglomerative: This is a "bottom up" approach: each observation starts in its own cluster, and pairs of clusters are merged as one moves up the hierarchy. Each of these "links" measure distance between clusters differently, and every step clusters are combined on the basis of which of the clusters are closest (least distant).

There are many variants of hierarchical clustering; here we explore 3. The key difference is how you measure the distance between two clusters S1 and S2.

Single-Link: inter-cluster distance is measured by the shortest link (distance between the closest set of points from cluster S1 to S2) alt text

Complete-Link: inter-cluster distance is measured by the longest link (distance between the furthest set of points from cluster S1 to S2)alt text

Mean-Link: measures inter-cluster distance is measured by the mean link (distance between the center of masses of cluster S1 and S2) alt text


Gonzalez-Algorithm: Greedy approximation algorithm for finding k clusters that minimize the maximum diameter of a cluster. This essenatially tries to pick cluster points as the furthest point from all other cluster center points. Chokes with outliers.


Centroid-based: Optimizes finding the k cluster centers and assign the objects to the nearest cluster center, such that the squared distances from the cluster are minimized. Chokes with bad initial centeroid initialization.

Lloyd's Algorithm (K-means) Consists of two important steps:

  1. Assignment step: Assign each observation to the cluster whose mean has the least squared Euclidean distance, this is intuitively the "nearest" mean. (partition into the Voronoi diagram by using the means as the "post offices")alt text
  2. Update step: Calculate the new means to be the centroids of the observations in the new clusters.alt text

The algorithm finishes when the new means stop changing with new iterations.

K-means++ Is meant to overcome K-means initialization pitfalls by assigning initial ceneroids by picking a random point using a weighted probability distribution where picking a point p is propoortional to D(x)^2, the distance between it and all other established centeroids. After K centers have been chosen, we are able to proceed with the normal K-means algorithm.

Releases

No releases published

Packages

No packages published