## K-means clustering 



1.Choose the number of clusters (k). This is a critical step, as the number of clusters will determine the granularity of the clustering results.

2.Randomly initialize the cluster centroids. This can be done by randomly selecting k data points from the data set.

3.Assign each data point to the cluster whose centroid is closest to it. This can be done using a distance metric, such as the 
Euclidean distance or the Manhattan distance.

4.Recalculate the centroids of each cluster. The centroid of a cluster is the mean value of the data points in that cluster.

5.Repeat steps 3 and 4 until the centroids no longer change or until a maximum number of iterations is reached.


some of the advantages of k-means clustering:

It is relatively simple to implement and understand.
It can be used to cluster data points in a variety of dimensions.
It is relatively efficient, especially for large data sets.

some of the disadvantages of k-means clustering:

The choice of the number of clusters (k) can be arbitrary.
The algorithm can be sensitive to the initial choice of the cluster centroids.
The algorithm can be trapped in local minima.

Common Performance Measures used for Clustering:

1.Within-cluster sum of squares (WCSS): This is the sum of the squared distances between each data point and the centroid of its cluster. A lower WCSS indicates that the data points within each cluster are more tightly clustered together.

2.Between-cluster sum of squares (BCSS): This is the sum of the squared distances between each cluster centroid and the overall mean of the data set. A higher BCSS indicates that the cluster centroids are more well-separated from each other.

3.Homogeneity: This measures the similarity of data points within the same cluster. A higher homogeneity score indicates that the data points within each cluster are more similar to each other. Intra-Clsuter Vairance

4.Completeness: This measures the similarity of data points to the cluster centroids. A higher completeness score indicates that the data points are more similar to the cluster centroids of their respective clusters. Inter-Cluster Variance

5.Silhouette coefficient: This is a measure of how well each data point is assigned to its cluster. A higher silhouette coefficient indicates that the data point is more well-clustered.

In [34]:
import random
def kmeans(data, k, max_iter):
    random.seed(9001)
    centroids = []
    for _ in range(k):
        centroids.append([random.randint(0, 100), random.randint(0, 100)])

    labels = []
    for i in range(len(data)):
        distances = []
        for j in range(k):
            distances.append(distance(data[i], centroids[j]))
        labels.append(distances.index(min(distances)))

    for _ in range(max_iter):
        new_centroids = []
        for i in range(k):
            new_centroids.append([0, 0])
            for j in range(len(data)):
                if labels[j] == i:
                    new_centroids[i][0] += data[j][0]
                    new_centroids[i][1] += data[j][1]
            new_centroids[i][0] /= len(data[labels == i])
            new_centroids[i][1] /= len(data[labels == i])

        for i in range(k):
            if new_centroids[i] != centroids[i]:
                centroids = new_centroids

    return labels

def distance(p1, p2):
    return ((p1[0] - p2[0])**2 + (p1[1] - p2[1])**2)**0.5




In [75]:
data = [[10, 10], [50, 50], [30, 30], [70, 70], [90, 90]]

labels = kmeans(data, 4, 100)

print(labels)

[1, 2, 1, 3, 3]
