In [4]:
"""
One of the most popular clustering algorithms is k-means. Assuming that there are n data points, the algorithm works as follows:

Step 1:initialization - pick k random points as cluster centers, called centroids
Step 2:cluster assignment - assign each data point to its nearest centroid based on its distance to each centroid, and that forms k clusters
Step 3:centroid updating - for each new cluster, calculate its centroid by taking the average of all the points assigned to the cluster
Step 4:repeat steps 2 and 3 until none of cluster assignments change, or it reaches the maximum number of iterations

The k-means algorithm has been implemented in module sklearn.cluster, to access it:
"""

from sklearn.cluster import KMeans

"""
The algorithm has gained great popularity because it is easy to implement and scales well to large datasets. 
However, it is difficult to predict the number of clusters, 
it can get stuck in local optimums, 
and it can perform poorly when the clusters are of varying sizes and density.
"""

'\nThe algorithm has gained great popularity because it is easy to implement and scales well to large datasets. \nHowever, it is difficult to predict the number of clusters, \nit can get stuck in local optimums, \nand it can perform poorly when the clusters are of varying sizes and density.\n'

In [5]:
"""
How do we calculate the distance in k-means algorithm? One way is the euclidean distance, a straight line between two data points
"""

'\nHow do we calculate the distance in k-means algorithm? One way is the euclidean distance, a straight line between two data points\n'

![image.png](attachment:image.png)

For example, the euclidean distance between points x1 = (0, 1) and x2 = (2, 0) are given by:

![image-2.png](attachment:image-2.png)

In [6]:
"""
Or in numpy we can calculate the distance as follows:
"""

import numpy as np
x1 = np.array([0, 1])
x2 = np.array([2, 0])

print(np.sqrt(((x1-x2)**2).sum()))
# 2.23606797749979

print(np.sqrt(5))
# 2.23606797749979

2.23606797749979
2.23606797749979


One can extend it to higher dimensions. In the n-dimensional space, there are two points:

![image.png](attachment:image.png)

Then the euclidean distance from p to q is given by the Pythagorean formula:

![image-2.png](attachment:image-2.png)

There are other distance metrics, such as Manhattan distance, cosine distance, etc. The choice of the distance metric depends on the data.

In [7]:
x1 = np.array([1, -1])
x2 = np.array([4, 3])

print(np.sqrt(((x1-x2)**2).sum()))

5.0
