## K-means

Cluster Analysis is a classic unsupervised learning algorithm. With given samples, cluster analysis divides the data into several categories based on measurement method of feature similarity or distance. Commonly used cluster analysis methods include hierarchical clustering, k-means clustering, fuzzy clustering, and density clustering.

Similarity measure or distance measure is the key of cluster analysis. The following are some frequently used  measure method of distance and similarity.

• Minkowski Distance

Given a set of $m$-dimensional vector samples X, $x_{i},x_{j} \in X, \enspace x_{i}=(x_{1i}, x_{2i}, \cdots, x_{mi})^{\top}, \enspace x_{j}=(x_{1j}, x_{2j}, \cdots, x_{mj})^{\top} $, the Minkowski Distance between $x_{i}$ and $x_{j}$ can be defined as:
$$
d_{ij} = (\sum^{m}_{k=1}|x_{ki}-x_{kj}|^{p})^{\frac{1}{p}}, \enspace p \ge 1
$$

When $p=2$, Minkowski Distance can be called as Euclidean Distance:
$$
d_{ij} = (\sum^{m}_{k=1}|x_{ki}-x_{kj}|^{2})^{\frac{1}{2}}
$$

When $p=1$, Minkowski Distance is Manhatan Distance:
$$
d_{ij} = \sum^{m}_{k=1}|x_{ki}-x_{kj}|
$$

When $p=\infty$, Minkowski Distance can be expressed as Chebyshev Distance:
$$
d_{ij} = \max|x_{ki}-x_{kj}|
$$

• Mahalanobis Distance

It's a clustering measure that considers the correlation between individual features. Given a set of samples $X=(x_{ij})_{m \text{x} n}$ and its covariance matrix $S$, the Mahalanobis Distance between $x_{i}$ and $x_{j}$ can be defined as:
$$
d_{ij} = [(x_{i}-x_{j})^{T}S^{-1} (x_{i}-x_{j})]^{\frac{1}{2}}
$$

When $S$ is an unit matrix, that is, when features of the sample are independent of each other and the variance is 1, Mahalanobis Distance is the same as Euclidean Distance.


• Correlation Coefficient

Correlation Coefficient is the most common way to measure similarity. The closer the correlation coefficient is to 1, the more similar the two samples are, and the closer the correlation coefficient is to 0, the less similar the two samples are. Correlation Coefficient between $x_{i}$ and $x_{j}$ can be defined as:
$$
s_{i j}=\frac{\sum_{k=1}^{m} x_{k i} x_{k j}}{\left[\sum_{k=1}^{m} x_{k i}^{2} \sum_{k=1}^{m} x_{k j}^{2}\right]^{\frac{1}{2}}}
$$

• Cosine Similarity

Cosine Similarity is also one of the ways to measure the similarity of two samples. The closer the cosine is to 1, the more similar the two samples are, and the closer the cosine is to 0, the less similar the two samples are. Cosine Similarity between $x_{i}$ and $x_{j}$ can be defined as:
$$
s_{i j} =\cos (\theta)=\frac{\sum_{k=1}^{m} x_{k i} x_{k j}}{\left[\sum_{k=1}^{m} x_{k i}^{2} \sum_{k=1}^{m} x_{k j}^{2}\right]^{\frac{1}{2}}}
$$

Given a set of samples $X=\{x_{1},x_{2},\cdot,x_{n}\}$ with dimension of $m$ x $n$, k-means clustering is to divide n samples into k different categories ($k \text{<} n$ in general). Therefore, k-means clustering can be summarized as the division of the sample set, and its learning strategy is to select the optimal division by minimizing the loss function.

Now using Euclidean Distance to measure the distance between samples, the distance $d(x_{i},x_{j})$ is:
$$
d_{i j}=\sum_{k=1}^{m}\left(x_{k i}-x_{k j}\right)^{2}=\left\|x_{i}-x_{j}\right\|^{2}
$$

In [1]:
import numpy as np

# define distance measurement
def euclidean_distance(x1, x2):
    distance = 0
    for i in range(len(x1)):
        distance += pow((x1[i] - x2[i]), 2)
    return np.sqrt(distance)

Define the sum of the distances between the samples and their class centers as the loss function:
$$
W(C)=\sum_{i=1}^{k} \sum_{C(i)=l}\left\|x_{i}-\bar{x}_{l}\right\|^{2}
$$

$\bar{x}_{l}=\left(\bar{x}_{1 l}, \bar{x}_{2 l}, \ldots, \bar{x}_{m l}\right)^{T}$ is the center of $l$-th class. In $n_{l}=\sum^{n}_{i=1}I(C(i)=l)$, $I(C(i)=l)$ is the indicator function, taking the value 1 or 0. The function $W(C)$ indicates how similar samples in the same class are. Hence, k-means cluster can be regarded as a solution for an optimization problem:
$$
\begin{aligned}
C^{*}&=\arg \min _{C} W(C) \\
&=\arg \min _{C} \sum_{l=1}^{k} \sum_{C(i)=l}\left\|x_{i}-x_{j}\right\|^{2}
\end{aligned}
$$

The main steps of the k-means clustering algorithm are as follows:

• Initialize centers. That is, randomly select sample points as the initial cluster centre points at the $0$-th iteration: $m^{(0)}=(m_{1}^{(0)}, \ldots, m_{l}^{(0)}, \ldots, m_{k}^{(0)})$.

In [2]:
# initialize the center points
def centroids_init(k, X):
    n_samples, n_features = X.shape
    centroids = np.zeros((k, n_features))
    for i in range(k):
        # random choose a center point in each iteration
        centroid = X[np.random.choice(range(n_samples))]
        centroids[i] = centroid
    return centroids

• The samples are clustered according to their distance from the center. For the fixed centre points $m^{(t)}=(m_{1}^{(t)}, \ldots, m_{l}^{(t)}, \ldots, m_{k}^{(t)})$, calculate the distance from each sample to the center of the class, assign each sample to the class where its nearest center point is located, and form the preliminary clustering result $C^{(t)}$.

In [3]:
# calculate the index of the nearest class center point to which each sample belongs
def closest_centroid(sample, centroids):
    closest_i = 0
    closest_dist = float('inf')
    for i, centroid in enumerate(centroids):
        # choose the class that the closet center point belongs to 
        distance = euclidean_distance(sample, centroid)
        if distance < closest_dist:
            closest_i = i
            closest_dist = distance
    return closest_i

In [4]:
# construct different class
def create_clusters(centroids, k, X):
    n_samples = np.shape(X)[0]
    clusters = [[] for _ in range(k)]
    for sample_i, sample in enumerate(X):
        # assign the sample to the closet class
        centroid_i = closest_centroid(sample, centroids)
        clusters[centroid_i].append(sample_i)
    return clusters

• Calculate the new cluster center of the clustering result of the previous step. Calculate the current sample mean of each class for the clustering result $C^{(t)}$, and use it as the new class center $m^{(t+1)}=(m_{1}^{(t+1)}, \ldots, m_{l}^{(t+1)}, \ldots, m_{k}^{(t+1)})$.

In [5]:
# recalculate the mean center point of each category based on the clustering results of the previous step
def calculate_centroids(clusters, k, X):
    n_features = np.shape(X)[1]
    centroids = np.zeros((k, n_features))
    # use the mean of all samples as the new center point
    for i, cluster in enumerate(clusters):
        centroid = np.mean(X[cluster], axis=0)
        centroids[i] = centroid
    return centroids

• If the iteration converges or meets the iteration stop condition, output the final clustering result $C^{*}=C^{(t)}$, otherwise set $t=t+1$, and return to the second step to continue the iteration.

In [6]:
# obtain the class label of each sample
def get_cluster_labels(clusters, X):
    y_pred = np.zeros(np.shape(X)[0])
    for cluster_i, cluster in enumerate(clusters):
        for sample_i in cluster:
            y_pred[sample_i] = cluster_i
    return y_pred

In [7]:
def kmeans(X, k, max_iterations):
    # step 1
    centroids = centroids_init(k, X)
    for _ in range(max_iterations): 
        # step 2
        clusters = create_clusters(centroids, k, X) 
        prev_centroids = centroids
        # step 3
        centroids = calculate_centroids(clusters, k, X)
        # step 4
        diff = centroids - prev_centroids
        if not diff.any():
            break
    
    return get_cluster_labels(clusters, X)

In [8]:
X = np.array([[0,2],[0,0],[1,0],[5,0],[5,2]])
labels = kmeans(X, 2, 10)
print(labels)

[0. 0. 0. 1. 1.]


In [9]:
# kmeans in sklearn
from sklearn.cluster import KMeans
kmeans = KMeans(n_clusters=2, random_state=0).fit(X)
print(kmeans.labels_)

[1 1 1 0 0]
