#### Exercise 2: K-Means Clustering

In [19]:
import numpy as np
import matplotlib.pyplot as plt
from matplotlib import cm
from sklearn.cluster import KMeans

from keras.datasets import mnist

In [20]:
# Load the dataset and process the data
(train_X, train_y), (test_X, test_y) = mnist.load_data()

# Combining dataset with labels
train = np.column_stack((train_X.reshape(-1, 28*28), train_y))
test = np.column_stack((test_X.reshape(-1, 28*28), test_y))

# Randomly trimming dataset for better performance
np.random.shuffle(train)
train_y = np.array(train[:100][:, -1])
train = np.array(train[:100][:, :-1].tolist())

train.shape

(100, 784)

In [21]:
# Scikit-learn K-Means (for comparison)
kmeans = KMeans(n_clusters=10, init='random', max_iter=10, random_state=0, n_init='auto')
kmeans.fit(train[0:785])
labels = kmeans.labels_
centers = kmeans.cluster_centers_
centers.shape, labels.shape

((10, 784), (100,))

#### Part 1: Training K-Means Classifier

In [22]:
# Training K-Means classifier
def K_Means_Classifier(data, n_clusters, max_iter):

    # Initializing the clusters evenly
    # centers = data[[j * int(np.floor(data.shape[0] / n_clusters)) for j in range(n_clusters)]]
    
    # Random initialization
    centers = data[np.random.choice(data.shape[0], size=n_clusters, replace=False)]

    # Compute Euclidean distances to each cluster center and return the minimum value's index
    def assign_cluster(vector):
        distances = []
        for cluster in range(n_clusters):
            distances.append(sum([(vector[i] - centers[cluster][i])**2 for i in vector]))
        return np.argmin(np.array(distances))

    labels = np.array([assign_cluster(i) for i in data])

    # Re-assign clusters
    for iteration in range(max_iter):
        centers = np.array([np.mean(np.array(data[np.where(labels == cluster)]), axis=0) for cluster in range(n_clusters)])

        new_labels = np.array([assign_cluster(i) for i in data])

        # Check for convergence by comparing labels with previous iteration
        if np.array_equal(labels, new_labels):
            print("Convergence for k =", n_clusters, " at iteration #", iteration)
            return centers, labels
        else:
            labels = new_labels

    return centers, labels

#### Parts 2, 3: Applying K-Means to MNIST and Accuracy Calculations

In [23]:
# Run the algorithm
max_iter = 5

for k in [5, 10, 20, 40]:
    centers, labels = K_Means_Classifier(train, k, max_iter)
    accuracies = []
    clusters = [train_y[np.where(labels == cluster)] for cluster in range(k)]

    # Compute accuracies
    for cluster in range(k):
        _, counts = np.unique(np.array(clusters[cluster]), return_counts=True)
        # Sometimes the algorithm may converge such that all datapoints are forced into a single cluster
        # In this case, assume that cluster's accuracy is 0
        accuracies.append(0 if counts.shape[0] < 1 else max(counts)/len(clusters[cluster]))

    # Output the average accuracy
    print("Accuracy for k=", k, ": ", sum(accuracies) / k)

Accuracy for k= 5 :  0.27829573934837093
Accuracy for k= 10 :  0.3636510711510712
Convergence for k = 20  at iteration # 4
Accuracy for k= 20 :  0.4832142857142857
Convergence for k = 40  at iteration # 4
Accuracy for k= 40 :  0.705


In [24]:
print("Cluster accuracies for K=40")
np.array(accuracies)

Cluster accuracies for K=40


array([0.5       , 0.33333333, 0.66666667, 0.25      , 1.        ,
       1.        , 0.25      , 1.        , 1.        , 1.        ,
       0.66666667, 0.5       , 1.        , 0.25      , 1.        ,
       1.        , 1.        , 1.        , 0.83333333, 0.5       ,
       0.66666667, 0.33333333, 0.5       , 0.5       , 1.        ,
       0.75      , 1.        , 0.5       , 1.        , 1.        ,
       1.        , 0.66666667, 1.        , 0.2       , 1.        ,
       0.5       , 0.5       , 0.5       , 0.33333333, 0.5       ])

#### Part 4: Analysis

**Which k value produces the best results?**

The higher values of $k$ produce better results according to cluster consistency, so $K=40$ resulted in the highest 'accuracy' at $73%$. This is because cluster consistency is naturally biased toward smaller cluster sizes (ie. larger $K$ values). Mathematically the $N_i$ term will be lower so it is easier to score a higher accuracy. Observe that many of the individual cluster accuracies were 100% in the output above. In this sense, cluster consistency is a misleading accuracy measure. Graphically it would also visibly not lead to good results if done on 2D data; for example 10 clusters on 10 datapoints would result in 1 datapoint per cluster, or 100% consistency but this defeats the purpose of clustering as a means of grouping like items

Smaller K values result in larger cluster sizes which statistically results in larger entropies (ie. greater diversity within clusters). This is especially true for trying to fit data containing 10 unique labels into clusters of $K=5$; there are more labels than clusters so the maximum accuracy would only be 50% in this case

However, intuitively $K=10$ should score the highest because there are 10 unique values in the dataset (representing numbers 0-9) but the algorithm converged to a local optima 