# K-Means

In this lab we will implement an application of the K-Means clustering algorithm, the first in our line of unsupervised methods. 

As a reiteration of the core concept, given a set of data points X, meant to be clustered into _k_ groups, the algorithm aims at determining $k$ cluster centers. Subsequently, each point is going to be assinged to a cluster based on the shortest distance towards a cluster center. 

Let $c_i \in \mathbf{R}^n$ denote the center of cluster $i = \overline{1,k}$.

Each data point $x_i \in X$ is going to be assigned to cluster $argmin_j d(x_i, c_j)$, where d is the chosen distance function.

When determining the cluster centers, we start by assigning $k$ random cluster centers within our data distribution. Subsequently, a series of two steps if performed repeatedly until convergence.

1. Expectation - Based on the current centroids, each point is assigned to a cluster.
2. Maximization - Based on the current cluster assignment, each centroid is moved to the center of the cluster it has determined.

The following figure illustrates this process:

![kmeans](./kmeans.png)

The expectation and maximization steps are repeated for a set number of iterations or until the cluster centers have stabilized and no longer update via this procedure.

## Implementation

For this lab we will use the sklearn implementation of the K-Means algorithm. The following is a short usage example.

In [2]:
from sklearn.cluster import KMeans
import numpy as np

# loading the data
X = np.array([
    [1, 2], [1, 4], [1, 0],
    [10, 2], [10, 4], [10, 0]
])

# intantiating the model and fitting the data, similar to previous sklearn models
kmeans = KMeans(n_clusters = 2, random_state = 0).fit(X)
print(kmeans.labels_)
# array([1, 1, 1, 0, 0, 0], dtype=int32)

# using the trained model to make predictions for new data
kmeans.predict([[0, 0], [12, 3]])
# array([1, 0], dtype=int32)

print(kmeans.cluster_centers_)
# array([[10.,  2.], [ 1.,  2.]])

[1 1 1 0 0 0]
[[10.  2.]
 [ 1.  2.]]


## Application

In this lab we will start implementing the pipepline of a large scale unsupervised classification procedure. This method has been proposed by Caron et al in the following ECCV 2018 paper: https://openaccess.thecvf.com/content_ECCV_2018/papers/Mathilde_Caron_Deep_Clustering_for_ECCV_2018_paper.pdf

As described in the paper, the goal is to train a neural network to classify samples without the need for human annotations. In order to do this, the authors propose the exaptation of the expectation-maximization procedure.

They start with a randomly initiallized neural network and perform the following two steps alternatively until convergence:

1. Expectation - (i) the data is passed through the neural network in order to extract features, (ii) the samples are clustered using the k-means algorithm based on the extracted features
2. Maximization - the network is trained to match the labels assigned by the k-means algorithm

We will divide this procedure into steps and implement those steps sequentially.

First we will load the data. For this task we will work with the MNist dataset. The data has been prepared in the mnist folder. In turn this folder contains 3 subfolders: train, val and test. Within each subfolder samples have a filename which follows the pattern {label}\_{sample\_number}.{extension}. That is, in order to determine the label of a given sample, one can simply split the filename based on the '\_' character and convert the first token to int.

As a first task, write the code that reads the data. Complete the 'read_dataset' function, which, given a use case (train, val, test) reads the data from the appropriate folder and returns a numpy array of images and a numpy array of labels for those images. As a preprocessing step, we will normalize the images via division by 255.

In [None]:
def read_dataset(use): # use can be 'train', 'val' or 'test'
    
    # your code here
    
    return images, labels
    
train_images, train_labels = read_dataset('train')
val_images, val_labels = read_dataset('val')
test_images, test_labels = read_dataset('test')

Next, we will define our neural network

Define a convolutional neural network with the following architecture:

- a convolutional layer with 64 filters, kernel size of 5, and relu activation
- a max pooling layer with a size of 2
- a convolutional layer with 64 filters, kernel size of 5, and relu activation
- a max pooling layer with a size of 2
- a flattening layer
- a dense layer with a size of 512 and relu activation with the name 'features'. We name this layer (by specifying name = 'features' in the layer parameters) in order to be able to access its features. We will use those features to perform clustering
- a dense layer with 10 neurons and softmax activation function for classification

Compile the model using the Adam optimizer, a sparse categorical cross-entropy loss and accuracy as an evaluation metric.

In [None]:
model = # your code here

You can test your model the implementation of the model by training it for an epoch on the mnist dataset and noticing its performance on the test set (which should be above 95%).

Next, we will define a K-Means model with 10 clusters.

In [None]:
kmeans = # your code here

Now that we have defined our prerequisites, we will start the actual procedure. For the expectation step, we will pass the data through the model, taking the output of the second to last layer, and cluster the samples based on those features.

We can get the output of the 'features' layer in tensorflow by defining a new "partial model" which takes inputs the same way the normal model does and outputs the result of the 'features' layer.

In [None]:
partial_model = tf.keras.models.Model(
    inputs = model.input,
    outputs = model.get_layer('features').output
)

Now, we should pass all the training data through the partial model in order to get the features. Iterate through the data passing batches (32 or 64) of samples through the model and collecting the representations for the entire dataset.

In [None]:
kmeans_features = # your code here

Finally, fit the K-Means model on the collected features and get their labels.

In [None]:
kmeans_labels = # your code here

For the maximization step, train the neural network for an epoch on the kmeans_features and kmeans_labels. In the interest of time, for this exercise you can shorten the training by specifying steps_per_epoch = 100. 

In [None]:
model.fit(
    # your code here
)

After implementing the expectation and maximization procedures. It is time to formally evaluate our model. When doing this we should take into account the fact that, the clustering labels are most likely not alligned with the supervised labels. In order to do that, when computing the performance on the test set, we should be careful to match the labels predicted by the network with the test labels.

Firstly, iterate through the test dataset in order to get the network predictions for each sample. Due to memory constraints, it is recommended to do this in batches.

In [None]:
predictions = # your code here

In order to compute the accuracy properly, we have to match the clustering labels with the real labels. We will implement a very straight forwards way of doing this. Computing the confusion matrix, we will assign each cluster to the label with which it has the most predictions in common. That is, a given cluster (predicted label) $i$, will be converted by replacing it with the position of the highest value from column $i$ from the confusion matrix.

In [None]:
confusion_matrix = # your code here
adjusted_predictions = # your code here

Finally, evaluate the model by computing the accuracy, comparing the adjusted predictions with the test labels.

In [None]:
acc = # your code here
print(acc)

As a final task, you can continue the training by looping over the expectation and maximization steps.