# Flat Clustering

In this notebook, you will get familiar with two flat clustering methods: k-means and EM clustering.

In [None]:
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline

## Data: iris flowers

We start by clustering a small dataset: the iris flowers dataset. This dataset consists of four attributes which describe the properties of a set of iris flowers. Each flower is labeled with its type, or *species*. There are three different species in the dataset: Iris setosa, Iris versicolor and Iris virginica.

In [None]:
data = pd.read_csv('../datasets/iris.csv')
data.head()

Note that, as opposed to classification, clustering is an unsupervised task. Therefore, we remove the class label from data, such that is unknown to the clustering methods.

In [None]:
X = data.iloc[:, :-1]
y = data.iloc[:, -1]
X.head()

## Implementing k-means clustering

**Exercise 1:** Implement the k-means algorithm as you have seen it in the lectures, using the Euclidean distance. Your implementation should take two parameters: the number of clusters and the maximal number of iterations. Initialize the cluster centroids by randomply sampling points from the given dataset.

In [None]:
def k_means(X, n_clusters=3, max_iter=1000):
    """
    Cluster the data `X` using k-means.
    
    Parameters
    ----------
    X : pd.DataFrame
        Input data.
    n_clusters : int
        The number of clusters. Default: 3
    max_iter : int
        Maximal number of iterations. Default: 100
    """

    # Your implementation here

**Exercise 2:** Run your implementation of k-means clustering on this dataset and choose an appropriate number of clusters by visualizing the clusters.

Hint: you can use the <a href="https://seaborn.pydata.org/generated/seaborn.pairplot.html">`pairplot`</a> function of `seaborn` to plot a grid of all attribute pairs.

Even though class labels are typically not given in clustering problems, it would be interesting for this dataset to check whether the clusters that we found using k-means actually correspond the species of the flowers. However, we cannot compare the assigned clusters with the class labels directly by computing e.g. the accuracy score as we did for evaluating classifiers.

**Question:** Why can we not simply compute the accuracy score?

**Exercise 3:** Look for a <a href="https://scikit-learn.org/stable/modules/classes.html#clustering-metrics">metric</a> that tells you how well the clusters correspond with the class labels. Compare the clusters to the given actual labels.

## EM clustering

Now, we will try to improve this score by using a different clustering algorithm. By looking at the plots that visualize the k-means clusters, we can see that two of the clusters are close to each other. The data points close to this boundary may therefore be assigned to the wrong cluster. Let's instead try a different clustering method based on the expectation-maximization algorithm (look for `GaussianMixture` in the `scikit-learn` documentation).

**Exercise 4:** Cluster the Iris flowers using the algorithm described above. Visualize the clusters again and compare the correspondence to the class labels with the k-means algorithm. Does this method result in a better clustering?

## Clustering images

Let's proceed with a more interesting dataset. Instead of iris flowers, we will now cluster images of famous football players. For this, you will need some additional packages. Make sure you've installed the packages listed below:

In [None]:
!pip install cmake
!pip install dlib
!pip install opencv-python
!pip install face_recognition
!pip install imutils

If you use anaconda on windows, you might need to install some packages using anaconda instead of pip: open anaconda navigator, go to environments, click on the arrow next to the base (root) environment, click "open terminal" and run the following command: `conda install -y -c conda-forge package_name`. You can also install pre-downloaded packages.

We unfortunately cannot give in-depth support for installing these packages on windows.

If you cannot install face_recognition or dlib, just import the already processed features as mentioned below the feature extraction.

In [None]:
import cv2
import face_recognition
from imutils import paths

### Feature extraction

Instead of using the raw pixels of the images directly, we first cut out the faces from the images and extract a feature vector representation from each face. As this is not the goal of this exercise, the code for doing this step is given:

In [None]:
dir = '../datasets/images'
data = []
for i, path in enumerate(paths.list_images(dir)):
    print(path)
    # Load the images
    rgb = cv2.cvtColor(cv2.imread(path), cv2.COLOR_BGR2RGB)
    # Find the bounding boxes of the faces
    boxes = face_recognition.face_locations(rgb, model='cnn')
    # Encode the faces
    encodings = face_recognition.face_encodings(rgb, boxes)
    # Add to the dataset
    data.extend([
        {'path': path, 'rgb': rgb, 'box': box, 'encoding': encoding}
        for (box, encoding) in zip(boxes, encodings)
    ])
    
X = pd.DataFrame([d['encoding'] for d in data])
data = pd.DataFrame(data)

If the cell above runs too slow, you can also load the results: (this requires installing the `tables` package)

In [None]:
data = pd.read_hdf('images.hdf', key='images')
X = pd.DataFrame([data.loc[i, 'encoding'] for i in data.index])

### Visualization

Whereas we only had four attributes in the Iris dataset, we now have a much larger number of features which makes it hard to visualize the data points. Instead we can project the points to a two-dimensional plane using a dimensionality reduction method. A commonly used reduction method is the <a href="https://en.wikipedia.org/wiki/T-distributed_stochastic_neighbor_embedding">*t-distributed stochastic neighbor embedding*</a> or t-SNE in short, which is also implemented in `scikit-learn`.

**Exercise 5:** Visualize a two-dimensional <a href="https://scikit-learn.org/stable/modules/generated/sklearn.manifold.TSNE.html">t-SNE</a> representation of the encodings. 

### Clustering

**Exercise 6:** Use your k-means implementation to cluster the images (using the t-SNE representation as input data)

**Exercise 7:** Do the clusters actually make sense? Plot the images of their correspoding faces.