# Setup

In [None]:
import numpy as np
import matplotlib.pyplot as plt
import random

from sklearn import datasets


Whenever you look at some source of data which somehow has the form of $\textit{clusters}$ in a d-dimensional input space. Our goal will be to identify the clusters and find a representative mean-value.

In the simple k-mean algorithm the number of Clusters **K** should be choosen in advantage and the partition of the data points into sets will be done by minimization of the total sum of squared distances from each point to the mean of its assigned cluster.

In [None]:
def get_data(n_samples=2000, n_clusters=4, cluster_std=.4, rs=0):
    return datasets.make_blobs(n_samples=n_samples, centers=n_clusters, cluster_std=cluster_std, random_state=rs)[0]

X = get_data()
plt.figure(figsize=(8, 8))
plt.scatter(X[:,0], X[:,1], alpha = 0.5, s=20)
plt.show()

> **Based on the template below. Implement the k-mean algorithm which contains following steps**
>
> - Assign each Point to the mean to which it is closest.
> - If point's assignments has changed, recompute the mean of each cluster
> - If no point's assignments has changed, stop and keep the clusters
> - calculate decision boundaries
> - Plot the data, Cluster Mean-Values and (optional) decision Boundaries

In [None]:
class KMeans:

    def __init__(self, n_clusters=3, max_iteration=20):
        """Init k-means algorithm.

        Args:
            n_clusters (int): Assumed number of clusters.
            max_iteration (int): Maximum number of iterations.
        """
        self.n_clusters = n_clusters
        self.max_iteration = max_iteration

    def _get_initial_center(self, X):
        """Initialize the centers.

        Args:
            n_dim (int): Dimensionality of the input samples.

        Returns:
            ndarray: A (n_cluster, n_dim) shaped array which represent the initial centers.

        """
####################
# Your Code Here   #
####################

    def _get_initial_center_problem(self, X):
        return np.array([[0.01, 5], [0.0, 5.2], [0, 4.9]])

    def _dist(self, X):
        """Compute a distance matrix between centers and datapoints based on the
             squared euclidean distance.

        Args:
            X (ndarray): Input samples.

        Returns:
            ndarray: A (n_samples, n_clusters) shaped array representing the distance
                        of a sample to a corresponding center.

        """
####################
# Your Code Here   #
####################

    def fit(self, X):
        """Perform k-means.

        Args:
            X (ndarray): Input samples.

        """
        self.X_ = np.array(X)
        # Init centers
        self.centers = self._get_initial_center(self.X_)

        for _ in range(self.max_iteration):

            # E-Step according to Equation 9.2 in [Bishop2006].
####################
# Your Code Here   #
####################

            # M-step accoding to Equation 9.4 in [Bishop2006].
####################
# Your Code Here   #
####################
        return self

    def predict(self, X):
        """Calculate closest cluster center.

        Args:
            X (ndarray): Input data.

        Returns
            ndarray: An (n_samples,) shaped array containing cluster indices.

        """
####################
# Your Code Here   #
####################


    def plot(self, X):
        """Plot the clustered data with corresponding centers.

        Args:
            X (ndarray): Input samples.

        """
####################
# Your Code Here   #
####################

In [None]:
clusterer = KMeans(max_iteration=10)
clusterer.fit(X)
clusterer.plot(X)

In the previous example the choice of $K$ was given by factors outside our control. In general this wont be the case and a reasonable way to choose its value is by plotting the sum of squared errors between each point and the **mean** of its **Cluster**

> Implement a function for plotting the sum of squared errors subject to values for the Parameter $K$.

In [None]:
def mse_cluster(X, k):
    """Compute the distance of all samples to its center and compute its mean.

    Args:
        X (ndarray): Input data.
        k (int): Number of centers.

    """
    ####################
    # Your Code Here   #
    ####################

In [None]:
####################
# Your Code Here   #
####################

## k-Means for Color Compression

One interesting application of clustering is in color compression within images. For example, consider follow Image whiich is stored as a three-dimensional array of size (height, width, channel) containing red/blue/green contributions as integers from 0 to 255

In [None]:
china = datasets.load_sample_image("china.jpg")
plt.figure(figsize = (10, 10))
plt.axes(xticks=[], yticks=[])
plt.imshow(china)
plt.show()
print(china.shape)

One way we can see this set of pixels is as a cloud of points in a three-dimensional color space.

> Reshape the data to `(n_samples, n_features)` and rescale the color so that they lie between 0 and 1

In [None]:
####################
# Your Code Here   #
####################

> Use follow function to visualize the pixels in color space.

In [None]:
def plot_pixels(data, title, colors = None, N = 10000):
    if colors is None:
        colors = data

    #choose a random subset
    rng = np.random.RandomState(0)
    i = rng.permutation(data.shape[0])[:N]
    colors = colors[i]
    R, G, B = data[i].T

    fig, ax = plt.subplots(1, 2, figsize = (10, 5))
    ax[0].scatter(R, G, color = colors, marker = '.')
    ax[0].set(xlabel = 'Red', ylabel = 'Green', xlim = (0, 1), ylim = (0,1))

    ax[1].scatter(R, B, color = colors, marker = '.')
    ax[1].set(xlabel = 'Red', ylabel = 'Blue', xlim = (0, 1), ylim = (0,1))

    fig.suptitle(title, size = 15)

In [None]:
plot_pixels(data, title = 'Input color space: 16 million possible colors')

> Now reduce these 16 Million color to just 16, using a k-means clustering across the pixel space.

In [None]:
####################
# Your Code Here   #
####################

> Show the resulting Image-data with the new colors in the image space.

In [None]:
####################
# Your Code Here   #
####################