# Clustering: K-means

Let's start clustering our data.

K-means is *the* clustering algorithm: known for its speed, k-means finds the centers of each cluster given the number of clusters and their initial positions.

References:<br>
K-means Clustering: https://scikit-learn.org/stable/auto_examples/cluster/plot_cluster_iris.html <br>
Color Quantization using K-Means: https://scikit-learn.org/stable/auto_examples/cluster/plot_color_quantization.html <br>

## Installation

In [None]:
%pip install numpy
%pip install matplotlib
%pip install sklearn
%pip install scipy

%pip install -U matplotlib

In [None]:
import numpy as np
import matplotlib.pyplot as plt
from mpl_toolkits.mplot3d import Axes3D
from sklearn.cluster import KMeans
from sklearn import datasets

## Clustering the Iris dataset

Once again, let's use the Iris dataset.

### Viewing the Iris dataset

Let's view the dataset one more time.

In [None]:
iris_x, iris_y = datasets.load_iris(return_X_y=True, as_frame=True)
X, y = datasets.load_iris(return_X_y=True)

In [None]:
iris_x.head()

In [None]:
iris_y.head()

### Choosing our inputs

We can plot up to 3D, so let's pick petal width, sepal length, and petal length.

In [None]:
# Filter out the petal width and length and the sepal length, then convert to numpy arrays


<details><summary>Click to cheat</summary>

```python
# Filter out the petal width and length and the sepal length, then convert to numpy arrays
iris_petal_w = iris_x["petal width (cm)"].to_numpy()
iris_petal_l = iris_x["petal length (cm)"].to_numpy()
iris_sepal_l = iris_x["sepal length (cm)"].to_numpy()
```
</details>

### Creating our K-means clusterer

In [None]:
# Based on the number of types of flowers, what should be k be?
k =

# Create the models
models = [
    # Create a K-means clusterer with random initialisation
    
    # Create a K-means clusterer with k-means++ initialisation
    
    # Create a K-means clusterer with random initialisation and high tol
    
]

# Cluster the data
models2 = [model.fit(iris_x.to_numpy()) for model in models]

<details><summary>Click to cheat</summary>

```python
# Based on the number of types of flowers, what should be k be?
k = 3

# Create the models
models = [
    # Create a K-means clusterer with random initialisation
    KMeans(n_clusters=k, init='random'),
    # Create a K-means clusterer with k-means++ initialisation
    KMeans(n_clusters=k, init='k-means++'),
    # Create a K-means clusterer with random initialisation and high tol
    KMeans(n_clusters=k, tol=1e-2)
]

# Cluster the data
models2 = [model.fit(iris_x.to_numpy()) for model in models]
```
</details>

### Plotting the clustered data

In [None]:
titles = ["Random init", "K-means++", "High tolerance"]
for idx, (title, est) in enumerate(zip(titles, models2)):
    fig = plt.figure(idx + 1, figsize=(4, 3))
    ax = Axes3D(fig, rect=[0, 0, 0.95, 1], elev=48, azim=134)
    est.fit(iris_x.to_numpy())
    labels = est.labels_

    ax.scatter(iris_petal_w, iris_sepal_l, iris_petal_l, c=labels.astype(float), edgecolor="k")

    ax.w_xaxis.set_ticklabels([])
    ax.w_yaxis.set_ticklabels([])
    ax.w_zaxis.set_ticklabels([])
    ax.set_xlabel("Petal width")
    ax.set_ylabel("Sepal length")
    ax.set_zlabel("Petal length")
    ax.set_title(title)
    ax.dist = 12

# Plot the ground truth
fig = plt.figure(len(models2) + 1, figsize=(4, 3))
ax = Axes3D(fig, rect=[0, 0, 0.95, 1], elev=48, azim=134)

for name, label in [("Setosa", 0), ("Versicolour", 1), ("Virginica", 2)]:
    ax.text3D(
        iris_petal_w[y == label].mean(),
        iris_sepal_l[y == label].mean(),
        iris_petal_l[y == label].mean() + 2,
        name,
        horizontalalignment="center",
        bbox=dict(alpha=0.2, edgecolor="w", facecolor="w"),
    )
# Reorder the labels to have colors matching the cluster results
y = np.choose(y, [1, 2, 0]).astype(float)
ax.scatter(X[:, 3], X[:, 0], X[:, 2], c=y, edgecolor="k")

ax.w_xaxis.set_ticklabels([])
ax.w_yaxis.set_ticklabels([])
ax.w_zaxis.set_ticklabels([])
ax.set_xlabel("Petal width")
ax.set_ylabel("Sepal length")
ax.set_zlabel("Petal length")
ax.set_title("Ground Truth")
ax.dist = 12

fig.show()

## Colour Quantization Using K-Means

We can also use K-means to compress an image into it's core colours.

### Loading the China image

First things first, we need to load the image.

In [None]:
# Load the raw image


# Convert from 8-bit ints to floats from 0-1


<details><summary>Click to cheat</summary>

```python
# Load the raw image
china = datasets.load_sample_image("china.jpg")

# Convert from 8-bit ints to floats from 0-1
china = np.array(china, dtype=np.float64) / 255.0
```
</details>

Let's see what the original image looks like

In [None]:
plt.figure(1)
plt.clf()
plt.axis("off")
plt.title("Original image (96,615 colors)")
plt.imshow(china)

### Create the model

In [None]:
# Convert to an array where each row is a pixel and each column is a RGB channel


# Randomly shuffle the image


# Create the K-means clusterer



# Load the data into the model


<details><summary>Click to cheat</summary>

```python
# Convert to an array where each row is a pixel and each column is a RGB channel
w, h, d = china.shape
china2 = np.reshape(china, (w * h, d))

# Randomly shuffle the image
img_rand = shuffle(china2, random_state=0, n_samples=1000)

# Create the K-means clusterer
n_colours = 64
kmeans = KMeans(n_clusters=n_colours, random_state=0)

# Load the data into the model
kmeans.fit(img_rand)
```
</details>

### Test the model

Let's see how our model has compressed our image

In [None]:
# Get the labels from our model
labels = kmeans.predict(china2)

def recreate_image(codebook, labels, w, h):
    """Recreate the (compressed) image from the code book & labels"""
    return codebook[labels].reshape(w, h, -1)

plt.figure(2)
plt.clf()
plt.axis("off")
plt.title(f"Quantized image ({n_colours} colors, K-Means)")
plt.imshow(recreate_image(kmeans.cluster_centers_, labels, w, h))

## Finding the optimal number of clusters

Now we can optimise our compression by finding the elbow point.

### Laoding the China image

In [None]:
# Load the raw image
china = datasets.load_sample_image("china.jpg")

# Convert from 8-bit ints to floats from 0-1
china = np.array(china, dtype=np.float64) / 255.0

# Reshape the image so that each row is a pixel and each column is a RGB value
w, h, d = china.shape
china2 = np.reshape(china, (w * h, d))

### Calculate the Sum of Squared Errors and Silhouette Scores

In [None]:
# There are a lot of colours in the original image, so
# let's try various k values across a large range with a large step


# Create an array to store the SSE values


# Iterate through the different K-Means models, calculating the SSE
for i, k in enumerate(K):





<details><summary>Click to cheat</summary>

```python
# There are a lot of colours in the original image, so
# let's try various k values across a large range with a large step
K = np.arange(10, 71, 10)

# Create an array to store the SSE values
sse = np.ndarray(len(K), dtype=float)

# Iterate through the different K-Means models, calculating the SSE
for i, k in enumerate(K):
    kmeans = KMeans(n_clusters=k).fit(china2)
    labels = kmeans.predict(china2)
    sse[i] = kmeans.inertia_

```
</details>

### Plot the graph

In [None]:
plt.figure()
plt.add_subplot(121)
plt.plot(K, sse, label="Sum of Squared error")
plt.xlabel("Number of clusters")
plt.ylabel("SSE")
plt.xticks(K)
plt.legend()

plt.subplot(122)
plt.plot(K, ss)
plt.show()

### Generate the compressed image

In [None]:
# create the model
k = 35
kmeans = KMeans(n_clusters=k)

# insert the data
kmeans.fit(china2)

# Get the labels from our model
labels = kmeans.predict(china2)

def recreate_image(codebook, labels, w, h):
    """Recreate the (compressed) image from the code book & labels"""
    return codebook[labels].reshape(w, h, -1)

plt.figure(2)
plt.clf()
plt.axis("off")
plt.title(f"Optimal quantized image ({k} colors, K-Means)")
plt.imshow(recreate_image(kmeans.cluster_centers_, labels, w, h))