# Clustering: K-means

Let's start clustering our data.

K-means is *the* clustering algorithm: known for its speed, k-means finds the centers of each cluster given the number of clusters and their initial positions.

References:
K-means Clustering: https://scikit-learn.org/stable/auto_examples/cluster/plot_cluster_iris.html <br>
Vector Quantization Example: https://scikit-learn.org/stable/auto_examples/cluster/plot_face_compress.html <br>
Color Quantization using K-Means: https://scikit-learn.org/stable/auto_examples/cluster/plot_color_quantization.html <br>

## Installation

In [None]:
%pip install numpy
%pip install matplotlib
%pip install sklearn
%pip install scipy

%pip install -U matplotlib

In [None]:
import numpy as np
import matplotlib.pyplot as plt
from mpl_toolkits.mplot3d import Axes3D
from sklearn.cluster import KMeans
from sklearn import datasets
from sklearn.metrics import pairwise_distances_argmin
from sklearn.utils import shuffle
import scipy as sp

## Clustering the Iris dataset

Once again, let's use the Iris dataset.

### Viewing the Iris dataset

Let's view the dataset one more time.

In [None]:
iris_x, iris_y = datasets.load_iris(return_X_y=True, as_frame=True)

In [None]:
iris_x.head()

In [None]:
iris_y.head()

### Choosing our inputs

We can plot up to 3D, so let's pick petal width, sepal length, and petal length.

In [None]:
# Filter out the petal width and length and the sepal length, then convert to numpy arrays
iris_petal_w = iris_x["petal width (cm)"].to_numpy()
iris_petal_l = iris_x["petal length (cm)"].to_numpy()
iris_sepal_l = iris_x["sepal length (cm)"].to_numpy()

<details><summary>Click to cheat</summary>

```python
# Filter out the petal width and length, then convert to a numpy array
iris_sepal = iris_x.filter(items=['sepal length (cm)', 'sepal width (cm)']).to_numpy()
# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(
    iris_sepal, iris_y.to_numpy(), train_size=0.7
)
```
</details>

### Creating our K-means clusterer

In [None]:
# There's three types of iris flowers in the dataset, so let's set k = 3
k = 3

# Create the models
models = [
    # Create a K-means clusterer with random initialisation
    KMeans(n_clusters=k, init='random'),
    # Create a K-means clusterer with k-means++ initialisation
    KMeans(n_clusters=k, init='k-means++'),
    # Create a K-means clusterer with random initialisation and high tol
    KMeans(n_clusters=k, tol=1e-2)
]

# Cluster the data
models2 = [model.fit(iris_x.to_numpy()) for model in models]

<details><summary>Click to cheat</summary>

```python
# Choose your k value
k = 3

# Create the models with/without neighbourhood component analysis
models = [
    # Create a KNN classifier without NCA
    neighbors.KNeighborsClassifier(k),
    # Create a KNN classifier with NCA
    Pipeline(
        [
            ("nca", neighbors.NeighborhoodComponentsAnalysis()),
            ("knn", neighbors.KNeighborsClassifier(k))
        ]
    )
]

# Train the models and store in a list
models2 = [model.fit(X_train, y_train) for model in models]
```
</details>

### Plotting the clustered data

In [None]:
titles = ["Random init", "K-means++", "High tolerance"]
for idx, (title, est) in enumerate(zip(titles, models2)):
    fig = plt.figure(idx + 1, figsize=(4, 3))
    ax = Axes3D(fig, rect=[0, 0, 0.95, 1], elev=48, azim=134)
    est.fit(X)
    labels = est.labels_

    ax.scatter(iris_petal_w, iris_sepal_l, iris_petal_l, c=labels.astype(float), edgecolor="k")

    ax.w_xaxis.set_ticklabels([])
    ax.w_yaxis.set_ticklabels([])
    ax.w_zaxis.set_ticklabels([])
    ax.set_xlabel("Petal width")
    ax.set_ylabel("Sepal length")
    ax.set_zlabel("Petal length")
    ax.set_title(title)
    ax.dist = 12

# Plot the ground truth
fig = plt.figure(fignum, figsize=(4, 3))
ax = Axes3D(fig, rect=[0, 0, 0.95, 1], elev=48, azim=134)

for name, label in [("Setosa", 0), ("Versicolour", 1), ("Virginica", 2)]:
    ax.text3D(
        X[y == label, 3].mean(),
        X[y == label, 0].mean(),
        X[y == label, 2].mean() + 2,
        name,
        horizontalalignment="center",
        bbox=dict(alpha=0.2, edgecolor="w", facecolor="w"),
    )
# Reorder the labels to have colors matching the cluster results
y = np.choose(y, [1, 2, 0]).astype(float)
ax.scatter(X[:, 3], X[:, 0], X[:, 2], c=y, edgecolor="k")

ax.w_xaxis.set_ticklabels([])
ax.w_yaxis.set_ticklabels([])
ax.w_zaxis.set_ticklabels([])
ax.set_xlabel("Petal width")
ax.set_ylabel("Sepal length")
ax.set_zlabel("Petal length")
ax.set_title("Ground Truth")
ax.dist = 12

fig.show()

## Digits dataset

Similarly, we can try KNN on the digits dataset.

### Loading the data

First things first, we need to load the data. Let's also view the first few samples while we're at it.

In [None]:
# Load the digits as a bunch object
# We do this to get the target names and images for plotting

# Also load the digits X as a pandas Dataframe and the y as a Series


# Plot the first few examples
_, axes = plt.subplots(nrows=1, ncols=4, figsize=(10, 3))
for ax, image, label in zip(axes, digits.images, digits.target):
    ax.set_axis_off()
    ax.imshow(image, cmap=plt.cm.gray_r, interpolation="nearest")
    ax.set_title(f"Digit {label}")

<details><summary>Click to cheat</summary>

```python
# Load the digits as a bunch object
# We do this to get the target names and images for plotting
digits = datasets.load_digits()
# Also load the digits X as a pandas Dataframe and the y as a Series
digits_X, digits_y = datasets.load_digits(return_X_y=True, as_frame=True)

# Plot the first few examples
_, axes = plt.subplots(nrows=1, ncols=4, figsize=(10, 3))
for ax, image, label in zip(axes, digits.images, digits.target):
    ax.set_axis_off()
    ax.imshow(image, cmap=plt.cm.gray_r, interpolation="nearest")
    ax.set_title(f"Digit {label}")
```
</details>

Now let's split our labelled data into training and testing sets with a 70/30 ratio.

<details><summary>Click to cheat</summary>

```python
X_train, X_test, y_train, y_test = train_test_split(
    digits_X.to_numpy(), digits_y.to_numpy(), train_size=0.7
)
```
</details>

### Create the model

In [None]:
# Create the untrained model
# Choose whatever k you want

# Train the model

# get the predictions


<details><summary>Click to cheat</summary>

```python
# Create the untrained model
# Choose whatever k you want
k = 5

model = neighbors.KNeighborsClassifier(k)

# Train the model
model.fit(X_train, y_train)

# get the predictions
y_pred = model.predict(X_test)
```
</details>

### Test the model

Let's see a few examples of our predictions.

In [None]:
_, axes = plt.subplots(nrows=1, ncols=4, figsize=(10, 3))
for ax, image, prediction in zip(axes, X_test, y_pred):
    ax.set_axis_off()
    image = image.reshape(8, 8)
    ax.imshow(image, cmap=plt.cm.gray_r, interpolation="nearest")
    ax.set_title(f"Prediction: {prediction}")

Let's also view our confusion matrix for good measure.

In [None]:
disp = metrics.ConfusionMatrixDisplay.from_predictions(y_test, y_pred)
disp.figure_.suptitle("Confusion Matrix")

plt.show()

Finally, we'll look at our measures of performance.

In [None]:
from sklearn.metrics import classification_report

target_names = [str(name) for name in digits.target_names]

print(classification_report(y_test, y_pred, target_names=target_names))