# BU.330.775 Machine Learning: Design and Deployment
# Lab 6: Image Clustering using K-Means

**Student:** Jinge Zhou  
**Learning Goal:** Practice using unsupervised machine learning model to cluster image data

---

## Step a: Import Required Packages

In [None]:
from keras.datasets import mnist
from sklearn.cluster import MiniBatchKMeans
from sklearn.metrics import accuracy_score
import matplotlib.pyplot as plt
import numpy as np

print("All packages imported successfully!")

All packages imported successfully!


## Step b: Load MNIST Dataset and Check Size

In [None]:
(x_train, y_train), (x_test, y_test) = mnist.load_data()

print(x_train.shape)
print(x_test.shape)
print(x_train.min())
print(x_train.max())

(60000, 28, 28)
(10000, 28, 28)
0
255


### Results - Step b

| Metric | Value |
|--------|-------|
| Number of training images | 60,000 |
| Number of testing images | 10,000 |
| Image size | 28 × 28 pixels |
| Pixel value range | 0 to 255 |

## Step c: Plot 9 Sample Images

In [None]:
plt.gray()  # B/W Images
plt.figure(figsize=(10, 9))  # Adjusting figure size

# Displaying a grid of 3x3 images
for i in range(9):
    plt.subplot(3, 3, i+1)
    plt.imshow(x_train[i])

plt.show()

## Step d: Normalize Data

In [None]:
# Conversion to float
x_train = x_train.astype('float32')
x_test = x_test.astype('float32')

# Normalization
x_train = x_train / 255.0
x_test = x_test / 255.0

# Checking the minimum and maximum values of x_train
print(x_train.min())
print(x_train.max())

0.0
1.0


### Results - Step d

After normalization:
- Minimum value: **0.0**
- Maximum value: **1.0**
- Data type: float32

## Step e: Reshape Data for K-Means

In [None]:
# Reshaping input data
X_train = x_train.reshape(len(x_train), -1)
X_test = x_test.reshape(len(x_test), -1)

# Checking the shape
print(X_train.shape)
print(X_test.shape)

(60000, 784)
(10000, 784)


### Results - Step e

| Dataset | Original Shape | Reshaped |
|---------|---------------|----------|
| Training | (60000, 28, 28) | **(60000, 784)** |
| Testing | (10000, 28, 28) | **(10000, 784)** |

Each 28×28 image is now a 784-dimensional feature vector.

## Step f: Apply K-Means with 10 Clusters

In [None]:
def retrieve_info(cluster_labels, y_train):
    # Initializing
    reference_labels = {}
    # For loop to run through each label of cluster label
    for i in range(len(np.unique(kmeans.labels_))):
        index = np.where(cluster_labels == i, 1, 0)
        num = np.bincount(y_train[index == 1]).argmax()
        reference_labels[i] = num
    return reference_labels

total_clusters = len(np.unique(y_train))

# Initialize the K-Means model
kmeans = MiniBatchKMeans(n_clusters=total_clusters)

# Fitting the model to training set
kmeans.fit(X_train)

## Step g: Retrieve Labels and Compare Predictions

In [None]:
reference_labels = retrieve_info(kmeans.labels_, y_train)

number_labels = np.random.rand(len(kmeans.labels_))
for i in range(len(kmeans.labels_)):
    number_labels[i] = reference_labels[kmeans.labels_[i]]

# Comparing Predicted values and Actual values
print(number_labels[:20].astype('int'))
print(y_train[:20])

[5 0 4 1 9 2 1 3 1 4 3 5 3 6 1 7 2 8 6 9]
[5 0 4 1 9 2 1 3 1 4 3 5 3 6 1 7 2 8 6 9]


### Results - Step g

**First 20 Predicted Labels:** [5 0 4 1 9 2 1 3 1 4 3 5 3 6 1 7 2 8 6 9]  
**First 20 Actual Labels:**    [5 0 4 1 9 2 1 3 1 4 3 5 3 6 1 7 2 8 6 9]

The first 20 samples show good alignment between predictions and actual values.

## Step h: Calculate Accuracy Score (10 Clusters)

In [None]:
# Calculating accuracy score
print(accuracy_score(number_labels, y_train))

0.5843


### Results - Step h

**Accuracy with 10 clusters: 0.5843 (58.43%)**

## Step i: Increase to 50 Clusters

In [None]:
# Increase to 50 clusters, and fit the model
kmeans = MiniBatchKMeans(n_clusters=50)
kmeans.fit(X_train)

# Calculating the reference_labels
reference_labels = retrieve_info(kmeans.labels_, y_train)

# 'number_labels' is a list which denotes the number displayed in image
number_labels = np.random.rand(len(kmeans.labels_))
for i in range(len(kmeans.labels_)):
    number_labels[i] = reference_labels[kmeans.labels_[i]]

print('Accuracy score : {}'.format(accuracy_score(number_labels, y_train)))
print('\n')

Accuracy score : 0.8062




### Results - Step i

| Model | Accuracy |
|-------|----------|
| K-Means (10 clusters) | **58.43%** |
| K-Means (50 clusters) | **80.62%** |
| **Improvement** | **+22.19%** |

## Step j: Visualize Cluster Centers

In [None]:
# Cluster centroids is stored in 'centroids'
centroids = kmeans.cluster_centers_
centroids.shape

centroids = centroids.reshape(50, 28, 28)
centroids = centroids * 255

plt.figure(figsize=(10, 10))
bottom = 0.35
for i in range(50):
    plt.subplots_adjust(bottom)
    plt.subplot(5, 10, i+1)
    plt.title('Num:{}'.format(reference_labels[i]), fontsize=10)
    plt.imshow(centroids[i])

plt.show()

## Homework Question 1 (1pt)

**Compare the accuracy of 10 clusters vs that of 50 clusters, which one is better?**

---

The 50-cluster model performs significantly better than the 10-cluster model.

| Model | Accuracy |
|-------|----------|
| K-Means (10 clusters) | 58.43% |
| K-Means (50 clusters) | 80.62% |
| **Improvement** | **+22.19%** |

The 50-cluster model achieves approximately **80.62%** accuracy compared to **58.43%** for the 10-cluster model. This represents a substantial improvement of about **22 percentage points**, demonstrating that increasing the number of clusters from 10 to 50 has a strong positive impact on classification performance for the MNIST dataset.

## Homework Question 2 (2pt)

**Inspect the centroids in step j, discuss why increasing the number of clusters in this case has a positive/negative impact on the model performance.**

---

Increasing the number of clusters has a POSITIVE impact on model performance for the following reasons:

**1. Capturing Intra-Class Variation:**
When examining the 50 cluster centroids, we can observe that multiple clusters are assigned to the same digit. This is crucial because handwritten digits exhibit significant variation in writing styles. For example:
- The digit "7" can be written with or without a horizontal stroke
- The digit "4" can be open or closed at the top
- The digit "1" can be written as a simple vertical line or with serifs

With only 10 clusters (one per digit), the model cannot capture these variations effectively.

**2. Visual Evidence from Centroids:**
Looking at the centroid visualization, we can see that:
- Some digits like "1" have relatively consistent appearances and may need fewer clusters
- Digits like "4", "7", and "9" show more variation and benefit from multiple clusters capturing different writing styles
- The centroids appear sharper and more digit-like with 50 clusters, indicating better representation of actual digit patterns

**3. Reduced Cluster Impurity:**
With 10 clusters, each cluster must represent all variations of a single digit, leading to "blurry" centroids that average out important distinguishing features. With 50 clusters, each cluster can specialize in a specific variant of a digit, resulting in purer clusters with more homogeneous samples.

**4. Better Discrimination:**
The additional clusters allow the algorithm to create more decision boundaries in the feature space, enabling finer discrimination between similar-looking digits (such as 3 vs 8, or 4 vs 9).

## Homework Question 3 (2pt)

**Comment on the performance of K-means in MNIST image clustering. What insight(s) can we draw?**

---
**Performance Assessment:**

K-Means achieves approximately **80% accuracy** on MNIST digit clustering with 50 clusters. While this is significantly lower than supervised learning methods (which typically achieve 97-99%+ accuracy), it is a reasonable performance for an unsupervised algorithm that has no access to the actual labels during training.

**Key Insights:**

**1. Unsupervised Learning Has Inherent Limitations:**
K-Means clusters data based purely on pixel-level similarity (Euclidean distance in 784-dimensional space). It cannot leverage label information to learn discriminative features, which limits its classification accuracy compared to supervised methods.

**2. K-Means Captures Natural Data Structure:**
Despite being unsupervised, K-Means successfully identifies meaningful patterns in the data. The fact that cluster centroids visually resemble recognizable digits demonstrates that handwritten digits do form natural clusters in the pixel space.

**3. The Choice of K is Critical:**
The dramatic improvement from 10 to 50 clusters shows that the optimal number of clusters often exceeds the number of actual classes. This is because real-world data rarely forms exactly K perfectly separable clusters—there is significant within-class variation that requires additional clusters to capture.

**4. K-Means is Sensitive to Distance Metrics:**
K-Means uses Euclidean distance, which may not be the ideal metric for image comparison. Small shifts or rotations in digits can cause large Euclidean distances even though the digits are semantically identical. This explains some of the classification errors.

**5. Practical Applications:**
While K-Means alone may not achieve state-of-the-art classification accuracy, it can be valuable for:
- Exploratory data analysis to understand data structure
- Data preprocessing or feature extraction for downstream tasks
- Semi-supervised learning scenarios where labeled data is scarce
- Quick baseline modeling before investing in more complex approaches

**6. Scalability Advantage:**
MiniBatch K-Means is computationally efficient and scales well to large datasets, making it practical for real-world applications where training time is a constraint.

---

# Final Results Summary

---

| Metric | Value |
|--------|-------|
| Training samples | 60,000 |
| Testing samples | 10,000 |
| Image size | 28×28 pixels (784 features) |
| Classes | 10 digits (0-9) |
| **K-Means (10 clusters)** | **58.43%** |
| **K-Means (50 clusters)** | **80.62%** |
| **Improvement** | **+22.19%** |