# CS 484 :: Data Mining :: George Mason University :: Spring 2025


# Homework 3: Clustering

- **100 points [6% of your final grade]**
- **Due Wednesday, April 09 by 11:59pm**

- *Goals of this homework:* (1) implement your K-means model; (2) Apply PCA to K-Means.

- *Submission instructions:* for this homework, you only need to submit to Blackboard. Please name your submission **FirstName_Lastname_hw3.ipynb**, so for example, my submission would be something like **Ziwei_Zhu_hw3.ipynb**. Your notebook should be fully executed so that we can see all outputs. 

## Part 1: K-Means (70 points)

In this part, you will implement your own K-means algorithm to conduct clustering on handwritten digit images. In this homework, we will still use the handwritten digit image dataset we have already used in previous homework. However, since clustering is unsupervised learning, which means we do not use the label information anymore. So, here, we will only use the testing data stored in the "test.txt" file.

First, let's load the data by excuting the following code:

In [1]:
import numpy as np

test = np.loadtxt("test.txt", delimiter=',')
test_features = test[:, 1:]
test_labels = test[:, 0]
print('array of testing feature matrix: shape ' + str(np.shape(test_features)))
print('array of testing label matrix: shape ' + str(np.shape(test_labels)))

array of testing feature matrix: shape (10000, 784)
array of testing label matrix: shape (10000,)


Now, it's time for you to implement your own K-means algorithm. First, please write your code to build your K-means model using the image data with **K = 10**, and **Euclidean distance**.

**Note: You should implement the algorithm by yourself. You are NOT allowed to use Machine Learning libraries like Sklearn**

**Note: you need to decide when to stop the iteration.**

In [None]:
import numpy as np
import sys

def k_means_clustering(test_features, k=10):
    # Initialize centroids by randomly selecting k points
    kcentroids = [test_features[np.random.randint(0, len(test_features))] for _ in range(k)]
    kcentroids = np.array(kcentroids)

    flag = False
    iteration = 0

    while not flag:
        # Reset clusters
        clusters = [[] for _ in range(k)]

        # Assign points to the nearest centroid
        for i, row in enumerate(test_features):
            closest_dist = sys.float_info.max
            closest_idx = 0
            for idx, centroid in enumerate(kcentroids):
                dist = np.linalg.norm(row - centroid)
                if dist < closest_dist:
                    closest_dist = dist
                    closest_idx = idx
            clusters[closest_idx].append(i)

        # Update centroids
        new_centroids = np.zeros_like(kcentroids)
        for idx, cluster in enumerate(clusters):
            if len(cluster) > 0:
                points = test_features[cluster]  # Fetch by index
                new_centroids[idx] = np.mean(points, axis=0)
            else:
                # Reinitialize empty cluster
                new_centroids[idx] = test_features[np.random.randint(0, len(test_features))]

        # Check convergence
        if np.allclose(kcentroids, new_centroids):
            flag = True
        else:
            kcentroids = new_centroids

        iteration += 1

    return kcentroids, clusters, iteration

# Example usage
kcentroids, clusters, iteration = k_means_clustering(test_features, k=10)
print(f"K-means converged in {iteration} iterations.")

K-means converged in 57 iterations.


Next, you need to calculate and print the **square root** of **S**um of **S**quared **E**rror (SSE) of each cluster generated by your K-means algorithm. Then, print out the averaged square root of SSE over all clusters.

**Note: The value of "square root of SSE" can diverge significantly. The expected value range is 10000~100000.**

In [51]:
# Write your code here
def calculate_sqrt_sse(test_features, clusters, kcentroids):
    sqrt_sse_list = []

    for idx, cluster_indices in enumerate(clusters):
        if len(cluster_indices) > 0:
            points = test_features[cluster_indices]  # Fetch rows by index
            diffs = points - kcentroids[idx]         # Now shapes match!
            sse = np.sum(np.square(diffs))
            sqrt_sse = np.sqrt(sse)
            sqrt_sse_list.append(sqrt_sse)
            print(f"Cluster {idx} - sqrt(SSE): {sqrt_sse:.4f}")
        else:
            print(f"Cluster {idx} is empty.")

    # Average of the square roots of SSEs
    if sqrt_sse_list:
        avg_sqrt_sse = np.mean(sqrt_sse_list)
        print(f"\nAverage sqrt(SSE) across all clusters: {avg_sqrt_sse:.4f}")
    else:
        print("No non-empty clusters to calculate SSE.")
    
    return sqrt_sse_list  # Return list if needed for further use

# Example usage
sqrt_sse_list = calculate_sqrt_sse(test_features, clusters, kcentroids)

Cluster 0 - sqrt(SSE): 35172.3752
Cluster 1 - sqrt(SSE): 52431.6118
Cluster 2 - sqrt(SSE): 62310.8753
Cluster 3 - sqrt(SSE): 35564.7404
Cluster 4 - sqrt(SSE): 49338.2069
Cluster 5 - sqrt(SSE): 49138.4388
Cluster 6 - sqrt(SSE): 50235.8613
Cluster 7 - sqrt(SSE): 57684.1139
Cluster 8 - sqrt(SSE): 44813.5720
Cluster 9 - sqrt(SSE): 59007.6027

Average sqrt(SSE) across all clusters: 49569.7398


Then, please have a look on https://scikit-learn.org/stable/modules/generated/sklearn.metrics.homogeneity_completeness_v_measure.html#sklearn.metrics.homogeneity_completeness_v_measure, and use this function to print out the homogeneity, completeness, and v-measure of your K-means model.

**Note: The values of homogeneity, completeness, and v-measure are expected to be >0.48**

In [59]:
# Write your code here
from sklearn.metrics import homogeneity_completeness_v_measure

def evaluate_clustering_performance1(test_labels, clusters):
    # Initialize predicted_labels as an array of zeros with the same length as test_labels
    predicted_labels = np.empty(len(test_labels), dtype=int)

    # Assign the predicted labels based on cluster indices
    for cluster_idx, point_indices in enumerate(clusters):
        for i in point_indices:
            predicted_labels[i] = cluster_idx

    return predicted_labels
# Example usage
pred = evaluate_clustering_performance1(test_labels, clusters)
h, c, v = homogeneity_completeness_v_measure(test_labels, pred)

# Print results
print(f"\nHomogeneity: {h:.4f}")
print(f"Completeness: {c:.4f}")
print(f"V-measure: {v:.4f}")


Homogeneity: 0.4988
Completeness: 0.5022
V-measure: 0.5005


## Part 2: Dimension Reduction by PCA (30 points)

Last, in this part, please use PCA to reduce the feature dimension here. And then, apply your K-Means code to the reduced data and report homogeneity, completeness, and v-measure.

**Note: You need to consider the reduced dimension m of PCA as a hyper-parameter to tune. That is, you need to try different m and measure the corresponding clustering performance. At the end, you need to report the best m and clustering performance.**

**Note: Everything else is the same as Part1, i.e., K=10, and you need to use Euclidean distance.**

**Note: You can reuse your own code of PCA from your HW1, or you can directly use the PCA model from sklearn -- https://scikit-learn.org/stable/modules/generated/sklearn.decomposition.PCA.html**

In [61]:
# Write your code here
from sklearn.decomposition import PCA
def evaluate_clustering_performance2(test_features, k,pcaDim):
    # Try different dimensions for PCA

    print(f"\nEvaluating for {pcaDim} PCA components...")
    
    # Apply PCA to reduce the feature space to m dimensions
    pca = PCA(pcaDim)
    reduced_features = pca.fit_transform(test_features)
    
    # Apply custom K-means clustering on the reduced data
    centroids, clusters, iter = k_means_clustering(reduced_features, k)
    return clusters

# Example usage
clusters = evaluate_clustering_performance2(test_features,10,50)
pred = evaluate_clustering_performance1(test_labels, clusters)



Evaluating for 50 PCA components...


In the next cell, use sklearn.metrics.homogeneity_completeness_v_measure() to print out the homogeneity, completeness, and v-measure of your K-means++ model.

**Note: The values of homogeneity, completeness, and v-measure are expected to be >0.48**

In [62]:
# Write your code here
h, c, v = homogeneity_completeness_v_measure(test_labels, pred)
print(f"Best Homogeneity: {h:.4f}")
print(f"Best Completeness: {c:.4f}")
print(f"Best V-measure: {v:.4f}")

Best Homogeneity: 0.4912
Best Completeness: 0.5040
Best V-measure: 0.4976
