# Homework09

Exercises to practice unsupervised learning with clustering

## Goals

- Get more practice with the ML flow: encode -> normalize -> train -> evaluate
- Understand the tradeoffs of modeling parameters
- Develop intuition for different clustering models and when to use them

### Setup

Run the following 2 cells to import all necessary libraries and helpers for this homework.

In [None]:
!wget -q https://github.com/PSAM-5020-2025F-A/5020-utils/raw/main/src/data_utils.py
!wget -q https://github.com/PSAM-5020-2025F-A/5020-utils/raw/main/src/image_utils.py

In [None]:
import json
import matplotlib.pyplot as plt
import pandas as pd

from os import listdir
from PIL import Image as PImage

from sklearn.preprocessing import OrdinalEncoder, StandardScaler
from sklearn.mixture import GaussianMixture
from sklearn.cluster import KMeans

from data_utils import balance_score, distance_score, silhouette_score, display_silhouette_plots
from data_utils import object_from_json_url

from image_utils import get_pixels, make_image

## Helmet Sizing

### Load Dataset

Let's load up the full [ANSUR](https://www.openlab.psu.edu/ansur2/) dataset that we looked at briefly in [Week 02](https://github.com/PSAM-5020-2025F-A/WK02) and then again in [Homework07](https://github.com/PSAM-5020-2025F-A/Homework07).

This is the dataset that has anthropometric information about U.S. Army personnel.

#### WARNING

Like we mentioned in class, this dataset is being used for these exercises due to the level of detail in the dataset and the rigorous process that was used in collecting the data.

This is a very specific dataset and should not be used to draw general conclusions about people, bodies, or anything else that is not related to the distribution of physical features of U.S. Army personnel.

In [None]:
# Load Dataset
ANSUR_FILE = "https://raw.githubusercontent.com/PSAM-5020-2025F-A/5020-utils/main/datasets/json/ansur.json"
ansur_data = object_from_json_url(ANSUR_FILE)

# Look at first 2 records
ansur_data[:2]

Let's load it into a `DataFrame`, like last week.

In [None]:
# Read into DataFrame
ansur_df = pd.json_normalize(ansur_data)
ansur_df.head()

### Unsupervised Learning

Let's pretend we are designing next-generation helmets with embedded over-the-ear headphones and we want to have a few options for sizes.

We could use clustering to see if there is a number of clusters that we can divide our population into, so each size covers a similar portion of the population.

We can follow similar steps to regression to create a clustering model that uses features about head and ear sizes:

1. Load dataset (done! ðŸŽ‰)
2. Encode label features as numbers
3. Normalize the data
4. Separate the feature variables we want to consider (done below)
5. Pick a clustering algorithm
6. Determine number of clusters
7. Cluster data
8. Interpret results

For step $5$, it's fine to just pick an algorithm ahead of time to see what happens, but feel free to experiment and plot results for multiple clustering methods.

In [None]:
## Encode non-numerical features
encoder = OrdinalEncoder()
ansur_encoded_df = ansur_df.copy()

# Encode all object (non-numeric) columns
for col in ansur_encoded_df.select_dtypes(include="object").columns:
    ansur_encoded_df[col] = encoder.fit_transform(ansur_encoded_df[[col]])


## Normalize the data

scaler = StandardScaler()
ansur_scaled_df = ansur_encoded_df.copy()
ansur_scaled_df[ansur_scaled_df.columns] = scaler.fit_transform(ansur_encoded_df)


In [None]:
## Separate the features we want to consider
ansur_features = ansur_scaled_df[["head.height", "head.circumference", "ear.length", "ear.breadth", "ear.protrusion"]]

In [None]:
## Create Clustering model
kmeans = KMeans(n_clusters=3, random_state=42)
## Run the model(s) on the data
kmeans_labels = kmeans.fit_predict(ansur_features)
ansur_features["cluster"] = kmeans_labels

## Check errors
print("Silhouette Score:", silhouette_score(ansur_features.drop("cluster", axis=1), kmeans_labels))
print("Balance Score:", balance_score(kmeans_labels))
print("Distance Score:", distance_score(ansur_features.drop("cluster", axis=1), kmeans_labels))


## Plot clusters as function of 2 or 3 variables
plt.figure(figsize=(8,6))
plt.scatter(ansur_features["head.circumference"], ansur_features["head.height"], c=ansur_features["cluster"], cmap="viridis")
plt.xlabel("Head Circumference")
plt.ylabel("Head Height")
plt.title("Helmet Size Clusters")
plt.show()


### Interpretation

<span style="color:hotpink;">
Which clustering algorithm did you choose?<br>
Did you try a different one?<br>
Do the clusters make sense ? Do they look balanced ?
</span>

<span style="color:hotpink;">
I used the K-Means algorithm to group the data into three clusters. I also tried using Gaussian Mixture, which gave a similar result but with smoother boundaries between groups.

The clusters look quite reasonable, they seem to group people by overall head size, from smaller to larger heads. The distribution looks fairly balanced, which makes sense for designing helmet sizes like small, medium, and large. Thereâ€™s some overlap between the clusters, but that makes sense as real human measurements usually blend gradually too rather than forming clear separations.
</span>

## Figure out how many cluster

Experiment with the number of clusters to see if the initial choice makes sense.

The [WK09](https://github.com/PSAM-5020-2025F-A/WK09) notebook had a for loop that can be used to plot errors versus number of clusters.

In [None]:
## Plot errors and pick how many cluster
sil_scores = []
dist_scores = []
cluster_range = range(2, 10)

for k in cluster_range:
    kmeans = KMeans(n_clusters=k, random_state=42)
    labels = kmeans.fit_predict(ansur_features.drop("cluster", axis=1, errors="ignore"))
    sil = silhouette_score(ansur_features.drop("cluster", axis=1, errors="ignore"), labels)
    dist = distance_score(ansur_features.drop("cluster", axis=1, errors="ignore"), labels)
    
    sil_scores.append(sil)
    dist_scores.append(dist)

plt.figure(figsize=(8,5))
plt.plot(cluster_range, sil_scores, marker='o', label="Silhouette Score")
plt.plot(cluster_range, dist_scores, marker='x', label="Distance Score")
plt.xlabel("Number of Clusters")
plt.ylabel("Score")
plt.title("Choosing the Best Number of Clusters")
plt.legend()
plt.show()


### Interpretation

<span style="color:hotpink;">
Based on the graphs of errors versus number of clusters, does it look like we should change the initial number of clusters ?<br>
How many clusters should we use ? Why ?
</span>

<span style="color:hotpink;">Looking at the graphs, the scores get better until around 3 clusters, and after that they donâ€™t change much. This means using more clusters doesnâ€™t really improve the results. So, 3 clusters seems like the best choice because it keeps things simple and still separates the data well.
</span>

### Revise Number of Clusters.

Re-run with the new number of clusters and plot the data in $2D$ or $3D$.

This can be the same graph as above.

In [None]:
# Re-run clustering with final number of clusters
FINAL_K = 3
kmeans_final = KMeans(n_clusters=FINAL_K, random_state=42)

# Run the model on the training data
final_labels = kmeans_final.fit_predict(ansur_features.drop("cluster", axis=1, errors="ignore"))
ansur_features["cluster"] = final_labels

# Check errors
print("Silhouette Score:", silhouette_score(ansur_features.drop("cluster", axis=1), final_labels))
print("Balance Score:", balance_score(final_labels))
print("Distance Score:", distance_score(ansur_features.drop("cluster", axis=1), final_labels))

# Plot in 3D
from mpl_toolkits.mplot3d import Axes3D

fig = plt.figure(figsize=(8,6))
ax = fig.add_subplot(111, projection='3d')
ax.scatter(
    ansur_features["head.circumference"], 
    ansur_features["head.height"], 
    ansur_features["ear.length"], 
    c=ansur_features["cluster"], cmap="viridis"
)
ax.set_xlabel("Head Circumference")
ax.set_ylabel("Head Height")
ax.set_zlabel("Ear Length")
ax.set_title("3D Cluster Visualization (K=3)")
plt.show()


### Interpretation

<span style="color:hotpink;">
Do these look better than the original number of clusters?
</span>

<span style="color:hotpink;">Yes, these clusters look better than before. With three clusters, the data is grouped more clearly, and the points form distinct regions in the 3D plot. The clusters also seem more balanced in size and make more sense for dividing helmet sizes into small, medium, and large. Adding more clusters didnâ€™t really improve the separation, so three works well for this dataset.
</span>

## Image Organization

We have a dataset of about $600$ flower images that we might want to classify by species... eventually.

What we want to do first is take a look at all of the images and see what kind of images we have, what kind of colors our flowers have and see if there's any other visual information that could help us classify these images later.

We'll see how to use clustering and distances to organize our images by color to create a visualization that we cna use to get to know our dataset.

### Load Dataset

The following cell downloads the dataset:

In [None]:
!wget -qO- https://github.com/PSAM-5020-2025F-A/5020-utils/releases/latest/download/flowers.tar.gz | tar xz

Then, we can take a look at a few of the images:

In [None]:
IMG_DIR = "./data/image/flowers"

In [None]:
display(PImage.open(f"{IMG_DIR}/00_001.png"))
display(PImage.open(f"{IMG_DIR}/15_001.png"))

### Find Representative Colors

The overall process for organizing our images by color will be something like this:

1. Iterate over all files in the `data/image/flowers` directory, open each image file and treat it as a dataset
   1. Load image into a `DataFrame` where each pixel is a row and R,G,B values are columns/features
   2. Cluster into $2$ - $16$ colors
   3. Pick $3$ or $4$ representative colors
   4. Store image filenames and their representative colors in a Python object
2. Once all images have been processed we can order our dataset by different color characteristics: white to black, red to blue, hue value, brightness

### One Image

Let's step through the process of getting representative colors for one image, and then we can repeat this in a loop to process all of the flower images.

#### Open Image

The `PIL` library does all the work here:

In [None]:
# Open image
fname = "00_001.png"
pimg = PImage.open(f"{IMG_DIR}/{fname}").convert("RGB")

#### Put into `DataFrame`

We get the pixels and make a dataset/`DataFrame` out of them:

In [None]:
# Load into DataFrame
pxs = get_pixels(pimg)
pxs_df = pd.DataFrame(pxs, columns=["R", "G", "B"])

#### Cluster colors

Create a clustering object, cluster colors into $8$ clusters with `fit_predict()` and take a look at our color palette (`cluster_centers_`):

In [None]:
# TODO: Create Clustering object
kmeans_colors = KMeans(n_clusters=5, random_state=42)

# TODO: Cluster by color
kmeans_colors.fit(pxs_df)

# TODO: Take a look at the color palette (cluster_centers_)
print(kmeans_colors.cluster_centers_)


#### Checkpoint

<span style="color:hotpink;">
Does anything stand out about the colors?
</span>

<span style="color:hotpink;">The colors match the main tones of the flower image, showing a good mix of bright and dark shades that represent the petals and background.</span>

#### Reconstruct Image

Since we're only doing one image for now, let's take a look at the clustering result.

This is like in the lecture notebook. We'll start with an empty pixel array and as we iterate through the `DataFrame` of cluster ids we append the corresponding colors to it.

In [None]:
# TODO: create empty pixel array
pxs_post = []

# TODO: iterate through resulting list of cluster ids
for cid in kmeans_colors.labels_:
    # TODO: append corresponding color value, converting to int
    pxs_post.append([int(c) for c in kmeans_colors.cluster_centers_[cid]])



Now we can look at the image. If this next cell gives errors about using `float` values in images, just make sure the pixel values that are being appended above are all whole number `int` values.

In [None]:
display(make_image(pxs_post, width=pimg.size[0]))

#### Checkpoint

<span style="color:hotpink;">
How does changing the number of clusters affect the resulting image?<br>Try some lower values like <code>2</code> and <code>4</code>, and also some higher ones like <code>12</code> and <code>16</code>. Take a look at a different image.
</span>

<span style="color:hotpink;">Fewer clusters (like 2-4) make the image very simplified with only a few colors, while more clusters (like 12-16) capture more detail and look closer to the original. Moderate numbers (around 5-8) give a good balance.
</span>

#### Pick Colors

Ok, we have some representative colors for our images. We should keep more than one color, but maybe we don't have to keep $12$.

If we put our predictions in a `DataFrame` we can use the `value_counts()` function of our `DataFrame` to see how many pixels are represented by each of our cluster colors, and get the result ordered by most frequent to less frequent cluster id:

In [None]:
# cluster ids and pixel counts, ordered by descending counts
px_clusters_df = pd.DataFrame(kmeans_colors.labels_, columns=["clusters"])
ccounts = px_clusters_df["clusters"].value_counts()
display(ccounts)


Since what we are really trying to do here is get some information about the colors of the flowers present in our images, and given the type of images we have, we can start by assuming that the flower colors will be in the top-$4$ clusters returned by `value_counts()`.

We can revisit this assumption later. We might also want to add some filters here to ignore sky and vegetation colors (blues and greens) and only keep flower colors.

For now, let's just grab the top $4$ colors from `value_counts()`, remembering we want to keep their rounded `int` values and not the default `float` values in `cluster_centers_`.

In [None]:
# Object to keep colors for each file
file_info = {
  "filename": fname,
  "colors": []
}

# TODO: go through ccounts.index and get corresponding colors for each clusters

# TODO: add top-4 colors to the "colors" key of the file_color_info object
for cluster_id in ccounts.index[:4]:  # top 4 clusters
    # Get the cluster color and convert to int
    color = [int(c) for c in kmeans_colors.cluster_centers_[cluster_id]]
    file_info["colors"].append(color)

In [None]:
display(file_info)

#### Checkpoint

<span style="color:hotpink;">
Why might we want to cluster into <code>8</code> or even <code>12</code> colors when in the end we're only keeping <code>4</code>?
</span>

<span style="color:hotpink;">Clustering into more colors (like 8 or 12) helps capture subtle variations in the image. Even if we only keep the top 4 most frequent colors, having more clusters ensures that the dominant colors we choose are more accurate and not averaged with minor shades.
</span>

### Iterate and Cluster

We've processed one image, now let's process $600$... for-loops FTW!

We'll need to loop through all of the images in our directory and repeat the process above for each one of them.

We can create a function that takes a filename as input and returns the top-$4$ colors for that image, or... we can just put all of the clustering logic in the body of a for loop. Whichever is easiest.

Let's get started.

In [None]:
# list of all files in the flowers directory
flower_files = sorted([f for f in listdir(IMG_DIR) if f.endswith(".png")])

Here's the loop. In the end we want our `file_colors` list to have objects that have a filename and $4$ colors associated with each filename. Something like:

```py
[
  {
    "filename": "00_001.png",
    "colors": [[12,44,12], [112,144,62],  [12,84,112], [212,144,102]]
  },
  {
    "filename": "00_002.png",
    "colors": [[22,24,28], [112,114,122], [128,200,2], [250,240,230]]
  },
  ...
]
```

This can take a while to run (up to a minute for $600$ images). We can use slicing to test our logic on a subset of `flower_files` before processing all $600$ images.

In [None]:
# List to keep colors for each file
file_colors = []

# TODO: get colors for each image
for fname in flower_files:
    # TODO: add logic here
    # Open image and convert to RGB
    pimg = PImage.open(f"{IMG_DIR}/{fname}").convert("RGB")
    
    # Get pixels as a DataFrame
    pxs = get_pixels(pimg)
    pxs_df = pd.DataFrame(pxs, columns=["R", "G", "B"])
    
    # Cluster pixels into 8 colors (can adjust this number)
    kmeans_colors = KMeans(n_clusters=8, random_state=42)
    kmeans_colors.fit(pxs_df)
    
    # Count pixels per cluster
    px_clusters_df = pd.DataFrame(kmeans_colors.labels_, columns=["clusters"])
    ccounts = px_clusters_df["clusters"].value_counts()
    
    # TODO: add filename+colors object to list of objects
    # Pick top 4 colors
    top_colors = []
    for cluster_id in ccounts.index[:4]:
        color = [int(c) for c in kmeans_colors.cluster_centers_[cluster_id]]
        top_colors.append(color)
    
    file_colors.append({
        "filename": fname,
        "colors": top_colors
    })

# Display first few results to check
file_colors[:3]


#### Order Images (almost)

We have a list with objects that keep track of filenames and representative colors. We could create a `DataFrame` or csv dataset with these, but let's go ahead and just use this directly in this format.

What we want to do is re-order our list of objects, but using a `key` function that takes each object's colors into consideration.

We'll look into how to do this dynamically later, but for now let's order our images by something like _brightness_. It's _like_ brightness because what we'll do is measure how close each image is to the white color `(255,255,255)`.

We'll need some helper functions first:

- `color_distance()`: takes $2$ colors and returns the distance between them
- `min_color_distance()`: given a reference color and a list of colors, returns the distance between the reference color and the closest color in the list

In [None]:
# TODO: implement function that returns distance between two colors
def color_distance(c0, c1):
    # TODO: add logic here
    return math.sqrt((c0[0]-c1[0])**2 + (c0[1]-c1[1])**2 + (c0[2]-c1[2])**2)


Some tests for the `color_distance()` function:

In [None]:
# Some tests for the color_distance() function
print(color_distance([0,0,0], [255,255,255]), "should be", 255*3**.5)
print(color_distance([0,100,0], [100,100,0]), "should be", 100)
print(color_distance([55,222,120], [91,51,192]), "should be", 189)
print(color_distance([147,207,246], [87,57,50]), "should be", 254)
print(color_distance([12,250,126], [112,10,195]), "should be", 269)
print(color_distance([106,71,61], [105,136,100]), "should be", 75.81)

In [None]:
# TODO: implement function that returns minimum distance between a reference color and colors from a list
def min_color_distance(ref_color, color_list):
  # TODO: add logic here
  distances = [color_distance(ref_color, c) for c in color_list]
  return min(distances)


Three tests for the `min_color_distance()` function:

In [None]:
# Some tests for the color_distance() function
print(min_color_distance([0,0,0], [[255,255,255],[0,100,0],[100,100,0],[58,58,58]]), "should be", 100)
print(min_color_distance([0,0,0], [[255,255,255],[0,100,0],[100,100,0],[58,57,58]]), "should be", 99.88)
print(min_color_distance([91,51,192], [[147,207,246],[87,57,50],[12,250,126],[112,10,195]]), "should be", 46.16)

#### Order Images (for real now)

Alright. We have a function that can be used to order our images by their distance to a given color.

Let's order our images by how close they are to the brightest color `(255,255,255)`. We'll define a `key` function that, given an object from our `file_colors` list, returns how close that image is to the color `(255,255,255)`.

In [None]:
# TODO: implement function that returns how close our image is to the color white
def by_bright_dist(A):
  # TODO: add logic here
  colors = A["colors"]
  return min_color_distance([255, 255, 255], colors)

Order the list and write out a `JSON` file with the image order.

In [None]:
file_colors_sorted = sorted(file_colors, key=by_bright_dist)

In [None]:
files_sorted = [A["filename"] for A in file_colors_sorted]

with open("./data/flower_order.json", "w") as ofp:
  json.dump(files_sorted, ofp)

### Viewing Results

We can check the results by running a webserver and looking at a simple web page that orders the images according to the resulting `JSON` file from above.

We'll make use of the [`Live Server`](https://marketplace.visualstudio.com/items?itemName=ritwickdey.LiveServer) VSCode extension.

We can start the server by clicking on the "_Go Live_" button towards the right hand side of the bar at the very bottom of our text editor:

<img src="./imgs/go_live.jpg" width="600px">

Clicking the "_Go Live_" button in Codespace should open up a new tab with a plain html navigation view of our repository. Clicking on the `html/` directory should open up a web page with all of the flower images. If not, you can use your Codespace url to try to find the web server address.

If your Codespace url is something like:<br>`https://mango-special-giggle-v6v7asd322f7p6.github.dev/`

Then, the webserver should be running at:<br>`https://mango-special-giggle-v6v7asd322f7p6-5500.app.github.dev/`

### Review, Contemplate, Experiment

Yes, images with white parts are towards the beginning, but the images towards the end aren't necessarily the ones with dark flowers, but are the ones that have all of their representative colors farthest away from white `(255,255,255)`, which includes very saturated colors/images.

A couple of interesting experiments here could be:
- Decrease the number of clusters or the number of colors kept after clustering.
- Use different colors as the reference for the distance functions. For example, create `by_gold_dist()` or `by_purple_dist()` functions to use as the `key` for sorting.
- Order the list of cluster colors by [hue](https://stackoverflow.com/questions/23090019/fastest-formula-to-get-hue-from-rgb). This can be a bit tricky to get right because some colors, like white, black and gray, don't have a unique value for hue, but depend on other aspects of the color, like saturation and lightness, to be well-defined.

In [None]:
# TODO: experiment with number of clusters, number of colors, reference colors or hue distances
# Re-cluster all images with 12 clusters and keep top 4 colors
file_colors_all = []

for fname in flower_files:  # remove the [:5] slicing to include all images
    pimg = PImage.open(f"{IMG_DIR}/{fname}").convert("RGB")
    pxs_df = pd.DataFrame(get_pixels(pimg), columns=["R", "G", "B"])
    
    # Cluster colors and keep top 4
    kmeans_colors = KMeans(n_clusters=12, random_state=42)
    kmeans_colors.fit(pxs_df)
    px_clusters_df = pd.DataFrame(kmeans_colors.labels_, columns=["clusters"])
    ccounts = px_clusters_df["clusters"].value_counts()
    top_colors = [[int(c) for c in kmeans_colors.cluster_centers_[i]] for i in ccounts.index[:4]]
    
    file_colors_all.append({"filename": fname, "colors": top_colors})

# Sort all images by gold color
def by_gold_dist(A):
    return min_color_distance([255, 215, 0], A["colors"])

file_colors_sorted = sorted(file_colors_all, key=by_gold_dist)

# Check the sorted filenames
files_sorted = [f["filename"] for f in file_colors_sorted]
files_sorted


### Interpretation

<span style="color:hotpink;">
What did you try ? What happened ?
</span>

<span style="color:hotpink;">
I re-clustered all the images using 12 clusters and kept the top 4 colors. Then I sorted them by how close their colors are to gold.

* The images with more gold tones came first, and those with less gold came later.

* Changing the reference color changes the order of the images.

* Using more clusters helped capture subtle colors, so the sorting looks more accurate.</span>

### Conclusion

It's challenging to define a set of functions that will perfectly order our flowers by color without first having to define very specific color values for filtering and corner-cases. At a high-level, we can imagine that this is because color is a $3$-dimensional value, and we're using it to organize our images into a single-dimensional order.

The beginning of our ordering is usually pretty good, since there's only one way for a color to be _close_ to our reference color, but the ordering gets less consistent towards the end because there are many different ways for a color to be _far_ from the reference color.

Next week we'll see a very powerful technique that, amongst other things, will help us get around this kind of "_dimensionality mismatch_".