## Introduction to Unsupervised Learning

In this section we introduce unsupervised learning—a process where data is explored without predefined labels. Unsupervised learning is applied in customer segmentation, content recommendation, and many other fields. The examples in this lab follow the lab plan outlined in the attached lab plan document fileciteturn0file0.

In [None]:
print('Welcome to the Unsupervised Learning Lab!')

The code above prints a welcome message to introduce the lab. 

Challenge: Modify the printed message to include additional details about unsupervised learning.

## Exploring the K-Means Clustering Algorithm

K-Means clustering partitions data into groups by iteratively assigning data points to the closest centroid and recalculating centroids using squared Euclidean distance. This section uses a simple toy dataset to illustrate these concepts, as described in the lab plan fileciteturn0file0.

In [None]:
import numpy as np

# Define a toy dataset of 2D points
X = np.array([[1, 2], [1, 4], [2, 3], [5, 8], [6, 9], [5, 6]])

# Set the number of clusters
k = 2

# Randomly assign a cluster for each data point
np.random.seed(42)  # for reproducibility
assignments = np.random.randint(0, k, size=X.shape[0])
print('Initial cluster assignments:', assignments)

The code defines a small toy dataset and randomly assigns each point to one of two clusters. The use of a fixed random seed ensures consistent results upon each run. 

Challenge: Experiment with a different random seed or change the number of clusters to observe how the initial assignments vary.

In [None]:
def squared_distance(a, b):
    """Compute the squared Euclidean distance between two vectors."""
    return np.sum((a - b) ** 2)

# Calculate the squared distance between the first two points
dist = squared_distance(X[0], X[1])
print('Squared distance between first two points:', dist)

The function above computes the squared Euclidean distance between two vectors. In the example, it calculates the distance between the first two points of the toy dataset. 

Challenge: Enhance the function by adding a print statement that shows the individual differences between vector elements during the computation.

In [None]:
# Compute centroids based on initial cluster assignments
centroids = []
for i in range(k):
    cluster_points = X[assignments == i]
    centroids.append(np.mean(cluster_points, axis=0))
centroids = np.array(centroids)
print('Computed centroids:', centroids)

The above code calculates centroids by computing the mean of the data points assigned to each cluster. The resulting centroids are then printed. 

Challenge: Modify the centroid computation to use the median of the points instead of the mean, and compare the outcomes.

In [None]:
import matplotlib.pyplot as plt

# Visualize the toy dataset with the initial cluster assignments
plt.scatter(X[:, 0], X[:, 1], c=assignments, cmap='viridis', label='Data Points')

# Plot the computed centroids
for i, center in enumerate(centroids):
    plt.scatter(center[0], center[1], marker='x', color='red', s=100, label=f'Centroid {i}')

plt.title('Toy Dataset Cluster Visualization')
plt.legend()
plt.show()

This visualization displays the toy dataset colored by their respective initial cluster assignments, with red 'x' markers indicating the centroids. 

Challenge: Modify the visualization by changing the colormap or marker style, and discuss the effect on your interpretation of the clusters.

## Code-Along Activity: Building K-Means from Scratch

In this section, we build the K-Means algorithm step by step. We begin by defining helper functions to compute cluster assignments and update centroids, and then combine these into the main K-Means function.

In [None]:
def compute_clusters(X, centroids):
    import numpy as np
    clusters = []
    for x in X:
        distances = [np.sum((x - c) ** 2) for c in centroids]
        clusters.append(np.argmin(distances))
    return np.array(clusters)

# Example usage:
# clusters = compute_clusters(X, centroids)

The function compute_clusters assigns each data point in X to the closest centroid using squared Euclidean distance. 

Challenge: Modify this function to use an alternative distance metric, such as Manhattan distance.

In [None]:
def update_centroids(X, clusters, k):
    import numpy as np
    new_centroids = []
    for i in range(k):
        points = X[clusters == i]
        new_centroids.append(np.mean(points, axis=0))
    return np.array(new_centroids)

# Example usage:
# new_centroids = update_centroids(X, assignments, k)

The update_centroids function recalculates the centroids by finding the mean of data points in each cluster. 

Challenge: Adjust the function to handle empty clusters (for example, by keeping the previous centroid if no points are assigned).

In [None]:
def k_means(X, k, max_iter=10):
    import numpy as np
    # Initialize centroids by selecting k random data points
    indices = np.random.choice(len(X), k, replace=False)
    centroids = X[indices]
    
    for iteration in range(max_iter):
        clusters = compute_clusters(X, centroids)
        new_centroids = update_centroids(X, clusters, k)
        
        # Check for convergence using allclose for numerical stability
        if np.allclose(centroids, new_centroids):
            break
        centroids = new_centroids
    return centroids, clusters

# Run K-Means on the toy dataset with a higher iteration limit
centroids_final, clusters_final = k_means(X, k, max_iter=20)
print('Final centroids:', centroids_final)
print('Final cluster assignments:', clusters_final)

The k_means function integrates the helper functions to perform clustering. It initializes centroids, iterates to update cluster assignments and centroids, and stops when the centroids converge. 

Challenge: Re-run the k_means function with a different value of k (for example, k=3) and observe how the final centroids and assignments differ.

## Working with a Real-World Inspired Dataset

In this section we apply our K-Means algorithm to a real-world dataset. The Iris dataset is used with its first two features to facilitate 2D visualization. This practical example mirrors real-world clustering challenges described in the lab plan fileciteturn0file0.

In [None]:
from sklearn.datasets import load_iris

data = load_iris()
# Extract the first two features (sepal length and sepal width)
X_real = data.data[:, :2]

The above code loads the Iris dataset and selects its first two features to simplify the clustering visualization. 

Challenge: Try selecting a different pair of features (for example, the last two features) and analyze how the resulting clusters change.

In [None]:
centroids_real, clusters_real = k_means(X_real, 3, max_iter=20)

import matplotlib.pyplot as plt

plt.scatter(X_real[:, 0], X_real[:, 1], c=clusters_real, cmap='viridis', label='Data Points')
plt.scatter(centroids_real[:, 0], centroids_real[:, 1], marker='x', color='red', s=100, label='Centroids')
plt.xlabel('Sepal length')
plt.ylabel('Sepal width')
plt.title('K-Means Clustering on Iris Dataset (first 2 features)')
plt.legend()
plt.show()

This visualization displays the Iris dataset with k=3 clusters. Data points are colored by cluster assignment and centroids are marked in red. 

Challenge: Experiment with different values of k and describe how the cluster visualization changes.

## Extensions: Filtering Techniques and Practical Applications

Clustering results enable filtering data for applications like collaborative and content-based filtering. In this section, we filter the Iris dataset based on a specific cluster assignment to simulate this process.

In [None]:
# Filter the Iris dataset for data points assigned to cluster 0
filtered_data = X_real[clusters_real == 0]
print('Data points in Cluster 0:')
print(filtered_data)

The code filters the dataset to display only those data points that belong to cluster 0. This approach can be extended to various filtering techniques in real-world applications. 

Challenge: Modify the filtering criterion to display data points from a different cluster and analyze the differences.

## Reflection and Professional Relevance

Reflect on the clustering process and its practical implications. Consider the effects of random initialization, choice of k, and feature scaling on the clustering results, as well as how these aspects relate to real-world data science challenges.

In [None]:
def print_reflections():
    questions = [
        'How does random initialization affect the final clusters?',
        'What changes do you observe when you modify the number of clusters (k)?',
        'How might feature scaling impact the clustering outcome?'
    ]
    for q in questions:
        print(q)

print_reflections()

The function above prints a set of reflective questions to help you consider the sensitive aspects of the K-Means algorithm, such as initialization and convergence criteria. 

Challenge: Extend the list with an additional question regarding the impact of convergence tolerance on algorithm performance.