# Unsupervised Machine Learning Tutorials

Run the cell below if you need to install the required libraries. In Google Colab they come pre-installed.


In [None]:
!pip install scikit-learn matplotlib

In [None]:
import numpy as np
import matplotlib.pyplot as plt
from sklearn.datasets import load_iris, load_digits
from sklearn.cluster import KMeans
from sklearn.decomposition import PCA

## k-means clustering

k-means partitions observations into `k` groups by iteratively updating cluster assignments and centroids. Starting with randomly chosen centroids, each iteration performs two steps:
1. **Assign step:** assign each sample to the nearest centroid using Euclidean distance.
2. **Update step:** recompute each centroid as the mean of all samples assigned to it.

The goal is to minimize the within-cluster sum of squared distances\n$$\nJ = \sum_{i=1}^n \|x_i - \mu_{c_i}\|^2,\n$$
where $x_i$ is a sample and $\mu_{c_i}$ is the centroid of its cluster. Iterations stop once assignments stabilize or a maximum iteration count is reached.

In [None]:
def kmeans_scratch(X, k, max_iter=100, random_state=42):
    rng = np.random.default_rng(random_state)
    centroids = X[rng.choice(len(X), size=k, replace=False)]
    for _ in range(max_iter):
        dists = np.linalg.norm(X[:, None] - centroids[None], axis=2)
        labels = dists.argmin(axis=1)
        new_centroids = np.array([X[labels == i].mean(axis=0) for i in range(k)])
        if np.allclose(new_centroids, centroids):
            break
        centroids = new_centroids
    return labels, centroids


In [None]:
X, _ = load_iris(return_X_y=True)
scratch_labels, scratch_centers = kmeans_scratch(X, k=3)
print('Cluster counts (scratch):', np.bincount(scratch_labels))
print('Centroids (scratch):', scratch_centers)


### scikit-learn k-means

`KMeans` provides a highly optimized implementation of the same algorithm.

In [None]:
kmeans = KMeans(n_clusters=3, random_state=42, n_init=10)
sk_labels = kmeans.fit_predict(X)
print('Cluster counts (scikit-learn):', np.bincount(sk_labels))


In [None]:
plt.scatter(X[:,0], X[:,1], c=sk_labels, cmap='viridis', s=30)
plt.scatter(kmeans.cluster_centers_[:,0], kmeans.cluster_centers_[:,1], c='red', marker='x', s=100, linewidths=2, label='Centers')
plt.title('k-means Clustering')
plt.legend()
plt.tight_layout()
plt.show()


## Principal component analysis

PCA reduces dimensionality by projecting data onto directions that maximize variance. For a centered data matrix $X$, the covariance matrix $C = \frac{1}{n-1} X^T X$ is eigendecomposed. The top eigenvectors form the projection matrix. The projected samples are $Z = X W$ and the eigenvalues determine the explained variance ratio.

In [None]:
def pca_scratch(X, n_components=2):
    X_c = X - X.mean(axis=0)
    cov = np.cov(X_c, rowvar=False)
    eigvals, eigvecs = np.linalg.eigh(cov)
    idx = np.argsort(eigvals)[::-1]
    eigvals = eigvals[idx]
    eigvecs = eigvecs[:, idx]
    components = eigvecs[:, :n_components]
    explained = eigvals[:n_components] / eigvals.sum()
    reduced = X_c @ components
    return reduced, explained


In [None]:
X, y = load_digits(return_X_y=True)
scratch_reduced, scratch_ratio = pca_scratch(X, n_components=2)
print('Explained variance ratio (scratch):', scratch_ratio)
print('Reduced shape:', scratch_reduced.shape)


### scikit-learn PCA

`PCA` performs the eigen decomposition for us and exposes the variance ratio via `explained_variance_ratio_`.

In [None]:
pca = PCA(n_components=2)
sk_reduced = pca.fit_transform(X)
print('Explained variance ratio (scikit-learn):', pca.explained_variance_ratio_)


In [None]:
plt.scatter(sk_reduced[:,0], sk_reduced[:,1], c=y, cmap='tab10', s=15)
plt.xlabel('PC1')
plt.ylabel('PC2')
plt.title('PCA of Digits')
plt.tight_layout()
plt.show()


This concludes the brief tour of unsupervised learning examples with and without scikit-learn. Feel free to experiment with other datasets or algorithms.