# Intro

Clustering is an unsupervised learning algorithm

> http://scikit-learn.org/stable/modules/clustering

# Steps

- Determine how many clusters
- Initiate random centers (VERY IMPORTANT)
- Repeat (Assign & optimize) :
    + **Assign**: Determine which pts are closest (orthogonal line through connection)
    + **Optimize:** Move centers to minimize total distance
    

Example: https://www.naftaliharris.com/blog/visualizing-k-means-clustering/

# Evaluating

Elbow plot to determine "k"

![](images/dim_returns.png)

# Issues

- Initial parameters matter: (hillclimbing)
- Clusters are not always what we expect (because of local minimum)

## Demonstration of "Incorrect" Clusters are Found

> Can I find a local minimum where the "obvious" groups are incorrectly identified?

In [None]:
import numpy as np
import matplotlib.pyplot as plt
np.random.seed(27)

### Example 1

In [None]:
sigma_x = sigma_y = 0.1
n = 10

for c_x,c_y in [[0,4],[0,2],[1,3]]:
    a_x = np.random.normal(c_x, sigma_x, n)
    a_y = np.random.normal(c_y, sigma_y, n)

    pts = np.array([(x,y) for (x,y) in zip(a_x,a_y)])

    plt.scatter(pts[:,0], pts[:,1])
plt.show()

> Using **3 centroids** find a local minimum where the two groups on the left are one group and another group on the right.

### Example 2

In [None]:
sigma_x = 0.1
sigma_y = 0.5
n = 15

centers = [[0,2],[1,2]]

for c_x,c_y in centers:
    a_x = np.random.normal(c_x, sigma_x, n)
    a_y = np.random.normal(c_y, sigma_y, n)

    pts = np.array([(x,y) for (x,y) in zip(a_x,a_y)])

    plt.scatter(pts[:,0], pts[:,1])
plt.show()


> Using **2 centroids** find a local minimum where the two groups on the left are one group and another group on the right.

## In Summary

> K-means relies on distance centers (lines of data) → looks for "circular" data

## Datasets

The clustering algorithm's performance can depend on the initial data's structure/distribution

- Random
- Crescents
- Cocentric circles
- Clusters
- Clusters w/ density
- lined densities

<img src='https://scikit-learn.org/stable/_images/sphx_glr_plot_cluster_comparison_0011.png'/>