# Unsupervised Learning

Clustering vs dimensionality reduction.

### Clustering and DR
k-Means minimizes $$\sum_i ||x_i - c_{z_i}||^2$$.
PCA seeks directions of maximal variance.

### Advantages and Disadvantages
**k-Means**: simple but assumes spherical clusters.
**DBSCAN**: handles arbitrary shapes but sensitive to parameters.

In [1]:
import pandas as pd
from sklearn.cluster import KMeans
from sklearn.decomposition import PCA
X = pd.read_csv('../data/moons.csv')
y = X.pop('Label')
km = KMeans(n_clusters=2).fit(X)
PCA(n_components=2).fit_transform(X)[:3]

array([[ 0.48614471, -0.45739385],
       [ 0.24720623,  0.74301456],
       [ 0.79162163, -0.49702093]])

### Evaluation with Silhouette Score
The silhouette coefficient for a sample is:
$$s = \frac{b-a}{\max(a,b)}$$
where $a$ is mean intra-cluster distance and $b$ the mean nearest-cluster distance.

In [ ]:
from sklearn.metrics import silhouette_score
silhouette_score(X, km.labels_)

### Exercises & Further Reading
1. Adjust k in k-Means.
2. Try DBSCAN.
3. [sklearn clustering](https://scikit-learn.org/stable/modules/clustering.html)