# Clustering
This is an *unsupervised method*. You tell the computer to create groups without giving it labels. Useful in recommendation systems, cohort grouping, determining supervised learning labels and features.

In [None]:
%matplotlib inline
import matplotlib.cm as cm
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
from scipy.cluster import hierarchy
from sklearn import cluster, datasets, metrics
from yellowbrick.cluster.elbow import KElbowVisualizer
from yellowbrick.cluster.silhouette import SilhouetteVisualizer

## K-Means Clustering

Process:

* Choose number of clusters (K)
* Randomly assign K observations as the *centroids*
* For remaining observations, determine which centroid it is closest to
* Determine new centroid for each cluster
* Reassign if necessary/repeat X times

In [None]:
iris = datasets.load_iris()
print(iris.DESCR)

In [None]:
dir(iris)

In [None]:
iris.target_names

In [None]:
target = pd.Series(iris.target)
iris_df = pd.DataFrame(iris.data, columns=iris.feature_names)
iris_df['target'] = target
X = iris_df

In [None]:
iris_df

In [None]:
#Actuals
X.plot(kind='scatter', x='sepal length (cm)', y='sepal width (cm)',
          c=target, cmap='plasma')

In [None]:
# Plot 3 iterations
axs = []
X_2d = X[['sepal length (cm)', 'sepal width (cm)']]
for i in [1, 3, 10, 100]:
    # using default init (k-means++) gives better starting points
    k = cluster.KMeans(n_clusters=3, max_iter=i, n_init=1, init='random', 
                       random_state=420)
    pred = k.fit_predict(X_2d)
    axs.append(X.plot(title=f'iter {i}', kind='scatter', 
                      x='sepal length (cm)',
                      y='sepal width (cm)',
                      c=pred, cmap='plasma'))
    centroids = k.cluster_centers_
    axs.append(plt.scatter(centroids[:, 0], centroids[:, 1],
                marker='x', s=169, linewidths=3,
                color='r', zorder=10))

## Exercise: Run K-Means

The (wheat) seed dataset has a feature engineered column, compactness
\begin{align}
C=4*pi*area/perimeter^2
\end{align}

* Run K-means with 3 clusters on this data set. (Ignore variety)
* Scatter plot the result.

The file is at ``../data/seeds_dataset.txt``

Use the ``sep`` parameter and split on whitespace (might need a regex)!


It has the following fields:

1. area A, 
2. perimeter P, 
3. compactness C = 4*pi*A/P^2, 
4. length of kernel, 
5. width of kernel, 
6. asymmetry coefficient 
7. length of kernel groove. 
8. variety (Kama, Rosa, Canadian)

https://archive.ics.uci.edu/ml/datasets/seeds


## How Many Clusters?
We can calculate the *WCSS*, within cluster sum of squares, for a variety of K sizes. This value starts off large. As K approaches the number of observations, this value goes to 0.
At some point there might be an "elbow". Around that elbow can be minimum number of clusters.

In [None]:
def plot_elbow(data, k_candidates, fig_opts=None):
    inertias = []
    for k in k_candidates:
        kmeans = cluster.KMeans(n_clusters=k, random_state=42)
        kmeans.fit(X)
        inertias.append(kmeans.inertia_)
    fig_opts = fig_opts or {}
    fig = plt.figure(**fig_opts)
    plt.plot(k_candidates, inertias)
    plt.title('Elbow Plot')
plot_elbow(X, range(2, 20))

In [None]:
cluster.KMeans

In [None]:
viz = KElbowVisualizer(cluster.KMeans(random_state=42), k=(2,20), metric='silhouette')
viz.fit(X)
viz.poof()

In [None]:
# Create 5 clusters - using all dimensions
# Plot those results into 2 dimensions (sepal length/width)
k = cluster.KMeans(n_clusters=5, random_state=50)
pred = k.fit_predict(X)
X.plot(title=f'iter {i}', kind='scatter', x='sepal length (cm)',
       y='sepal width (cm)',
      c=pred, cmap='plasma')
centroids = k.cluster_centers_
plt.scatter(centroids[:, 0], centroids[:, 1],
            marker='x', s=169, linewidths=3,
            color='r', zorder=10)

## Exercise: Elbow Curve
* Run an elbow curve on the seed data. Is there an elbow?

## Cluster size with Silhouette Analysis

http://scikit-learn.org/stable/auto_examples/cluster/plot_kmeans_silhouette_analysis.html

Computes the score for each cluster. Score is the normalized difference between the intra-cluster distance and nearest cluster. -1 means bad cluster, 1 means perfect clustering.

The red line is the average silhouette score. Clusters failing below the average may indicate poor clustering.

In [None]:
for i in range(2, 10):
    sviz = SilhouetteVisualizer(cluster.KMeans(
        n_clusters=i, random_state=42))
    sviz.fit(X)
    sviz.poof()

## Exercise: Silhouette Anaylsis

* Run Silhouette Analysis on the seed data

## Hierarchical Clustering
(Two types, *Agglomerative* and *Divisive*). Going to look at agglomerative, which treats each observation as its own cluster. Using some metric, join the closest pairs. Repeat until one cluster. Can be slow! \begin{align}
O(n^3)
\end{align}

Tracks creation in a *dendrogram*. The left side of the dendrogram shows the distance. The taller the line, the less similar the clusters are (*dissimilarity*). One method for determining number cluster is to assume horizontal lines go out and make a horizontal cutoff below the tallest line. The number of lines it intersects is the number of clusters.


In [None]:
# ward clustering minimizes the sum of the squares in the clusters (like k-means)

dend = hierarchy.dendrogram(hierarchy.linkage(X, method='ward'))

In [None]:
# leaf counts in brackets
fig = plt.figure(figsize=(14,10))
dend = hierarchy.dendrogram(
    hierarchy.linkage(X, method='ward'),
    truncate_mode='lastp',
    p=20,
    show_contracted=True) # shows density

In [None]:
hc = cluster.AgglomerativeClustering(n_clusters=3, affinity='euclidean',
                                    linkage='ward')
hc.fit_predict(X)

In [None]:
# Plot Actuals & compare to agg and kmeans
X.plot(kind='scatter', x='sepal length (cm)', y='sepal width (cm)',
          c=target, cmap='plasma')

In [None]:
# Plot agglomerative clustering
pred = hc.fit_predict(X)
X.plot(title=f'Agg Cluster', kind='scatter', 
       x='sepal length (cm)', y='sepal width (cm)',
      c=pred, cmap='plasma')

In [None]:
# vs k-means
k2 = cluster.KMeans(n_clusters=3)
pred = k2.fit_predict(X)
X.plot(title=f'K-means Cluster', kind='scatter', 
       x='sepal length (cm)', y='sepal width (cm)',
      c=pred, cmap='plasma')

## Exercise: Hierarchical Clustering 
* Plot a dendrogram for the seed data. Is there a logical cut point?