In [1]:
import warnings
warnings.simplefilter(action='ignore', category=FutureWarning)
warnings.simplefilter(action='ignore', category=UserWarning)
warnings.simplefilter(action='ignore', category=DeprecationWarning)

import numpy as np
import pandas as pd
from scipy import stats
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline

In [2]:
from sklearn.preprocessing import StandardScaler
from sklearn.datasets import make_blobs,make_moons


#### Imports for Clustering

In [3]:
from sklearn.cluster import KMeans
from sklearn.metrics import silhouette_samples

import scipy.cluster.hierarchy as clusterHierarchy
from sklearn.cluster import AgglomerativeClustering

from sklearn.cluster import DBSCAN

In [None]:
from scipy.cluster.hierarchy import dendrogram,linkage
from scipy.spatial.distance import pdist, squareform


 
## Clustering  

* Unsupervised learning method (i.e. no labels)
    - Data: $(x_1,x_2,...,x_n), \text{  }x \in R^p$
* Broad set of techniques for finding subgroups or clusters in the data
* Partition data into groups so that observations within a group are **similar** to each other
* Meaning of similar can depend on the domain
* Comparison to PCA
    - PCA looks for a low-dimensional representation of the observations that explains a good fraction of the variance.
    - Clustering looks for **homogeneous subgroups** among the observations.
* Clustering methods
    - Prototype-based Clustering
        - Centroid: average of similar points with continuous features
        - Medoid: most representative or most frequently occurring point for categorical features
        - Modeler specifies number of clusters
    - Hierarchical
        - Numbers of clusters not known
    - Density-based clustering
 

           
### K-Means clustering

* A prototype-based algorithm
* Easy to implement and computationally efficient compared to other clustering algorithms
* Observations in the same cluster will be 'similar' based on a distance metric
* The modeler must specify k, the number of clusters
* Assumes at least one item per cluster and that clusters do not overlap

#### How many clusters? 

In [None]:
iris = sns.load_dataset('iris')
fig,(ax1,ax2) = plt.subplots(1,2,figsize = (10,6))
sns.scatterplot(x='petal_length',y='petal_width',data=iris,ax = ax1)
sns.scatterplot(x='sepal_length',y='sepal_width',data=iris,ax = ax2)
plt.tight_layout()

In [None]:
fig,(ax1,ax2) = plt.subplots(1,2,figsize = (10,6))
sns.scatterplot(x='petal_length',y='petal_width',hue='species',data=iris, ax = ax1)
sns.scatterplot(x='sepal_length',y='sepal_width',hue='species',data=iris, ax =ax2)
plt.tight_layout()

#### Similarity Measure

* For continuous features: squared Euclidean distance between x and y, $x,y \in R^m$

<div style="font-size: 115%;">
$$ d(\vec{x},\vec{y})^2 = \sum_{j=1}^{m}(x_j - y_j)^2 = ||\vec{x} - \vec{y}||_2^2$$
</div>

* The index j refers to the j-dimension (i.e. jth feature) of the $(\vec{x},\vec{y})$

#### K-Means as optimization

Let:  
i = sample index  
j = cluster index  
k = number of clusters  
n = number of samples
  
* Each observation $x^{(i)}$ belongs to one cluster j  

* Clusters are non-overlapping  

#### Cluster Inertia

* Cluster inertia: The within-cluster Sum of Squared Errors (SSE)  

<div style="font-size: 125%;">
$$SSE = \sum_{i=1}^{n}\sum_{j=1}^{k}w^{(i,j)}||x^{(i)} - \mu^{(j)}||_2^2$$
</div>

$\mu^{(j)}$ is the centroid for cluster j     
$w^{(i,j)}$ = 1 if sample $x^{(i)}$ is in th cluster j and 0 otherwise  

* Inertia is a measure of how internally coherent clusters are.

* The K-Means algorithm is an iterative procedure for minimizing the within cluster SSE.

* Choice of starting points can change cluster outcome
    - Run algorithm many times with different randomly chosen starting points  

* Drawbacks:
    - Inertia makes the assumption that clusters are convex 
    - It responds poorly to elongated clusters, or irregular shapes.
    - Inertia only tells us that lower values are better and zero is optimal. (i.e. not normalized)
        - In very high-dimensional spaces, Euclidean distances tend to become inflated (curse of dimensionality). 
        - Use PCA prior to k-means clustering  to alleviate this problem


#### K-means Clustering Algorithm
 
1. Randomly pick k centroids from the sample points as initial cluster centers    
2. Assign each sample to the nearest centroid $\mu^{(j)}$, j $\in${1,...,k}  
3. Calculate the new centroids based on the assignments made in step 2. 
4. Repeat steps 2 and 3 until the cluster assignments do not change or a maximum number of iterations is reached.
 


### sklearn K-Means

https://scikit-learn.org/stable/modules/generated/sklearn.cluster.KMeans.html

* Since calculating distances, when using real data need to scale the features

#### KMeans parameters
 
* n_clusters: Number of clusters
* n_init: number of times to run with different centroid seeds
* init: Initialization method
* max_iter: Maximum number of iterations for a single run.
    - KMeans will stop before max_iter if it converges (i.e. within-cluster SSE doesn't change within tolerance)
* tol:  controls the tolerance to changes in the within-cluster SSE to declare convergence.
    - Larger values will allow faster convergence

#### Generate data

In [None]:


X, y = make_blobs(n_samples=150,
                    n_features=2,
                      centers=3,
                      cluster_std=0.75,
                      shuffle=True,
                      random_state=0)
X.shape,y.shape

In [None]:
plt.scatter(X[:,0], X[:,1], c=y, marker='o', edgecolor='black', s=50)
plt.grid()

In [None]:
X[0:11,:]

In [None]:
y

#### Build Model

In [None]:
model =  KMeans(n_clusters=3,
                init='random',
                n_init=10,  # Run 10 times with with different centroid seeds
                max_iter=300,
                tol=1e-04,
                random_state=42)

model.fit(X)

#### Predict new sample (-1,4)

In [None]:
yhat = model.predict(np.array([-1,4]).reshape(-1,2))
yhat.shape

####  KMeans output
 
* cluster_centers: coordinates of cluster centers
* labels_: Assigned label for each point
* inertia_: Sum of squared distances of samples to their closest cluster center.  
* n_iter: Number of iterations run

#### Display outputs

In [None]:
print(f'Cluster Centers \n {model.cluster_centers_}')
print(f'Model Labels \n {model.labels_}')
print("Model Inertia: ", model.inertia_)
print(f'Predicted Cluster {yhat[0]} for new point (-1,4)')


In [None]:
y

In [None]:
df = pd.DataFrame({'X1': X[:,0],'X2': X[:,1], 'y':model.labels_})

def plot_kmeans(model,df,yhat):
    sns.scatterplot(x = 'X1', y = 'X2',hue = 'y',data = df)
    plt.plot(model.cluster_centers_[:,0],model.cluster_centers_[:,1], 'ro', marker = '*',markersize=10)
    plt.plot(-1,4, 'bo', marker = 'X',markersize=10)
    plt.title('Stars: Cluster Centers, Cross: Predicted sample')
    
plot_kmeans(model,df,yhat)

#### Issues with KMeans

* Random centroid initialization can produce very different results
* Modeler must choose number of clusters
* Makes assumptions about shape of clusters

### K-Means++

* An improved way to place initial cluster centroids

* In regular K-Means centers are chosen randomly and run multiple times and choose best performer 
    - Can result in bad centers or slow convergence
  
* To improve upon this Arthur and Vassilvitskii (2007) developed the K-Means++ centroid initialization algorithm
    - Centers should be as far from each other as possible

#### K-Means++ init algorithm

* Place centroids far from each other according the the following algorithm

From the original paper: http://ilpubs.stanford.edu:8090/778/1/2006-13.pdf  

Let X be array of sample observations.  

Let D(x) denote the shortest distance from a data point to the closest center we have already chosen

1a. Take one center $c_i$, chosen uniformly at random from X  
1b. Take a new center $c_i$, choosing x $\in$  X with probability 

$$P(c_i) = \frac{D(x)^2}{\sum_{x \in X}D(x)^2}$$

1c. Repeat Step 1b. until we have taken k centers altogether.

In [None]:
def init_kmeans(X, k):
    idx = np.random.randint(0,len(X))
    C = [X[idx]]
    for k in range(1, k):
        print('C: ', C)
        D2 = np.array([min([np.dot(c-x,c-x) for c in C]) for x in X])
        probs = D2/D2.sum()  # P(c_i)
        print('Distance: ',D2)
        print('Probs: ', probs)
        C.append(np.random.choice(X,size=1,p=probs)[0]) # Choosing from X according to P(c_i)
    return C

In [None]:
np.random.seed(42)
x = [0,1,2,3,4]
k = 3
init_kmeans(x,k)

#### K-Means++ in sklearn


* init paramater = 'k-means++'
    - this is the default


In [None]:
model =  KMeans(n_clusters=3, n_init=10, max_iter=300, tol=1e-04, random_state=42)
model.fit(X)
yhat = model.predict(np.array([-1,4]).reshape(-1,2))

In [None]:
print(f'Cluster Centers \n {model.cluster_centers_}')
print(f'Model Labels \n {model.labels_}')
print("Model Inertia: ", model.inertia_)
print(f'Predicted Cluster {yhat[0]} for new point (-1,4)')

In [None]:
df = pd.DataFrame({'X1': X[:,0],'X2': X[:,1], 'y':model.labels_})
plot_kmeans(model,df,yhat)

### Elbow Method to find best k

In [None]:
def Elbow(X):
    SSEs = []
    for i in range(1, 11):
        model = KMeans(n_clusters=i,
                init='k-means++',
                n_init=10,
                max_iter=300,
                random_state=42)
        model.fit(X)
        SSEs.append(model.inertia_)

    plt.plot(range(1,11), SSEs, marker='o')
    plt.xlabel('Number of clusters')
    plt.ylabel('SSEs')

In [None]:
Elbow(X)

In [None]:
X, y = make_blobs(n_samples=150,
                    n_features=2,
                      centers=5,
                      cluster_std= 1.5,
                      shuffle=True,
                      random_state=0)
X.shape,y.shape

In [None]:
plt.scatter(X[:,0], X[:,1],c=y,marker='o', edgecolor='black',s=50)
plt.grid()

In [None]:
Elbow(X)

### Cluster Evaluation

#### Silhouette Analysis

* Quantify the quality of the clustering performed

* **silhouette coefficient**: a measure how tightly grouped the samples in the cluster are.

#### Algorithm

1. Calculate **cluster cohesion**: $a^{(i)}$ as the average distance between a sample $x^{(i)}$ and all other samples in the same cluster

2. Calculate **cluster separation**: $b^{(i)}$ from the next closest cluster for a sample $x^{(i)}$ as the average distance between the sample and all samples in that cluster.

3. Calculate the silhouette $s^{(i)}$ as the difference between cluster cohesion and separation divided by the greater of the two.

<div style="font-size: 125%;">
$$ s^{(i)} = \frac{b^{(i)} - a^{(i)}}{max(b^{(i)},a^{(i)})}$$
</div>

* $-1 \le s^{(i)} \le 1$

* $b^{(i)}$: quantifies how dissimilar a sample is to other clusters

* $a^{(i)}$: quantifies within cluster similarity

#### Interpretation

* The best value is 1 and the worst value is -1. 
* Values near 0 indicate overlapping clusters. 
* Negative values generally indicate that a sample has been assigned to the wrong cluster (i.e.a different cluster is more similar).

#### sklearn silhouette_samples

* Calculates silhouette coefficient for each data sample

In [None]:

model =  KMeans(n_clusters=5, n_init=10, max_iter=300, tol=1e-04, random_state=42)
model.fit(X)
yhat = model.predict(X)
silhouette_vals = silhouette_samples(X,yhat,metric='euclidean')
print(silhouette_vals.shape)
sil_mean = np.mean(silhouette_vals)
print(f'Silhouette mean: {np.round(sil_mean,4)}')

In [None]:
bad_vals = []
for i,v in enumerate(silhouette_vals):
    if v < 0: bad_vals.append((i,v)) 
bad_vals

#### KMEANS with Mall Customers dataset

In [None]:
Cust = pd.read_csv('Mall_Customers.csv')
Cust.tail()

In [None]:
sns.scatterplot(x='Income',y='Spending',data=Cust);

In [None]:
X = Cust.iloc[:, [3, 4]].values

sc = StandardScaler()
X_s = sc.fit_transform(X)

Elbow(X_s)

In [None]:
k = 5
# Fit to the data
model = KMeans(n_clusters = k, n_init = 20, random_state = 42)
clusters = model.fit_predict(X_s)
print('Model inertia: ',model.inertia_)

In [None]:
Cust['Labels'] = model.labels_
Cust.head()

In [None]:
# Visualize the clusters
centers  = sc.inverse_transform(model.cluster_centers_)
fig,ax = plt.subplots(figsize = (12,6))
sns.scatterplot(x = 'Income', y = 'Spending',hue = 'Labels',data = Cust,ax = ax)
ax.scatter(centers[:, 0], centers[:, 1], s = 100, c = 'red', label = 'Centroids')
plt.title('Clusters of customers')
plt.xlabel('Income')
plt.ylabel('Spending')
plt.legend();


In [None]:
vals = silhouette_samples(X_s,clusters,metric='euclidean')
print(f'Silhouette coefficient {np.round(np.mean(vals),3)}')

    
### Hierarchical Clustering

* Don't choose K ahead of time
* Bottom-up or agglomerative clustering
    - Build dendrogram from leaves to trunk
* Requires manual cutoff point
     
#### Hierarchical Clustering Algorithm:

Given a set of N items to be clustered and an NxN distance (or similarity) matrix  
1.  Start by assigning each observation to its own cluster. You have N clusters. The distances (or dissimilarities)   between clusters is just the distances (or dissimilarities) between the items.  
2.  Find the closest (most similar) pair of clusters and merge them into a single cluster.  
3.  Compute distances (dissimilarities) between the new cluster and each of the old clusters  
4.  Repeat steps 2 and 3 until all items are clustered into a single cluster of size N  
5. Step 3, comparing two groups, can be done in different ways called linkage  
    
#### Linkage - how to compare two clusters
 
* Complete: Maximal intercluster dissimilarity.Compute all pairwise dissimilarities between the observations in cluster A and the observations in cluster B, and record the largest of these dissimilarities. (i.e. the ones farthest apart)
* Single: Minimal intercluster dissimilarity. Compute all pairwise dissimilarities between the observations in cluster A and the observations in cluster B, and record the smallest of these dissimilarities. (i.e. the closest ones)
* Average: Mean intercluster dissimilarity. Compute all pairwise dissimilarities between the observations in cluster A and the observations in cluster B, and record the average of these dissimilarities.
* Centroid - The distance between the centroids of the clusters
    
#### Choice of Dissimilarity Measure
 
* Euclidean distance
* Correlation
    
#### Plot using Dendrogram

* Length of vertical bars is the distance measure
* Where to cut?
    - Heuristic: Find longest vertical bar that doesn't intersect with a horizontal line (i.e. if you extend the horizontal lines)
    
![](dendrogram.png)
$$\text{Figure 1. Clustering with Dendrogram}$$
    

### Agglomerative Clustering with Complete Linkage

1. Compute the distance matrix of all samples.
2. Represent each data point as a singleton cluster.
3. Merge the two closest clusters based on the distance between the most dissimilar (distant) members.
4. Update the similarity matrix.
5. Repeat steps 2-4 until one single cluster remains.

#### Generate some data

In [None]:
np.random.seed(123)
variables = ['X', 'Y', 'Z']
labels = ['A','E','D','B','C']
X = np.random.random_sample([5,3])*10
df = pd.DataFrame(X, columns=variables, index=labels)
df

#### scipy distance metric

* calculates specified distance metric between pairwise samples in a data frame

In [None]:

d = pdist(df, metric='euclidean')
d


#### scipy squareform

* converts scipy pdist output to a symmetric square matrix

In [None]:
s = squareform(d)
row_dist = pd.DataFrame(s,columns=labels, index=labels)
row_dist

#### Apply complete linkage agglomeration

In [None]:
row_clusters = linkage(d,method='complete') # from pdist
row_clusters

In [None]:
row_clusters = linkage(df.values,method='complete') # data frame input
row_clusters

In [None]:
df = pd.DataFrame(row_clusters,
    columns=['row label 1','row label 2','distance','no. of items in clust.'],
    index=['cluster %d' %(i+1) for i in range(row_clusters.shape[0])])

df


#### dendrogram

In [None]:

row_dendr = dendrogram(row_clusters,labels=labels)
plt.ylabel('Euclidean distance')
plt.tight_layout()
plt.title("Dendrogram");


#### Hierarchical Clustering in sklearn

https://scikit-learn.org/stable/modules/generated/sklearn.cluster.AgglomerativeClustering.html

In [None]:
# Hierarchical Clustering
Cust = pd.read_csv('Mall_Customers.csv')
X = Cust.iloc[:, [3, 4]].values

# Use the dendrogram to find the optimal number of clusters
model = clusterHierarchy.linkage(X, method = 'complete',metric = 'euclidean')

In [None]:
fig,ax = plt.subplots(figsize = (10,12))
dgram = clusterHierarchy.dendrogram(model)
plt.title('Dendrogram')
plt.xlabel('Customers')
plt.ylabel('Euclidean distances')
plt.tight_layout()

In [None]:
cut = 5
# Fit to the data
m = AgglomerativeClustering(n_clusters = cut, affinity = 'euclidean', linkage = 'complete')
clusters = m.fit_predict(X)

In [None]:
Cust['Clusters'] = clusters

# Visualize the clusters
fig,ax = plt.subplots(figsize = (12,6))
sns.scatterplot(x = 'Income', y = 'Spending',hue = 'Clusters',data = Cust,ax = ax)
plt.title('Clusters of customers')
plt.xlabel('Income')
plt.ylabel('Spending')
plt.legend();


### Density-based Spatial Clustering of Applications with Noise (DBSCAN)

https://scikit-learn.org/stable/modules/generated/sklearn.cluster.DBSCAN.html

* Assigns cluster labels based on dense regions of points
* Doesn't make assumption about spherical clusters (e.g. k-means)
* Doesn't partition the dataset into hierarchies that require a manual cut-off point
* **Density**: the number of points within a specified radius $\epsilon$

#### Labeling

* Assign to each sample (point) using the following criteria:  
    - A point is a **core point** if at least MinPts (e.g. 3) of neighboring points fall within the specified radius $\epsilon$  
    - A **border point** is a point that has fewer neighbors than MinPts within $\epsilon$ , but lies within the $\epsilon$ radius of a core point  
    - All other points that are neither core nor border points are considered
**noise points**

![](DBSCAN.png)
$$\text{Figure 2. Density Based Clustering}$$

#### DBSCAN Algorithm

1. Label the points
2. Form a separate cluster for each core point or connected group of core points (i.e. core points are connected if they are no farther away than $\epsilon$  ).
3. Assign each border point to the cluster of its corresponding core point.

### Compare KMeans,  Hierarchical and DBScan Clustering

#### Generate non-linear data

In [None]:
X, y = make_moons(n_samples=200, noise=0.05, random_state=42)
plt.scatter(X[:,0], X[:,1]);

#### KMeans and Hierarchical (Agglomerative)

In [None]:
km = KMeans(n_clusters=2,random_state=0)
y_km = km.fit_predict(X)
ac = AgglomerativeClustering(n_clusters=2,affinity='euclidean',linkage='complete')
y_ac = ac.fit_predict(X)

In [None]:
fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(8, 3))

ax1.scatter(X[y_km==0,0],X[y_km==0,1],c='lightblue',edgecolor='black',marker='o',s=40, label='cluster 1')
ax1.scatter(X[y_km==1,0],X[y_km==1,1],c='red',edgecolor='black',marker='s',s=40,label='cluster 2')
ax1.set_title('K-means clustering')

ax2.scatter(X[y_ac==0,0], X[y_ac==0,1],c='lightblue',edgecolor='black',marker='o',s=40,label='cluster 1')
ax2.scatter(X[y_ac==1,0],X[y_ac==1,1],c='red',edgecolor='black',marker='s',s=40,label='cluster 2')
ax2.set_title('Agglomerative clustering')
plt.legend();


#### DBSCAN

In [None]:
db = DBSCAN(eps=0.2,min_samples=3,metric='euclidean')
y_db = db.fit_predict(X)
plt.scatter(X[y_db==0,0],X[y_db==0,1],c='lightblue',edgecolor='black',marker='o',s=40,label='cluster 1')
plt.scatter(X[y_db==1,0],X[y_db==1,1],c='red',edgecolor='black',marker='s',s=40,label='cluster 2')
plt.legend();

#### References

Raschka,Sebastin & Mirjalili, Vahid (2017). Python Machine Learning, 2nd Edition, Packt Publishing.

Source Figure 1: "An Introduction to Statistical Learning, with applications in R"  (Springer, 2013) with permission from the authors: G. James, D. Witten,  T. Hastie and R. Tibshirani 