---

## Working with Unlabeled Data: Clustering Analysis  


### Unit Convenor & Lecturer {-}
[George Milunovich](https://www.georgemilunovich.com)  
[george.milunovich@mq.edu.au](mailto:george.milunovich@mq.edu.au)

### References {-}

1. Python Machine Learning 3rd Edition by Raschka & Mirjalili - Chapter 11  
2. Various open-source material  

### Overview {-}

- Grouping objects by similarity using k-means  
  - K-means clustering using scikit-learn  
  - A smarter way of placing the initial cluster centroids using k-means++  
  - Using the elbow method and silhouette plots to find the optimal number of clusters  
- Organizing clusters as a hierarchical tree  
  - Grouping clusters in bottom-up fashion  
  - Performing hierarchical clustering on a distance matrix  
  - Attaching dendrograms to a heat map  
  - Applying agglomerative clustering via scikit-learn  
- Locating regions of high density via DBSCAN  

---

## Working with Unlabeled Data

Working with unlabeled data presents a unique set of challenges and opportunities, notably through clustering analysis.

- **Clustering** is a technique that organises data into clusters or groups based on similarity without prior knowledge of group assignments.
    - Similar to sorting different objects into distinct categories based solely on their features, without a predefined list of object types.
- The process involves algorithms like **K-means** or **hierarchical clustering** to detect patterns and relationships that aren't immediately apparent.

**Clustering is a type of unsupervised learning** 


- So far we have considered supervised learning techniques to build machine learning models where the **target variable is known**  
- In clustering:
  - We don't know what the target variable is  
  - We can find some commonalities/patterns in the data
  
**Objective: Find groupings in the data so that items in the same cluster are more similar to each other than to items from different clusters**

**Cluster Analysis in Business**

Clustering is particularly valuable in business applications. Some examples include the following:

- **Customer Segmentation**: Businesses use clustering to segment customers based on demographics, purchasing behaviors, preferences, and responsiveness to marketing. This helps in tailoring marketing strategies, improving customer engagement, and enhancing service delivery.
- **Market Segmentation**: Clustering helps identify different market segments, allowing companies to target specific clusters with customised products or services. This approach can increase market penetration and optimise investments.
- **Inventory Management**: Clustering algorithms can categorise inventory based on movement velocity, value, and other characteristics. This enables more efficient inventory control and optimisation of the supply chain.
- **Product Recommendation Systems**: E-commerce platforms use clustering to understand customer preferences and behavior, which helps in recommending products that customers are more likely to purchase.
- **Risk Management**: In insurance and finance, clustering helps in identifying risk profiles and categorising customers into risk groups, aiding in setting premiums and coverage policies.
- **Human Resources**: HR departments apply clustering to group employees based on skills, performance, and behavior. This can help in talent management, team formation, and identifying training needs.


---
--- 
## Grouping Objects by Similarity Using k-Means 


**k-Means** method  
- Category of **prototype-based clustering**  
- Each cluster is represented by a prototype:  
    - **Centroid** (average) in the case of continuous features  
    - **Medoid** (the most representative point) in the case of categorical features  
- We need to specify the number of clusters $k$ *a priori* (before the analysis)   
    - When dealing with more than 3 dimensions it becomes impossible to visualise data effectively  
    - This is a significant *limitation* of the method as it can be difficult to choose the right $k$
    - Inappropriate choice of $k$ can result in a poor clustering performance  
        - The elbow method and silhouette plots can be used to evaluate the quality of clustering and **determine $k$**  
    - One or more clusters can potentially be empty  
        - Although scikit-learn implements a solution for this  
- Clusters cannot be hierarchical (inside one another)       
    
**k-Means Algorithm**  
1. Randomly pick $k$ points as initial cluster centres $\mu^{(j)}, j\in{1,2,\dots,k}$   
2. Assign each example to the nearest centroid $\mu^{(j)}$    
3. Move centroids/medoids to the center of the examples that were assigned to it  
4. Repeat steps 2 and 3 until the centroids/medoids don't change and examples don't move between clusters  
    - Or a user-defined tolerance or maximum number of iterations is reached (if no convergence)  
- Watch [https://youtu.be/RD0nNK51Fp8](https://youtu.be/RD0nNK51Fp8) for a good explanation of this algorithm  

<img src="images/11_17.png" alt="Drawing" style="width: 600px;"/>  


**Measuring Distance**  
- Measuring distance in one dimensional data is easy -> we use the absolute value  
    - E.g. what is the distance between 3 and -2?  
        - $|3 - (-2)|=|-2-3|=5$  
- However what if we multi-dimensional data? Consider the following 2-d case, where we have 2 variables:  
    - E.g. $\mathbf{x}=\left(\begin{array}{c} 
\text{Sale Amount}\\
\text{Customer Income}
\end{array}\right)$

  
- To measure the distance between two m-dimensional vectors $x$ and $y$ we typically use **squared Euclidean distance**  
    - $d(x, y)^2=\sum_{j=1}^m(x_j-y_j)^2=||x-y||_2^2$  
    - Note that this produces the square of the hypotenuse of a right-angle triangle when $y$ (or $x$) is a vector of zeros and $x$ (or $y$) contains elements $x_1$ and $x_2$ which are the coordinates of the triangle   
        - E.g. If $y=[0 \quad 0 \quad 0], x=[3 \quad 4]$ then $d(x,y)^2=[3-0]^2 + [4-0]^2 = 25$  

  
- Using the Euclidean distance we can define the k-means algorithms as an optimization problem  
    - Minimize within-cluster sum of squared errors (SSE) $SSE=\sum_i^n\sum_j^kw^{(i,j)}||x^{(i)}-\mu^{(j)}||_2^2$ where  
        - $w^{(i,j)}=\left\{ 
\begin{array}{cc}
1 & \text{if }x{(i)}\in j \\ 
0 & \text{otherwise} \hfill
\end{array}
\right.$
  

In scikit-learn use `from sklearn.cluster import KMeans`  
    - [https://scikit-learn.org/stable/modules/generated/sklearn.cluster.KMeans.html](https://scikit-learn.org/stable/modules/generated/sklearn.cluster.KMeans.html)           

<hr style="width:35%;margin-left:0;"> 

<span style='background:orange'>  **Example: k-Means clustering**    
    
1. Import `make_blobs` data from sklearn.datasets and plot it in a scatterplot
    - 150 randomly generated points in 2-d that are grouped into 3 regions ($k$) with higher density
2. Use `KMeans` from `sklearn` to classify data into 3 clusters
3. Plot the clusters and classified data in a scatterplot 

<hr style="width:35%;margin-left:0;"> 

1. Import `make_blobs` data from sklearn.datasets and plot it in a scatterplot
- 150 randomly generated points in 2-d that are grouped into 3 regions ($k$) with higher density

```
import warnings 
warnings.filterwarnings("ignore")


from sklearn.datasets import make_blobs # create dataset

import matplotlib.pyplot as plt

X, y = make_blobs(n_samples=150, 
                  n_features=2, 
                  # n_features=3,
                  centers=3, 
                  cluster_std=0.5, 
                  shuffle=True, 
                  random_state=0)


fig = plt.figure()

ax = fig.add_subplot()
ax.scatter(X[:, 0], X[:, 1], c='steelblue', marker='o', edgecolor='black', s=20)

# ax = fig.add_subplot(projection='3d')
# ax.scatter(X[:, 0], X[:, 1], X[:, 2], c='steelblue', marker='o', edgecolor='black', s=20)

plt.grid()
plt.tight_layout()
#plt.savefig('images/11_01.png', dpi=300)
plt.show()
```



```
y

```

<hr style="width:35%;margin-left:0;"> 

2. Use `KMeans` from `sklearn` to classify data into 3 clusters 

```
from sklearn.cluster import KMeans

km = KMeans(n_clusters=3, 
            init='random',  # use standard k-means rather than k-means++ (see below)
            n_init=10,      # run 10 times with different random centroids to choose the final model with the lowest SSE
            max_iter=300,   # max number of iterations for each run
            random_state=0)

y_km = km.fit_predict(X)

print(y_km)
print(f'\nClusters:\n {km.cluster_centers_}')
```

<hr style="width:35%;margin-left:0;"> 

3. Plot the clusters and classified data in a scatterplot 

```
plt.scatter(X[y_km == 0, 0], X[y_km == 0, 1], s=50, c='lightgreen', marker='s', edgecolor='black', label='Cluster 1')
plt.scatter(X[y_km == 1, 0], X[y_km == 1, 1], s=50, c='orange', marker='o', edgecolor='black', label='Cluster 2')
plt.scatter(X[y_km == 2, 0], X[y_km == 2, 1], s=50, c='lightblue', marker='v', edgecolor='black', label='Cluster 3')
# plt.scatter(X[y_km == 3, 0], X[y_km == 3, 1], s=50, c='yellow', marker='*', edgecolor='black', label='Cluster 3')

plt.scatter(km.cluster_centers_[:, 0], km.cluster_centers_[:, 1], s=250, marker='*', c='red', edgecolor='black', label='Centroids')
plt.legend(scatterpoints=1)
plt.grid()
plt.tight_layout()
#plt.savefig('images/11_02.png', dpi=300)
plt.show()

```

In the case of the three dimensional dataset we can plot it as follows:

```
fig = plt.figure()

ax = fig.add_subplot(projection='3d')

ax.scatter(X[y_km == 0, 0], X[y_km == 0, 1],X[y_km == 0, 2], s=50, c='lightgreen', marker='s', edgecolor='black', label='Cluster 1')
ax.scatter(X[y_km == 1, 0], X[y_km == 1, 1],X[y_km == 1, 2], s=50, c='orange', marker='o', edgecolor='black', label='Cluster 2')
ax.scatter(X[y_km == 2, 0], X[y_km == 2, 1],X[y_km == 2, 2],s=50, c='lightblue', marker='v', edgecolor='black', label='Cluster 3')

ax.scatter(km.cluster_centers_[:, 0], km.cluster_centers_[:, 1], km.cluster_centers_[:, 2], s=250, marker='*', c='red', edgecolor='black', label='Centroids')
plt.legend(scatterpoints=1)
plt.grid()
plt.tight_layout()
#plt.savefig('images/11_02.png', dpi=300)
plt.show()

```

---
### A Smarter Way of Placing the Initial Cluster Centroids Using k-Means++ {-}

- k-Means++ is an improved method to strategically select initial centroids (rather than randomly choose them as in k-Means)
    - Random selection can lead to bad clustering or slow convergence
- k-Means++ chooses initial centroids sequentially such that they have high probability of being far away from each other (not belong in the same cluster)
    - See textbook for algorithm details if interested
- For a good explanation of k-Means++ see here [https://youtu.be/HatwtJSsj5Q](https://youtu.be/HatwtJSsj5Q)
- Implement in scikit-learn by setting `init=k-means++` in `KMeans` object (this is default setting)

<hr style="width:35%;margin-left:0;"> 

<span style='background:orange'>  **Example: k-means++ clustering**    
    
1. Repeat the above analysis using k-means++
    
```
km_plus = KMeans(n_clusters=3, 
            init='k-means++',  # use starndard k-means rather than k-means++ (see below)
            n_init=10,      # run 10 times with different random centroids to choose the final model with the lowest SSE
            max_iter=300,   # max number of iterations for each run
            random_state=0)

y_km_plus = km_plus.fit_predict(X)

print(y_km_plus)
print(f'\nClusters:\n {km_plus.cluster_centers_}')
```

<hr style="width:35%;margin-left:0;"> 

## Using the Elbow Method to Find the Optimal Number of Clusters  {-}

- The main challenge in unsupervised learning is that we don't know the values of class labels (classification) or what the target variable is (regression)  
    - Need to use metrics, e.g. within-cluster SSE (cluster inertia) to compare the performance of different k-Means clusterings   
    - Within-cluster SSE is the sum of the squared distances from each point to its cluster's centroid $SSE=\sum_i^n\sum_j^kw^{(i,j)}||x^{(i)}-\mu^{(j)}||_2^2$
    - In scikit-learn use `inertia_` attribute to get within-cluster SSE (sum of squared distances of samples to their closest cluster centre)    

- **Elbow Method** - a heuristic used in determining the number of clusters $k$ using inertia
    - Note: A heuristic is an approach to problem-solving that simplifies decision-making to produce efficient, though not necessarily optimal, solutions. 
    - If $k\uparrow\Rightarrow$inertia$\downarrow$ (but its easy to overfit)
    - Plot inertia (within-cluster SSE) vs $k$ (the number of clusters)  
    - Choose $k$ so that adding another cluster doesn't provide large improvement in SSE
        - **Identify $k$ where inertia stops decreasing rapidly**  
        - We say that the elbow of the curve is located at that $k$   

<hr style="width:35%;margin-left:0;"> 

<span style='background:orange'>  **Example: Determining $k$ using the elbow method**  

1. Compute within-cluster SSE (inertia) for $k\in\{1,2,\dots,10\}$ and plot inertia against $k$  
2. Determine the optimal $k$ according to the plot.  

<hr style="width:35%;margin-left:0;"> 

1. Compute within-cluster SSE (inertia) for $k\in\{1,2,\dots,10\}$ and plot inertia against $k$

```
inertias = [] # empty list

for i in range(1, 11):
    km = KMeans(n_clusters=i, 
                init='k-means++', 
                n_init=10, 
                max_iter=300, 
                random_state=0)
    km.fit(X)
    inertias.append(km.inertia_)

plt.plot(range(1, 11), inertias, marker='o')
plt.xlabel('Number of clusters')
plt.ylabel('Cluster inertia (within-cluster SSE)')
plt.xticks(range(1,11))
plt.tight_layout()
#plt.savefig('images/11_03.png', dpi=300)
plt.show()
```

<hr style="width:35%;margin-left:0;"> 

2. Determine the optimal $k$ according to the plot.

- According to the above plot the optimal number of clusters is 3.

<hr style="width:35%;margin-left:0;"> 

## Quantifying the Quality of Clustering  via Silhouette Plots {-}  

- Silhouette analysis - graphical tool to plot a measure of how tightly grouped the examples in the clusters are
    - Used to assess the quality of clustering in a dataset   
    - Tells us how well each object has been classified in its cluster, based on the object's similarity to its own cluster compared to other clusters.   



- Silhouette coefficient (for an example $x^{(i)}$)   
    - Cluster Cohesion - $a^{(i)}$ - the average distance between $x^{(i)}$ and all other points in the same cluster  
        - How similar $x^{(i)}$ is to other examples in its own cluster  
        - Smaller $a^{(i)}$ is better  
    - Cluster Separation - $b^{(i)}$ - the average distance between $x^{(i)}$ and all examples in the nearest cluster  
        - Larger $b^{(i)}$ is better  
    - Silhouette coefficient - $s^{(i)}=\frac{b^{(i)}-a^{(i)}}{\text{max}\{b^{(i)},a^{(i)}\}}\in(-1,1)$
        - $s^{(i)}$ close to 1: $x^{(i)}$ is appropriately clustered
        - $s^{(i)}$ close to -1: $x^{(i)}$ would be more appropriate if it was clustered in its neighbouring cluster
        - $s^{(i)}$ close to 0: $x^{(i)}$ is on the border of two natural clusters         
- In scikit-learn use `from sklearn.metrics import silhouette_samples`  


---

<span style='background:orange'>  **Example: Determining $k$ via Silhouette Plots**    

1. Group the data into 3 clusters and compute a silhouette coefficient for each example (observation) using `silhouette_samples`  
- Plot all silhouette coefficients for each example (from smallest to largest) grouped by clusters  
- Compute and plot the average silhouette coefficient across all observations  
    
2. Repeat 1. by grouping data into only 2 clusters and observe the difference in silhouette coefficients  
    

---
```
import numpy as np
from matplotlib import cm

from sklearn.metrics import silhouette_samples

km = KMeans(n_clusters=3,  
            init='k-means++', 
            n_init=10, 
            max_iter=300,
            tol=1e-04,
            random_state=0)

y_km = km.fit_predict(X)
# print(y_km)

cluster_labels = np.unique(y_km)
# print(cluster_labels)

n_clusters = cluster_labels.shape[0]

silhouette_vals = silhouette_samples(X, y_km, metric='euclidean')
# print('silhouette_vals\n', silhouette_vals)

## ------- plotting silhouette values -------

y_ax_lower, y_ax_upper = 0, 0
yticks = []

for i, c in enumerate(cluster_labels):
    c_silhouette_vals = silhouette_vals[y_km == c]
    c_silhouette_vals.sort()
    y_ax_upper += len(c_silhouette_vals)
    color = cm.jet(float(i) / n_clusters)
    plt.barh(range(y_ax_lower, y_ax_upper), c_silhouette_vals, height=1.0, edgecolor='none', color=color)
    yticks.append((y_ax_lower + y_ax_upper) / 2.)
    y_ax_lower += len(c_silhouette_vals)
    
    
silhouette_avg = np.mean(silhouette_vals)
print(f'silhouette_avg: {silhouette_avg:.2f}')

plt.axvline(silhouette_avg, color="red", linestyle="--") # plot vertical average line

plt.yticks(yticks, cluster_labels + 1)
plt.ylabel('Cluster')
plt.xlabel('Silhouette coefficient')

plt.tight_layout()
#plt.savefig('images/11_04.png', dpi=300)
plt.show()

```

- None of the silhouette coefficeint are close to 0 -> an indicator of good clustering
- Average silhouette coefficient is 0.83 (out of the maximum of 1) -> also a good sign 

---

# Organizing Clusters as a Hierarchical Tree {-}  

- Hierarchical Clustering    
    - Method of cluster analysis which seeks to build a hierarchy of clusters    

<img src="images/11_18.png" alt="Drawing" style="width: 400px;"/>  

- There are two approaches to hierarchical clustering  
    - Agglomerative (bottom-up approach) - covered in this lecture  
        1. Start by assuming that each example is a single cluster  
        2. Merge closest pairs of clusters iteratively until only one cluster remains  
    - Divisive (top-down approach)  
        1. Start with one cluster  
        2. Split the cluster into smaller clusters iteratively until each cluster contains only one example  

- For a good explanation of hierarchical clustering see [https://youtu.be/rg2cjfMsCk4](https://youtu.be/rg2cjfMsCk4)  

<hr style="width:35%;margin-left:0;">   

## Grouping Clusters in Agglomerative (Bottom-Up) Approach {-}  

- Key Operation: Combine two nearest clusters
- Key Problem: How to measure distance between clusters


There are two standard algorithms for agglomerative hierarchical clustering:  
1. **Single Linkage Approach**    
- Compute the distances between the **most similar** members for each pair of clusters and merge the two clusters for which the distance between the most similar members is the smallest  
2. **Complete Linkage Approach**    
- Compute the distance between the **most dissimilar** members for each pair of clusters and merge the two clusters for which the distance between the most dissimilar members is the smallest  

<img src="images/11_07.png" alt="Drawing" style="width: 400px;"/>  

**Agglomerative Complete Linkage Algorithm**  
1. Compute the distance matrix of all examples containing distances between all data points  
2. Represent each data point as a cluster  
3. Merge the two closest clusters based on the distance between the most dissimilar members  
4. Update the similarity matrix (containing distance metrics)
5. Repeat 3 - 4 until only one cluster remains  


**Visualising Hierarchical Clusters**

- A **dendrogram** is a tree-like diagram that is used to illustrate the arrangement of the elements or clusters formed during hierarchical clustering analysis. 
    - Visually displays the sequence of cluster mergings and the distance at which each merging occurred.
    - Particularly useful in agglomerative (bottom-up) hierarchical clustering, although they can represent divisive (top-down) clustering processes as well.


<hr style="width:35%;margin-left:0;">   

<span style='background:orange'> **Example: Clustering Data using hierarchical/agglomerative clustering with Euclidean norm & visualize linkage matrix in a dendrogram (using both scipy and scikit-learn)**    

1. Generate data for 3 random variables X, Y and Z to be used in hierarchical cluster analysis using [https://numpy.org/doc/stable/reference/random/generated/numpy.random.random_sample.html](https://numpy.org/doc/stable/reference/random/generated/numpy.random.random_sample.html)  
2. Compute the distance matrix with Euclidean norm using  
[https://docs.scipy.org/doc/scipy/reference/generated/scipy.spatial.distance.pdist.html](https://docs.scipy.org/doc/scipy/reference/generated/scipy.spatial.distance.pdist.html)  
3. Use [https://docs.scipy.org/doc/scipy/reference/generated/scipy.cluster.hierarchy.linkage.html](https://docs.scipy.org/doc/scipy/reference/generated/scipy.cluster.hierarchy.linkage.html) to compute the **linkage matrix** and perform hierarchical/agglomerative clustering  
4. Visualize the linkage matrix in a dendrogram  
5. Plot the data in a heatmap and attach the dendrogram to it  
6. Repeat the agglomerative clustering analysis in scikit-learn   

<hr style="width:35%;margin-left:0;">   

1. Generate data for 3 random variables X, Y and Z to be used in hierarchical cluster analysis using [https://numpy.org/doc/stable/reference/random/generated/numpy.random.random_sample.html](https://numpy.org/doc/stable/reference/random/generated/numpy.random.random_sample.html)

```
import pandas as pd
import numpy as np

np.random.seed(123)

variables = ['X', 'Y', 'Z']
labels = ['ID_0', 'ID_1', 'ID_2', 'ID_3', 'ID_4']

X = np.random.random_sample([5, 3])*10
df = pd.DataFrame(X, columns=variables, index=labels)
df
```

<hr style="width:35%;margin-left:0;">   

2. Compute the distance matrix using Euclidean norm using
[https://docs.scipy.org/doc/scipy/reference/generated/scipy.spatial.distance.pdist.html](https://docs.scipy.org/doc/scipy/reference/generated/scipy.spatial.distance.pdist.html)
- Print `squareform`
- Print `condensed form`

```
from scipy.spatial.distance import pdist, squareform

row_dist = pd.DataFrame(squareform(pdist(df, metric='euclidean')),
                        columns=labels,
                        index=labels)

print(row_dist)


pd.Series(pdist(df, metric='euclidean'))

```



---

We can check the distance between ID_0 and ID_1

```

print('ID_0 \n', df.iloc[0,:])
print('ID_1 \n',df.iloc[1,:])

print('(ID_0 - ID_1)^2 \n',(df.iloc[0,:] - df.iloc[1,:])**2)
print('sum((ID_0 - ID_1)^2) \n', np.sqrt(np.sum((df.iloc[0,:] - df.iloc[1,:])**2)))

```

<hr style="width:35%;margin-left:0;">   

3. Use [https://docs.scipy.org/doc/scipy/reference/generated/scipy.cluster.hierarchy.linkage.html](https://docs.scipy.org/doc/scipy/reference/generated/scipy.cluster.hierarchy.linkage.html) to compute the **linkage matrix** and perform hierarchical/agglomerative clustering  

- Either pass a condensed distance matrix (upper triangular) from the `pdist` function, or   
- Pass the "original" data array and define the `metric='euclidean'` argument in `linkage`.   

However, we should not pass the squareform distance matrix, which would yield different distance values although the overall clustering could be the same.  

```
# ---- 1st Approach: input condensed distance matrix ----

from scipy.cluster.hierarchy import linkage


row_clusters = linkage(pdist(df, metric='euclidean'), method='complete')
pd.DataFrame(row_clusters,
             columns=['row label 1', 'row label 2',
                      'distance', 'no. of items in clust.'],
             index=[f'cluster {i + 1}' for i in range(row_clusters.shape[0])])
             
             


---- 2nd Approach: just input data -----

```
row_clusters = linkage(df.values, method='complete', metric='euclidean')

pd.DataFrame(row_clusters,
             columns=['row label 1', 'row label 2',
                      'distance', 'no. of items in clust.'],
             index=[f'cluster {i + 1}' for i in range(row_clusters.shape[0])])

#              index=['cluster %d' % (i + 1)
#                     for i in range(row_clusters.shape[0])])             
```

Note:  
- Each row presents one merge  
- 1st and 2nd columns denote most dissimilar members in each cluster  
- 3rd column - distance between those most dissimilar members  
- 4th column - count of members in each cluster  

<hr style="width:35%;margin-left:0;"> 

4. Visualize the linkage matrix in a dendrogram
- A dendrogram shows how we merged datapoints into clusters

```
from scipy.cluster.hierarchy import dendrogram

import matplotlib.pyplot as plt

# make dendrogram black (part 1/2)

from scipy.cluster.hierarchy import set_link_color_palette
set_link_color_palette(['black'])

row_dendr = dendrogram(row_clusters, 
                       labels=labels,
                       # make dendrogram black (part 2/2)
                       color_threshold=np.inf
                       )
plt.tight_layout()
plt.ylabel('Euclidean distance')
#plt.savefig('images/11_11.png', dpi=300, 
#            bbox_inches='tight')
plt.show()
```

<hr style="width:35%;margin-left:0;"> 

5. Plot the data in a heatmap and attach the dendrogram to it

```
# plot row dendrogram
fig = plt.figure(figsize=(8, 8), facecolor='white')
axd = fig.add_axes([0.09, 0.1, 0.2, 0.6])

# note: for matplotlib < v1.5.1, please use orientation='right'
row_dendr = dendrogram(row_clusters, orientation='left')
# print(row_dendr)

# reorder data with respect to clustering
df_rowclust = df.iloc[row_dendr['leaves'][::-1]]

axd.set_xticks([])
axd.set_yticks([])

# remove axes spines from dendrogram
for i in axd.spines.values():
    i.set_visible(False)

# plot heatmap
axm = fig.add_axes([0.23, 0.1, 0.6, 0.6])  # x-pos, y-pos, width, height
cax = axm.matshow(df_rowclust, interpolation='nearest', cmap='hot_r')
fig.colorbar(cax)
axm.set_xticklabels([''] + list(df_rowclust.columns))
axm.set_yticklabels([''] + list(df_rowclust.index))

#plt.savefig('images/11_12.png', dpi=300)
plt.show()
```

<hr style="width:35%;margin-left:0;"> 

6. Repeat the agglomerative clustering analysis in scikit-learn

## Applying agglomerative clustering via scikit-learn {-}

Apply agglomerative clustering in scikit-learn to identify 
- 3 clusters
- 2 clusters


```
from sklearn.cluster import AgglomerativeClustering

# ---------- 2 clusters --------------

ac = AgglomerativeClustering(n_clusters=2, 
                             affinity='euclidean', 
                             linkage='complete')
labels = ac.fit_predict(X)
print('Cluster labels: %s' % labels)

# ---------- 3 clusters --------------

ac = AgglomerativeClustering(n_clusters=3, 
                             affinity='euclidean', 
                             linkage='complete')
labels = ac.fit_predict(X)
print('Cluster labels: %s' % labels)

```

---

# Locating Regions of High Density via DBSCAN  {-}

- DBSCAN (Density-based spatial clustering of applications with noise)  
    - Group clusters based on dense regions of points  
    -  <span style='background:lightgrey'> **Density**: number of points within a specified radius $\varepsilon$  
    
- Each observation is classified as follows  
    - Core Point: if at least a specified number (MinPts) of neighboring points fall within the specified radius $\varepsilon$  
    - Border Point: has less neighbours than MinPts within $\varepsilon$, but lies within the $\varepsilon$ radius of a core point  
    - Noise Point: any point that is neither a core nor border point  
    
<img src="images/11_13.png" alt="Drawing" style="width: 400px;"/> 

**DBSCAN Algorithm**  

1. Specify the radius $\varepsilon$  
1. Set the minimum number of neighbouring points - MinPts   
2. Label each point as either a core, border or noise point  
3. Form a separate cluster for each core point or connected group of core points (core points are connected if they are no further away than $\varepsilon$) 
4. Assign each border point to the cluster of its corresponding core point  

Advantages of DBSCAN:   
1. Not all points are assigned to a cluster since points can be removed as noise points  
2. Does not require *a-priori* specification of number of clusters  

In scikit-learn   
- `from sklearn.cluster import DBSCAN`  
- [https://scikit-learn.org/stable/modules/generated/sklearn.cluster.DBSCAN.html](https://scikit-learn.org/stable/modules/generated/sklearn.cluster.DBSCAN)  

<hr style="width:35%;margin-left:0;">   

<span style='background:orange'> **Example: Cluster make_moons with K-means,  hierarchical clustering and DBSCAN**    
1. Create a linearly inseparable dataset using `make_moons`   
2. Attempt to identify the two clusters using K-means and hierarchical clustering in scikit-learn  
3. Cluster the data on the basis of DBSCAN  

<hr style="width:35%;margin-left:0;">
1. Create a linearly inseparable dataset using `make_moons` 

```
from sklearn.datasets import make_moons

X, y = make_moons(n_samples=200, noise=0.05, random_state=0)
plt.scatter(X[:, 0], X[:, 1])
plt.tight_layout()
#plt.savefig('images/11_14.png', dpi=300)
plt.show()
# X
# y
```

<hr style="width:35%;margin-left:0;">

2. Attempt to identify the two clusters using k-Means and hierarchical clustering in scikit-learn

```
from sklearn.cluster import KMeans

f, (ax1, ax2) = plt.subplots(1, 2, figsize=(8, 3))

#----------------- k-means clustering with 2 clusters ------------------
km = KMeans(n_clusters=2, random_state=0)
y_km = km.fit_predict(X)


ax1.scatter(X[y_km == 0, 0], X[y_km == 0, 1],edgecolor='black', c='lightblue', marker='o', s=40, label='cluster 1')
ax1.scatter(X[y_km == 1, 0], X[y_km == 1, 1], edgecolor='black', c='red', marker='s', s=40, label='cluster 2')
ax1.set_title('K-means clustering')

#----------------- agglomerative clustering using two clusters --------------
ac = AgglomerativeClustering(n_clusters=2, affinity='euclidean', linkage='complete')
y_ac = ac.fit_predict(X)

ax2.scatter(X[y_ac == 0, 0], X[y_ac == 0, 1], c='lightblue', edgecolor='black', marker='o', s=40, label='Cluster 1')
ax2.scatter(X[y_ac == 1, 0], X[y_ac == 1, 1], c='red', edgecolor='black', marker='s', s=40, label='Cluster 2')
ax2.set_title('Agglomerative clustering')

plt.legend()
plt.tight_layout()
#plt.savefig('images/11_15.png', dpi=300)
plt.show()

print('clusters accroding to KMeans\n', y_km)
```

<hr style="width:35%;margin-left:0;">

3. Cluster the data on the basis of DBSCAN

```
from sklearn.cluster import DBSCAN

db = DBSCAN(eps=0.3, min_samples=10, metric='euclidean')

y_db = db.fit_predict(X)

plt.scatter(X[y_db == 0, 0], X[y_db == 0, 1], c='lightblue', marker='o', s=40, edgecolor='black',  label='Cluster 1')
plt.scatter(X[y_db == 1, 0], X[y_db == 1, 1], c='red', marker='s', s=40, edgecolor='black', label='Cluster 2')
plt.legend()
plt.tight_layout()
#plt.savefig('images/11_16.png', dpi=300)
plt.show()
```