# Lecture 12: Clustering
Description: Prof. Guttag discusses clustering.

Instructor: John Guttag

## 1- Hierarchical clustering 
Hierarchical clustering is a method of cluster analysis which seeks to build a hierarchy of clusters. This approach can be classified into two types: **Agglomerative Hierarchical Clustering** (bottom-up approach) and **Divisive Hierarchical Clustering** (top-down approach). Here, we'll primarily focus on the agglomerative method, which is more commonly used.

### Key Concepts

1. **Agglomerative Hierarchical Clustering (AHC)**:
   - **Bottom-Up Approach**: Each data point starts in its own cluster, and pairs of clusters are merged as one moves up the hierarchy.
   - **Steps**:
     1. Start with each data point as a single cluster.
     2. Compute the distance (or similarity) between all pairs of clusters.
     3. Merge the two closest clusters.
     4. Repeat steps 2 and 3 until there is only one cluster left.

2. **Divisive Hierarchical Clustering**:
   - **Top-Down Approach**: All data points start in one cluster, and the cluster is recursively split into smaller clusters.
   - **Steps**:
     1. Start with all data points in one cluster.
     2. Split the cluster into two least similar clusters.
     3. Repeat step 2 until each data point is in its own cluster.

### Distance Metrics

Hierarchical clustering relies on distance metrics to measure the similarity between data points or clusters. Common distance metrics include:
- **Euclidean Distance**: The straight-line distance between two points in Euclidean space.
- **Manhattan Distance**: The sum of the absolute differences between coordinates.
- **Cosine Similarity**: Measures the cosine of the angle between two vectors.

### Linkage Criteria

When deciding which clusters to merge, various linkage criteria can be used:
- **Single Linkage (Minimum Linkage)**: The distance between the closest points of the clusters.
- **Complete Linkage (Maximum Linkage)**: The distance between the farthest points of the clusters.
- **Average Linkage**: The average distance between all points of the clusters.
- **Centroid Linkage**: The distance between the centroids of the clusters.

### Dendrogram

A dendrogram is a tree-like diagram that records the sequences of merges or splits. It is a visual representation of the hierarchical clustering process and helps in understanding the structure of the clusters formed at different levels of hierarchy.

### Advantages and Disadvantages

**Advantages**:
- Does not require the number of clusters to be specified in advance.
- Dendrograms provide a visual summary of the clustering process.
- Can capture complex cluster shapes.

**Disadvantages**:
- Computationally intensive, especially for large datasets.
- Sensitive to noise and outliers.
- The results can be difficult to interpret without a clear criterion for cutting the dendrogram.

### Example Algorithm (Pseudocode for Agglomerative Clustering)

1. Initialize each data point as a single cluster.
2. Compute the distance matrix for all clusters.
3. While more than one cluster remains:
   - Find the pair of clusters with the smallest distance.
   - Merge these two clusters.
   - Update the distance matrix to reflect the merge.
4. Construct the dendrogram based on the merging steps.

### Applications

Hierarchical clustering is used in various fields including:
- **Bioinformatics**: Grouping genes with similar expression patterns.
- **Marketing**: Segmenting customers based on purchasing behavior.
- **Document Clustering**: Organizing a large set of documents into categories.


## 2- K-Means Clustering

K-Means clustering is one of the simplest and most popular unsupervised machine learning algorithms used for partitioning a dataset into a set of distinct, non-overlapping subgroups (clusters). It attempts to organize the data into \( K \) clusters where each data point belongs to the cluster with the nearest mean, serving as a prototype of the cluster.

#### Key Concepts

1. **Centroid**:
   - The center of a cluster, calculated as the mean of all the points in the cluster.
   
2. **Cluster**:
   - A group of data points that are grouped together based on similarity.

#### Algorithm Steps

1. **Initialization**:
   - Randomly select \( K \) points as initial centroids. These points are often chosen randomly from the dataset.

2. **Assignment**:
   - Assign each data point to the nearest centroid, forming \( K \) clusters.

3. **Update**:
   - Recalculate the centroid of each cluster by taking the mean of all the points in the cluster.

4. **Repeat**:
   - Repeat the assignment and update steps until the centroids no longer change significantly or a maximum number of iterations is reached.

#### Example Pseudocode

1. Initialize centroids randomly.
2. Repeat until convergence or for a fixed number of iterations:
   - Assign each data point to the nearest centroid.
   - Update centroids by computing the mean of all points in the cluster.
3. Return the final cluster assignments and centroids.

### Choosing the Number of Clusters (K)

One of the key challenges in K-Means clustering is determining the appropriate number of clusters (\( K \)). Several methods are used to determine this, including:

1. **Elbow Method**:
   - Plot the sum of squared distances from each point to its assigned centroid (within-cluster sum of squares) against the number of clusters.
   - The point at which the decrease in the sum of squared distances slows down (forming an elbow) indicates the appropriate number of clusters.

2. **Silhouette Score**:
   - Measures how similar a point is to its own cluster compared to other clusters.
   - The average silhouette score for different values of \( K \) can help in choosing the best \( K \).

### Applications

1. **Customer Segmentation**:
   - Grouping customers based on purchasing behavior to tailor marketing strategies.
   
2. **Image Compression**:
   - Reducing the number of colors in an image by clustering similar colors.

3. **Document Clustering**:
   - Organizing documents into topics based on content similarity.

### Advantages and Disadvantages

**Advantages**:
- Simple to implement and computationally efficient.
- Works well with large datasets.
- Easy to interpret and understand.

**Disadvantages**:
- Requires specifying the number of clusters in advance.
- Sensitive to the initial placement of centroids.
- Assumes clusters are spherical and equally sized, which may not always be the case.

