### Hierarchical Clustering (Agglomerative) Formulas and Key Concepts

Hierarchical clustering builds a tree-like structure called a **dendrogram**. In agglomerative hierarchical clustering, we start with each point as its own cluster and iteratively merge the closest clusters.

1. **Distance Between Clusters**:
   The distance between two clusters \( A \) and \( B \) is computed based on different linkage criteria.

#### a) **Single Linkage (Min Linkage)**:
   $$d(A, B) = \min_{x \in A, y \in B} \|x - y\|$$
   - **Explanation**: The distance between two clusters is the minimum distance between any two points in the clusters.
   
#### b) **Complete Linkage (Max Linkage)**:
   $$d(A, B) = \max_{x \in A, y \in B} \|x - y\|$$
   - **Explanation**: The distance between two clusters is the maximum distance between any two points in the clusters.
   
#### c) **Average Linkage**:
   $$d(A, B) = \frac{1}{|A| |B|} \sum_{x \in A, y \in B} \|x - y\|$$
   - **Explanation**: The distance between two clusters is the average distance between all pairs of points from the two clusters.

#### d) **Ward’s Linkage**:
   $$d(A, B) = \sqrt{\frac{|A| |B|}{|A| + |B|} \| \bar{x}_A - \bar{x}_B \|^2}$$
   - **Explanation**: The distance between two clusters is based on the increase in the sum of squared errors (SSE) when the two clusters are merged. It minimizes the variance within clusters.
     - $\bar{x}_A$ and $\bar{x}_B$: Centroids of clusters $A$ and $B$.
     - $|A|$ and $|B|$: Number of points in clusters $A$ and $B$.

2. **Dendrogram**:
   - The hierarchical structure is visualized as a **dendrogram**, where:
     - Each node represents a cluster.
     - The height at which clusters are merged represents the distance between them.
   - You can cut the dendrogram at a desired height to obtain a specific number of clusters.

### Agglomerative Clustering Process:
1. Start with each point as its own cluster.
2. Compute the distance between each pair of clusters using one of the linkage methods.
3. Merge the closest pair of clusters.
4. Repeat steps 2 and 3 until all points belong to a single cluster.

### Key Parameters:
- **Linkage**: Defines how to measure the distance between clusters (e.g., min, max, average, or Ward’s).
- **Distance Metric**: Typically Euclidean distance, but others like Manhattan distance can be used.


### **Ward's Linkage in Agglomerative Clustering (Step-by-Step)**

Ward's Linkage is a method to calculate the distance between clusters. It focuses on minimizing the increase in **within-cluster variance** (or Sum of Squared Errors, SSE) when clusters are merged. Here's how it works:

---

#### **Step-by-Step Process**:

1. **Start with each data point as its own cluster**:
   - Initially, every data point is treated as an individual cluster.

2. **Compute the centroids of clusters**:
   - A **centroid** is the mean of all points in a cluster.
   - For a cluster $A$, its centroid $\bar{x}_A$ is calculated as:
     $$
     \bar{x}_A = \frac{1}{|A|} \sum_{x \in A} x
     $$
     - $|A|$: Number of points in cluster $A$.
     - $x$: Data points in cluster $A$.

3. **Calculate the increase in variance (SSE) if two clusters are merged**:
   - For two clusters $A$ and $B$, the increase in SSE is given by:
     $$
     d(A, B) = \sqrt{\frac{|A| \cdot |B|}{|A| + |B|}} \cdot \|\bar{x}_A - \bar{x}_B\|^2
     $$
     - $\bar{x}_A$ and $\bar{x}_B$: Centroids of clusters $A$ and $B$.
     - $|A|$ and $|B|$: Number of points in clusters $A$ and $B$.
   - This measures how much the "compactness" of clusters decreases when they are merged.

4. **Find the pair of clusters with the smallest increase in variance**:
   - Among all possible pairs of clusters, select the pair $(A, B)$ with the smallest $d(A, B)$.

5. **Merge the selected clusters**:
   - Combine the two clusters $A$ and $B$ into a single cluster.

6. **Repeat steps 2-5**:
   - Recompute the centroids and distances for the updated clusters.
   - Continue merging clusters until all points belong to a single cluster or until the desired number of clusters is reached.

---

#### **Why Ward's Linkage?**
- Ward’s method minimizes the increase in the **within-cluster variance** after merging clusters. This results in compact and spherical clusters, making it a good choice for datasets with such characteristics.

---

#### **Example (Intuitive Walkthrough)**:

- Suppose we have 4 points: $A$, $B$, $C$, $D$.
- Each starts as its own cluster.
- Step-by-step:
  1. Compute centroids of all clusters.
  2. Calculate the SSE increase for merging every pair (e.g., $d(A, B)$, $d(A, C)$, etc.).
  3. Merge the pair with the smallest increase in SSE.
  4. Update centroids and repeat.

---

#### **Dendrogram**:
- The results of Ward's Linkage are often visualized with a **dendrogram**.
- The height of each merge in the dendrogram represents the increase in SSE at that step.
- You can "cut" the dendrogram at a specific height to decide the number of clusters.