# Hierarchical Cluster Analysis (HCA)

In [None]:
"""
# Hierarchical Cluster Analysis (HCA)

## Introduction
Hierarchical Cluster Analysis (HCA) is a method for grouping similar objects into clusters.
Unlike flat clustering (e.g., k-means), HCA builds a hierarchy of clusters.
There are two main types of hierarchical clustering:

1. **Agglomerative (bottom-up)**: Each data point starts as its own cluster, and clusters are merged iteratively.
2. **Divisive (top-down)**: All data points start in one cluster, and clusters are split iteratively.

This notebook will focus on agglomerative clustering using the **average linkage** method.

## Steps of HCA
1. Compute the pairwise distances between all points.
2. Identify the two closest clusters and merge them.
3. Update the distance matrix.
4. Repeat steps 2 and 3 until all points are in one cluster.
5. Represent the results using a **dendrogram**.
"""

using Random, Clustering, Distances, Plots

"""
## 1. Generating Data
We generate two distinct clusters of data points to illustrate HCA.
"""

Random.seed!(42)
X = vcat(rand(5, 2) .+ 1, rand(5, 2) .+ 4)  # Two distinct clusters

"""
## 2. Computing Pairwise Distances
HCA relies on a distance matrix that quantifies the similarity between points.
We use **Euclidean distance** to measure the distance between data points.
"""

dist_matrix = pairwise(Euclidean(), X', dims=2)

display(dist_matrix)  # Show distance matrix

"""
## 3. Performing Hierarchical Clustering
We use **average linkage**, which calculates the average distance between elements in different clusters.
Other linkage methods include:
- **Single linkage**: Distance between the closest points of two clusters.
- **Complete linkage**: Distance between the farthest points of two clusters.
- **Centroid linkage**: Distance between the centroids of two clusters.
"""

linkage = hclust(dist_matrix, linkage=:average)

"""
## 4. Visualizing the Dendrogram
A dendrogram represents the hierarchical structure of clusters.
By cutting the dendrogram at a specific height, we can choose the number of clusters.
"""

plot(linkage, xticks=1:size(X,1), title="Hierarchical Clustering Dendrogram", xlabel="Data Points", ylabel="Distance")

"""
## 5. Additional Examples
To further explore HCA, we generate different datasets and apply hierarchical clustering using various linkage methods.
"""

# Example 1: Clustering a new dataset with different distributions
Random.seed!(100)
Y = vcat(randn(5, 2) .+ 3, randn(5, 2) .- 3)
dist_matrix_Y = pairwise(Euclidean(), Y', dims=2)
linkage_Y = hclust(dist_matrix_Y, linkage=:single)
plot(linkage_Y, xticks=1:size(Y,1), title="HCA with Single Linkage", xlabel="Data Points", ylabel="Distance")

# Example 2: Applying HCA to a more complex dataset
Z = rand(15, 2) * 10
dist_matrix_Z = pairwise(Euclidean(), Z', dims=2)
linkage_Z = hclust(dist_matrix_Z, linkage=:complete)
plot(linkage_Z, xticks=1:size(Z,1), title="HCA with Complete Linkage", xlabel="Data Points", ylabel="Distance")

"""
## 6. Exercises
Try the following exercises to reinforce your understanding of hierarchical clustering.

### Exercise 1: Implement Custom Clustering Function
Modify the function to test different linkage methods (e.g., `:single`, `:complete`, `:ward`).

### Exercise 2: Generate a Larger Dataset
Create a dataset with 50 points and apply hierarchical clustering. Visualize the results.

### Exercise 3: Experiment with Different Distance Metrics
Modify the distance metric to use **Manhattan** distance instead of Euclidean. Compare the dendrograms.

### Exercise 4: Cut the Dendrogram at a Fixed Height
Write a function to extract cluster assignments by cutting the dendrogram at a predefined distance threshold.
"""

# Solutions

"""
## Solution to Exercise 1
Modify the function to allow different linkage methods.
"""

function hierarchical_clustering(points, method=:average)
    return hclust(pairwise(Euclidean(), points', dims=2), linkage=method)
end

# Example usage
dendrogram = hierarchical_clustering(X, :complete)
plot(dendrogram)

"""
## Solution to Exercise 2
Generate a larger dataset and apply clustering.
"""

Random.seed!(123)
large_X = rand(50, 2) * 10
dist_large = pairwise(Euclidean(), large_X', dims=2)
linkage_large = hclust(dist_large, linkage=:average)
plot(linkage_large, xticks=1:size(large_X,1), title="HCA on Large Dataset", xlabel="Data Points", ylabel="Distance")

"""
## Solution to Exercise 3
Use Manhattan distance instead of Euclidean.
"""

dist_manhattan = pairwise(Cityblock(), X', dims=2)
linkage_manhattan = hclust(dist_manhattan, linkage=:average)
plot(linkage_manhattan, xticks=1:size(X,1), title="HCA with Manhattan Distance", xlabel="Data Points", ylabel="Distance")

"""
## Solution to Exercise 4
Extract clusters by cutting the dendrogram at a threshold.
"""

function cut_dendrogram(linkage, threshold)
    return cutree(linkage, h=threshold)
end

clusters = cut_dendrogram(linkage, 1.5)
println("Cluster Assignments:", clusters)
