# STA663 Final Project Report

### Abstract

We implement the Bayesian hierarchical clustering algorithm described in the paper by Heller and Ghahramani (2005) and optimized the algorithm's speed using Cython. 

### Background

Hierarchical clustering is an unsupervised learning algorithm that organizes data into a binary tree. By cutting the tree at various heights, one can acquire clustering structures of the data. The traditional hierarchical clustering algorithm is the  agglomerative hierarchical clustering method. This algorithm starts with each data point being in it's own cluster and uses a bottom-up approach of iteratively merging the most similar clusters until all the data has been grouped into one cluster. The similarity of clusters is evaluated using a pre-specified distance measure (e.g. Euclidean distance).

The major drawback of the traditional hierarchical clustering method is that it does not provide intuitive guidance to choosing the correct number of clusters or the appropriate distance measure. Furthermore, the algorithm does not define a probabilistic model for the data, thus it is not able to give predictions or cluster new data points into existing clusters.

Heller and Ghahramani (2005) proposed the Bayesian hierarchical clustering model to overcome those aforementioned problems. The model is similar to the traditional hierarchical clustering method, but Bayesian hierarchical clustering merges data using the marginal likelihood of data. The major advantage of Bayesian hierarchical clustering over the traditional method is that it has a natural way to choose the correct number of clusters, and it can also give predictions of the probability of assigning new data into existing clusters. On the other hand, the disadvantage of the algorithm is that it is more computationally intensive since it requires calculating the marginal likelihood at each step. Bayesian hierarchical clustering works best when the user has no prior knowledge on the number of clusters or when the user wants to predict the probability of new data belonging to any existing clusters.

### Description of algorithm

The Bayesian hierarchical clustering algorithm starts with every data point being in its own node. Afterwards, it iteratively merge the nodes with the highest merge probability until there is only a single node left (all data points in the same node). Merge probability is defined as the the ratio of the marginal likelihood of the two nodes being in the same cluster against the marginal likelihood of the nodes being partitioned in all possible ways which does not violate the existing tree structure. The tree can then be cut at the step where the merge probability is less than 50%.

In mathematical terms, if we represent the hypothesis that the nodes are in the same cluster by $H_1^k$ and the tree structure using $T$, then we can write our the algorithm:

1.Input: data = {$x^{(1)}, ..., x^{(n)}$}, model(likelihood) $p(x \mid \theta)$, prior $p(\theta \mid \beta)$

2.Initialize: number of clusters $c = n$, $D_i = $ {$x^{(i)}$} for $i = 1, ..., n$

3.while $c > 1$ do

Find the pair of clusters $D_i$ and $D_j$ that has the highest merge probability: $r_k = \frac{\pi_k p(D_k \mid H_1^k)}{p(D_k \mid T_k)}$ 
        
Merge $D_i, D_j$ into $D_k$, $T_i$, and $T_j$ into $T_k$

Delete $D_i, D_j$ and $c = c - 1$

end while

To calculate the merge probabilities, we need to compute $p(D_k \mid T_k)$, which is
$$
p(D_k \mid T_k) = \pi_k p(D_k \mid H_1^k) + (1 - \pi_k) p(D_i \mid T_i)p(D_j \mid T_j)
$$
The term $\pi_k$ can be calculated by

1.Initialize: each leaf $i$ with $d_i = \alpha, \pi_i = 1$ 

2.for each internal node $k$ do

$d_k = \alpha \Gamma(n_k) + d_{left_k}d_{right_k}$

$\pi_k = \frac{\alpha \Gamma(n_k)}{d_k}$

end for

### Optimization for performance

### Applications to simulated data sets

In [16]:
%matplotlib inline
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import pandas as pd

In [None]:
def purity_score(cluster_true, cluster_pred):
    """
    Function that returns the purity scores of each cluster given the predicted clusters and the true clusters
    """
    
    def most_common(lst):
        """Helper function that finds the most frquent item in a list"""
        return max(set(lst), key = lst.count)
    
    purity = []
    for cluster in set(clust_pred):
        np.where(np.array(cluster_pred) == cluster)
        cluster_element = [cluster_true[index] for index in np.where(np.array(cluster_pred) == cluster)[0]]
        dominant_item = most_common(cluster_element)
        purity.append(cluster_element.count(dominant_item) / len(cluster_element))
        
    return purity

### Applications to real data sets
We tested our BHC algorithm on the glass dataset mentioned in the original paper, as well as the famous iris dataset by Fisher.

In [20]:
glass = pd.read_csv("data/glass.data", index_col=0, header=None)
iris = pd.read_csv("data/iris.data", header=None)
glass_x = glass.iloc[:, 0:9]
iris_x = iris.iloc[:, 0:4]

### Comparative analysis with competing algorithms

### Conclusion

### References

Heller, K. A., & Ghahramani, Z. (2005). Bayesian Hierarchical Clustering. Neuroscience, 6(section 2), 297–304. doi:10.1145/1102351.1102389

Dua, D. and Graff, C. (2019). UCI Machine Learning Repository [http://archive.ics.uci.edu/ml]. Irvine, CA: University of California, School of Information and Computer Science.