# Principle Component Analysis and Clustering of Hospital Quarterly Financial Data
### Jeevitha Bandi, Doah Karabey, Elling Payne

## 2.4 Technical Background - Agglomerative Hierarchical Clustering

In hierarchical clustering (AHC) the goal is build a hierarchical map of relationships between clusters. In agglomerative clustering, this is done iteratively and starting from the lowest level, which is relationships between individual data points. At each iteration, the two most similar clusters are linked, and this link informs the next iteration that the two clusters are now one. This process continues until all clusters have been linked together, with each iteration presenting a clustering with fewer and fewer clusters. Which two clusters are most similar at a given step depends mainly on the _linkage type_ and the _distance measure_ used. The most common _distance measure_ is the Euclidean distance between two points in feature space. The _linkage type_ refers to which points in each cluster matter for calculating cluster distance, as well as how they are aggregated into a single distance metric for every cluster pair. Three of the most common are _single linkage_, _complete linkage_, and _average linkage_.

When compared with K-means clustering, the process of hierarchical clustering produces multiple clustering solutions with the cluster counts ranging from one to the size of the data set, rather than requiring that the number of clusters be set beforehand. As a bonus, the process of building the hierarchy can be represented nicely in a _dendogram_. The height of a dendogram corresponds to the similarity between the most sinmilar clusters. A choice of clustering step corresponds to cutting the dendogram at a particular height. Another way of looking at it is that cutting the dendogram at a particular height is in fact a selection of the minimum cluster dissimilarity to consider two clusters truly different. Then only those cluster splits that results in clusters at least that dissimilar will be retained.

### 2.4.1 Single Linkage
When using single linkage, the linkage distance between two clusters only depends on the closest pair of points in the two clusters. This might be useful in a scenario in which the closest points in a cluster are the most important for understanding the relationships of interest. For instance, one might be able to imagine a social process in which two groups are more likely to come into contact if they each have one member that can kick things off with the someone in the other group. If all we were interested in was attempting to group friend groups, it is concievable that given two groups, the most similar members matter more than members that are dissimilar. However, in practice this is fairly uncommon, and single linkage tends to result in pretty unbalanced clusters. This is because the largest two clusters at any given step are the most likely to be the closest.


### 2.4.2 Complete Linkage
In complete linkage, the distance between two clusters is based on the pair of points which are furthest from eachother. This might be considered the most conservative of the three linkage types mentioned in that it tends to enforce a more balanced clustering. In contrast with single linkage, the largest two clusters are now the least likely to be the most similar. However, in some scenarios this approach might result in single outliers being merged into a cluster before another cluster which by other measures might be considered more similar.

### 2.4.3 Average Linkage
Average linkage considers the average of all of the pairwise similarity scores between points in the two clusters. This is not to be confused with _centroid linkage_, which is not used in this project. Average linkage is a balanced approach compared with single and complete linkage. Like complete linkage, it tends to create more balanced clusters than single linkage, but to a lesser degree.

## 4.4 Results - Agglomerative Hierarchical Clustering

### 4.4.1 Single Linkage

![Single Linkage Clusters by K](data/hierarchical_clustering/single_linkage_clusters.png)

### 4.4.2 Complete Linkage

### 4.4.3 Average Linkage

## 5.1 Discussion - Singular Values and Scree Plots

## 5.4 Discussion - Clustering Interpretations

The hope is that the clusterings here will represent meaningful clusters where the clustering unit is not hospitals but individual financial quarters at various hospitals. As such, some expected outcomes are that different quarters from the same hospital are likely to be in the same cluster, unless some drastic change occured at that hospital. This is visible in the dendogram for agglomerative clustering with complete linkage, in which quarterly records from teh same hospital typically occur in the same groups.

## 7 References