## Hierarchical Clustering (Seeding)
In the NCAA March Madness tournament there are 64 teams each year qualified 
to participate. These 64 Teams are grouped by a committee into 16 seeds. 
Each seed consists 4 teams which have approximately the same performance. 
With seed number 1 containing the four best teams to seed number 16 containing 
the 4 weakest teams.

The task is to cluster the teams from the tournament based on their season statistics 
and then compare the clustering to the decision from the committee.
Hierarchical clustering is used to build groupings with variable cluster sizes 
depending on where the tree is cut. As the distinction between 16 seeds might be too
hard, an even number of seeds can be conflated to one and the clustering validation
can be compared for different seed sizes.

Looking at histogram plots of the feature distributions, we can see that all 14 used features
are normalized in an interval from 0 to 1.

<img src="images/max_datahistogram.png" style="float: left margin-right: 10px;"/>

Through PCA the data can be visualized in 3D by the first three principal components. If only data from a single year is selected, one can see that it is hard to group these data points into clusters using only three features. 

<img src="/images/max_3dpca_2018.png" style="float: center margin-right: 10px;"/>

If the data from all nine years is plotted, some patterns and clusters can be seen. But its still hard to distinguish between some of the overlapping seed clusters.

<img src="/images/max_3dpca.png" style="float: center margin-right: 10px;"/>

Principal component analysis tells us that at least 9 out of the 14 features are needed to achieve a retained variance of **95.1%**. This can be seen in the following scree plot.

<img src="/images/max_screeplot_seeding.png" style="float: center margin-right: 10px;"/>

For hierarchical clustering two important settings need to be considered. One is the metric, for which euclidean distance is commonly used. The other one is the linkage, which can be single link, complete link, average link, etc.. Here average link was chosen, because it is mostly recommended for general tasks, avoids chaining and clusters mostly into evenly sized groups, which is necessary for our task. The effect of this can be seen in the following dendrogram.

<img src="/images/max_dendrogram.png" style="float: center margin-right: 10px;"/>

After clustering is performed, we evaluate how good it is by means of different validation metrics.

<img src="/images/max_cluster_val.png" style="float: center margin-right: 10px;"/>

The purity gives us an indicator of how purely the clusters consist of only data points from a single ground truth, with 1 corresponding to the best purity. For the given task we get the best purity when we only partition into two seeds and the purity decreases with the number of seeds, as expected.

The mutual information measures the amount of information shared between clustering and ground truth. The adjusted variant of MI is independent of the number of clusters in a partition, otherwise a higher number of clusters would give a better MI score. Larger values indicate a good clustering. Our results show a slight decrease in the AMI for a increased number of seeds.

The random score is a pairwise measure, which is the fraction of true positives and true negatives over the total number of pairs. The adjusted rand score is centered and normalized to adjust for chance. Negative values are bad, close to zero means random and a score of one means that the clustering assignments are identical up to label permutations. For our task the ARI slightly decreases with the number of seeds.

An overall trend of decreased performance can be observed as the partitioning of the teams into the seeds gets finer. With a purity below 20% for partitioning into the regular 16 seeds, this means that this task is harder than expected based on the provided features.


### Hierarchical Clustering (Tournament Qualification)

Another clustering task is to determine which teams qualified for the tournament.
The 3D visualization via PCA applied on the full dataset (not just qualified teams) shows that the clusters of qualified vs. non-qualified teams appears easier to distinguish.


<img src="/images/max_3dpca_qual.png" style="float: center margin-right: 10px;"/>

This is also visible in the decrease in variance of the principal components. Over **50%** of the variance is contained in the first component, whereas the other components only have minor contributions.

<img src="/images/max_screeplot_qual.png" style="float: center margin-right: 10px;"/>

We need  9 features to recover **96.2%** of the variance.
A purity of **82%** for the qualified vs. non-qualified clustering is reached, as both clusters contain mainly points from one ground truth (qualified or non-qualified). But from the contigency matrix 

| Clustering/Qualification       | NQ   | Q  |
|:-----------------------------:|:----:|:---:|
| **C1**                            | 2521 | 576 |
| **C2**                            | 28   |   0 |

we can see that both clusters are assigned to the same ground truth partition. This results in a bad ARI and AMI score close to 0. These results suggest that the qualified and non-qualified teams are not properly separable through hierarchical clustering and the distance metric utilized. The heavy bias toward non-qualified teams (as only 64 teams make the tournament each year) may be a contributing factor of this result.