# Dimension Reduction and Clustering

<style>body {text-align: justify}</style>

Unsupervised machine learning consist in working with unlabeled data in order to create clusters or groups of observations sharing similar features. Contrary to supervised learning, the dataset only stores observations :

$$

\mathcal{D} = \{X_1, ... ,X_n \} \in \mathcal{X}^n

$$

However, the dimension of $ \mathcal{X} $ often makes it hard to work with. Therefore, one the first task to do in unsupervised machine learning is to reduce the dimension of the space. This can be done with a map $ \phi $ from $ \mathcal{X} $ to a new space $ \mathcal{X}' $ of smaller dimension. It is important in this process to take a close look at the reconstruction error of the application $ \tilde{\phi} $ from $ \mathcal{X}' $ to $ \mathcal{X} $, and at the relationship preservation : $(\phi(X_i),\phi(X_j))$ should have a similar relationship as $ (X_i,X_j)$

The hight dimensional geometry curse - Folks theorem : 

- If $X_1,...,X_n$ in the hypercube of deimension d such that their coordinates are i.i.d then : 

$$
\frac{min \vert\vert X_i - X_j \vert\vert_p}{max \vert\vert X_i - X_j \vert\vert_p} = 1 + \mathcal{O}_p(\sqrt{\frac{log(n)}{d}})
$$

That means that when d is large enough, all points are almost equidistant. 

## Principal Component Analysis

As the name suggests, the idea behind principal component analysis is to select the most important features in order to create a subspace of smaller dimension that retains the most information, it aims at decreasing the dimension of the dataset from $n$ to $d$ with $ n \ge d$. We consider the dispertion of the data with respect to a feature as the importance of this feature, i.e the more dispersed the data is with respect to a feature, the more useful this feature will be in order to create clusters.

<center>
<figure>
<img src="./pictures/PCA_rotation.png">
<figcaption>Fig.1 - PCA example from wikipedia</figcaption>
</figure>
</center>

On the example above, we can clearly see that the feature represented on the X-axis is the principal component, whereas the one on the Y-axis is less important. Indeed, the variation is greater on the X-axis as it is on the Y-axis. The objective will be to rotate the data as shown in red in order to increase the variance with respect to the X axis, and decrease the variance with respect to the Y axis. 

In a real life scenario where there are many more features and thus dimensions, when classify the importance of each component by computing the covariance matrix. 

## Clustering

There are a variety of clustering methods, which are very difficult to compare. The image below from scikit learn documentation shows 
<center>
<figure>
<img src="./pictures/plot_cluster_comparison.png">
<figcaption>Fig.1 - Clustering method from scikit learn</figcaption>
</figure>
</center>