# Dimensionality Reduction Techniques

_Kevin Siswandi_  
**Fundamentals of machine learning**  
June 2020  

- Many parameters need to be selected
- The fidelity of the low-dimensional representation depends on the chosen parameters and intrinsic dimensionality of the data.
- The parameters control the tradeoff between stiffness and flexibility

We say that the lower-dimensional projection is either too stiff or too flexible when
- Too stiff = faraway points get mapped to nearby locations
- Too flexible = nearby points get mapped to faraway locations

In real-world datasets, usually the intrinsic dimensionality is much lower than the nominal dimensionality. The first thing to do, is almost always PCA.

## Principal Component Analysis

We will approach PCA here from the perspective of Singular Value Decomposition. We are looking for a rank-1 approximation to find the principal components. This is achived by decomposing the original data matrix X ($p \times n$) into:
- principal components -- new basis in terms of old basis
- scores -- coordinates in the new, lower dimensional basis

Some properties:
- The principal component matrix $U$ is of shape $p \times r$, where $r \leq p$.
- The scores matrix $Z$ is $r \times n$.
- The principal components are orthonormal to each other, i.e. $U^T U = I$

Singular Value Decomposition says that X can always be decomposed into

$$ X = U Z = U S V^T$$

where S is a diagonal matrix. The optimization problem solved by PCA is

$$ arg\min_{U, Z} || X - UZ||_F^2 $$

such that the product UZ has rank r. The decomposition is unique if singular values in diagonal matrix S are unique (nondegenerate). As a concrete example, consider a data matrix X of shape $p \times n$, where $p \leq n$. Then:
* U is of shape p x p
* S is of shape p x p
* $V^T$ is of shape p x n
* $U^T U = I_p$ (identity matrix of p dimension)
* $V V^T = I_n$ (identity matrix of n dimension)

The optimization problem can also be written as

$$ arg\min_\tilde{X} || X - \tilde{X}||_F^2 $$

such that rank($\tilde{X}$) = r, where:
* X -- original data
* Z -- low dimensional representation (encoding of X)
* $\tilde{X}$ -- approximation to original data/decoding of Z

Note that the encoding can be done via $ Z = U^T X$, and the approximation/decoding computed as $ \tilde{X} = UZ$.

## Autoencoders

Note that both the encoding and decoding in PCA are linear operations. How do we go nonlinear? In autoencoders, the encoding and decoding are implemented by nonlinear neural networks. This is achieved by:

$$ arg\min_{\theta_e, \theta_d} \sum_i L(x_i, f_d(f_e(x_i; \theta_e); \theta_d)) $$

where L is the loss function, $f_e$ and $f_d$ are the encoding and decoding functions (from the neural network).

## Other techniques

Here we review some methods that make high- and low- dimensional dissimilarities match.

### Metric Multidimensional Scaling
* Dissimilarity measure in high-dimensional space: Euclidean distance $d_{ij}^* = || x_i - x_j ||_2$
* Dissimilariy measure in low-dimensional space: Euclidean distance $d_{ij} = || z_i - z_j ||_2$
* Optimization objective: $arg\min_z \sum_i \sum_j (d_{ij}^* - d_{ij})^2$

The drawbacks of this method are 2fold:
- It emphasizes long distances too much
- non-convex optimization problem

There are also variants of this, such as weighting the optimization objective by the distances (either in low- or high- dimensional space).

### Curvilinear Component Analysis

This method [Demartinez, 1996] extends the MDS method by weighting the optimization objective by the low-dimensional distances. This allows for stronger or more non-linear unfolding of manifolds.

### Stochastic Neighbour Embedding

In this method [Hinton, 2003], the similarity is measured using radial basis function (RBF):

$$ k^*(x_i, x_j) = \exp(-\kappa_i ||x_i - x_j||^2) $$

where the constant $\kappa_i$ can be different for each observation. The similarity is also measured using RBF in the low-dimensional space, and the optimization uses the KL divergence between the two similarities.

###  t-SNE

t-SNE [van der Maaten, 2008] uses RBF kernel in the high-dimensional space, but student-t kernel in the lower dimensional space:

$$ k(z_i, z_j) = \frac{1}{1 + \alpha ||z_i - z_j||_2^2} $$

However, this is still a hard optimization problem. Depending on the initialization, you can end up in different local minima. A workaround is to initialize with a deterministic procedure (e.g. results from PCA). Computationally, this is also expensive because we minimize over all coordinates in the low-dimensional space: If we have 1 million points, there are two million coordinates! (since 1 observation has two coordinates).

### UMAP

Uniform Manifold Approximation uses similarity kernel that is constant up to nearest neighbour (and RBF otherwise) in the high-dimensional space. In the low-dimensional space, the single-sided exponential kernel is used as the similarity kernel. The optimization problem is minimizing the cross entropy between the two similarities, and initialization is done using Laplacian eigenmap.