Dimensionality Reduction

* Objectives:
    * Why reduce dimensionality?
    * Methods for reducing dimensionality
    * One common technique: Principal Components Analysis (PCA)
    * Crude facial recognition with PCA and kNN (e.g. Eigenfaces)
    * Singular Value Decomposition (SVD)
    * SVD vs. PCA
    * SVD for capturing latent features

1) The Importance of Reducing Dimensionality
* dimensionality = number of features = number of predictors
* Why reduce dimensionality?
    * High dimensional data causes many problems
    * **Curse of Dimensionality** - points are "far away" in high dimensions, and it's easy to overfit small datasets (sparsity of sample data points)
        * Distance between nearest neighbors is very large
        * Stimulation: uniformly distributed points in a cube
        ![distance_in_hyperspace](distance_in_hyperspace.png)
        * Stimulation: uniformly distributed points in a hypercube (most data points are closer to the boundary of the sample space)
        ![hypercube](hypercube.png)
        * **Sample sparsity** - when the dimensionality $d$ increases, the volume of the space increases so fast that the available data becomes sparse
            * Example: uniformly distributed data points ($N=1000$) and cover 20% of range of each space
            ![sample_sparsity](sample_sparsity.png)
            * In ten dimensions we need to cover 80% of the range of each coordinate to capture 10% of the data
            ![10_dimensions](10_dimensions.png)
                * For a fraction $r$ of unit volume: (Edge Length) $e(r)=r^{\frac{1}{d}}$
                * Reducing $r$ gives fewer observations to average, and higher variance of fit
        * **Classifier Performance** - as the dimensionality increases, the classifier's performance increases until the optimal number of features is reached
        ![classifier_performance](classifier_performance.png)
            * Increasing the dimensionality further **without** increasing the number of training samples results in a **decrease** in classifier performance
    * **Difficult Visualization** - it's hard visualize anything more than 3 dimensions
    * **Finding Latent Features** - often the most relevant features are not explicitly present in the raw high dimensional data (especially for image/video data)
    * **Removing Correlation** - with many, many features (dimensions), there will most likely be a lot of correlations (e.g. consider neighboring pixels in an image dataset)

2) Other Methods For Reducing Dimensionality
* Subset Selection of Features (e.g. **Forward Stepwise Selection**) - selecting the best model with the best subset of features based on some metric (e.g. Mallow's C, AIC, BIC, Adjusted $R^2$, or cross-validation)
* **Lasso Regression** - creates sparse models by zeroing out features that less importance
* **Relaxed Lasso**
    1. Using cross-validation and lasso regression, find the best value for $\lambda$
    2. Keep only the features with non-zero coefficients
    3. Re-fit using ordinary least squares (OLS) regrsesion (which is refitting using no regularization)
* **Upper-layer Features** (in NN for labeled training data) - train the neural network, then interpret the output of the hidden neurons (fully-connected layer before output) as high-level features
* **Autoencoders** (in NN for labeled or unlabeled training data) - autoencoders are neural networks that has the network learn to **reconstruct the input** instead of learning the target
![autoencoders](autoencoders.png)
    * Force the information through a **bottleneck hidden layer**
    * Converges on PCA

3) **Principal Components Analysis (PCA)** - common dimensionality reduction technique that doesn't require labeled data
* First goal of PCA is to remove correlation between features
* A side effect is that we can use the process to reduce the dimensionality of the data while preserving most of the variance in our data
* **Rank and Dimensionality**
    * $\mathbf{x}_i^T$ are dimensional row vectors (feature space)
    * $\mathbf{y}_j$ are column vectors (dependent space)
    * Linear Independence:
        * For any set of $n$ vectors, a linear combination is an expression of the form: $c_1\mathbf{y}_1+c_2\mathbf{y}_2+\cdots+c_n\mathbf{y}_n=0$
            * If this equation holds only if all $c_i$'s are zero, then $y_i$'s are **linear independent** vectors
            * If this equation holds with $c_i$'s **not equal** to zero, then $y_i$'s are **linear dependent** vectors. This means we can express at least one of the vectors as linear combination of the others: $\mathbf{y}_1=k_2\mathbf{y}_2+\cdots+k_n\mathbf{y}_n$ where $k_j=\frac{-c_j}{c_1}$
    * Rank of Matrix $\mathbf{D}$:
        * Maximum number of linearly independent column vectors of $\mathbf{D}$
        * Maximum number of linearly independent row vectors of $\mathbf{D}$
        * $\mathbf{D}=\left[\begin{array}{cc}
            \mathbf{y}_1 & \mathbf{y}_2 & \cdots & \mathbf{y}_d
            \end{array}\right]$
            * $\mathbf{D}$ is a $[n \times d]$
            * $rank(\mathbf{D})=r\leq min(n,d)$
        * Rank of data matrix gives the dimensionality of the data
    * The number of linearly independent **basis vectors** needed to represent a data vector, gives the dimensionality of the data
        * equation: $\mathbf{x}=x_1\mathbf{e}_1+x_2\mathbf{e}_2+\cdots+x_d\mathbf{e}_d$
        ![basis_vectors](basis_vectors.png)
        * The data points apparently reside in a $d$-dimensional **attribute space**
        * But, if $r<d$, then the data points actually reside in a lower $r$-dimensional space
    * Determining vector space for orthonormal basis vectors:
        ![orthonormal_basis_vectors](orthonormal_basis_vectors.png)
        * $\mathbf{D}=\left[\begin{array}{cc}
            \mathbf{x}_1^T \\
            \mathbf{x}_2^T \\
            \vdots \\
            \mathbf{x}_n^T
            \end{array}\right]$
        * each point $\mathbf{x}_i^T=(x_1,x_2,\dots,x_d)^T$ is a vector in $d$-dimensional vector space
        * write $\mathbf{x}$ as: $\mathbf{x}=\sum_{i=1}^d x_i\mathbf{e}_i$ where $e_i$ are **orthonormal basis vectors**:
            * $\mathbf{e}_i^Te_j=1$ if $i=j$
            * $\mathbf{e}_i^Te_j=0$ if $i\neq j$
    * Changing the dimensional space to fit along the data vector space
        ![orthonormal_basis_vectors_2](orthonormal_basis_vectors_2.png)
        * $\mathbf{D}=\left[\begin{array}{cc}
            \mathbf{x}_1^T \\
            \mathbf{x}_2^T \\
            \vdots \\
            \mathbf{x}_n^T
            \end{array}\right]$
        * each point $\mathbf{x}_i^T=(x_1,x_2,\dots,x_d)^T$ is a vector in $d$-dimensional vector space
        * given any other set of $d$ orthonormal vectors: $\mathbf{x}$ as: $\mathbf{x}=\sum_{i=1}^d a_i\mathbf{u}_i$ where $u_i$ are **orthonormal basis vectors**:
            * $\mathbf{u}_i^Tu_j=1$ if $i=j$
            * $\mathbf{u}_i^Tu_j=0$ if $i\neq j$  
        * $\mathbf{A}=\left[\begin{array}{cc}
            \mathbf{a}_1^T \\
            \mathbf{a}_2^T \\
            \vdots \\
            \mathbf{a}_n^T
            \end{array}\right]$
        * each point $\mathbf{a}_i^T=(a_1,a_2,\dots,a_d)^T$ is a vector in $d$-dimensional vector space
        * $a_j=\mathbf{u}_j^T\mathbf{x}$
        * in vector form:
            * $\mathbf{a}=\mathbf{U}^T\mathbf{x}$
            * $\mathbf{U}=(\mathbf{u}_1,\mathbf{u}_2,\dots,\mathbf{u}_d)$
    * Finding the optimal set of orthonormal basis vectors:
        * Because there are potentially infinite choices for the set of orthonormal basis vectors, one natural question is whether there exists an **optimal** basis, for a suitable notion of optimality
        * Finding a reduced dimensionality subspace that still preserves the essential characteristics of the data (high variance explained)
        * Project data points from a $d$-dimensional space to an $r$-dimensional space where $r<d$
        * $\mathbf{x}'=\sum_{i=1}^r a_i \mathbf{u}_i$
        * $\mathbf{\epsilon}=\sum_{i=1}^d a_i \mathbf{u}_i = \mathbf{x}-\mathbf{x}'$ (error vector)
* PCA basic terminology:
    * **Principle Component Analysis (PCA)** is a technique that seeks a $r$-dimensional basis that best captures the variance in the data
    * **First Principal Component** - the direction with the largest projected variance
    * **Second Principal Component** - the orthogonal direction that captures the second largest projected variance
    ![principal_component](principal_component.png)
* First Principle Component calculation:
    * choose the direction $\mathbf{u}$ such that the variance of the projected points is maximized
    * the projected variance along $\mathbf{u}$ is: $\sigma_u^2=\frac{1}{n}\sum_{i=1}^n (a_i-\mu_{\mathbf{u}})^2$
    * for centered data: $\sigma_u^2=\mathbf{u}^T\mathbf{\Sigma}\mathbf{u}$
    * $\mathbf{\Sigma}$ is covariance matrix of centered $\mathbf{D}$: $\mathbf{\Sigma}=\frac{1}{n}\mathbf{D}^T\mathbf{D}$
    * maximizing $\sigma$ (with constraint $\mathbf{u}^T\mathbf{u}=1$) gives:
        * $\mathbf{\Sigma}\mathbf{u}=\mathbf{\lambda}\mathbf{u}$
        * $\sigma_{\mathbf{u}}^2=\mathbf{\lambda}$
    * to maximize projected variance, maximize the eigenvalue of $\mathbf{\Sigma}$
    * eigenvector $\mathbf{u}$ with maximum $\mathbf{\lambda}$ specifies the direction of most variance (First Principal Component)
* Finding all eigenvectors corresponding to $\lambda$:
    * to find the best $r$-dim approximation to $\mathbf{D}$, compute the eigenvalues of the covariance matrix $\mathbf{\Sigma}$
    * eigenvalues of $\mathbf{\Sigma}$ are non-negative, and be sorted in decreasing order: $\lambda_1\geq \lambda_2\geq \cdots \lambda_r \geq \lambda_{r+1} \cdots \geq \lambda_d \geq 0$
        * eigenvector corresponding to $\lambda_1$ gives first principle component
        * eigenvector corresponding to $\lambda_2$ gives second principle component
* Projected variance and minimizing MSE:
    * reduced $r$-dimensional data matrix: $\mathbf{A}=[\mathbf{\alpha}_1,\mathbf{\alpha}_2,\cdots,\mathbf{\alpha}_r]$
    * total variance of $\mathbf{A}$: $var(\mathbf{A})=\sum_{i=1}^r \lambda_i$
    * the first $r$-principal components maximize the projected variance, $var(\mathbf{A})$, and thus minimize the MSE:
        * $MSE=\frac{1}{n}\sum_{i=1}^n \mathbf{\epsilon}_i^T\mathbf{\epsilon}_i=var(\mathbf{D})-var(\mathbf{A})$
        * $\mathbf{\epsilon}_i = \mathbf{x}_i-\mathbf{x}_i'$
* **Choosing the dimensionality** - how many dimensions, $r$, to use for a good approximation
    * compute the fraction of the total variance captured by the first $r$ principal components: $f(r)=\frac{\lambda_1+\lambda_2+\cdots+\lambda_r}{\lambda_1+\lambda_2+\cdots+\lambda_d}=\frac{var(\mathbf{A})}{var(\mathbf{D})}$
    * starting from the first principal component, then keep on adding additional components, and stop at the smallest value $r$, for which $f(r)\geq\alpha$
    * in practice, $\alpha$ is usually set to 0.9 or higher, so that the reduced dataset captures at least 90% of the total variance
* PCA for $n<d$:
    * the covariance matrix is a $d\times d$ matrix
    * typical algorithms for finding eigenvectors of $d \times d$ matrix have a computational cost that scales like $O(d^3)$
    * for $n<d$, there are only $n$ non-zero eigenvalues of covariance matrix
    * starting from the eigenvalue equation: 
        * $\mathbf{\Sigma}\mathbf{u}_i=\mathbf{\lambda}_i\mathbf{u}_i$
        * $\frac{1}{n}\mathbf{D}\mathbf{D}^T\mathbf{v}_i=\mathbf{\lambda}_i\mathbf{v}_i$ where $\mathbf{v}_i=\mathbf{D}\mathbf{u}_i$
    * converted to eigenvalue equation of $n \times n$ matrix:
        * $\mathbf{u}_i=\frac{1}{\sqrt{n\lambda_i}}\mathbf{D}^T\mathbf{v}_i$
    * Summary of steps:
        1. Evaluate $\mathbf{D}\mathbf{D}^T$
        2. Find its eigenvectors and eigenvalues
        3. Compute the eigenvectors in the original data space
* **Summary of PCA Steps**:
    1. create the **centered design matrix** ($M$) - center the data by subtracting the mean. Transform centered data into matrix form where each row is one example
    2. compute the **covariance matrix** ($M^TM$)
    3. principal components are the **eigenvectors** of the covariance matrix - order the eigenvectors by decreasing corresponding eigenvalues to get an uncorrelated and orthogonal basis vector capturing the directions of most-to-least variance in your data
        * **Eigenvector** - size of each eigenvalue denotes the amount of variance captured by that eigenvector
* Example: Yale Face Database (using a subset of 105 face images)
    * each image is 320x240 pixels, grayscale, centered, and cropped identically
    ![face_images](face_images.png)
    * **Eigenfaces** - result of applying PCA on images and looking at the eigenvectors of the face database covariance matrix:
    ![eigenfaces](eigenfaces.png)
        * What is each eigenface capturing?
    * Drawbacks of Eigenfaces: (Using the old method in 1987-1991)
        * (-) Faces must be aligned eyes to eyes, mouth to mouth - differences in translation and scale are captured by PCA (which isn't what we want)
        * (-) Faces must be lit the same - differences in lighting are captured by PCA (which isn't what we want)
    * Improvements to methodology:
        * **Fisherfaces** - uses LDA and labels to help remove lighting effects
        * **Active Shape Model** - use PCA on shapes detected in the image
    * Crude Facial Recognition via PCA and kNN - using PCA on cropped face images (aka Eigenfaces) and combined with kNN, the model yields a rough facial recognition system
    ![pca_knn](pca_knn.png)
* Example: MNIST dataset (dataset of handwritten digits, 10 classes (0-9), 28x28 pixels (yielding 784-dimensional vector space), grayscale, 60,000 training images / 10,000 test images)
    * Each eigenvector of the covariance matrix is a vector in the original $d$-dimensional space that can be represented as images of the same size as the data points
    ![lambda_eigenvectors](lambda_eigenvectors.png)
    * What are the sizes of the eigenvalues? (recall that the size of each eigenvector's eigenvalue denotes the amount of variance captured by that eigenvector)
        * Plot of the complete spectrum of eigenvalues, sorted into decreasing order
    ![eigenvalues](eigenvalues.png)
    * Reconstructing the input by a linear combination of eigenvectors:
    ![comb_eigenvectors](comb_eigenvectors.png)
    * Embedding in 2D:
    ![embedding_2d](embedding_2d.png)
* When to use PCA (generally):
    * kNN on high dimensional data
    * clustering high dimensional data
    * visualization (e.g. embeddings)
    * working with images (e.g. feeding an image into a decision tree model)
* When **not** to use PCA:
    * retain interpretability of the feature space
    * model doesn't need reduced dimensional data (e.g. OLS on relatively small dataset)

4) Singular Value Decomposition (SVD) - more generalized matrix decomposition
* **Singular Value Decomposition** - factorization of a real or complex matrix. It is the generalization of the eigendecomposition of a positive semidefinite normal matrix to any $m \times n$
    * an $n \times d$ data matrix $\mathbf{D}$ can be factorized as: 
        * $\mathbf{D}=\mathbf{L}\mathbf{\Delta}\mathbf{R}^T$
            * $\mathbf{L} = (n \times n)$ left singular matrix
            * $\mathbf{R} = (d \times d)$ right singular matrix
            * $\mathbf{\Delta} = (n \times d)$ diagonal matrix
    * $\mathbf{\Delta}(i,j) = $$\begin{cases} 
            \delta_i & \text{if } i=j \\
            0 & \text{if } i\neq j 
            \end{cases}$
        * $i=1,\dots,n$
        * $j=1,\dots,d$
        * $\delta_i$ = singular values
    * if the rank of $\mathbf{D}$ is $r \leq min(n,d)$, then there will be only $r$ non-zero singular values: 
        * $\delta_1 \geq \delta_2\geq \cdots \geq \delta_r \geq 0$
    * can discard those **left** and **right** singular vectors that correspond to zero singular values, to obtain the reduced **SVD** as: 
        * $\mathbf{D}_r=\mathbf{L}_r\mathbf{\Delta}_r\mathbf{R}_r^T$
        ![svd](svd.png)
* SVD vs PCA:
    * PCA is a **special case** of more general matrix decomposition (SVD)
    * PCA yields the following decomposition of covariance matrix:
        * $\mathbf{\Sigma}=\mathbf{U}\mathbf{\Lambda} \mathbf{U}^T$
        * Covariance matrix in new basis: $\mathbf{\Lambda}=\left[\begin{array}{cc}
            \mathbf{\lambda}_1 & 0 & \cdots & 0 \\
            0 & \mathbf{\lambda}_2 & \cdots & 0 \\
            \vdots & \vdots & \ddots & \vdots \\
            0 & 0 & \cdots & \mathbf{\lambda}_d \\
            \end{array}\right]$
        * $\mathbf{\sigma}_i^2=\mathbf{\lambda}_i=\mathbf{u}_i^T\mathbf{\Sigma}\mathbf{u}_i$
    * Matrix comparison:
        * PCA gives: $\mathbf{D}^T\mathbf{D}=n\mathbf{\Sigma}=\mathbf{U}(n\mathbf{\Lambda})\mathbf{U}^T$
        * SVD gives: $\mathbf{D}^T\mathbf{D}=(\mathbf{R}\mathbf{\Delta}\mathbf{L}^T)^T(\mathbf{R}\mathbf{\Delta}\mathbf{L}^T)=\mathbf{R}\mathbf{\Delta}_d^2\mathbf{R}^T$
            * where $\mathbf{\Delta}_d^2$ is the ($d\times d$) diagonal matrix defined as: $\mathbf{\Delta}_d^2(i,i)=\delta_i^2$
        * Comparing both: $\delta_i^2=n\lambda_i$, $\mathbf{R}=\mathbf{U}$
    * Singular vector comparison:
        * The right singular vectors in $\mathbf{R}$ are the same as eigenvectors of $\mathbf{\Sigma}$
        * The left singular vectors in $\mathbf{L}$ are the eigenvectors of the matrix ($n \times n$), matrix $\mathbf{D}\mathbf{D}^T$, and the corresponding eigenvalues are given as $\delta_i^2$ 
* Capturing Latent Features in SVD (Finding Topics/Concepts)
    * SVD can be used to find latent topics in the data
    * Example: People and Food Preferences Matrix
    ![latent_topics](latent_topics.png)
    * $\mathbf{\Delta}=$
    ```python
    array([[ 8.45, 0.  , 0.  , 0.  , 0.  ],
           [ 0.  , 6.98, 0.  , 0.  , 0.  ],
           [ 0.  , 0.  , 1.83, 0.  , 0.  ],
           [ 0.  , 0.  , 0.  , 1.5 , 0.  ],
           [ 0.  , 0.  , 0.  , 0.  , 1.1 ],
           [ 0.  , 0.  , 0.  , 0.  , 0.  ],
           [ 0.  , 0.  , 0.  , 0.  , 0.  ]])
    ```
    * $\mathbf{R}^T=$
    ![latent_topics_rt](latent_topics_rt.png)
    * $\mathbf{R}_2=$
    ![latent_topics_r2](latent_topics_r2.png)
* **Multidimensional Scaling (MDS)** - a process used to find lower dimensional representaion that give that same distance between points
    * distance matrix or proximity matrix:
        * $\mathbf{\Delta}=\left[\begin{array}{cc}
            \mathbf{\delta}_{11} & \mathbf{\delta}_{12} & \cdots & \mathbf{\delta}_{1n} \\
            \mathbf{\delta}_{21} & \mathbf{\delta}_{22} & \cdots & \mathbf{\delta}_{2n} \\
            \vdots & \vdots & \ddots & \vdots \\
            \mathbf{\delta}_{n1} & \mathbf{\delta}_{n2} & \cdots & \mathbf{\delta}_{nn} \\
            \end{array}\right]$
        * $\mathbf{\delta}_{ij}=\Vert \mathbf{x}_i-\mathbf{x}_j\Vert$ (Euclidean distance between two vectors)
    * MDS algorithm steps:
        1. calculate the matrix of squared proximities: $\mathbf{\Delta}^2$
        2. calculate the matrix: $\mathbf{B}=\frac{1}{2}\mathbf{J}\mathbf{\Delta}^2\mathbf{J}$, $\mathbf{J}=\mathbf{I}-\frac{1}{n}\mathbf{1}\mathbf{1}^T$
        3. Obtain SVD of $\mathbf{B}$: $\mathbf{B}=\mathbf{U}\mathbf{\Lambda}\mathbf{U}^T$
    * For a 2-dimensional representation, keep two eigenvectors corresponding to the largest eigenvalues:
    ![mds](mds.png)

5) **t-Distributed Stochastic Neighbor Embedding (t-SNE)** - a nonlinear dimensionality reduction technique that is particularly well-suited for embedding high-dimensional data into a space of two or three dimensions, which can then be visualized in a scatter plot
* Specifically, it models each high-dimensional object by a two- or three-dimensional point in such a way that similar objects are modeled by nearby points and dissimilar objects are modeled by distant points
    * It is often used to visualize high-level representations learned by an artificial neural network
    ![t_sne_digits](http://2.bp.blogspot.com/--l8yNRipldU/Vg5ECxalQmI/AAAAAAAAAws/s1rDpWuvtaY/s1600/tsne.png)
* t-SNE algorithm comprises two main stages:
    1. t-SNE constructs a probability distribution over pairs of high-dimensional objects in such a way that similar objects have a high probability of being picked, whilst dissimilar points have an extremely small probability of being picked
    2. t-SNE defines a similar probability distribution over the points in the low-dimensional map, and it minimizes the **Kullback–Leibler (KL) divergence** between the two distributions with respect to the locations of the points in the map
        * Note that whilst the original algorithm uses the Euclidean distance between objects as the base of its similarity metric, this should be changed as appropriate
* t-SNE algorithm:
    * Given a set of $N$ high-dimensional objects $x_1,\dots,x_N$, t-SNE first computes probabilities $p_{ij}$ that are proportional to the similarity of the objects $x_i$ and $x_j$: $$p_{j\mid i} = \frac{e^{\big(\frac{-\Vert x_i - x_j \Vert^2}{2\sigma^2_i}\big)}}{\sum_{k \neq i}e^{\big(\frac{-\Vert x_i - x_k \Vert^2}{2\sigma^2_i}\big)}}$$
        * The similarity of datapoint $x_j$ to datapoint $x_i$ is the conditional probability, $p_{j \mid i}$, that $x_i$ would pick $x_j$ as its neighbor if neighbors were picked in proportion to their probability density under a Gaussian centered at $x_i$: $$p_{ij}=\frac{p_{j \mid i} + p_{i \mid j}}{2N}$$
    * t-SNE aims to learn a $d$-dimensional map $y_1,\dots,y_N$ (with $y_i \in \mathbb{R}^d$ that reflects the similarities $p_{ij}$ as well as possible.
        * To this end, it measure similarities $q_{ij}$ between two points in the map $y_i$ and $y_j$, using very similar approach: $$q_{ij} = \frac{\frac{1}{1+\Vert y_i-y_j \Vert^2}}{\sum_{k \neq i} \frac{1}{1+\Vert y_i-y_k \Vert^2}}$$
        * Herein a heavy-tailed **Student-t distribution** (with one-degree of freedom) is used to measure similarities between low-dimensional points in order to allow dissimilar objects to be modeled far apart in the map
* t-SNE has been used in a wide range of applications:
    * computer security research
    * music analysis
    * cancer research
    * bioinformatics
    * biomedical signal processing