forked from mmistakes/minimal-mistakes
-
Notifications
You must be signed in to change notification settings - Fork 25
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
- Loading branch information
Showing
7 changed files
with
273 additions
and
0 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,69 @@ | ||
--- | ||
title: \[Paper Review\] Dimensionality Reduction as Probabilistic Inference | ||
categories: [Dimension] | ||
tags: [차원 축소,Dimension reduction] | ||
excerpt: Probabilistic Framework to interpret non-probabilistic Algorithms | ||
sidebar: | ||
- title: "Dimensionality Reduction" | ||
image: /assets/img/dimension.png | ||
image_alt: "image" | ||
nav: Dimension | ||
author_profile: False | ||
--- | ||
|
||
|
||
|
||
# 1. Probabilistic Dimensionality Reduction | ||
|
||
The Dimensionality Reduction is a field to make an intractable high-dimensional data into tractable row-dimensional data by extracting essential information. By shrinking the size of the data, we can expect following advantages. | ||
|
||
- Shrinking computational complexity | ||
- Figuring out the essential factors of the data | ||
- Visualizing the high-dimensional data | ||
|
||
|
||
|
||
There is a lot of framework to approach the dimension reduction, but most of them share common sense that there are hidden factors deep down in the data and the things we can observe is only the result of manipulated, noised or reproduced information of the factors. | ||
|
||
<img src="C:\Users\mhg95\Documents\GitHub\HGmin1159.github.io\_posts\2023-09-15-probdr1.assets\image-20230915132715481.png" alt="image-20230915132715481" style="zoom: 33%;" /> | ||
|
||
|
||
|
||
By setting hypothesis about the process of the hidden factors realization, we can reversely follow the process and estimate the hidden factors using the observable data. Because there are various points of view regarding the process, various algorithms have been made. In the probabilistic view, they set the process as follow. | ||
$$ | ||
\begin{align*} | ||
X & \sim p(\cdot) \\ | ||
Y & \sim exponential(f(X)) | ||
\end{align*} | ||
$$ | ||
where $X$ is hidden random vector and $Y$ is observable random vectors. $p$ is just unknown process that making $X$ and $f(\cdot)$ is the way that the $X$ is manipulated into $Y$. The exponential means any exponential distribution family which can make thing easier. | ||
|
||
Below list is the most famous dimension reduction algorithms using the probabilistic framework and I handled most of them in the post. | ||
|
||
- [PCA (Principal Component Analysis)](https://hgmin1159.github.io/dimension/dimension_reduction/) | ||
- [FA (Factor Analysis)](https://hgmin1159.github.io/dimension/dimension_reduction/) | ||
- [GMMs (Gaussian Mixture Models)](https://hgmin1159.github.io/bayesian/GMM/) | ||
- [NMF (Nonnegative Matrix Factorization)](https://hgmin1159.github.io/dimension/mf2/) | ||
- LDA (Latent Dirichlet Allocation) | ||
- [ICA (Independent Component Analysis)](https://hgmin1159.github.io/dimension/mf3/) | ||
|
||
|
||
|
||
Above algorithms are rather traditional. In these days, a lot of machine learning based methods ,not based on the probabilistic view, are made and they have showed a lot amazing results in the processing big data. The most popular machine learning based algorithms are SNE, t-SNE and UMAP. However, the algorithms have some weakness in that they cannot handle some "uncertainty" and hard to explain the process. | ||
|
||
The paper ["Aditya Ravuri and Neil D.Lawrence, 2023, Dimensionality Reduction as Probabilistic Inference"](https://arxiv.org/abs/2304.07658) target to understand the non-probablistic DR in to probabilistic framework. By using the various statistical methods, it reconstruct the famous ML based dimension reduction algorithms including SNE, t-SNE, UMAP, CMDS, Isomap, Kernel PCA, MVU, LE, Spectral Embeddings, LLE and Diffusion Maps. | ||
|
||
They used Variational Inference, GPLVMs and Matern-$\nu$ graph Gaussian Process to reconstruct the DR methods. Therefore, before understanding the ProbDR, let's figure out each of the fields. | ||
|
||
1. Learning Based DR - SNE,t-SNE,UMAP | ||
2. Kernel based DR - Multi Dimension Scaling (MDS), Isomap, kernel PCA, MVU (Maximum Variance Unfolding), Diffusion Maps | ||
3. Other DR algorithms - Laplacian Eigenmaps , Spectral Embedding, Locally Linear Embedding (LLE) | ||
4. Variational Inference | ||
5. Gaussian Process Latent Variable Models (GPLMVs) | ||
6. Matern-$\nu$ Graph Gaussian Process | ||
|
||
|
||
|
||
*** | ||
|
||
This post is ongoing. After I made all of the post of above lists, I'll finished it. |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,200 @@ | ||
--- | ||
title: \[Paper Review\] Dimensionality Reduction as Probabilistic Inference | ||
categories: [Dimension] | ||
tags: [차원 축소,Dimension reduction] | ||
excerpt: Probabilistic Framework to interpret non-probabilistic Algorithms | ||
sidebar: | ||
- title: "Dimensionality Reduction" | ||
image: /assets/img/dimension.png | ||
image_alt: "image" | ||
nav: Dimension | ||
author_profile: False | ||
--- | ||
|
||
|
||
|
||
# 1. SNE and t-SNE | ||
|
||
Stochastic Neighbor Embedding (SNE) is nonlinear dimensionality reduction methods developed by Geoffrey Hinton. It assume the distribution of given data points $X_i \in \mathbb{R}^p, i=1,2,...n$ and their connectivity. Here n is the numbers of sample and p is the dimension of original data. And by making low dimensional data points $Y_i \in \mathbb{R}^q , i=1,2,...,n$ preserving same connectivity of original data points, we can find new sets of data having much less dimension but preserve connectivity with original data. t-SNE is an algorithm adding t-distribution concepts on SNE. | ||
|
||
|
||
|
||
Let's see detailed process. To understand the SNE and t-SNE, we need to focus on three points of algorithms. | ||
|
||
|
||
|
||
|
||
|
||
First, we define the measure of connectivity of original data. | ||
|
||
In the original paper, they suggested following measure | ||
|
||
> $p_{i \mid j} = \frac{exp( -d_{ij}^2)}{\sum_{k \neq i} exp( -d_{ik}^2)}$ | ||
> | ||
> where $d_{ij}$ is a metric between i-th and j-th data points. Possibly, $d_{ij} = \parallel x_i - x_{j} \parallel_2^2/\sigma_i^2$ works | ||
In the praise, we have several noteworthy things. | ||
|
||
- The measure $p_{i \mid j}$ is asymmetric. That is, $p_{i \mid j}$ and $p_{j \mid i}$ are different so we need to make it symmetric like $p_{ij} = \frac{1}{c}(p_{i \mid j }+ p_{j \mid i})$ to get a consistent result. The c is scaling parameter, but the author suggest it as the sample size n. | ||
|
||
- If we use $d_{ij}$ as $\parallel x_{i} - x_{j} \parallel_2$, the $p_{i \mid j}$ becomes the probability of "$x_i$ follows same distribution with $x_j$ given that all of the points follow Gaussian Distribution and $x_i$ belongs to one of them" | ||
|
||
> Let's assume $X_i \sim N(\mu_i ,\sigma_i^2)$ and $\mu_i = \mu_k$ for one of $k \neq i$. | ||
> | ||
> Then, $Pr(\mu_i = \mu_j \mid x_i) = \frac{Pr(x_i \mid \mu_i=\mu_j)Pr(\mu_i = \mu_j)}{\sum_{k\neq i}Pr(x_i\mid \mu_i=\mu_k)Pr(\mu_i=\mu_k)} = \frac{\exp( - \parallel x_i - x_j \parallel_2^2/2\sigma_i^2)}{\sum_{k \neq i} \exp( - \parallel x_i - x_k \parallel_2^2/2\sigma_i^2)}$ | ||
> | ||
> given that $Pr(\mu_i = \mu_j), \forall j$ which means that unless having any data point, we can conjecture that the probability of $\mu_i=\mu_j$ is the same along all of the j. | ||
- Similarly, if we let $d_{ij} = \parallel x_i - x_j \parallel_1^{1/2}$, then we have similar result but assuming the data points follow Laplace Distribution. | ||
- By using it, we can set some kind of adjoint matrix $P = \{p_{ij}\}_{i,j=1}^{n,n}$ | ||
|
||
|
||
|
||
Secondly, we'll set up the dissimilarity structure of reduced data Y. | ||
|
||
The original SNE paper suggest similar form with the p. However, Laurens van der Maaten suggest to use the t-distribution with one degree of freedom. | ||
|
||
> Original SNE | ||
> | ||
> $q_{i \mid j} =\frac{exp( -\parallel y_i - y_j \parallel_2^2/2\sigma_{i}^2)}{\sum_{k \neq i} exp( -\parallel y_k -y_j \parallel_2^2/2\sigma_{i}^2)}$ | ||
> | ||
> t-SNE | ||
> | ||
> $q_{i \mid j} = \frac{(1+\parallel y_i - y_j \parallel^2)^{-1}}{\sum_{k} \sum_{l\neq k}(1+\parallel y_k - y_l \parallel^2)^{-1}}$ | ||
> | ||
> $q_{ij} =\frac{1}{c}(q_{i \mid j} + q_{j \mid i })$ | ||
In any way, it assume another distribution which have much less dimension than original data. Therefore, it aims to make a new data set which have the same connectivity but follow different, less complicate distribution. | ||
|
||
It is said that Cauchy distribution is much more robust about out lier high dimension data. | ||
|
||
<img src="https://upload.wikimedia.org/wikipedia/commons/thumb/4/41/Student_t_pdf.svg/1024px-Student_t_pdf.svg.png" alt="undefined" style="zoom: 33%;" /> | ||
|
||
Above figure is shape of t-distribution with $\nu$-degree of freedom. It converges to standard normal distribution as $\nu$ goes to infinity. Observe that the probability of extreme part is larger as $\nu$ become small. That is, by using the t-distribution, we can give less confidence and more flexibility on the assumption of distribution. Distributions with this kind of shape are called heavy-tailed distribution. | ||
|
||
|
||
|
||
Finally, let's see how can we find the Y with above structure. In the original paper, they used Kulback-Leibler Divergence used to measure the difference between two distributions. | ||
|
||
> $\min_{y} KL(P\mid \mid Q) = \sum_{i \neq j} p_{ij} \log \frac{p_{ij}}{q_{ij}}$ | ||
- The KL divergence is $E(\log p(X) - \log q(X) \mid \text{The measure p is true in reality})$. Therefore, it is the expectation of difference of log-probability. | ||
- Note that KL-divergence is also asymmetric. But in this case, we have true-distribution set of P, so we don't need to adjust it. | ||
|
||
|
||
|
||
Now, we'll use the gradient based methods to find the Y. Basically, we can use gradient descent. | ||
|
||
> $Y^{+} = Y - t \nabla_Y KL(P \mid \mid Q) $ | ||
> | ||
> Original SNE : $\frac{\partial}{\partial y_i} KL(P \mid \mid Q) = c \sum_{j \neq i}(p_{i,j} -q_{i,j})(y_i-y_j)$ | ||
> | ||
> t-SNE : $\frac{\partial}{\partial y_i} KL(P \mid \mid Q) = c \sum_{j \neq i}(p_{i,j} -q_{i,j})(1+ \parallel y_i-y_j \parallel^2)^{-1}(y_i-y_j)$ | ||
|
||
|
||
|
||
|
||
So we can finally find the reduced data set Y. | ||
|
||
|
||
|
||
|
||
|
||
# 2. UMAP | ||
|
||
|
||
|
||
|
||
|
||
# 3. MDS | ||
|
||
Multidimensional Scaling (MDS) is a method to find reduced axis preserving the pairwise distance information among the data. It seems similar with the SNE, but the structure and method to find it is quite different. | ||
|
||
The ordinary MDS is also called Principal Coordinate Analysis (PCoA). As we can see the name is similar with PCA, it has similar structure with PCA. The objective function of PCoA is as follow. | ||
$$ | ||
\min_{X \in \mathbb{R}^{n \times q}} \left(\frac{\sum_{i,j}(b_{ij}-x_i^tx_j)^2}{\sum_{i,j}b_{ij}^2}\right)^{1/2} | ||
$$ | ||
|
||
- The vector $x_i$ is q-dimensional and represents the features of the i-th sample. Meanwhile, $b_{ij}$ indicates the closeness or similarity between the i-th and j-th elements of the original data, being 0 if they're distant and 1 if they're close. | ||
- $\left(\sum_{i,j} a_{i,j}^2\right)^{1/2} = (tr(A^tA))^{1/2}$ is a matrix norm called Shatten-2 norm, Hilbert Schmidt Norm and Frobenius norm. | ||
- Therefore, the cost function is $\parallel B-X^tX \parallel_F$. By Eckart-Yong-Mirsky Theorem, the minimizer of the cost function is q-th order Low Rank Approximation of $B$. | ||
|
||
|
||
|
||
The classical MDS has following traits. | ||
|
||
- As a result, the result $X$ is $n \times q$ reduced matrix whose inner product matrix $X^tX$ resembles the distance matrix of original data (assuming they are centerized). | ||
- Note that the PCA is a Low Rank Approximation on the covariance matrix of original data. On the other hand, the PCoA is a Low Rank Approximation on the distance matrix of original data. | ||
- For the $n \times p$ matrix, the covariance matrix is $p \times p$ matrix but the distance matrix is $n \times n$ matrix. Because the spectral decomposition on the $m \times m$ matrix have $O(m^{2.3})$ of time complexity and we usually have $n>>p$ data, the PCoA needs much more computational resource. | ||
|
||
|
||
|
||
So, there is another way to approach the MDS problem, **Metric Multidimensional Scaling (mMDS)**. Let's see its objective function. | ||
$$ | ||
Stress_{D}(X) = \left( \sum_{i\neq j} (d_{ij}- \parallel x_i - x_j\parallel)^2 \right)^{1/2} | ||
$$ | ||
|
||
- It is much more direct way in that it target to copy the distance of original data $d_{ij}$ and distance of reduced data $\parallel x_i - x_j \parallel$. The way to measure the distance could be selected with user's appetite. | ||
- The optimization strategies for the mMDS are little bit specialized, called Stress Majorization. Most of them are based on iterative methods including gradient descent and SMACOF. | ||
- The main differences between MDS and mMDS are in 1. structure of target function, 2. flexibility over the choice of distance measure and 3. optimization methods. | ||
- Because the distance matrix of result are similar with the distance matrix of the Original Data, it could be said a more intuitive method. | ||
|
||
|
||
|
||
On the other hand, we can find the adequate methods to measure the distance. Sometimes, the usual way to calculate the similarity might not work. | ||
|
||
- Just think about the case that features contain nominal variable. There is no common method to measure the distance between nominal data. | ||
|
||
- Consider situations where differences aren't linear, such as the distribution of wealth on Earth. It's strikingly exponential, given that the top 1% of the wealthiest individuals possess half of the world's riches. | ||
|
||
|
||
|
||
- It is called Non-metric Multidimensional Scaling (NMDS). | ||
|
||
$$ | ||
\underset{X \in \mathbb{R}^{n\times q}, f \in F}{\min} f | ||
$$ | ||
|
||
|
||
|
||
|
||
|
||
|
||
|
||
|
||
|
||
|
||
|
||
1. Learning Based DR - SNE,t-SNE,UMAP | ||
2. Kernel based DR - Multi Dimension Scaling (MDS), Isomap, kernel PCA, MVU (Maximum Variance Unfolding), Diffusion Maps | ||
3. Other DR algorithms - Laplacian Eigenmaps , Spectral Embedding, Locally Linear Embedding (LLE) | ||
4. Variational Inference | ||
5. Gaussian Process Latent Variable Models (GPLMVs) | ||
6. Matern-$\nu$ Graph Gaussian Process | ||
|
||
|
||
|
||
|
||
|
||
Yun Li | ||
|
||
Xihao Li | ||
|
||
|
||
|
||
Dimension Reduction | ||
|
||
Optimization | ||
|
||
Machine Learning | ||
|
||
Quantum Data Science | ||
|
||
Linear Model | ||
|
||
Model Selection | ||
|
||
Nonparametric Modeling | ||
|
||
Kernel Methods |
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.