## t-SNE (Student-t Stochastic Neighborhood Embedding)

t-SNE is just a visualization method, give intuition for high dimensional embedding

### idea and interpreation

idea: minimize divergence between 2 distributions

- a distribution that measures pairwise similarities of **corresponding visualization vectors** (**word embedding vectors**)


interpretation: close will be close, far will be far

- if word embedding vector $\phi(w_i) \in \mathbb{R}^d$ is very **close** to $\phi(w_j)$,

    then visualization vector $y_i \in \mathbb{R}^2$ will also be **close** to $y_j$

- while word embedding vector $\phi(w_i)$ is very **far** to $\phi(w_j)$,

    then visualization vector $y_i$ will also be **far** to $y_j$


- long distances maybe stretched further

### algorithm

1. form a Gaussian kernel over vocabulary based on embedding vectors $\phi(w_i) \in \mathbb{R}^d$


2. scale and symmetrize, produce a matrix $P=[P_{ij}]$


3. represent word $i$ by visualization vectors $y_i \in \mathbb{R}^2$ (suppose we want to visualize in 2D)

    use a heavy-tailed Student t-distribution with df=1
    

4. solve optimal $y_i$ using stochastic gradient descent

- **first**, for each word $w_i$, compute a language model

    a conditional distribution model: Given word $i$ the probability of word $j$ is word $i$'s neighbor:

$$
P_{j|i} \propto \exp\left [ -\frac{\left \| \phi (w_i)-\phi (w_j)\right \|^2}{2 h_i^2} \right ]
$$


derived from Gaussian Kernel:

$$
P_{j|i} =\frac{\exp\left [ -\frac{\left \| \phi (w_i)-\phi (w_j)\right \|^2}{2 h_i^2} \right ]}{\sum _k\exp\left [ -\frac{\left \| \phi (w_i)-\phi (w_k)\right \|^2}{2 h_i^2} \right ]} 
$$

where $k$ is the number of words in the corpus

$h_i$ is for scaling, bandwidth for word $i$, which set perplexity to be 10

let all probabilities on the same scale

perplexity is often set between 5-50:

$$
e^{H(j|i)} \approx 10
$$



where $H(j|i)$ is entropy

$$
H(j|i)=\sum_j P_{j|i} \log(P_{j|i})
$$

the language model $P_{j|i}$ produces non-linearity

a simple way of symmetrizing:

$$
P_{ij} = \frac{1}{2} (P_{j|i}+P_{i|j})
$$

where $P_{ij}$ is entry of a symmetric matrix $P$

for Kernel PCA:

- first embed data on high dimensional space, get $P_{j|i}$


- then do SVD on $P_{j|i}$ to project onto finite space

- **second**, form Student-t distribution based on visualization vectors $y_i \in \mathbb{R}^2$ 

has fatter tails than Gaussian, encourage close vectors to be more close, spread vectors to be more spread

each entry of matrix $Q$ is proportional to :

$$
Q_{ij} \propto \left ( 1+\left \| y_i -y_j \right \|_2^2 \right )^{-1}
$$

derived from

$$
Q_{ij} = \frac{\left ( 1+\left \| y_i -y_j \right \|_2^2 \right )^{-1}}{\sum _{k \neq l}\left ( 1+\left \| y_k -y_l \right \|_2^2 \right )^{-1}}
$$

where denominator is sum over rows

- **finally**, run stochastic gradient descent to optimize Kullback-Leibler divergence between matrix $P$ and $Q$  over vectors $y_i$ 

    Kullback-Leibler divergence is a kind of non-linear loss

$$
\hat y = \arg \min_{y} \sum_{ij} P_{ij} \log \left ( \frac{P_{ij}}{Q_{ij}} \right ) 
$$

where $P_{ij}$ is observed, $Q_{ij}$ is unknown (we are learning), dictated by $\hat y_i$

$$
P_{ij} = \frac{1}{2} (P_{j|i}+P_{i|j})
$$

### Student t-distribution

- general form of Student t-distribution

$$
f(t)=\frac{\Gamma (\frac{v+1}{2})}{\sqrt{v\pi}\Gamma (\frac{v}{2})}\left ( 1+\frac{t^2}{v} \right )^{-\frac{v+1}{2}}
$$

where $v$ is degree of freedom


- simplified form with 1 degree of freedom, also called **Cauchy distribution** 

$$
f(t)=\frac{1}{\pi}\left ( 1+t^2 \right )^{-1}
$$

thus the language model $P_{j|i}$ is $e^{-t^2}$