# Umap tutorial
# Resources
* https://umap-learn.readthedocs.io/en/latest/how_umap_works.html
* https://towardsdatascience.com/how-exactly-umap-works-13e3040e1668
* https://towardsdatascience.com/how-to-program-umap-from-scratch-e6eff67f55fe
* https://www.analyticsvidhya.com/blog/2017/01/t-sne-implementation-r-python/
* https://en.wikipedia.org/wiki/Connectivity_(graph_theory)
* https://en.wikipedia.org/wiki/Euclidean_distance

In [25]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.metrics.pairwise import euclidean_distances

In [28]:
expr = pd.read_csv('../../../data/CAFs.txt', sep='\t')

In [29]:
expr.head()

Unnamed: 0,1110020A21Rik,1110046J04Rik,1190002F15Rik,1500015A07Rik,1500015O10Rik,1700010K23Rik,1700012D01Rik,1810011H11Rik,2010204K13Rik,2310057J18Rik,...,Wif1,Wisp2,Yy2,Zfp2,Zfp36,Zfp454,Zfp652os,Zfp81,Zfp944,cluster
SS2_15_0048_A3,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,107.526495,0.0,201.533078,0.0,0.0,0.0,0.0,1
SS2_15_0048_A6,0.0,0.0,0.0,0.0,0.0,0.0,0.0,230.755035,0.0,0.0,...,0.0,0.0,0.0,0.0,175.071938,0.0,0.0,0.0,0.0,1
SS2_15_0048_A5,0.0,0.0,0.0,0.0,0.0,0.0,0.0,375.177236,0.0,0.0,...,0.0,0.0,0.0,0.0,290.743379,0.0,0.0,0.0,0.0,1
SS2_15_0048_A4,0.0,0.0,0.0,0.0,891.488043,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,2
SS2_15_0048_A1,43.324338,0.0,0.0,0.0,0.0,0.0,0.0,20.527119,65.766243,0.0,...,0.0,0.0,0.0,0.0,768.997431,0.0,0.0,0.0,2.413509,1


In [42]:
X_train = expr.values[:,0:(expr.shape[1]-1)]
X_train = np.log(X_train + 1)
n = X_train.shape[0]
print("\nThis data set contains " + str(n) + " samples")
y_train = expr.values[:,expr.shape[1]-1]
print("\nDimensions of the  data set: ")
print(X_train.shape, y_train.shape)


This data set contains 716 samples

Dimensions of the  data set: 
(716, 557) (716,)


### 1. Compute Squared Pairwise Euclidean Distance Matrix
This will be used for the initial high-dimensional data set. A squared pairwise euclidean distance matrix is defined if we consider a matrix, `X`, and we treat the rows as vectors. We then want to compute the distance between each pair of vectors (rows). Mathematically, recall that euclidean distance between two $n$ dimensional vectors, $x$ and $y$, is defined as:

$$\text{Euclidean Distance} = \sqrt{(x_1 - y_1)^2 + (x_2 - y_2)^2 + \dots + (x_n - y_n)^2} $$

We can define it in code as:




In [83]:
a = np.array([1,4,3,7])
b = np.array([5,1,6,7])

def euclidean_distance(a, b):
    return np.sqrt(np.sum((a - b)**2))

euclidean_distance(a, b)

5.830951894845301

Note that for this implementation we will be **squaring** the euclidean distances. Scikit learn provides the `euclidean_distances` func that we can use:

In [84]:
dist = np.square(euclidean_distances(X_train, X_train))

In [86]:
pd.DataFrame(dist)

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,706,707,708,709,710,711,712,713,714,715
0,0.000000,914.950163,1477.468361,3036.911722,1433.734302,3246.777478,3196.617421,1683.945244,1493.029622,1474.278633,...,2408.598901,2230.479562,1306.721575,1316.501903,1563.305853,3093.630690,3400.304971,1058.566144,3328.595824,1478.189282
1,914.950163,0.000000,1307.392946,2960.415600,1513.019853,2988.217818,3180.651504,1756.129013,1493.183536,1319.675555,...,2391.512893,2205.686665,1386.919424,1373.626611,1648.813104,3313.505222,3330.411220,867.051976,3044.978347,1376.963129
2,1477.468361,1307.392946,0.000000,2678.344426,1404.936894,2920.436888,3022.925486,1860.169864,1394.458730,1293.047515,...,2352.261664,2781.247570,1741.879213,1863.113818,1859.956229,3627.469676,3197.032903,1400.747050,2986.189517,1308.086666
3,3036.911722,2960.415600,2678.344426,0.000000,2833.064000,1690.638732,1828.177877,2629.107910,2280.411129,3186.498756,...,1772.263126,2555.364064,2779.614583,2904.331176,2710.834583,3457.504665,1849.690465,2843.794892,1343.375827,2707.317181
4,1433.734302,1513.019853,1404.936894,2833.064000,0.000000,3102.982919,3290.133727,1619.241883,1139.226883,1465.024756,...,1917.132342,3058.591015,1727.042658,1978.056400,1888.497975,3586.046188,2996.018197,1572.172421,3057.076635,1110.071095
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
711,3093.630690,3313.505222,3627.469676,3457.504665,3586.046188,3625.620604,3986.961621,3432.711108,3428.893683,3810.853003,...,3378.107145,3006.843464,3311.964090,3187.086360,3350.420918,0.000000,3490.796471,3121.574516,3808.310115,3132.950741
712,3400.304971,3330.411220,3197.032903,1849.690465,2996.018197,2390.711559,2229.701920,3054.763244,2613.221643,3577.300923,...,1920.441807,2764.638365,3096.147705,3296.387405,3052.190845,3490.796471,0.000000,3130.734064,1913.949385,2799.440042
713,1058.566144,867.051976,1400.747050,2843.794892,1572.172421,2837.078426,2950.435268,1493.283819,1349.517334,1436.280819,...,2270.485056,2038.257291,1334.687524,1270.466425,1471.776297,3121.574516,3130.734064,0.000000,2849.678751,1253.931976
714,3328.595824,3044.978347,2986.189517,1343.375827,3057.076635,1659.823876,1465.107868,2780.799018,2465.900252,3548.084912,...,2032.339536,2599.005481,2844.485974,3055.108440,2883.391408,3808.310115,1913.949385,2849.678751,0.000000,2748.920217


We can check to make sure the first entry is calculated as we expected:

In [87]:
a = X_train[0]
b = X_train[1]

np.sqrt(np.sum((a - b)**2)) **2

914.9501631139505

And we do in fact end up with the correct distance of `914.95`. Another way to calculate this would be:

In [90]:
np.linalg.norm(a-b)**2

914.9501631139502

While we are here we will define the **local connectivity paramter**, $\rho$. In order to define $\rho$, we must discuss the **high dimensional probability space**. This probability space is a way of discussing the similarity of points in the high dimensional space. 

### $t$-SNE aside
Recall from $t-SNE$ that we convert high-dimensional euclidean distances between data points into conditional probabilities that represent **similarities** (for more detail see [here](https://www.analyticsvidhya.com/blog/2017/01/t-sne-implementation-r-python/)). The similarity of data point (a row) $x_i$ to datapoint $x_j$ is the conditional probability, $p(x_j \mid x_i)$. This can be interpreted as:

> The probability that $x_i$ would pick $x_j$ as its neighbor if neighbors were picked in proportion to their probability density under a Gaussian Centered at $x_i$.

This probability for a given $x_i$ and $x_j$ is defined as:

$$
p(x_j \mid x_i) = 
\frac{exp \big(\frac{- || x_i - x_j ||^2}{2\sigma_i^2} \big)}
{\sum_{k \neq i} exp \big(\frac{- || x_i - x_k ||^2}{2\sigma_i^2} \big)}
$$

Where the two vertical bars notation is in reference to euclidean distance. For nearby points, the conditional probability of being a neighbor is relatively high, whereas for widely separated points, the conditional probaiblity will be almost infinitesimal. 

It is important to remember that:
> The conditional probability is the the probability that $x_i$ would pick $x_j$ to be it's neighbor if neighbors were picked in proportion to their probability density under a Gaussian centered at $x_i$. 

In other words, **we can think of $p(x_j \mid x_i)$ as the probability that $x_j$ is a neighbor of $x_i$**. 

### Back to Umap and $\rho$
With that said, Umap also defines a conditional probability that $x_j$ is a neighbor of $x_i$:

$$p(x_j \mid x_i) = exp\big( - \frac{d(x_i, x_j) - \rho_i}{\sigma_i} \big)$$

What $\rho$ represents is the distance from each $i$th data point to its first nearest neighbor (the closest data point to $x_i$). This ensures **locally connectivity of the manifold**. In other words, this gives a locally adaptive exponential kernel for each data point, so the **distance metric varies from point to point**.

Note the key differences between the conditional probability umap creates, vs that of $t$-SNE:
* Umap does not normalize the conditional probability. 


LEAVING OFF
* What is a kernel? 
* What is purpose of $\rho$? 
* What is local connectivity? 

In [41]:
rho = [sorted(dist[i])[1] for i in range(dist.shape[0])]

In [44]:
pd.DataFrame(X_train)

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,547,548,549,550,551,552,553,554,555,556
0,0.000000,0.0,0.000000,0.0,0.000000,0.0,0.0,0.000000,0.000000,0.000000,...,0.000000,0.0,0.0,4.686994,0.0,5.310903,0.000000,0.0,0.0,0.000000
1,0.000000,0.0,0.000000,0.0,0.000000,0.0,0.0,5.445681,0.000000,0.000000,...,0.000000,0.0,0.0,0.000000,0.0,5.170893,0.000000,0.0,0.0,0.000000
2,0.000000,0.0,0.000000,0.0,0.000000,0.0,0.0,5.930060,0.000000,0.000000,...,0.000000,0.0,0.0,0.000000,0.0,5.675875,0.000000,0.0,0.0,0.000000
3,0.000000,0.0,0.000000,0.0,6.794013,0.0,0.0,0.000000,0.000000,0.000000,...,0.000000,0.0,0.0,0.000000,0.0,0.000000,0.000000,0.0,0.0,0.000000
4,3.791534,0.0,0.000000,0.0,0.000000,0.0,0.0,3.069313,4.201198,0.000000,...,0.000000,0.0,0.0,0.000000,0.0,6.646387,0.000000,0.0,0.0,1.227741
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
711,0.000000,0.0,0.000000,0.0,6.921657,0.0,0.0,0.000000,3.643672,0.000000,...,3.100076,0.0,0.0,3.111447,0.0,3.025396,1.054224,0.0,0.0,3.570110
712,3.655752,0.0,0.000000,0.0,7.354515,0.0,0.0,0.000000,0.000000,0.000000,...,4.343768,0.0,0.0,0.000000,0.0,3.752833,0.000000,0.0,0.0,0.000000
713,0.000000,0.0,3.993747,0.0,1.919504,0.0,0.0,4.729406,4.151709,3.024468,...,0.000000,0.0,0.0,0.000000,0.0,4.266139,0.000000,0.0,0.0,0.000000
714,0.000000,0.0,0.000000,0.0,7.788297,0.0,0.0,0.000000,0.000000,0.000000,...,0.000000,0.0,0.0,0.000000,0.0,0.000000,0.000000,0.0,0.0,4.042943
