# Genomics & High Dimensional Data

#### Preliminaries

##### Libraries

In [1]:
from os import path
import numpy as np
import pandas as pd
from sklearn import decomposition, cluster, manifold
import plotly.express as px
import plotly.graph_objects as go
from alive_progress import alive_it

##### Utilities

In [2]:
from utilities import json as utl_json

##### Configuration

In [3]:
env_config = utl_json.to_dict(file_path="../../config/env.json")

## Dataset

In [4]:
X = np.load(
    file=path.normpath(
        path.join(
            env_config['root'],
            "modules/m2/data/p1",
            "X.npy"
        )
    )
)

In [5]:
X_log = np.log2((X + 1))

## T-SNE PCs

In [6]:
pc_lst = [
    10,
    50,
    100,
    250,
    500
]

In [7]:
tsne_dict = {}
for components in alive_it(pc_lst):
    pca = decomposition.PCA(n_components=components)
    X_redux = pca.fit_transform(X_log)
    tsne = manifold.TSNE(n_components=2, max_iter = 1000, n_iter_without_progress=500)
    tnse_X_redux = tsne.fit_transform(X_redux)
    tsne_dict[components] = tnse_X_redux

|████████████████████████████████████████| 5/5 [100%] in 28.3s (0.18/s) 


### 10 PCs

In [8]:
pc_components = pc_lst[0]
fig = go.Figure()
fig.add_traces(
    go.Scatter(
        x=tsne_dict[pc_components][:,0],
        y=tsne_dict[pc_components][:,1],
        mode="markers",
    )
)
fig.update_layout(
    title_text = f"T-SNE Method [{pc_components} PCs]"
)
fig.update_yaxes(title_text = "T-SNE Component 2")
fig.update_xaxes(title_text ="T-SNE Component 1")
fig.show()

### 50 PCs

In [9]:
pc_components = pc_lst[1]
fig = go.Figure()
fig.add_traces(
    go.Scatter(
        x=tsne_dict[pc_components][:,0],
        y=tsne_dict[pc_components][:,1],
        mode="markers",
    )
)
fig.update_layout(
    title_text = f"T-SNE Method [{pc_components} PCs]"
)
fig.update_yaxes(title_text = "T-SNE Component 2")
fig.update_xaxes(title_text ="T-SNE Component 1")
fig.show()

### 100 PCs

In [10]:
pc_components = pc_lst[2]
fig = go.Figure()
fig.add_traces(
    go.Scatter(
        x=tsne_dict[pc_components][:,0],
        y=tsne_dict[pc_components][:,1],
        mode="markers",
    )
)
fig.update_layout(
    title_text = f"T-SNE Method [{pc_components} PCs]"
)
fig.update_yaxes(title_text = "T-SNE Component 2")
fig.update_xaxes(title_text ="T-SNE Component 1")
fig.show()

### 250 PCs

In [11]:
pc_components = pc_lst[3]
fig = go.Figure()
fig.add_traces(
    go.Scatter(
        x=tsne_dict[pc_components][:,0],
        y=tsne_dict[pc_components][:,1],
        mode="markers",
    )
)
fig.update_layout(
    title_text = f"T-SNE Method [{pc_components} PCs]"
)
fig.update_yaxes(title_text = "T-SNE Component 2")
fig.update_xaxes(title_text ="T-SNE Component 1")
fig.show()

### 500 PCs

In [12]:
pc_components = pc_lst[4]
fig = go.Figure()
fig.add_traces(
    go.Scatter(
        x=tsne_dict[pc_components][:,0],
        y=tsne_dict[pc_components][:,1],
        mode="markers",
    )
)
fig.update_layout(
    title_text = f"T-SNE Method [{pc_components} PCs]"
)
fig.update_yaxes(title_text = "T-SNE Component 2")
fig.update_xaxes(title_text ="T-SNE Component 1")
fig.show()

As the numbers of principal components are increased, the noticable clusters get closer togater. 
<br>
This decline continous until the observable cluster merge. 
<br>
At 10 principal components, there are 5 distinct groups of cells. 
<br>
There is a marked distance between cluster and cluster.  
<br>
At 500 pcs, there are two distinct groups of observations. The relative distance between cluster and cluster is smaller. 
<br>
Overall, less principal components appear to reduce the noise enough for T-SNE to identify and maintain close data points togeter, and relative distant data points apart. 
<br>
With more components, this identification is less pronounce, identifying wide relative groups within the noise.  

## Hyper parameters

Because T-SNE can provide different results depending on the initialization of its parameters, the following two will be studied:
* perplexity
* initialization 

### Perplexity

From the scikit website:
<br>
***"...perplexity is related to the number of nearest neighbors that is used...Consider selecting a value between 5 and 50. Different values can result in significantly different results. The perplexity must be less than the number of samples."***
<br>
Since there are ~500 observations in the reduced data set, the following values might be interest:
* 5
* 20
* 40 
* 100
* 300
* 500 
<br>
Default value will be maintained for all other params, and 50-PCs will be used for the initial dimensionality reduction.

In [13]:
perplex_lst = [
    5,
    20,
    40,
    100,
    300,
    500
]

In [14]:
per_dict = {}
pca = decomposition.PCA(n_components=50)
X_redux = pca.fit_transform(X_log)
for per in alive_it(perplex_lst):
    print("Now using perplexity: ", per)
    tsne = manifold.TSNE(n_components=2, perplexity=per)
    tnse_X_redux = tsne.fit_transform(X_redux)
    per_dict[per] = tnse_X_redux

on 0: Now using perplexity:  5
on 1: Now using perplexity:  20
on 2: Now using perplexity:  40
on 3: Now using perplexity:  100
on 4: Now using perplexity:  300
on 5: Now using perplexity:  500
|████████████████████████████████████████| 6/6 [100%] in 12.8s (0.47/s) 


In [15]:
per = perplex_lst[0]
fig = go.Figure()
fig.add_traces(
    go.Scatter(
        x=per_dict[per][:,0],
        y=per_dict[per][:,1],
        mode="markers",
    )
)
fig.update_layout(
    title_text = f"T-SNE Method [perplexity: {per}]"
)
fig.update_yaxes(title_text = "T-SNE Component 2")
fig.update_xaxes(title_text ="T-SNE Component 1")
fig.show()

In [16]:
per = perplex_lst[1]
fig = go.Figure()
fig.add_traces(
    go.Scatter(
        x=per_dict[per][:,0],
        y=per_dict[per][:,1],
        mode="markers",
    )
)
fig.update_layout(
    title_text = f"T-SNE Method [perplexity: {per}]"
)
fig.update_yaxes(title_text = "T-SNE Component 2")
fig.update_xaxes(title_text ="T-SNE Component 1")
fig.show()

In [17]:
per = perplex_lst[2]
fig = go.Figure()
fig.add_traces(
    go.Scatter(
        x=per_dict[per][:,0],
        y=per_dict[per][:,1],
        mode="markers",
    )
)
fig.update_layout(
    title_text = f"T-SNE Method [perplexity: {per}]"
)
fig.update_yaxes(title_text = "T-SNE Component 2")
fig.update_xaxes(title_text ="T-SNE Component 1")
fig.show()

In [18]:
per = perplex_lst[3]
fig = go.Figure()
fig.add_traces(
    go.Scatter(
        x=per_dict[per][:,0],
        y=per_dict[per][:,1],
        mode="markers",
    )
)
fig.update_layout(
    title_text = f"T-SNE Method [perplexity: {per}]"
)
fig.update_yaxes(title_text = "T-SNE Component 2")
fig.update_xaxes(title_text ="T-SNE Component 1")
fig.show()

In [19]:
per = perplex_lst[4]
fig = go.Figure()
fig.add_traces(
    go.Scatter(
        x=per_dict[per][:,0],
        y=per_dict[per][:,1],
        mode="markers",
    )
)
fig.update_layout(
    title_text = f"T-SNE Method [perplexity: {per}]"
)
fig.update_yaxes(title_text = "T-SNE Component 2")
fig.update_xaxes(title_text ="T-SNE Component 1")
fig.show()

In [20]:
per = perplex_lst[5]
fig = go.Figure()
fig.add_traces(
    go.Scatter(
        x=per_dict[per][:,0],
        y=per_dict[per][:,1],
        mode="markers",
    )
)
fig.update_layout(
    title_text = f"T-SNE Method [perplexity: {per}]"
)
fig.update_yaxes(title_text = "T-SNE Component 2")
fig.update_xaxes(title_text ="T-SNE Component 1")
fig.show()

In [21]:
# per = perplex_lst[6]
# fig = go.Figure()
# fig.add_traces(
#     go.Scatter(
#         x=per_dict[per][:,0],
#         y=per_dict[per][:,1],
#         mode="markers",
#     )
# )
# fig.update_layout(
#     title_text = f"T-SNE Method [perplexity: {per}]"
# )
# fig.update_yaxes(title_text = "T-SNE Component 2")
# fig.update_xaxes(title_text ="T-SNE Component 1")
# fig.show()

In [22]:
# per = perplex_lst[7]
# fig = go.Figure()
# fig.add_traces(
#     go.Scatter(
#         x=per_dict[per][:,0],
#         y=per_dict[per][:,1],
#         mode="markers",
#     )
# )
# fig.update_layout(
#     title_text = f"T-SNE Method [perplexity: {per}]"
# )
# fig.update_yaxes(title_text = "T-SNE Component 2")
# fig.update_xaxes(title_text ="T-SNE Component 1")
# fig.show()

In [23]:
# per = perplex_lst[8]
# fig = go.Figure()
# fig.add_traces(
#     go.Scatter(
#         x=per_dict[per][:,0],
#         y=per_dict[per][:,1],
#         mode="markers",
#     )
# )
# fig.update_layout(
#     title_text = f"T-SNE Method [perplexity: {per}]"
# )
# fig.update_yaxes(title_text = "T-SNE Component 2")
# fig.update_xaxes(title_text ="T-SNE Component 1")
# fig.show()

In [24]:
# per = perplex_lst[9]
# fig = go.Figure()
# fig.add_traces(
#     go.Scatter(
#         x=per_dict[per][:,0],
#         y=per_dict[per][:,1],
#         mode="markers",
#     )
# )
# fig.update_layout(
#     title_text = f"T-SNE Method [perplexity: {per}]"
# )
# fig.update_yaxes(title_text = "T-SNE Component 2")
# fig.update_xaxes(title_text ="T-SNE Component 1")
# fig.show()

In [25]:
# per = perplex_lst[10]
# fig = go.Figure()
# fig.add_traces(
#     go.Scatter(
#         x=per_dict[per][:,0],
#         y=per_dict[per][:,1],
#         mode="markers",
#     )
# )
# fig.update_layout(
#     title_text = f"T-SNE Method [perplexity: {per}]"
# )
# fig.update_yaxes(title_text = "T-SNE Component 2")
# fig.update_xaxes(title_text ="T-SNE Component 1")
# fig.show()

In [26]:
perplexity_frame = (
    pd.concat(
        objs=[
            (
                pd.DataFrame(
                    per_dict[i],
                    columns= ["T-SNE Component 1", "T-SNE Component 2"]
                    )
                .assign(
                    perplexity =  i
                )
            ) for i in perplex_lst
        ],
        ignore_index=True
    )
)

In [27]:
fig = px.scatter(
    perplexity_frame, 
    x="T-SNE Component 1", 
    y="T-SNE Component 2",
    facet_col="perplexity", 
    facet_col_wrap = 1)
fig.update_layout(title_text = "T-SNE Perplexity Effect")
fig.update_yaxes(visible=False, showticklabels = True)
fig.show()

The first observation to be made is that, as the perplexity parameter increases, the points get crowded in the middle. 
<br>
However, the first two values appear to overextend the relative distance of the observations.
<br> 
Lacking cohesion, small perplexity appear to reduce the utility of identifying distinct groups. 
<br>
A value on the high-end of the range suggested (30<=perplexity<=40) appears to make these group distincts. 
<br>
For our purposes, that of cell sub-type identification, ensuring well-defined groups is the important factor. 

### Initialization

The **init** param will be change from (its default) ***pca*** to ***random***. 
<br>
Then, the **random_state** of initialization will be varied from 0 to 1000. 
<br>
The **kl_divergence_** attribute of the resulting objects will be ploted to find instances where the **random_state** might have sharply affected the resulting cost-function.

In [30]:
random_dict = {}
kl_div_dict = {}
pca = decomposition.PCA(n_components=50)
X_redux = pca.fit_transform(X_log)
for i in alive_it(range(0, 1001)):
    print("Now on random state,", i)
    tsne = manifold.TSNE(n_components=2, init = 'random', random_state=i)
    tnse_X_redux = tsne.fit_transform(X_redux)
    random_dict[i] = tnse_X_redux
    kl_div_dict[i] = tsne.kl_divergence_

on 0: Now on random state, 0
on 1: Now on random state, 1
on 2: Now on random state, 2
on 3: Now on random state, 3
on 4: Now on random state, 4
on 5: Now on random state, 5
on 6: Now on random state, 6
on 7: Now on random state, 7
on 8: Now on random state, 8
on 9: Now on random state, 9
on 10: Now on random state, 10
on 11: Now on random state, 11
on 12: Now on random state, 12
on 13: Now on random state, 13
on 14: Now on random state, 14
on 15: Now on random state, 15
on 16: Now on random state, 16
on 17: Now on random state, 17
on 18: Now on random state, 18
on 19: Now on random state, 19
on 20: Now on random state, 20
on 21: Now on random state, 21
on 22: Now on random state, 22
on 23: Now on random state, 23
on 24: Now on random state, 24
on 25: Now on random state, 25
on 26: Now on random state, 26
on 27: Now on random state, 27
on 28: Now on random state, 28
on 29: Now on random state, 29
on 30: Now on random state, 30
on 31: Now on random state, 31
on 32: Now on random state, 

In [38]:
fig = go.Figure()
fig.add_traces(
    go.Scatter(
        x=list(kl_div_dict.keys()),
        y=list(kl_div_dict.values()),
        mode="markers",
        marker={
            "color":"blue"
        }
    )
)
fig.add_vline(x = 45)
fig.add_vline(x=762)
fig.add_vline(x=220)
fig.add_vline(x=418)
fig.add_vline(x=98)
fig.update_layout(title_text = "KL Divergance vs Random State on T-SNE")
fig.update_xaxes(title_text = "Random Seed [dimensionless]")
fig.update_yaxes(title_text = "KL Divergence")
fig.show()

In [40]:
random_seed_frame = (
    pd.concat(
        objs=[
            (
                pd.DataFrame(
                    random_dict[i],
                    columns= ["T-SNE Component 1", "T-SNE Component 2"]
                    )
                .assign(
                    random_seed =  i
                )
            ) for i in [45, 98, 220, 418, 762]
        ],
        ignore_index=True
    )
)

In [41]:
fig = px.scatter(
    random_seed_frame, 
    x="T-SNE Component 1", 
    y="T-SNE Component 2",
    facet_col="random_seed", 
    facet_col_wrap = 1)
fig.update_layout(title_text = "T-SNE Initialization Effect")
fig.update_yaxes(visible=False, showticklabels = True)
fig.show()

### Effect of number of PC's chosen on clustering

For this hyperparameter, the same list of PC values as in the first task will be re-used. 
* 50 
* 100 
* 250
* 500
<br>
The K-means algorithm will be performed on each of these data sets, with cluster varying from 2 to 10. 
<br>
The idea being of plotting an inertia curve for a PC dataset. 
<br>
Hopefully, this approach helps to showcase how fast we reach the ***elbow cluster*** as more principal components are added. 

In [45]:
pc_k_dict = {}
for components in alive_it(pc_lst):
    print("Now on pca, ", components)
    pca = decomposition.PCA(n_components=components)
    X_redux = pca.fit_transform(X_log)
    sse = {}
    labels = {}
    for k in range(2, 11):
        print("Now on n_cluster", k)
        kmeans = cluster.KMeans(n_clusters=k, max_iter=1000, random_state=1)
        kmeans.fit(X_redux)
        labels[k]=kmeans.labels_
        sse[k]=kmeans.inertia_
    pc_k_dict[components] = {
        "labels":labels,
        "sse":sse
    }

on 0: Now on pca,  10


on 0: Now on n_cluster 2
on 0: Now on n_cluster 3
on 0: Now on n_cluster 4
on 0: Now on n_cluster 5
on 0: Now on n_cluster 6
on 0: Now on n_cluster 7
on 0: Now on n_cluster 8
on 0: Now on n_cluster 9
on 0: Now on n_cluster 10
on 1: Now on pca,  50
on 1: Now on n_cluster 2
on 1: Now on n_cluster 3
on 1: Now on n_cluster 4
on 1: Now on n_cluster 5
on 1: Now on n_cluster 6
on 1: Now on n_cluster 7
on 1: Now on n_cluster 8
on 1: Now on n_cluster 9
on 1: Now on n_cluster 10
on 2: Now on pca,  100
on 2: Now on n_cluster 2
on 2: Now on n_cluster 3
on 2: Now on n_cluster 4
on 2: Now on n_cluster 5
on 2: Now on n_cluster 6
on 2: Now on n_cluster 7
on 2: Now on n_cluster 8
on 2: Now on n_cluster 9
on 2: Now on n_cluster 10
on 3: Now on pca,  250
on 3: Now on n_cluster 2
on 3: Now on n_cluster 3
on 3: Now on n_cluster 4
on 3: Now on n_cluster 5
on 3: Now on n_cluster 6
on 3: Now on n_cluster 7
on 3: Now on n_cluster 8
on 3: Now on n_cluster 9
on 3: Now on n_cluster 10
on 4: Now on pca,  500
on 4:

In [46]:
fig = go.Figure()
for components, info_dict in pc_k_dict.items():
    fig.add_traces(
        go.Scatter(
            x = list(info_dict['sse'].keys()),
            y = list(info_dict['sse'].values()),
            name=f"{components} PCs"
        )
    )
fig.update_layout(title_text = "Effect of number of PC's chosen on clustering")
fig.update_xaxes(title_text = "Number of Clusters")
fig.update_yaxes(title_text = "K-means Inertia")
fig.show()

Even though the inertia metric of K-means decreases with the number of principal components, that might not be a appropiate comparison to be made. 
<br>
The aim is rather to investigate whether increasing or decreasing the number of PCs will help to identify the ***'optimal'*** number of clusters. 
<br>
The answer appears to be in the negative. 
<br>
For all PC-sets, there is not a lot to be gained by using more than 3-clusters. 