# Dimensionality Reduction 2

Learning objectives
1. Apply NNMF and MDS to different data sets and interpret the outputs
2. Learn how to visualise the model outputs
3. Interpret the results and compare them with PCA

In [None]:
import pandas as pd 
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.decomposition import NMF, PCA
from sklearn.neighbors import DistanceMetric
from sklearn.manifold import MDS

## Load in datasets
Read in the omics datasets using the pandas [read_excel()](https://pandas.pydata.org/docs/reference/api/pandas.read_excel.html) function. For this example we will be using a urine metabolomics data from a diabetes cohort. 

In [None]:
diabetes_urine_metab = pd.read_excel("../Data/diabetes_metabolomics_urine.xlsx")

In [None]:
diabetes_urine_metab

We can see there are several covariates in the first columns: Age, Gender, BMI, ETH, and T2D (diabetes status). To obtain just the metabolite relative abundance data, we can use `diabetes_urine_metab.iloc[:, 6:]`. This takes all rows of the data, and all columns after the 6th column.

## NMF

We can use the [sklearn.NMF() function](https://scikit-learn.org/stable/modules/generated/sklearn.decomposition.NMF.html) to perform non-negative matrix factorisation on the dataset. We will initialise the model with five components, and use the "random" initialiser.

The input matrix $V$ (the metabolomics dataset without covariates/metadata) will be factorised into the matrices $W$ (weights / projected data) and $H$ (factorization / principal components)

$V \approx WH$

In [None]:
# instantiate the NMF model
# here we will select 5 components
model = NMF(n_components=5, init='random', random_state=0)

# set V as the input omics data matrix
V = diabetes_urine_metab.iloc[:, 6:]

# the weights matrix W is obtained by projecting the original data
W = model.fit_transform(V)

# the components matrix H is obtained by using the .components_ attribute
H = model.components_


Let's look a the shape of the resulting matrices:

In [None]:
W.shape

In [None]:
H.shape

We should be able to approximately reconstruct the original matrix $V$ by calculating the product of $W$ and $H$

In [None]:
# obtain the dot product of the matrices W and H
V_approx = pd.DataFrame(np.dot(W, H))

# This is the same shape as the original matrix V
V_approx.shape

We can compare the two matrices V and V_approx by eye:

In [None]:
V_approx

In [None]:
V

We can also estimate the reconstruction error using the Frobenius norm:

In [None]:
model.reconstruction_err_

How does the reconstruction error change as you vary `n_components`?

### Interpreting NMF

The $H$ matrix contains the loadings or principal components, summarising the importance of each feature (metabolites in this case) in the new components, of which we have selected 5.

In [None]:
# add metabolite names to the matrix H to help interpretation
H_df = pd.DataFrame(H.T, columns=range(1, H.shape[0]+1), index=V.columns)

In [None]:
H_df

If we sort each column, we can determine the most important metabolites driving each component (1-5)


In [None]:
# for each component, print out the top 5 features driving that component
for col in H_df:
    print("Component: "+ str(col))
    print(H_df[col].sort_values(ascending=False)[0:5].index.tolist())

Visualise the first (not top) 20 metabolites in the metabolomics data, alongside their contribution to the 5 latent factors:

In [None]:
plt.figure(figsize=(8, 8))
plt.imshow(H_df.iloc[0:20, :], cmap="viridis")
plt.yticks(range(len(H_df.iloc[0:20, :].index)), H_df.iloc[0:20, :].index)
plt.xticks(range(0, 5), [1, 2, 3, 4, 5])
plt.colorbar()
plt.show()

Visualise the first two components of the NMF projected data ($W$) using a scatterplot:

In [None]:
sns.set_style("ticks")
sns.set_context("notebook")
plt.figure(figsize=(8, 8))

p = sns.scatterplot(x=W[:, 0],
 y=W[:, 1], 
 hue=diabetes_urine_metab["T2D"])

p.set_xlabel("Dim 1")
p.set_ylabel("Dim 2")

plt.show()

We can also plot a heatmap using the NMF projected data $W$, which clusters the samples based on T2D status using the five latent factors:

In [None]:
sns.clustermap(W.T,
col_colors=["tab:orange" if i == 1 else "tab:blue" for i in diabetes_urine_metab["T2D"]],
row_colors=sns.color_palette("tab10"),
xticklabels=False)

plt.show()

### NMF parameters: alpha and L1 ratio
Find the full list of parameters [here](https://scikit-learn.org/stable/modules/generated/sklearn.decomposition.NMF.html) with explanations and default values. 

We can control the amount of regularisation in the NMF model by adjusting the $L1$ and $L2$ priors. The amount of L1 to L2 reqularisation can be controlled by setting the `l1_ratio`. The intensity of the regularisation can be set using the `alpha_W` and `alpha_H` parameters. By default, `alpha_W` is set to 0.0, and `alpha_H` is set to the same as `alpha_W`.

The `l1_ratio` is set to 0.0 by default. For l1_ratio = 0 the penalty is an elementwise L2 penalty (aka Frobenius Norm). For l1_ratio = 1 it is an elementwise L1 penalty. For 0 < l1_ratio < 1, the penalty is a combination of L1 and L2.

Read in more detail about the various sklearn NMF parameters [here](https://scikit-learn.org/stable/modules/decomposition.html#nmf).

In [None]:
# instantiate the NMF model
# set the L1 ratio to 0.5
model_l1_05 = NMF(n_components=5, init='random', l1_ratio=0.5, alpha_W=0.5)

# set V as the input omics data matrix
V = diabetes_urine_metab.iloc[:, 6:]

# the weights matrix W is obtained by projecting the original data
W_l1_05 = model.fit_transform(V)

# the components matrix H is obtained by using the .components_ attribute
H_l1_05 = model.components_

In [None]:
# instantiate the NMF model
# set the L1 ratio to 0.5
model_l1_08 = NMF(n_components=5, init='random', l1_ratio=0, alpha_W=1)

# the weights matrix W is obtained by projecting the original data
W_l1_08 = model.fit_transform(V)

# the components matrix H is obtained by using the .components_ attribute
H_l1_08 = model.components_

## Multi dimensional scaling (MDS)

[Multidimensional scaling](https://scikit-learn.org/0.24/modules/manifold.html#multidimensional-scaling) (MDS) seeks a low-dimensional representation of the data in which the distances respect well the distances in the original high-dimensional space. MDS (both metric and non-metric) can be performed using the sklearn.[MDS()](https://scikit-learn.org/stable/modules/generated/sklearn.manifold.MDS.html) function.

### Metric MDS

In [None]:
embedding = MDS(n_components=2, metric=True)

In [None]:
embedding_euclidean = embedding.fit_transform(diabetes_urine_metab.iloc[:, 6:])

Visualise the first two MDS components:

In [None]:
sns.set_style("ticks")
sns.set_context("notebook")
plt.figure(figsize=(8, 8))

p = sns.scatterplot(x=embedding_euclidean[:, 0],
 y=embedding_euclidean[:, 1], 
 hue=diabetes_urine_metab["T2D"])

p.set_xlabel("Dim 1")
p.set_ylabel("Dim 2")
plt.title("Metric MDS")
plt.show()

### Dissimilarity metrics
We can use different metrics to compute the dissimilarity matrix for metric MDS. The default option used in the MDS function is Euclidean distance. We can otherwise specify `dissimilarity = "precomputed"`. Here we provide a dissimilarity matrix we have already computed. 

We can compute the dissimilarity matrix using [various metrics](https://scikit-learn.org/0.24/modules/generated/sklearn.neighbors.DistanceMetric.html).

Running MDS on a precomputed Manhattan distance dissimilarity matrix

In [None]:
manhattan_dist = DistanceMetric.get_metric('manhattan').pairwise(diabetes_urine_metab.iloc[:, 6:])

embedding_manhattan = MDS(n_components=2, metric=True, dissimilarity="precomputed")
MDS_manhattan = embedding_manhattan.fit_transform(manhattan_dist)

and Minkowski distance

In [None]:
minkowski_dist = DistanceMetric.get_metric('minkowski').pairwise(diabetes_urine_metab.iloc[:, 6:])

embedding_minkowski = MDS(n_components=2, metric=True, dissimilarity="precomputed")
MDS_minkowski = embedding_minkowski.fit_transform(minkowski_dist)

Compare the results obtained using metric MDS with different dissimilarity metrics:

In [None]:
fig, (ax1, ax2, ax3) = plt.subplots(1, 3, figsize=(15, 5))

sns.scatterplot(x=embedding_euclidean[:, 0], y=embedding_euclidean[:, 1], hue=diabetes_urine_metab["T2D"], ax=ax1)
ax1.set_title("Euclidean")

sns.scatterplot(x=MDS_manhattan[:, 0], y=MDS_manhattan[:, 1], hue=diabetes_urine_metab["T2D"], ax=ax2)
ax2.set_title("Manhattan")

sns.scatterplot(x=MDS_minkowski[:, 0], y=MDS_minkowski[:, 1], hue=diabetes_urine_metab["T2D"], ax=ax3)
ax3.set_title("Minkowski")

plt.tight_layout()
plt.show()

Using previous code examples, compute another similarity metric based on the metabolomics data (full list [here](https://scikit-learn.org/0.24/modules/generated/sklearn.neighbors.DistanceMetric.html)) and visualise the first two components using a scatterplot:

In [None]:
# Write code here...

### Non-metric MDS
To compute non-metric MDS, set the parameter `metric=False`

In [None]:
embedding_nonmetric = MDS(n_components=2, metric=False)
diabetes_urine_metab_transformed_nMDS = embedding_nonmetric.fit_transform(diabetes_urine_metab.iloc[:, 6:])

Visualise the first two non-metric MDS components:

In [None]:
sns.set_style("ticks")
sns.set_context("notebook")
plt.figure(figsize=(8, 8))

p = sns.scatterplot(x=diabetes_urine_metab_transformed_nMDS[:, 0],
 y=diabetes_urine_metab_transformed_nMDS[:, 1], 
 hue=diabetes_urine_metab["T2D"])

p.set_xlabel("Dim 1")
p.set_ylabel("Dim 2")
plt.title("Non-metric MDS")
plt.show()

## How do NNMF and MDS compare with PCA?

Using what you have learnt from the previous tutorial on PCA, scale the input metabolomics data, apply PCA and project the metabolomics data into the latent space, and visualise the data using a scatterplot. Colour the datapoints by Type 2 diabetes status "T2D".

In [None]:
# import scaler
from sklearn.preprocessing import StandardScaler

# scale the data
diabetes_urine_metab_scaled = StandardScaler().fit_transform(diabetes_urine_metab.iloc[:, 6:])

In [None]:
# perform PCA and project the data
pca_res = PCA(n_components=2).fit_transform(diabetes_urine_metab_scaled)

In [None]:
# plot two components of the projected data on a scatter plot
sns.set_style("ticks")
sns.set_context("notebook")
plt.figure(figsize=(8, 8))

p = sns.scatterplot(x=pca_res[:, 0],
 y=pca_res[:, 1],
 hue=diabetes_urine_metab["T2D"])

p.set_xlabel("PC1")
p.set_ylabel("PC2")

plt.title("PCA")

plt.show()

In [None]:
fig, (ax1, ax2, ax3) = plt.subplots(1, 3, figsize=(15, 5))

sns.scatterplot(x=pca_res[:, 0], y=pca_res[:, 1], hue=diabetes_urine_metab["T2D"], ax=ax1)
ax1.set_title("PCA")

sns.scatterplot(x=W[:, 0], y=W[:, 1], hue=diabetes_urine_metab["T2D"], ax=ax2)
ax2.set_title("NNMF")

sns.scatterplot(x=embedding_euclidean[:, 0], y=embedding_euclidean[:, 1], hue=diabetes_urine_metab["T2D"], ax=ax3)
ax3.set_title("MDS (metric)")

plt.tight_layout()
plt.show()

Note - look at the axes: using NNMF all weights are POSITIVE, unlike in PCA or MDS

## Your turn
Choose one of the other omics datasets in the `Data` folder and perform either MDS or NNMF on it. Visualise the results.

In [None]:
# Import dataset

In [None]:
# Perform NNMF/MDS

In [None]:
# Visualise the results