# Unsupervised machine learning techniques

Today we will continue working with the gene expression data of our mice, and try to find patterns in them with the help of unsupervised machine learning algorithms. Unsupervised methods are useful when we have so-called unlabeled data: samples with no group membership information, only their raw values.

Now our expression data isn't unlabeled: we know for every sample 1) what diet it had, 2) what strain it was. But we will not give that information to the upcoming ML methods. We will instead ask these methods to score / separate / cluster the samples based on their raw values only, and then we will verify whether they managed to do it in a way which is consistent with the labels that we had hidden from the algorithms.

The unsupervised techniques we will use today are principal component analysis (PCA), hierarchical clustering and K-means clustering, provided by the wonderful, feature-rich and easy-to-use `scikit-learn` package. Their website is worth taking a look for anyone interested in machine learning: it is not just a documentation, but also a great guide for a lot of techniques.

In [1]:
# imports
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt

from sklearn.decomposition import PCA

## Load the expression data

We will need the original raw expression data, and your differential expression analysis csv. Adapt the below steps to make sure we are on the same page.

In [6]:
xls = pd.read_excel('../data-livermito/aad0189_DataFileS5.xlsx', header=2)
expr = xls.groupby('Gene.1').mean()  # or 'Gene' if you had used that for your DE calculations
expr = expr.loc[:, expr.columns.str.contains('Liver')]
is_hfd = pd.Series(expr.columns.str.contains('HFD'), index=expr.columns)  # diet labels for verification

Since `scikit-learn` likes to treat data as one sample per row, we should transpose our expression data matrix, such that rows stand for mice and columns stand for genes. Let's call the transposed form of `expr` as `data`.

In [None]:
# data = ...

## Task 1: principal component analysis
PCA takes a set of high-dimensional samples (vectors), and transforms them to a smaller set of variables using an orthogonal transformation. It finds a set of orthogonal vectors (their dimension identical to the dimension of your data points) and projects every sample to each of these orthogonal vectors with a simple dot product. The projections are called principal components, and they have some interesting properties that we will not go into detail just yet.

### 1.1 Initialize a PCA object with 4 components and transform your data.

### 1.2 Plot the components against each other

Plotting the first component against the second is simple enough with matplotlib `scatter`, but you can do better and create a scatter plot for each pair using seaborn's `pairplot`. This requires your data to be a `DataFrame`, but you know how to turn a numpy matrix into a DF.

### 1.2.1 Add color information based on the diet
Are you impressed?

### 1.2.2 Optional: is it enough to use every 200th gene and still get a nice PCA plot?

### 1.3 Find the component vectors that were used for the transformation. Are they orthogonal as promised? How would you verify it?

### 1.3.1 Optional: How correlated are the transformed values with each other? Why is it important?

### 1.4: Compare the first component vector's weights with the log-foldchange vector from your differential expression analysis. Visualize them on a scatter plot, and interpret what you see.
Remember, you computed the fold-change values by comparing expression values between CD and the HFD diets. PCA had no access to this information, and yet... Well, this is why it's such a popular data exploration technique.

You can also try a scatter plot for weight vs p-values, log10-p, etc. Some of them might look familiar.

### 1.5: Create a scatter plot for PC1 and PC2 only, but this time connect pairs of points that come from the same strain

This will involve `plt.scatter`, and a `for` loop with `plt.plot` calls.
The pattern of the resulting lines might be quite interesting. If it is, try to explain why.

## Task 2: Hierarchical clustering

You had created a cluster-map before (Day 2) and remember those dendrograms on the top and left edges of the figure. They are a result of an unsupervised technique called hierarchical clustering. It iteratively merges single data points into bigger and bigger clusters based on their similarity, one at a time, until all points belong to one big cluster. Seaborn's `clustermap` does it on both axes by default, and produces a heatmap of the values as well.

### 2.1 Create a clustermap
From your first PCA component vector, take the genes with the 15 largest positive, and 15 largest negative weights, and create a smaller expression matrix with these 30 genes only. Use seaborn's `clustermap` to do a hierarchical clustering of genes and mice alike, as well as a heatmap.


### 2.2 Display the diet at the bottom of the mice's dendrogram
You will have to convert `is_hfd`'s values to color values first: `y` or `yellow` for 0 and `b` or `blue` for 1 should do fine.

### 2.3 Standardize the matrix
It is hard to see the fine differences between the gene expression levels acros mice, because the range of general expression levels across genes is much higher. To circumvent this, we usually standardize our data: for every gene, subtract the mean and divide by the standard deviation.

You can do this with a pandas one-liner, or use sklearn's `StandardScaler` tool. You can simply extend the above cell with this.

### 2.4 Try different linkage methods
The linkage method defines how cluster similarities are defined. For example, `single` defines the distance of two clusters as the smallest distance of any two elements between them. Its opposite is `complete` which takes the largest distance of any two elements across two clusters. Middle grounds are `centroid`, `average`, and a few more. Look at how they affect the topology of the clusters. Which one seems most suitable in our case?

## Task 3: K-means clustering

K-means clustering attempts to create `k` virtual samples, whose average distance to the nearest actual samples is as small as possible. These `k` virtual samples are called cluster centers/prototypes or centroids. Each sample is assigned to the nearest centroid, therefore partitioning the samples to k clusters.

### 3.1 Perform k-means clustering on the expression dataset. What k should you choose?

### 3.2 Compare the resulting clusters with the diet labels

### 3.3 Optional: with k=2, take the centroids, and transform them with the same PCA that you had trained earlier. Mark them on the PC1 vs PC2 plot.

### 3.4 Train a 2-means clustering on every second sample and predict on the other half of the samples
Since our CD and HFD samples come in two big batches, you can just use `::2` and `1::2` to split them. What do you find?