## Visualising cell types (single cell RNA-Seq)

In this practical exercise we will be exploring a biological dataset. We will be loading the single-cell RNA-seq dataset from the paper, Macosko et al, "Highly Parallel Genome-wide Expression Profiling of Individual Cells Using Nanoliter Droplets," Cell (2015).

It contains genetic information from 44808 individual cells, and the labels correspond to their cell types. Instead of using genes as features, the data has already been pre-processed using PCA. The 50 principal components are stored, together with 12 unique labels or cell types. 

We will be using t-SNE to visualise the samples and identify the clusters only based on their gene expression. The clusters should ideally correspond to the cell types.

Note that for the purpose of this practical, we have randomly sampled 10,000 rows from the original dataset - this ensures we can run tSNE with different sets of parameters whilst not waiting too long. If you were to try running tSNE on the original dataset, it would usually take 20min+ to train.

### Loading libraries and data

As usual, let's start by loading in some essential libraries `pandas`, `numpy`, and `matplotlib`.

To plot, in this practical we will be using `seaborn`. It is a Python data visualisation library based on `matplotlib`, which provides a high-level interface for drawing attractive and informative statistical graphics.

In [None]:
import pandas as pd
import numpy as np

import matplotlib.pyplot as plt
%matplotlib inline

from sklearn.manifold import TSNE
import seaborn as sns

### Loading the data

Our data is available as pandas dataframe in pickle format. We will load them and convert to a numpy array
* Load the main dataset `data/x.pkl` into a dataframe using `pd.read_pickle`. Call `.values` on the dataframe to convert it to a numpy array and save the result in a variable called `x` 
* Load our labels, `data/y.pkl` into a variable (numpy array) called `y` by doing the same as above

In [None]:
# Your code here...


### Loading the data

* Inspect the data, print the number of samples and features for instance (using `.shape`)
* Print the cell types or labels. You can use the method `.unique()` to identify and return a list with all unique values in an array.

In [None]:
# Your code here...


### Plot using PCA

The original data was already reduced using PCA, in order to reduce the number of features. This is a common practice among biologists, as gene expression arrays can have very high dimensions - on the order of hundreds of thousands of genes. Where most genes have very low or practically zero expression. 

Therefore we can use the already computed PCA components to have a first look on the data. 

### Plot using PCA

* We will use `sns.scatterplot` to plot our data

`seaborn` operates over `pandas` DataFrames. Define a `pandas` DataFrame with the data and use it to plot the 2 principal components (first two features of the original data)

In [None]:
# Your code here...


### Plot using PCA

* We will use `sns.scatterplot` to plot our data

Now use `seaborn` to plot the DataFrame and use labels as colors. Use the pre-defined color palette:

```Python
    MACOSKO_COLORS = {
        "Amacrine cells": "#A5C93D",
        "Astrocytes": "#8B006B",
        "Bipolar cells": "#2000D7",
        "Cones": "#538CBA",
        "Fibroblasts": "#8B006B",
        "Horizontal cells": "#B33B19",
        "Microglia": "#8B006B",
        "Muller glia": "#8B006B",
        "Pericytes": "#8B006B",
        "Retinal ganglion cells": "#C38A1F",
        "Rods": "#538CBA",
        "Vascular endothelium": "#8B006B",
    }
````




In [None]:
# Your code here...


We can see that even though the first 2 principal components account for most of the data variability, they are still not enough to provide a visualisation with separable clusters or cell types.


### Run t-SNE

Instantiate the `TSNE` class, fit and transform the data. 

Set `verbose = True` to print some of the operation details while training. 
Set `random_state = 0` to initialize the internal random number generator equally for all models.

In [None]:
# Your code here...


In [None]:
%time tsne_embedding = tsne.fit_transform(x)

### Plot t-SNE

Similarly to what we did before, define and execute a function that uses the t-SNE embedding to plot the data using `seaborn`.

In [None]:
# Your code here...


### Explore different numbers of iterations

Let's see what the embeddings look like at different stages of training. This is controlled by the number of iterations that we allow the algorithm to run for. 

* We can modify the number of algorithm iterations by setting `n_iter` 

When the number of iterations is not defined, the default is to run until convergence. 

More information can be found on the official `sklearn` documentation at https://scikit-learn.org/stable/modules/generated/sklearn.manifold.TSNE.html 

In [None]:
# Your code here...


### Explore different perplexities

The `perplexity` hyper-parameter is directly related to the number of neighbours considered for each sample during training. 

Therefore it has an effect over the local vs global structure of the t-SNE embedding.

Try different values of `perplexity` and observe their effect on the t-SNE embedding. Modify the `perplexity` value inside TSNE

The default perplexity value is `p=30`.

In [None]:
# Your code here...
