# Visualisation of single-cell expression data using PCA
In this lab you will use PCA to visualise some single-cell gene expression data from Guo et al. "Resolution of Cell Fate Decisions Revealed by Single-Cell Gene Expression Analysis from Zygote to Blastocyst" Developmental Cell, Volume 18, Issue 4, 20 April 2010, Pages 675-685, available from http://dx.doi.org/10.1016/j.devcel.2010.02.012. The paper pdf is available in the handouts folder for Week 10. 

Exercise 2: In the Guo et al. paper there are PCA plots in Figure 1B and 1C. Can you reproduce these or equivalent? You will have to modify the dataset, run PCA again and make some new plots. 

Note: Our data does not have information about which embryos the cells come from, so you won't be able to colour in the cells by embryo of origin as is done in Figure 1B

In [1]:
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import visualisation # some functions defined for this lab
import plotly
import plotly.graph_objs as go
from plotly.offline import download_plotlyjs, init_notebook_mode, plot, iplot
init_notebook_mode(connected=False)
# use seaborn plotting style defaults
doInteractive = True
def interactivePlots(fig, axes):
    # helper function to decide to use plotly interactive plots or not
    if(doInteractive):
        plotly.offline.iplot_mpl(fig, show_link=False, strip_style=True) # offline ipython notebook 



In [2]:
GuoDataAll = pd.read_csv('GuoData.csv', index_col=[0])
labelsAll = GuoDataAll.index # The labels give the cell-type for each cell 

## Figure 1B just contains 64 cell stage samples

In [3]:
#Take the data subsdet containing only 64 cell stage labels
frames = [GuoDataAll.iloc[labelsAll=='64 TE',:],GuoDataAll.iloc[labelsAll=='64 EPI',:],
          GuoDataAll.iloc[labelsAll=='64 PE',:]]
GuoData = pd.concat(frames)
labels = GuoData.index
print('data shape\n', GuoData.shape)
labels

data shape
 (159, 48)


Index(['64 TE', '64 TE', '64 TE', '64 TE', '64 TE', '64 TE', '64 TE', '64 TE',
       '64 TE', '64 TE',
       ...
       '64 PE', '64 PE', '64 PE', '64 PE', '64 PE', '64 PE', '64 PE', '64 PE',
       '64 PE', '64 PE'],
      dtype='object', length=159)

## Carry out PCA on this new dataset

In [4]:
Wt, X_proj, fracs = visualisation.do_pca(GuoData)
X_proj = X_proj/abs(X_proj).max().max()

## Plot the 2D projection of the data
We don't have the embryo of origin in the data so we can't show that the figure looks very similar to the one in the paper, but it looks a reasonable match. By focussing on the later stege we see better separation of the relevant cell-types than in the above plots using all cells. 

There are some subtle differences between this plot and the paper - perhaps this is due to some differences in processing of the data. That's why it is useful to include all the code for data analysis when publishing papers, so that other scientists can reproduce all the data analysis exactly. It is best practice to publish analysis notebooks like this one along with journal articles. 

In [5]:
PC1 = 0
PC2 = 1
# Create a trace
traceCelltype = list()
labelList = np.unique(labels)
assert set(np.unique(labels)) == set(labelList), 'Missing labels'
for lbl in labelList:
    traceCelltype.append(go.Scatter(
        x = X_proj[labels==lbl].iloc[:,PC1],
        y = X_proj[labels==lbl].iloc[:,PC2],
        mode='markers',
        name = lbl,
        text = lbl
        ))
iplot(traceCelltype, filename='GuoPCAFig1B')

## Fig 1C shows the factor loadings which are in Wt
Here we need to plot the Wt matrix and we need to label each point with the corresponding gene name (feature).

The plot is not identical to the paper due to subtle differences in the data normalisation used.

In [6]:
GeneNames = GuoData.columns # Collect together the gene names
# Create a trace
traceGene = go.Scatter(
    x = Wt.loc['PC1',:],
    y = Wt.loc['PC2',:],
    text = GeneNames,
    mode = 'markers'
        )
data = [traceGene]
# Plot and embed in ipython notebook!
iplot(data, filename='GuoPCAFig1C')