# Applying GPinfo on the cell cycle single cell nCounter data of PC3 human prostate cancer
_Sumon Ahmed_, 2017

This notebooks describes how GPinfo with informative prior over the latent space can be used to infer the cell cycle stages from the single cell nCounter data of the PC3 human prostate cancer cell line.

In [None]:
import pandas as pd
import numpy as np
from BGPLVM import GPinfo
from utils import plot, plot_comparison, calcroughness
%matplotlib inline

## Data decription
<a href="https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4102402/" terget="_blank">McDavid et al. (2014)</a> assayed the expression profiles of the PC3 human prostate cancer cell line. They identified the cells in G0/G1, S and G2/M cell cycle stages. 

The cells identified as G0/G1, S and G2/M have been mapped to the capture times of 1, 2 and 3, respectively. Due to the additional challenge of optimizing pseudotime parameters for periodic data, random pseudotimes having the largest log likelihood to estimate cell cycle peak time points have been used to initilize the prior.


The "McDavidtrainingData.csv" file contains the expression profiles of the top 56 differentially expressed genes in 361 cells from the PC3 human prostate cancer cell line which have been used in the inference.

The "McDavidCellMeta.csv" file contains the additional information of the data such as capture time of each cells, different initializations of pseudotimes, etc.

## Model Construction
The first step of using the GPinfo is to initialize the model with the observed data and the additional metadata (optional).

### Build sparse Bayesian GPLVM model
Following initializations are essential for better optimizations. If not initialized, the default values will be used.  
<ul>
<li>__kernel:__ Covariance function to define the mappring mapping from the latent space to the data space in Gaussian process prior. 
<!--
    <ul>
        <li>name</li>
        <li>ls</li>
        <li>var</li>
        <li>period</li>
    </ul>
-->
</li>

<li>__vParams:__ Variational Parameters
    <ul>
        <li>Xmean - mean of the latent dimensions. 
        
        ndarray of size $N \times Q$.</li>
        <li>Xvar - variance over the latent dimensions. A single floating point value or a ndarray of size $N \times Q$.</li>
        <li>Z - inducing inputs. ndarray of size $M \times Q$.</li>
    </ul>
</li>
<li>__priors:__ Prior over the latent input dimensions
    <ul>
        <li>Priormean - mean of the prior distribution. ndarray of size $N \times D$.</li>
        <li>Priorvar - variance of the prior distribution. A floating point value or a ndarray of size $N \times D$.</li>
    </ul>
</li>

<li>__latent_dims:__ Number of latent dimensions. An integer.</li>
<li>__n_inducing_points:__ Number of inducing points. An integer.</li>
</ul>

### Run the model
- `fit_model` optimizes the model.
- `get_pseudotime` returns the estimated pseudotime points.

## Visualize the results
The expression profile of some interesting genes have been plotted against the estimated pseudotime. Each point corresponds to a particular gene expression in a cell. 

The points are coloured based on cell cycle stages according to <a href="https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4102402/" terget="_blank" style="text-decoration:none;">McDavid et al. (2014)</a>. The circular horizontal axis (where both first and last labels are G2/M) represents the periodicity realized by the method in pseudotime inference. 

The solid black line is the posterior predicted mean of expression profiles while the grey ribbon depicts the 95% confidence interval. 

The vertical dotted lines are the CycleBase peak times for the selected genes.

To see the expression profiles of a different set of genes a list containing gene names shound be passed to the function `plot_genes`.