# GPcounts applied to bulk RNA-Seq time series data

Nuha BinTayyash, 2023

GPcounts is a Gaussian process regression package for counts data with negative binomial and zero-inflated negative binomial likelihoods as described in the paper "Non-parametric modelling of temporal and spatial counts data from RNA-seq experiments".

This notebook shows how to run GPcounts with a negative binomial likelihood and we compare results with the more standard Gaussian likelihood to find differentially expressed genes using infer trajectory, one sample and two samples cases on the [fission yeast](https://bioconductor.org/packages/release/data/experiment/html/fission.html) gene expression dataset.

In [None]:
import numpy as np
import pandas as pd
import gpflow
from IPython.display import display

In [None]:
import tensorflow as tf 

In [None]:
tf.__version__

In [None]:
gpflow.__version__

In [None]:
filename = '../data/fission_normalized_counts.csv'
Y = pd.read_csv(filename,index_col=[0])
X = pd.read_csv('../data/fission_col_data.csv',index_col=[0])
X = X[['minute']]

In [None]:
Y.shape

In [None]:
from GPcounts.RNA_seq_GP import rna_seq_gp

Extract time series data for one gene

In [None]:
genes_name = ['SPAC11D3.01c']
gp_counts = rna_seq_gp(X.iloc[0:18,:],Y.iloc[:,0:18].loc[genes_name]) 

# 1. Using GP regression to learn hyper-parameters and infer trajectories 

In [None]:
likelihood = 'Negative_binomial' # choose the likelihood
results = gp_counts.Infer_trajectory(likelihood)
display(results)

### Save GPflow models

By default, GPcounts create GPcounts_models folder to save GPflow models as checkpoints using tf.train.Checkpoint throughout the training procedure. The package saves single GPflow model, per each GP fit for each single gene and for each likelihood.

To change the default folder name, use gp_counts object to set Folder name parameter.

###  Print GP hyper-parameters

To print GP hyper-parameters, use GPcounts object to call load_predic_models() method that loads GPflow models for list of genes and make prediction with the selected likelihood. Predict argument is true by default, to load models only set predict to False.  

\** Note that gp_counts object is shared between different tests and likelihoods so you have to specify the test and likelihood you want to load its results.

In [None]:
genes_name = results.index.values # list of ganes name 
test_name = 'Infer_trajectory' # name of the test
params = gp_counts.load_predict_models(genes_name,test_name,likelihood)

###### load_predict_models() method returns params and params is a dictionary of three items

I. params['models'][0] is list of GPflow models of the first gene <br />

In [None]:
gpflow.utilities.print_summary(params['models'][0], fmt='notebook')

II. params['means'][0] is list of means predicted using GPflow models <br />

III. params['vars'][0] is list of variances/percentiles predicted using GPflow models <br />

### Plot GP posterior predictive distribution 

To plot the fit of GP model on genes use plot function from helper.py file. In plot function we plot the GP posterior predictive distribution for each model and show $\pm1$ standard deviation in dark shade and $\pm2$ standard deviation in light shade for Gaussian likelihood and its equivalent percentiles for non-Gaussian likelihoods.

In [None]:
from helper import plot 
plot(params,X.iloc[0:18,:].values,Y.iloc[:,0:18].loc[genes_name],results, test_name)

###  Initialize GP hyper-parameters

To replace the initial values for GPflow hyper-parameters, use GPcounts object to call to set the hyper_parameters(length_scale,variance,alpha,km).

In [None]:
gp_counts.length_scale = 1.
gp_counts.variance = 1.
gp_counts.alpha = 1.
gp_counts.km = 1.

In [None]:
results = gp_counts.Infer_trajectory(likelihood)
display(results)

To restore the default hyper-parameters initialization for GPcounts set the hyperparameters to None

In [None]:
gp_counts.length_scale = None
gp_counts.variance = None
gp_counts.alpha = None
gp_counts.km = None

To use GPcounts with Gaussian likelihood assuming log counts transformation $log(y+1)$ change the likelihood parameter

In [None]:
likelihood = 'Gaussian' # change the likelihood to Gaussian will apply log count tramsformation log(y+1) 
results = gp_counts.Infer_trajectory(likelihood)
display(results)

Load GPflow models for Gaussian likelihood and print hyper-parameters

In [None]:
test_name = 'Infer_trajectory' # name of the test
params = gp_counts.load_predict_models(genes_name,test_name,likelihood)

In [None]:
gpflow.utilities.print_summary(params['models'][0], fmt='notebook')

Plot GP fit for Gaussian likelihood

In [None]:
plot(params,X.iloc[0:18,:].values,Y.iloc[:,0:18].loc[genes_name],results)

To use GPcounts with Gaussian likelihood assuming any other transformation (not log count) change the likelihood parameter to Gaussian, set transform parameter to False and pass y transformed to GPcounts

\* Y is transformed using Anscombe transformation.ipynb notebook from [SpatialDE](https://github.com/Teichlab/SpatialDE) package

In [None]:
filename = '../data/Anscombe_transformation_fission_normalized_counts.csv'
Y_transformed = pd.read_csv(filename,index_col=[0]) # Y is transformed using Anscombe transformation
gp_counts = rna_seq_gp(X.iloc[0:18,:],Y_transformed.iloc[:,0:18].loc[genes_name]) 

In [None]:
likelihood = 'Gaussian'
results = gp_counts.Infer_trajectory(likelihood,transform = False)
display(results)

In [None]:
params = gp_counts.load_predict_models(genes_name,test_name,likelihood)
plot(params,X.iloc[0:18,:].values,Y_transformed.iloc[:,0:18].loc[genes_name],results)

## 2. One-sample test

In a one-sample test we compute the log-likelihood ratio (LLR) between a dynamic and constant model

In [None]:
gp_counts = rna_seq_gp(X.iloc[0:18,:],Y.iloc[:,0:18].loc[genes_name]) 

In [None]:
likelihood = 'Negative_binomial' 
results = gp_counts.One_sample_test(likelihood)
display(results)

Change test type to one sample test then load GPflow models and plot the GP fit 

In [None]:
test_name = 'One_sample_test' 
params = gp_counts.load_predict_models(genes_name,test_name,likelihood)
plot(params,X.iloc[0:18,:].values,Y.iloc[:,0:18].loc[genes_name],results)

In [None]:
likelihood = 'Gaussian' 
results = gp_counts.One_sample_test(likelihood)
display(results)

In [None]:
params = gp_counts.load_predict_models(genes_name,test_name,likelihood)
plot(params,X.iloc[0:18,:].values,Y.iloc[:,0:18].loc[genes_name],results)

## 3. Two-sample test

In a two-sample test we test the LLR between a model where the two time-series are replicates (same mean trajectory) and a model where the tractories are different (independent trajectories).

First we create a new GPcounts object to containing time series from two different conditions

In [None]:
gp_counts = rna_seq_gp(X,Y.loc[genes_name])

Below we carry out a two sample test with a negative binomial likelihood 

The shared-trajectory model has a lower log-likelihood than the sum of the independent model log-likelihood, providing evidence that the trajectories are different. 

In [None]:
likelihood = 'Negative_binomial' 
results = gp_counts.Two_samples_test(likelihood)
display(results)

In [None]:
from helper import plot 

In [None]:
test_name = 'Two_samples_test'
params = gp_counts.load_predict_models(genes_name,test_name,likelihood)
plot(params,X.values,Y.loc[genes_name],results)


### Safe mode option

Set safe_mode to True to check:
1. if the mean of the GP posterior predictive distribution is in the mean of the data.
2. if the log-likelihood ratio LLR < 0 for small lengthscale or if the LLR takes on extreme values that would indicate a very large difference between the time-varying GP and constant model.

In [None]:
gp_counts = rna_seq_gp(X,Y.loc[genes_name],safe_mode = True)
likelihood = 'Negative_binomial' 
results = gp_counts.Two_samples_test(likelihood)
display(results)

Now we carry out a two sample test with a Gaussian likelihood - in this case the shared model has a higher log-likelihood than the sum of the independent model log-likelihoods

In [None]:
test_name = 'Two_samples_test'
params = gp_counts.load_predict_models(genes_name,test_name,likelihood)
plot(params,X.values,Y.loc[genes_name],results)

In [None]:
likelihood = 'Gaussian' 
results = gp_counts.Two_samples_test(likelihood)
display(results)

In [None]:
params = gp_counts.load_predict_models(genes_name,test_name,likelihood)
plot(params,X.values,Y.loc[genes_name],results)