Welcome! This is a tutorial about RASpy (Reaction Activity Scores in Python). 
In this notebook, we will show how to remove the cell-cycle effect on the RAS matrix. 

Note that cell cycle removal cannot
be applied on the count matrix before RAS computation. 
Indeed, the introduction of possible negative values in the count matrix would make
it meaningless to evaluate, e.g., an OR operator using the sum operation.
For these reasons, RASpy allows one to apply such operations directly on
RAS matrix.

## Load the data

Load the count matrix (h5ad format). Such a dataset are reported as TPM and was downloaded from the  GEO database Atlas (https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE110949). 

In [None]:
import scanpy as sc
adata=sc.read_h5ad("../datasets/GSE110949_raw_adata")
adata

Load the metabolic model

In [None]:
from cobra.io import read_sbml_model
model=read_sbml_model('../metabolic_models/RECON3_symbol.xml')
model

## Processing on the count data

Annotate all the mithocondrial genes

In [None]:
#in case of gene-symbol annotation for genes
adata.var['mt'] =adata.var_names.str.startswith('MT-')   

Compute quality metrics

In [None]:
sc.pp.calculate_qc_metrics(adata, qc_vars=['mt'], percent_top=None, log1p=False, inplace=True)

Basic filtering: 
- exclude low-quality cells with fewer expressed (!=0) genes 
- exclude all genes that were not detected in at least three cells
- exclude cells with too many mitochondrial genes expressed
- exclude cells with too many total counts


In [None]:
sc.pp.filter_cells(adata, min_genes=2000)
sc.pp.filter_genes(adata, min_cells=3)

adata =adata[adata.obs.total_counts <= 15000, :]  
adata = adata[adata.obs.pct_counts_mt < 10, :]  

adata

In [None]:
sc.pp.normalize_per_cell(adata, counts_per_cell_after=1e4)

## Annotate cell cycle

Suppose that you want to annotate the cell cycle of each cell to be removed during the clustering of RAS (see notebook "RAS cluster analysis")

In [None]:
adata_for_cellcycle=adata.copy()

Applying of Normalization, Log-transformation and scaling to prepare data for cell cycle scoring

In [None]:
sc.pp.log1p(adata_for_cellcycle)
sc.pp.scale(adata_for_cellcycle)

Load cell cycle genes defined in Tirosh et al, 2015. It is a list of 97 genes, represented by their gene symbol. The list here is for humans.

In [None]:
#in case of gene-symbol annotation for genes
cell_cycle_genes = [x.strip() for x in open('../utils_files/regev_lab_cell_cycle_genes_symbol.txt')]
s_genes = cell_cycle_genes[:43]
g2m_genes = cell_cycle_genes[43:]

In [None]:
#in case of ENSG annotation for genes
#cell_cycle_genes = [x.strip() for x in open('../utils_files/regev_lab_cell_cycle_genes_ensg.txt')]
#s_genes = cell_cycle_genes[:42]
#g2m_genes = cell_cycle_genes[42:]

Filter out genes not in the data

In [None]:
s_genes = [x for x in s_genes if x in adata.var_names]
g2m_genes =[x for x in g2m_genes if x in adata.var_names]

Perform cell cycle scoring

In [None]:
sc.tl.score_genes_cell_cycle(adata_for_cellcycle, s_genes=s_genes, g2m_genes=g2m_genes)

In [None]:
adata_for_cellcycle

## Effect of cell cycle on the RAS matrix

Reduce the dimensionality of the data by running principal component analysis (PCA)

In [None]:
sc.tl.pca(adata_for_cellcycle, svd_solver='arpack')

Make a scatter plot in the PCA coordinates, showing some differences

In [None]:
sc.pl.pca(adata_for_cellcycle, color=['phase'])

Save the information about the cell cycle on the original adata count matrix 

In [None]:
adata.obs['phase']=adata_for_cellcycle.obs['phase']
adata.obs['S_score']=adata_for_cellcycle.obs['S_score']
adata.obs['G2M_score']=adata_for_cellcycle.obs['G2M_score']

## RAS computation

In [None]:
import sys
sys.path.insert(1, '../raspy/')

In [None]:
from ras import RAS_computation as rc

In [None]:
ras_object=rc(adata,model)

In [None]:
import numpy as np
import time
t0= time.time()
ras_adata=ras_object.compute()
t1 = time.time()-t0
print("Time elapsed: ", t1) # CPU seconds elapsed (floating point)

Note that the information about the phase are saved as "countmatrix_phase" in the ras_adata structure

In [None]:
ras_adata.obs[["countmatrix_S_score","countmatrix_G2M_score"]]

## Pre-processing of the RAS matrix

Drop duplicates reaction (for example, toe reaction having the same GPR

In [None]:
reactions=list(ras_adata.to_df().T.drop_duplicates().index)
ras_adata=ras_adata[:,reactions]
ras_adata

Total-count normalization

In [None]:
sc.pp.normalize_total(ras_adata, target_sum=1e4)

Logarithmize the data

In [None]:
sc.pp.log1p(ras_adata)

Scale each reaction score to unit variance. Clip values exceeding standard deviation 10.

In [None]:
sc.pp.scale(ras_adata)

## Effect of cell cycle on the RAS matrix

Reduce the dimensionality of the data by running principal component analysis (PCA)

In [None]:
sc.tl.pca(ras_adata, svd_solver='arpack')

Make a scatter plot in the PCA coordinates, showing some differences

In [None]:
sc.pl.pca(ras_adata, color=['countmatrix_phase'])

Remove cell cycle effect (computed on the count matrix,see Notebook "Pre-processing of the count matrix")

In [None]:
sc.pp.regress_out(ras_adata, keys=['countmatrix_S_score','countmatrix_G2M_score'])
sc.pp.scale(ras_adata)

Reduce the dimensionality of the data by running principal component analysis (PCA)

In [None]:
sc.tl.pca(ras_adata, svd_solver='arpack')

Make a scatter plot in the PCA coordinates, showing some differences after cell-cycle effect removal

In [None]:
sc.tl.pca(ras_adata, svd_solver='arpack')
sc.pl.pca(ras_adata, color=['countmatrix_phase'])