Welcome! This is a tutorial about RASpy (Reaction Activity Scores in Python). 
In this notebook, we will show how to perform cluster analysis on the RAS matrix using the Scanpy toolkits.

In [None]:
import sys
sys.path.insert(1, '../raspy/')

## Load the data

Load the ras matrix (h5ad format), previously compute on notebook "Ras computation"

In [None]:
import scanpy as sc
ras_adata=sc.read_h5ad("../datasets/E-GEOD-86618_ras_adata")
ras_adata

## Pre-processing of the RAS matrix

Drop duplicates reaction (for example, toe reaction having the same GPR

In [None]:
reactions=list(ras_adata.to_df().T.drop_duplicates().index)
ras_adata=ras_adata[:,reactions]
ras_adata

Normalize the data

In [None]:
sc.pp.normalize_total(ras_adata, target_sum=1e4)

Logarithmize the data

In [None]:
sc.pp.log1p(ras_adata)

Set the .raw attribute of the AnnData object to the normalized and logarithmized raw gene expression for later use 

In [None]:
ras_adata.raw = ras_adata

Scale each reaction score to unit variance. Clip values exceeding standard deviation 10.

In [None]:
sc.pp.scale(ras_adata, max_value=10)

## Principal component analysis

Reduce the dimensionality of the data by running principal component analysis (PCA)

In [None]:
sc.tl.pca(ras_adata, svd_solver='arpack')

Make a scatter plot in the PCA coordinates, showing some differences

In [None]:
from matplotlib.colors import ListedColormap
sc.pl.pca(ras_adata, color=['countmatrix_Factor Value[disease]'])

Let us inspect the contribution of single PCs to the total variance in the data. 

In [None]:
sc.pl.pca_variance_ratio(ras_adata, log=True)

## Compute the clustering (default cluster parameters)

Let us compute the neighborhood graph of cells using the PCA representation of the data matrix. 

In [None]:
sc.pp.neighbors(ras_adata)

Cluster the cells using the Leiden algorithm

In [None]:
sc.tl.leiden(ras_adata)

Embed the graph in two dimensions using UMAP

In [None]:
sc.tl.umap(ras_adata)
sc.pl.umap(ras_adata,  color=['leiden','countmatrix_Factor Value[disease]'])

## Find best clustering

In [None]:
from utils import find_bh

In [None]:
resolutions=[0.25,0.5,0.75,1,1.25,1.5]
n_pcs=[5,10,15,20]
n_neighbors=[5,10,15,20]

In [None]:
df=find_bh(ras_adata,resolutions=resolutions,
    n_pcs=n_pcs,
    n_neighbors=n_neighbors,
    names_of_groud_truth=[])
df

In [None]:
obj_fun="cluster_values_sil"
index=df[obj_fun].argmax()
res,n_pc,n_neighbor=df.iloc[index][["res","pcs_values","neigh_values"]].values
res,n_pc,n_neighbor
df[obj_fun].max()

In [None]:
sc.pp.neighbors(ras_adata, n_neighbors=int(n_neighbor), n_pcs=int(n_pc))
sc.tl.leiden(ras_adata,resolution=res)
sc.tl.umap(ras_adata)
sc.pl.umap(ras_adata, color=['leiden'],
          palette={"0":"yellow",
                   "1":"green",
                   "2":"pink"})
sc.pl.umap(ras_adata, color=['countmatrix_Factor Value[disease]'],
           palette={"normal":"purple",
                    "idiopathic pulmonary fibrosis":"orange"})

## Finding marker reactions

Let us compute a ranking for the highly differential RAS in each cluster.

In [None]:
from utils import rank_reactions_groups
rank_reactions_groups(ras_adata, 'leiden', method='t-test')

Show the 5 top ranked RAS per cluster.

In [None]:
import pandas as pd
df_markers=pd.DataFrame(ras_adata.uns['rank_genes_groups']['names']).head(5)
df_markers

Convert the dataframe in a list

In [None]:
df_marker_list=df_markers.T.values.flatten()
df_marker_list

Makes a dot plot of the expression values of ras.

In [None]:
sc.pl.dotplot(ras_adata, df_marker_list, groupby='leiden',
              use_raw=False,swap_axes=True);

Save the results 

In [None]:
ras_adata.write("../datasets/E-GEOD-86618ras_adata_clustering")