# Tutorial for human gastrula dataset

## CIARA functions
Install the ciara_python package over pip (can of course also be done in command line)

In [2]:
import sys
!{sys.executable} -m pip install --upgrade ciara_python

Collecting ciara_python
  Downloading ciara_python-0.9.8-py3-none-any.whl (4.1 kB)
Installing collected packages: ciara-python
  Attempting uninstall: ciara-python
    Found existing installation: ciara-python 0.9.7
    Uninstalling ciara-python-0.9.7:
      Successfully uninstalled ciara-python-0.9.7
Successfully installed ciara-python-0.9.8
You should consider upgrading via the '/opt/python/bin/python3.8 -m pip install --upgrade pip' command.[0m


Import the two important CIARA functions and other packages needed for this notebook.

In [1]:
import scanpy as sc
import pandas as pd
import time
import numpy as np

#from ciara_python import get_background_full, ciara
from ciara import ciara
from get_background_full import get_background_full

## Import human gastrula dataset and KNN matrix

Note that for Anndata object the count matrix is transposed (cells x genes) compared to the Seurat pipeline in R (genes x cells).

In [2]:
human_gast_norm = sc.read_csv('/root/host_home/Documents/CIARA/Data/norm_elmir_5_30_transposed.csv', delimiter=',')
#change to your data path

human_gast_norm = human_gast_norm.transpose()
print(human_gast_norm)

AnnData object with n_obs × n_vars = 1195 × 36570


Calculate the PCA and from this the knn matrix for the dataset using the integrated scanpy function.

In [11]:
#sc.pp.highly_variable_genes(human_gast_norm, n_top_genes=2000)
sc.tl.pca(human_gast_norm, n_comps=30)
sc.pp.neighbors(human_gast_norm, n_neighbors=5, use_rep='X_pca')
print(human_gast_norm.obsp["connectivities"].shape)

(1195, 1195)


## CIARA algorithm

### Step 1: Find background genes

The background genes get calculated and added as boolean values to the gene metadata in the AnnData object (in column `human_gast_norm.var["CIARA_background"]`):

In [12]:
t = time.perf_counter()

get_background_full(human_gast_norm, threshold=1, n_cells=3, n_cells_high=20)

elapsed_time = time.perf_counter() - t
print("Execution time: " + str(np.round(elapsed_time, 2)) + "s")

Background genes: 5057
Execution time: 0.08s


### Step 2: Calculate entropy of mixing of background genes

The p value for each background gene is added to the gene metadata in the AnnData object (in column `human_gast_norm.var["CIARA_p_value"]`):

**Runtime (4-core MacBook Pro) per size of genes (no approximation):**
- 1 gene: **0.2s**
- 10 genes: **0.5s**
- 100 genes: **4s**
- 1000 genes: **10s**
- 5057 genes *(this dataset)*: **270s**

In [13]:
t = time.perf_counter()

ciara(human_gast_norm, n_cores=4, p_value=0.001, odds_ratio=2, approximation=True, local_region=1)

elapsed_time = time.perf_counter() - t
print("\nExecution Time: " + str(np.round(elapsed_time, 2)) + "s")


## Running on 4 cores with a chunksize of 317

---- Finished sucessfully! ----

Execution Time: 27.26s


## Ciara results

We receive an extended AnnData object that contains the results in its gene metadata:


In [14]:
human_gast_norm.var

Unnamed: 0,CIARA_p_value,CIARA_background,highly_variable,highly_variable_rank,means,variances,variances_norm,dispersions,dispersions_norm
A1BG,,False,False,,0.019167,0.004321,0.468777,-1.058216,0.077933
A1BG.AS1,,False,False,,0.002272,0.001009,1.103726,0.326512,1.011968
A1CF,1.0,True,False,,0.033122,0.015818,1.260520,0.567258,1.174357
A2M,,False,True,78.0,0.272114,0.136295,3.418209,3.335733,3.041764
A2M.AS1,,False,False,,0.000310,0.000083,1.002971,-0.991899,0.122666
...,...,...,...,...,...,...,...,...,...
ZXDC,1.0,True,False,,0.050461,0.023791,1.112799,0.300321,0.994301
ZYG11A,,False,False,,0.001977,0.000618,1.283750,-0.500891,0.453864
ZYG11B,,False,False,,0.142985,0.070245,1.153438,0.589084,1.189079
ZYX,,False,True,,0.219129,0.100239,1.089465,1.058813,1.505924


In [15]:
#human_gast_norm.var.to_csv('CIARA_python_scanpy_KNN.csv', sep=',')