# Tutorial for human gastrula dataset

## Entropy of Mixing functions
Install entropy of mixing package over pip (can of course also be done in command line)

In [2]:
import sys
!{sys.executable} -m pip install --upgrade identom_python

Collecting identom_python
  Downloading identom_python-0.9.7-py3-none-any.whl (4.2 kB)
Installing collected packages: identom-python
  Attempting uninstall: identom-python
    Found existing installation: identom-python 0.9.6
    Uninstalling identom-python-0.9.6:
      Successfully uninstalled identom-python-0.9.6
Successfully installed identom-python-0.9.7


Import the two important entropyofmixing functions

In [3]:
from identom_python import get_background_full, entropy_mixing

## Import human gastrula dataset and KNN matrix

Note that for Anndata object the count matrix is transposed (cells x genes) compared to the Seurat pipeline in R (genes x cells).

In [3]:
import scanpy as sc
import pandas as pd

human_gast_norm = sc.read_csv('/root/host_home/Documents/EntropyOfMixing/Data/norm_elmir_5_30_transposed.csv', delimiter=',')
human_gast_norm = human_gast_norm.transpose()
print(human_gast_norm)

knn_matrix = pd.read_csv('/root/host_home/Documents/EntropyOfMixing/Data/knn_matrix_elmir_5_30.csv', delimiter=',', index_col=0)


AnnData object with n_obs × n_vars = 1195 × 36570


## Entropy of mixing algorithm

### Step 1: Find background genes

The background genes get calculated and added to the gene metadata in the AnnData object:

In [4]:
import time
import numpy as np

t = time.perf_counter()

human_gast_norm.var["EOM_background"] = get_background_full(human_gast_norm, threshold=1, n_cells=3, n_cells_high=20)

elapsed_time = time.perf_counter() - t
print("Execution time: " + str(np.round(elapsed_time, 2)) + "s")

#background_genes = norm_adata.var_names[norm_adata.var["EOM_background"]]

Background genes: 5057
Execution time: 0.08s


### Step 2: Calculate entropy of mixing of background genes

The entropy and related p value for each background gene are added to the gene metadata in the AnnData object:

**Runtime (4-core MacBook Pro) per size of genes (no approximation):**
- 1 gene: **0.2s**
- 10 genes: **0.5s**
- 100 genes: **4s**
- 1000 genes: **10s**
- 5057 genes *(this dataset)*: **270s**

In [5]:
#human_gast_small = human_gast_norm[:,0:1000]
#human_gast_small = human_gast_small.copy()

t = time.perf_counter()

entropies, p_values = entropy_mixing(human_gast_norm, knn_matrix, n_cores=4, p_value=0.001, odds_ratio=2, approximation=True, local_region=1)

elapsed_time = time.perf_counter() - t
print("\nExecution Time: " + str(np.round(elapsed_time, 2)) + "s")

human_gast_norm.var["EOM_entropy"] = entropies
human_gast_norm.var["EOM_p_value"] = p_values


---- Finished sucessfully! ----

Execution Time: 69.7s


## Entropy of mixing results

We receive an extended AnnData object that contains the entropy of mixing results in its gene metadata:


In [6]:
human_gast_norm.var

Unnamed: 0,EOM_background,EOM_entropy,EOM_p_value
A1BG,False,,
A1BG.AS1,False,,
A1CF,True,0.000000,2.486022e-07
A2M,False,,
A2M.AS1,False,,
...,...,...,...
ZXDC,True,1.000000,1.000000e+00
ZYG11A,False,,
ZYG11B,False,,
ZYX,False,,
