This notebook is for experimenting and developing the pseudobulk/centroid approach. For details see [issue #244](https://github.com/theislab/pertpy/issues/244).

In [1]:
import pertpy as pt
import decoupler as dc
import scanpy as sc
import pandas as pd

Global seed set to 0
No GPU/TPU found, falling back to CPU. (Set TF_CPP_MIN_LOG_LEVEL=0 and rerun for more info.)


In [2]:
# Load data
mdata = pt.dt.papalexi_2021()

In [3]:
mdata

## Explaination - modalities

The modalities you mentioned are associated with the ECCITE-seq (Expanded CRISPR-compatible Cellular Indexing of Transcriptomes and Epitopes) technology. ECCITE-seq is a single-cell multi-omics technique that allows simultaneous profiling of gene expression (RNA), protein epitopes (ADT - Antibody-Derived Tags), cell surface proteins (HTO - Hashtag Oligonucleotides), and genetic perturbations (GDO - Guide-Derived Oligonucleotides).

Here's a brief description of each modality:

**RNA (RNA-Seq):** This modality captures and quantifies the gene expression levels in individual cells. It provides information about the transcriptome, allowing researchers to study gene expression patterns and identify different cell types or states based on their gene expression profiles.

**ADT (Antibody-Derived Tags):** ADT is a protein-level modality that uses antibodies conjugated with unique DNA tags. Each antibody recognizes a specific protein epitope, allowing the measurement of protein abundance or presence in individual cells. ADT enables the investigation of protein expression and cellular phenotypes at the single-cell level.

**HTO (Hashtag Oligonucleotides):** HTOs are short DNA sequences designed to label cells uniquely. They are used to capture the cellular origin of individual transcriptomes and allow multiplexing of samples. HTOs are typically used to profile cell surface proteins, enabling the identification and characterization of different cell populations or clusters.

**GDO (Guide-Derived Oligonucleotides):** GDOs are oligonucleotides used in ECCITE-seq to introduce genetic perturbations into cells using CRISPR technology. By targeting specific genes, GDOs enable the investigation of gene function and regulatory networks, providing insights into how genetic changes affect cellular behavior.

In [4]:
rna_only = mdata["rna"]

In [8]:
rna_only.obs

Unnamed: 0_level_0,orig.ident,nCount_RNA,nFeature_RNA,nCount_HTO,nFeature_HTO,nCount_GDO,nCount_ADT,nFeature_ADT,percent.mito,MULTI_ID,HTO_classification,guide_ID,gene_target,NT,perturbation,replicate,S.Score,G2M.Score,Phase
index,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1
l1_AAACCTGAGCCAGAAC,Lane1,17207,3942,99.0,4,576.0,801.0,4,2.295577,rep1-tx,rep1-tx,STAT2g2,STAT2,STAT2g2,Perturbed,rep1,-0.252716,-0.771309,G1
l1_AAACCTGAGTGGACGT,Lane1,9506,2948,35.0,5,190.0,545.0,4,4.512939,rep1-tx,rep1-tx,CAV1g4,CAV1,CAV1g4,Perturbed,rep1,-0.123802,-0.332603,G1
l1_AAACCTGCATGAGCGA,Lane1,15256,4258,66.0,4,212.0,344.0,4,4.116413,rep1-tx,rep1-tx,STAT1g2,STAT1,STAT1g2,Perturbed,rep1,-0.154633,-0.694418,G1
l1_AAACCTGTCTTGTCAT,Lane1,5135,1780,22.0,3,243.0,539.0,4,5.491723,rep1-tx,rep1-tx,CD86g1,CD86,CD86g1,Perturbed,rep1,-0.061262,-0.037820,G1
l1_AAACGGGAGAACAACT,Lane1,9673,2671,99.0,5,198.0,1053.0,4,3.359868,rep1-tx,rep1-tx,IRF7g2,IRF7,IRF7g2,Perturbed,rep1,-0.132188,-0.353156,G1
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
l8_TTTGTCAGTCACTTCC,Lane8,20863,4884,294.0,4,290.0,398.0,4,5.469012,rep3-tx,rep3-tx,CMTM6g1,CMTM6,CMTM6g1,Perturbed,rep3,-0.323562,-0.794679,G1
l8_TTTGTCAGTGACGGTA,Lane8,17553,3787,528.0,3,870.0,3042.0,4,2.159175,rep4-tx,rep2-tx,NTg4,NT,NT,NT,rep2,-0.153514,-0.632655,G1
l8_TTTGTCAGTTCCACAA,Lane8,15106,4185,154.0,6,267.0,212.0,4,2.661194,rep3-tx,rep3-tx,ATF2g1,ATF2,ATF2g1,Perturbed,rep3,-0.191933,-0.574283,G1
l8_TTTGTCATCACGCATA,Lane8,11209,3204,132.0,3,202.0,258.0,4,7.369078,rep3-tx,rep3-tx,CAV1g2,CAV1,CAV1g2,Perturbed,rep3,-0.134585,-0.501513,G1


In [9]:
pseudobulk = dc.get_pseudobulk(rna_only,sample_col="replicate",groups_col=None)

In [10]:
pseudobulk

View of AnnData object with n_obs × n_vars = 3 × 6034
    obs: 'MULTI_ID', 'HTO_classification', 'replicate'

In [17]:
pseudobulk.X

ArrayView([[ 4873.,  6286.,  8661., ..., 57074.,  2455.,  5901.],
           [ 4910.,  4675.,  5369., ..., 37991.,  1660.,  6557.],
           [ 5493.,  5061.,  5918., ..., 41939.,  1841.,  7796.]],
          dtype=float32)

Not sure what I have here. Shoudl check out the [decoupler pseudobulk tutorial](https://decoupler-py.readthedocs.io/en/latest/notebooks/pseudobulk.html)