# DANCE data object and dataset tutorial

In this short tutorial, we first showcase how to use DANCE built-in datasets. Our DANCE data object is built on top of the widely used [AnnData](https://anndata.readthedocs.io/en/latest/) (or [MuData](https://mudata.readthedocs.io/en/latest/) for multi-modality data) object). In the second part of this tutorial, we demonstrate how users can seemlessly construct a DANCE data object from AnnData.

## Optional installation script for runing on Google Colab

Uncommonet the following block to install if you are runing on Google Colab.

In [1]:
# # Colab comes with torch installed, so we do not need to install pytorch here
# # !pip3 install torch torchvision torchaudio

# !pip install -q torch_geometric==2.3.1
# !pip install -q dgl==1.1.0 -f https://data.dgl.ai/wheels/cu117/repo.html
# !pip install -q torchnmf==0.3.4

# # !pip install -q pydance  # Install latest DANCE release from PyPI
# !pip install git+https://github.com/OmicsML/dance  # alternatively, install the latest dev version of DANCE from github

## DANCE built-in datasets

In [2]:
import anndata as ad

from dance.data import Data
from dance.datasets.singlemodality import CellTypeAnnotationDataset

In [3]:
# Initialize a data loader object, which lazily loads the full data upon calling the load_data method
dataloader = CellTypeAnnotationDataset(
    data_dir="./data",
    train_dataset=[753],
    test_dataset=[2695],
    species="mouse",
    tissue="Brain",
)

In [4]:
# Load full DANCE data object (will download necessary raw data the first time it is called)
data = dataloader.load_data()

[INFO][2023-10-23 21:30:45,021][dance][is_complete] ./data/train/mouse/mouse_Spleen1970_celltype.csv
[INFO][2023-10-23 21:30:45,022][dance][is_complete] file mouse_Spleen1970_celltype.csv doesn't exist
[INFO][2023-10-23 21:30:45,022][dance][is_complete] ./data/train/mouse/mouse_Spleen1970_celltype.csv
[INFO][2023-10-23 21:30:45,023][dance][is_complete] file mouse_Spleen1970_celltype.csv doesn't exist
[INFO][2023-10-23 21:30:45,641][dance][download_file] Downloading: ./data/train/mouse/mouse_Spleen1970_celltype.csv Bytes: 65,703
100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 64.2k/64.2k [00:00<00:00, 5.47MB/s]
[INFO][2023-10-23 21:30:46,449][dance][download_file] Downloading: ./data/train/mouse/mouse_Spleen1970_data.csv Bytes: 80,38

In [5]:
# Load full DANCE data object from downloaded data (no need to redownload)
data = dataloader.load_data()

[INFO][2023-10-23 21:31:12,633][dance][_load_dfs] Loading data from ./data/train/mouse/mouse_Brain753_data.csv
[INFO][2023-10-23 21:31:13,278][dance][_load_dfs] Loading data from ./data/test/mouse/mouse_Brain2695_data.csv
[INFO][2023-10-23 21:31:18,687][dance][_load_dfs] Loading data from ./data/train/mouse/mouse_Brain753_celltype.csv
[INFO][2023-10-23 21:31:18,692][dance][_load_dfs] Loading data from ./data/test/mouse/mouse_Brain2695_celltype.csv
[INFO][2023-10-23 21:31:19,382][dance][_load_raw_data] Loaded expression data: AnnData object with n_obs × n_vars = 3448 × 17378
[INFO][2023-10-23 21:31:19,383][dance][_load_raw_data] Number of training samples: 753
[INFO][2023-10-23 21:31:19,383][dance][_load_raw_data] Number of testing samples: 2,695
[INFO][2023-10-23 21:31:19,383][dance][_load_raw_data] Cell-types (n=10):
['Astrocyte',
 'Astroglial cell',
 'Granulocyte',
 'Macrophage',
 'Microglia',
 'Myelinating oligodendrocyte',
 'Neuron',
 'Oligodendrocyte precursor cell',
 'Pan-GABAerg

In [6]:
data

Data object that wraps (.data):
AnnData object with n_obs × n_vars = 3448 × 17378
    uns: 'dance_config'
    obsm: 'cell_type'

In [7]:
data.data.obsm['cell_type']

Unnamed: 0,Astrocyte,Astroglial cell,Granulocyte,Macrophage,Microglia,Myelinating oligodendrocyte,Neuron,Oligodendrocyte precursor cell,Pan-GABAergic,Schwann cell
mouse_Brain753_C_1,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0
mouse_Brain753_C_2,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0
mouse_Brain753_C_3,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0
mouse_Brain753_C_4,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0
mouse_Brain753_C_5,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0
...,...,...,...,...,...,...,...,...,...,...
mouse_Brain2695_C_2691,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0
mouse_Brain2695_C_2692,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0
mouse_Brain2695_C_2693,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0
mouse_Brain2695_C_2694,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0


## Construct DANCE datasets directly from AnnData

In [8]:
import scanpy as sc

from dance.datasets.base import Data
from dance.utils.preprocess import cell_label_to_df

In [9]:
pbmc68k_subset_adata = sc.datasets.pbmc68k_reduced()
pbmc68k_subset_adata

AnnData object with n_obs × n_vars = 700 × 765
    obs: 'bulk_labels', 'n_genes', 'percent_mito', 'n_counts', 'S_score', 'G2M_score', 'phase', 'louvain'
    var: 'n_counts', 'means', 'dispersions', 'dispersions_norm', 'highly_variable'
    uns: 'bulk_labels_colors', 'louvain', 'louvain_colors', 'neighbors', 'pca', 'rank_genes_groups'
    obsm: 'X_pca', 'X_umap'
    varm: 'PCs'
    obsp: 'distances', 'connectivities'

In [10]:
pbmc68k_subset_ddata = Data(pbmc68k_subset_adata)
pbmc68k_subset_ddata

Data object that wraps (.data):
AnnData object with n_obs × n_vars = 700 × 765
    obs: 'bulk_labels', 'n_genes', 'percent_mito', 'n_counts', 'S_score', 'G2M_score', 'phase', 'louvain'
    var: 'n_counts', 'means', 'dispersions', 'dispersions_norm', 'highly_variable'
    uns: 'bulk_labels_colors', 'louvain', 'louvain_colors', 'neighbors', 'pca', 'rank_genes_groups', 'dance_config'
    obsm: 'X_pca', 'X_umap'
    varm: 'PCs'
    obsp: 'distances', 'connectivities'

In [11]:
pbmc68k_subset_ddata.obs

Unnamed: 0_level_0,bulk_labels,n_genes,percent_mito,n_counts,S_score,G2M_score,phase,louvain
index,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
AAAGCCTGGCTAAC-1,CD14+ Monocyte,1003,0.023856,2557.0,-0.119160,-0.816889,G1,1
AAATTCGATGCACA-1,Dendritic,1080,0.027458,2695.0,0.067026,-0.889498,S,1
AACACGTGGTCTTT-1,CD56+ NK,1228,0.016819,3389.0,-0.147977,-0.941749,G1,3
AAGTGCACGTGCTA-1,CD4+/CD25 T Reg,1007,0.011797,2204.0,0.065216,1.469291,G2M,9
ACACGAACGGAGTG-1,Dendritic,1178,0.017277,3878.0,-0.122974,-0.868185,G1,2
...,...,...,...,...,...,...,...,...
TGGCACCTCCAACA-8,Dendritic,1166,0.008840,3733.0,-0.124456,-0.867484,G1,2
TGTGAGTGCTTTAC-8,Dendritic,1014,0.022068,2311.0,-0.298056,-0.649070,G1,1
TGTTACTGGCGATT-8,CD4+/CD25 T Reg,1079,0.012821,3354.0,0.216895,-0.527338,S,0
TTCAGTACCGGGAA-8,CD19+ B,1030,0.014169,2823.0,0.139054,-0.981590,S,4


In [12]:
pbmc68k_subset_ddata.obsm["cell_type"] = cell_label_to_df(pbmc68k_subset_ddata.obs.bulk_labels, index=pbmc68k_subset_ddata.obs.index)
pbmc68k_subset_ddata.obsm["cell_type"]

Unnamed: 0_level_0,CD14+ Monocyte,CD19+ B,CD34+,CD4+/CD25 T Reg,CD4+/CD45RA+/CD25- Naive T,CD4+/CD45RO+ Memory,CD56+ NK,CD8+ Cytotoxic T,CD8+/CD45RA+ Naive Cytotoxic,Dendritic
index,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1
AAAGCCTGGCTAAC-1,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
AAATTCGATGCACA-1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0
AACACGTGGTCTTT-1,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0
AAGTGCACGTGCTA-1,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0
ACACGAACGGAGTG-1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0
...,...,...,...,...,...,...,...,...,...,...
TGGCACCTCCAACA-8,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0
TGTGAGTGCTTTAC-8,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0
TGTTACTGGCGATT-8,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0
TTCAGTACCGGGAA-8,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
