# Load h5ad data with scanpy

[Scanpy](https://scanpy.readthedocs.io/en/stable/index.html) – a package for Single-Cell Analysis in Python. It provides a [function](https://scanpy.readthedocs.io/en/stable/generated/scanpy.read_h5ad.html#scanpy.read_h5ad
) to load data in h5ad format. The paper also provides the sparse matrix (mtx.gz) in [h5ad format](http://catlas.org/catlas_downloads/humantissues/Cell_by_cCRE/).

**Note:** a customized function is provided in the fewshotbench example

In [1]:
import scanpy
import numpy as np

In [2]:
# load the matrix.h5ad from http://catlas.org/catlas_downloads/humantissues/Cell_by_cCRE/ 
### NOTE: when running on google cloud VM, memory allocation seems to be an issue
### add backed to reduce memory usage, see https://github.com/scverse/scanpy/issues/434 
adata = scanpy.read_h5ad("data/Cell_by_cCRE/matrix.h5ad", backed="r")
adata

AnnData object with n_obs × n_vars = 1323041 × 1154611 backed at 'data/Cell_by_cCRE/matrix.h5ad'

# AnnData format

[Anndata](https://anndata.readthedocs.io/en/latest/tutorials/notebooks/getting-started.html) is short for annotated data. The h5ad provided in the dataset website has very minimum annotation. We can build up the layers of observation (e.g, cell type annotation given in the paper) and layers of variable (e.g, classes of the cCRE <promoter, distal etc.>).

In [3]:
# initial data
# sparse matrix of shape (#cell, #feature)
adata.X

<HDF5 sparse dataset: format 'csr', shape (1323041, 1154611), type '<i8'>

In [4]:
# obs: cell ID
adata.obs

LungMap_D122_1+AAACTACCAGCTGCGCTTATCC
LungMap_D122_1+AACTGCGCCATCCACTTGGATA
LungMap_D122_1+AACTTCTGCTCACCTGTAAGAC
LungMap_D122_1+AATTCGGATGAGATCTGTGACG
LungMap_D122_1+AATTCGGATGGTCCGGTCCAAA
...
spleen_sample_57_1+TTGGTTAACCCTTCAGGCCATTGGCCAGGTCCTCGTCATA
spleen_sample_57_1+TTGGTTGGTACGTAGCCGTAGATAGCCGATTTGCTCGATT
spleen_sample_57_1+TTGGTTGGTACTAAGAGTTATACCTTAGCTACCAGTTATT
spleen_sample_57_1+TTGGTTGGTAGCATTAGGCGCCGGTCCTAATTGCTCGATT
thymus_sample_2_1+TAGCATTGATCTGGCAGCGGTTCTGGCGCAACTTAAGATA


In [5]:
# var: features, in this case candidate cis-regulatory elements (cCRE)
adata.var

chr1:9955-10355
chr1:29163-29563
chr1:79215-79615
chr1:102755-103155
chr1:180580-180980
...
chrY:56676947-56677347
chrY:56677442-56677842
chrY:56678029-56678429
chrY:56678600-56679000
chrY:56707025-56707425


# Interfacing pytorch models

AnnData format can directly interface with pytorch (need to check more on this): https://anndata.readthedocs.io/en/latest/tutorials/notebooks/annloader.html# 

In [6]:
import torch.nn as nn
import pandas as pd
from anndata.experimental.pytorch import AnnLoader

In [7]:
# Tentatively try the dataloader 
### NOTE: if this fails, try restart the kernel, see https://github.com/ipython/ipython/issues/13598
dataloader = AnnLoader(adata, batch_size=128, shuffle=True)

In [9]:
dataloader.dataset

AnnCollection object with n_obs × n_vars = 1323041 × 1154611
  constructed from 1 AnnData objects