# Multi file data loading and iteration for training deep learning models

We have a potentially very high number of `h5ad` files, for instance, from many donors or patients and would like to train a deep learning model on these data, say for integration, annotation and interpretation.

In [1]:
import numpy as np
import scanpy as sc
import anndata as ad
from anndata._core.concat_obs import AnnDataConcatObs
from sklearn.model_selection import train_test_split

# Generate sample data

In [2]:
adata_ref = sc.datasets.pbmc3k_processed()
adata = sc.datasets.pbmc68k_reduced()

In [3]:
var_names = adata_ref.var_names.intersection(adata.var_names)
adata_ref = adata_ref[:, var_names]
adata = adata[:, var_names]

In [4]:
adata_ref.write('data/pbmc_procs.h5ad')
adata.write('data/pbmc_reduc.h5ad')

# Construct an AnnDataSet object

We start with a collection of files.

In [5]:
files = ['data/pbmc_procs.h5ad', 'data/pbmc_reduc.h5ad']

From that, we can generate a list of AnnDatas, either in backed mode or in memory.

<div class="alert alert-info">
    
**Note**

* Loading files in backed mode will proceed very fast at loading time and cost almost no memory.
* Loading files in memory will take some time to load and cost memory.
</div> 

In [6]:
adatas = [sc.read(file, backed='r') for file in files]

From the collection of AnnDatas, generate an `AnnDataSet` object, which differs from concatenating in a fundmamental way:

* Concatenating will generate a new in-memory object with potentially very high memory requirements.
* `AnnDataSet`, by contrast, will not cost noticiable additional memory, as it only manages access to underlying collection of AnnDatas. 

In [7]:
adset = AnnDataSet(adatas, join_obs='inner', join_obsm=None)
# suggested defaults
# * join_obs='inner' with available {None, 'inner', 'outer'}
# * join_obsm=None with available {None, 'inner'} - meaning no fields from obsm are written to the shared obsm field

Most importantly, `AnnDataSet` constructs a shared index for all AnnDatas along with shared metadata fields.

In [9]:
adset.obs  # depending on the join parameter, this will contain the intersection 

AAACATACAACCAC-1
AAACATTGAGCTAC-1
AAACATTGATCAGC-1
AAACCGTGCTTCCG-1
AAACCGTGTATGCG-1
...
TGGCACCTCCAACA-8
TGTGAGTGCTTTAC-8
TGTTACTGGCGATT-8
TTCAGTACCGGGAA-8
TTGAGGTGGAGAGC-8


## Create a train-test split of the data

Package this up in `AnnDataSet.split_random()` which is an inplace function with param `random_state=0`.

In [13]:
_, test = train_test_split(adset.obs_names)
adset.obs['split'] = 'train'
adset.obs.loc[test, 'split'] = 'test'
adset.obs.split = adset.obs.split.astype('category')

In [14]:
adset.obs

Unnamed: 0_level_0,split
index,Unnamed: 1_level_1
AAACATACAACCAC-1,train
AAACATTGAGCTAC-1,train
AAACATTGATCAGC-1,test
AAACCGTGCTTCCG-1,train
AAACCGTGTATGCG-1,train
...,...
TGGCACCTCCAACA-8,train
TGTGAGTGCTTTAC-8,train
TGTTACTGGCGATT-8,test
TTCAGTACCGGGAA-8,train


# Generate subsets

Access all the test data.

In [16]:
adset[adset.obs.split == 'test']

AnnDataConcatView object with n_obs × n_vars = 835 × 208
    obs: 'n_genes', 'percent_mito', 'n_counts', 'louvain'
    obsm: 'X_pca', 'X_umap'

# Generate a pytorch-compatible DataLoader from the training split

Initialize an AnnDataLoader from the training split.

In [17]:
loader = AnnDataLoader(
    adset[adset.obs.split == 'train']
    batch_size=32,
    shuffle=True,
)

SyntaxError: invalid syntax (<ipython-input-17-3522c95cb64e>, line 3)

Train an integration, annotation, or interpretation model on the data, for example

* `scVI`
* `scGen`, `scArches`
* `Intercode`
* `MARS`
* ...

In [None]:
model = Model()

In [None]:
for batch in loader.iterate(
     layer='X', obs=['label1', 'label2', 'domain']
):
    batch.X  # is a tensor
    batch.obs  # is a dictionary/ Namespace fo 1-dim tensors storing columns ['label1', 'label2', 'domain']
    batch.obs.label1  # is a 1-dim tensor

# Apply the trained model to the test set

Annotate the test set with cell type information - predict and label cell type for each cell.

adset.obs.loc[adset.obs_name[adset.obs.split == 'test'], 'cell_type'] = model.predict(
   adset[adset.obs_name[adset.obs.split == 'test']].X
)

# Visualize the integrated and annotated data

In [None]:
adconcat = adset.to_adata()  # does not have `.X`

In [None]:
sc.pl.umap(adconcat, color='cell_type')

# Save the results of joint computation

In [None]:
adset.to_adata('my_results.h5ad')