# Create split index for CellXGeneNexusDataModule

 The NexusDB data-loader consists of two layers: a front-end and a back-end. The front-end serves data to multiple node GPUs, while the back-end is responsible for data storage. We use the universal data storage engine [TileDB](https://tiledb.com/) as our back-end. For distributed data parallel training, the front-end is based on the [LitData package](https://github.com/Lightning-AI/litdata). NexusDB supports indexing to reuse the same dataset files for multiple training splits and works with the existing dataset [CELLxGENE Census](https://chanzuckerberg.github.io/cellxgene-census/), which is based on [TileDB-SOMA](https://github.com/single-cell-data/TileDB-SOMA). 

 This notebook is designed to show how to generate indexes for NexusDB. 

## `dataset_id`-level split for cellxgene

First, refer to `cellxgene_dataset_split` notebook to learn about dataset-id split. The code reuses `celltypes_split.csv` to generate train and dev split. The cell generates new index in `cellxgene_nexus_index` folder.

In [None]:
from bmfm_targets.datasets.cellxgene.cellxgene_soma_utils import (
    build_index_from_dataset_id,
)

uri = "/dccstor/bmfm-targets/data/omics/transcriptome/scRNA/pretrain/cellxgene/soma-2023-12-15"
build_index_from_dataset_id(
    uri = "/dccstor/bmfm-targets/data/omics/transcriptome/scRNA/pretrain/cellxgene/soma-2023-12-15",
    index_dir="cellxgene_nexus_index"
)

## Create short index for debugging proposes

In [None]:
import os
import shutil

from bmfm_targets.datasets.cellxgene.cellxgene_soma_utils import build_range_index

uri = "/dccstor/bmfm-targets/data/omics/transcriptome/scRNA/pretrain/cellxgene/soma-2023-12-15"
index_dir="cellxgene_debug_nexus_index"

os.mkdir(index_dir)
train_index_dir = os.path.join(index_dir, "train")
build_range_index(
    uri,
    train_index_dir,
    n_records=32,
    chunk_size=8,
    label_columns=["cell_type", "tissue"],
    value_filter="is_primary_data == True and nnz <= 512",
)
shutil.copytree(train_index_dir, os.path.join(index_dir, "dev"), dirs_exist_ok=True)