# Load 'Visium' Spatial transcriptomics datasets

This library provides an interface to load 10xGenomics (visium) 'spatial gene expression' datasets.
Datasets are adapted from [here](https://www.10xgenomics.com/resources/datasets?query=&page=1&configure%5BhitsPerPage%5D=50&configure%5BmaxValuesPerFacet%5D=1000&refinementList%5Bproduct.name%5D%5B0%5D=Spatial%20Gene%20Expression)

This library uses the popular huggingface datasets format for sharing the datasets, it is recommended to cross-reference with datasets [documentation](https://huggingface.co/) when dusing this library.

In [1]:
from st_visium_datasets import setup_logging

setup_logging(level="DEBUG")

## List availlable dataset names

In huggingface nomenclature, this project provides a single dataset called `visium`, with different possible [configurations](https://huggingface.co/docs/datasets/load_hub#configurations).

The default config for `visium` is `all`: it contains an aggregation of all existing configs.

To get a list of all availlable configs, `list_visium_datasets` can be used.

Config names are generally in the format: `<species>_<anatomical_entity>`

In [2]:
from st_visium_datasets import list_visium_datasets

dataset_names = list_visium_datasets()
dataset_names

['all',
 'human',
 'human_heart',
 'human_lymph-node',
 'human_kidney',
 'human_colorectal',
 'human_skin',
 'human_prostate',
 'human_ovary',
 'human_brain',
 'human_large-intestine',
 'human_spinal-cord',
 'human_cerebellum',
 'human_brain-cerebellum',
 'human_lung',
 'human_breast',
 'human_colon',
 'mouse',
 'mouse_olfactory-bulb',
 'mouse_kidney',
 'mouse_brain',
 'mouse_kidney-brain',
 'mouse_mouse-embryo',
 'mouse_lung-brain']

## Simple stats

An important information about each dataset is the number of spots under tissue, and the number of genes detected. `st_visium_datasets` provides this information directly per dataset config name

In [3]:
from st_visium_datasets import gen_visium_dataset_stat

gen_visium_dataset_stat("human_heart") # returns a dict

{'name': 'human_heart',
 'number_of_spots_under_tissue': 8482,
 'number_of_genes_detected': 40491}

To view stats for all avillable datasets, you can use:

In [4]:
from st_visium_datasets import gen_visium_dataset_stat_table

print(gen_visium_dataset_stat_table())

| name                   |   number_of_spots_under_tissue |   number_of_genes_detected |
|------------------------|--------------------------------|----------------------------|
| all                    |                         344961 |                    1651817 |
| human                  |                         192976 |                     863878 |
| human_heart            |                           8482 |                      40491 |
| human_lymph-node       |                           8074 |                      48178 |
| human_kidney           |                           5936 |                      18068 |
| human_colorectal       |                           9080 |                      18077 |
| human_skin             |                           3458 |                      18069 |
| human_prostate         |                          14334 |                      68215 |
| human_ovary            |                          15153 |                      77975 |
| human_brain        

## Load a 'visium' dataset

Before you take the time to download a dataset, it’s often helpful to quickly get some general information about a dataset. A dataset’s information is stored inside DatasetInfo and can include information such as the dataset description, features, and dataset size.

Use the `load_visium_dataset_builder` function to load a dataset builder and inspect a dataset’s attributes without committing to downloading it.

Note: The `load_visium_dataset_builder` has exactly the same signature as `datasets.load_dataset_builder` (except for the `path` arg which is implicitly set to `visium`)

In [5]:
from st_visium_datasets import load_visium_dataset_builder

ds_builder = load_visium_dataset_builder("human_heart")

In [6]:
# Inspect dataset description
ds_builder.info.description

'Visium datasets for spatial transcriptomics. This dataset is a collection of the following datasets from 10x Genomics: v1-human-heart-1-0-0, v1-human-heart-1-1-0'

In [7]:
# Inspect dataset features
for k, v in ds_builder.info.features.items():
    print(f"- {k}: {v}")

- species: Value(dtype='string', id=None)
- anatomical_entity: Value(dtype='string', id=None)
- disease_state: Value(dtype='string', id=None)
- spot_path: Value(dtype='string', id=None)
- spot_bytes: Value(dtype='binary', id=None)
- features_path: Value(dtype='string', id=None)
- features: Sequence(feature={'feature_id': Value(dtype='string', id=None), 'feature_type': Value(dtype='string', id=None), 'gene': Value(dtype='string', id=None), 'count': Value(dtype='int64', id=None)}, length=-1, id=None)


If you’re happy with the dataset, then load it with `load_visium_dataset` (again, same api as `datasets.load_dataset`)

To speed data download and loading, we can use multiprocessing (please make sure in that case to have enough RAM in your machine, to support loadding multiple large tiff images into memory at the same time)

In [8]:
from st_visium_datasets import load_visium_dataset

num_proc = 2
ds = load_visium_dataset("human_heart", num_proc=num_proc)

Building datasets ...: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 2/2 [00:00<00:00,  5.46it/s]


Generating default split: 0 examples [00:00, ? examples/s]

In [14]:
import numpy as np
from IPython.display import display, Image

spot = np.frombuffer(ds[0]["spot_bytes"], dtype=np.uint8)
display(Image(data=spot))

array([147,  78,  85, ..., 164, 140, 149], dtype=uint8)