# SPEED workflow : Training on the spatial epigenomic data without prior information from single-cell data

Dataset: The E13 mouse embryo spatial CUT&Tag-RNA-seq dataset by Zhang et al ([here](https://doi.org/10.5281/zenodo.14948507))

In [1]:
import torch
print("Whether GPU is detected:", torch.cuda.is_available())
print("CUDA version:", torch.version.cuda)

Whether GPU is detected: True
CUDA version: 11.7


In [2]:
import SPEED
import scanpy as sc

adata_input_path = 'spCUT_Tag/tile_H3K27ac.h5ad'
adata_output_path = './H3K27ac_out'

load the spatial epigenomic data. 

## load the data

Load the spatial epigenomic data without the corresponding single-cell data. 

In [3]:
adata = sc.read(adata_input_path)

## Initialize the SPEED model

Initialize the model with spatial data.

`is_spatial` is set to `True` during the second stage of training on spatial data.

`k_degree` is the degree of spatial neighbor used for spatial relative position encoding. For data with a 50 μm resolution, k is defaulted to 5. For data with a 20 μm resolution, k is recommanded to 12.

`adata_sc` is set to `None` when training without prior information from single-cell data.

In [4]:
speed = SPEED.SPEED(adata,image=None, is_spatial=True,k_degree=12, adata_sc=None)

matrix ready...
use 0-1 matrix...
cell_features ready...
peak features ready...
Without single-cell reference


### Spliting training and validation sets.

`num_workers` is the number of subprocesses for data loading (default = 4).

`data_type` sets the input data format used by SPEED. SPEED will handle this format internally, so no external action is required from the user. For lower GPU memory and faster training, it is recommended to set `dense = False` (default) when training on GPU, and `dense = True` when training on CPU.

`batch_size_cell` and `batch_size_peak` are the batch sizes at the cell-level and peak-level. SPEED will choose automatically according to dataset size, but if the batch size is too large for your GPU, you can reduce it manually.

`split_ratio` sets the proportion of the validation set at both the cell level and peak level. (default = [1/6, 1/6])

In [5]:
speed.setup_data(num_workers=4)

batch_size_cell = 1024, batch_size_peak = 32768
split ready...
labels ready...
peak embedding is given
dataset ready...


### Build the neural network model for SPEED.

`emb_features` is the number of embedding features (default = 32).

`dropout_p` is the dropout probability of the model. For spatial data training, `dropout_p` is recommended to 0.4.

In [6]:
speed.build_model(emb_features=32,dropout_p=0.4)

## Train the SPEED model

`lr` is the learning rate. `device` specifies whether to train with GPU or CPU.

`epoch_num` is the maximum number of training epochs (default = 500). If no improvement is observed on the validation set within `epo_max` epochs, training is considered converged and will stop (default `epo_max=30`).

`alpha` represents the weight of the constraint on the similarity between peak embeddings of spatial data. The default value is 10. A larger `alpha` means the model relies more on single-cell prior information. 

`beta` represents the importance of image information for spot embedding. The default value is 1. A larger `beta` means the model relies more on image information.

In [None]:
speed.train(lr=1e-5, device='cuda')

## Get the results

Use `SPEED.SPEED.get_embedding` to get the low-dimensional embedding.

The spot/cell embeddings will be stored in `adata.obsm['X_SPEED']`. The peak embeddings will be stored in `adata.varm['peak_SPEED']`

In [None]:
adata = speed.get_embedding(adata)

Use `SPEED.SPEED.get_denoise_result` to get the denoised matrix.

In [9]:
adata.X = speed.get_denoise_result()

In [10]:
adata = speed.binarize(adata)

100%|██████████| 9370/9370 [00:29<00:00, 313.07it/s]
100%|██████████| 245219/245219 [01:27<00:00, 2797.61it/s]


In [11]:
adata.write(f'H3K27ac_out/adata_speed.h5ad')

In [12]:
exit