scBOND: Biologically faithful bidirectional translation between single-cell transcriptomes and DNA methylomes with adaptability to paired data scarcity

A sophisticated framework for bidirectional cross-modality translation between scRNA-seq and scDNAm profiles with broad biological applicability. We show that scBOND(a) accurately translates data while preserving biologically significant differences between closely related cell types. It also recovers functional and tissue-specific signals in the human brain and reveals stage-specific and cell type-specific transcriptional-epigenetic mechanisms in the oligodendrocyte lineage. We further introduce scBOND-Aug, a powerful enhancement of scBOND that leverages biologically guided data augmentation, achieving remarkable performance and surpassing traditional methods in paired data-limited scenarios.

Installation

It's prefered to create a new environment for scBOND

conda create -n scBond python==3.9
conda activate scBond

scBOND is available on PyPI, and could be installed using

pip install scBond

Installation via Github is also provided

git clone https://github.com/Biox-NKU/scBOND
cd scBOND
pip install -r requirements.txt

This process will take approximately 5 to 10 minutes, depending on the user's computer device and internet connectivition.

Quick Start

Illustrating with the translation between scRNA-seq and scDNAm data as an example, scBOND could be easily used following 3 steps: data preprocessing, model training, predicting and evaluating.

Generate a scBOND model first with following process:

from scBond.bond import Bond
bond = Bond()

1. Data preprocessing

Before data preprocessing, you should load the raw count matrix of scRNA-seq and scDNAm data via bond.load_data:

bond.load_data(RNA_data, MET_data, train_id, test_id, validation_id)

Parameters	Description
RNA_data	AnnData object of shape `n_obs` × `n_vars`. Rows correspond to cells and columns to genes.
MET_data	AnnData object of shape `n_obs` × `n_vars`. Rows correspond to cells and columns to peaks.
train_id	A list of cell IDs for training.
test_id	A list of cell IDs for testing.
validation_id	An optional list of cell IDs for validation, if setted None, bond will use a default setting of 20% cells in train_id.

Anndata object is a Python object/container designed to store single-cell data in Python packege anndata which is seamlessly integrated with scanpy, a widely-used Python library for single-cell data analysis.

For data preprocessing, you could use bond.data_preprocessing:

bond.data_preprocessing()

You could save processed data or output process logging to a file using following parameters.

Parameters	Description
save_data	optional, choose save the processed data or not, default False.
file_path	optional, the path for saving processed data, only used if `save_data` is True, default None.
logging_path	optional, the path for output process logging, if not save, set it None, default None.

scBOND also support to refine this process using other parameter, however, we strongly recommend the default settings to keep the best result for model.

other parameters

Parameter	Description
`normalize_total`	Whether to normalize total RNA expression per cell, default True
`log1p`	Whether to apply log(1+x) transformation to RNA data, default True
`use_hvg`	Whether to select highly variable genes, default True
`n_top_genes`	Number of highly variable genes to select, default 3000
`imputation`	Imputation method for missing values in methylation data ('median' or other), default 'median'
`min_cells`	Filter out features present in fewer than this fraction of cells, default 0.007
`normalize`	Normalization method for methylation data ('scale' or other), default 'scale'
`add_noise`	Whether to add Gaussian noise to data, default False
`noise_rate`	Standard deviation of Gaussian noise to add, default 0.0
`noise_seed`	Random seed for noise generation, default 42
`save_data`	Whether to save processed data, default False
`file_path`	Path for saving processed data (if save_data is True), default None

2. Model training

Before model training, you could choose to use data augmentation strategy or not. If using data augmentation, scBOND will generate synthetic samples with the use of cell-type labels(if cell_type in adata.obs) .

scButterfly provide data augmentation API:
```
bond.augmentation(enable_augmentation)
```
You could choose parameter enable_augmentation by whether you want to augment data (True) or not (False), this will cause more training time used, but promise better result for predicting.
- If you choose enable_augmentation = True, scBOND-Aug will try to find cell_type in adata.obs. If failed, it will automaticly transfer to False.
- If you just want to using original data for scBOND training, set enable_augmentation = False.

You could construct a scBOND model as following:

bond.construct_model(chrom_list)

scBOND need a list of peaks count for each chromosome, remember to sort peaks with chromosomes.

Parameters	Description
chrom_list	a list of peaks count for each chromosome, remember to sort peaks with chromosomes.
logging_path	optional, the path for output model structure logging, if not save, set it None, default None.

scBOND model could be easily trained as following:

bond.train_model()

Parameters	Description
output_path	optional, path for model check point, if None, using './model' as path, default None.
load_model	optional, the path for load pretrained model, if not load, set it None, default None.
logging_path	optional, the path for output training logging, if not save, set it None, default None.

scBOND also support to refine the model structure and training process using other parameters for bond.construct_model() and bond.train_model() .

other parameters for model construction

Parameter	Description
`R_encoder_nlayer`	Layer counts of RNA encoder, default 2
`M_encoder_nlayer`	Layer counts of methylation data encoder, default 2
`R_decoder_nlayer`	Layer counts of RNA decoder, default 2
`M_decoder_nlayer`	Layer counts of methylation data decoder, default 2
`R_encoder_dim_list`	Dimension list of RNA encoder, length equal to R_encoder_nlayer, default [256, 128]
`M_encoder_dim_list`	Dimension list of methylation data encoder, length equal to M_encoder_nlayer, default [32, 128]
`R_decoder_dim_list`	Dimension list of RNA decoder, length equal to R_decoder_nlayer, default [128, 256]
`M_decoder_dim_list`	Dimension list of methylation data decoder, length equal to M_decoder_nlayer, default [128, 32]
`R_encoder_act_list`	Activation list of RNA encoder, length equal to R_encoder_nlayer, default [nn.LeakyReLU(), nn.LeakyReLU()]
`M_encoder_act_list`	Activation list of methylation data encoder, length equal to M_encoder_nlayer, default [nn.LeakyReLU(), nn.LeakyReLU()]
`R_decoder_act_list`	Activation list of RNA decoder, length equal to R_decoder_nlayer, default [nn.LeakyReLU(), nn.LeakyReLU()]
`M_decoder_act_list`	Activation list of methylation data decoder, length equal to M_decoder_nlayer, default [nn.LeakyReLU(), nn.Sigmoid()]
`translator_embed_dim`	Dimension of embedding space for translator, default 128
`translator_input_dim_r`	Dimension of input from RNA encoder for translator, default 128
`translator_input_dim_m`	Dimension of input from methylation data encoder for translator, default 128
`translator_embed_act_list`	Activation list for translator, involving [mean_activation, log_var_activation, decoder_activation], default [nn.LeakyReLU(), nn.LeakyReLU(), nn.LeakyReLU()]
`discriminator_nlayer`	Layer counts of discriminator, default 1
`discriminator_dim_list_R`	Dimension list of discriminator for RNA, length equal to discriminator_nlayer, default [128]
`discriminator_dim_list_M`	Dimension list of discriminator for methylation data, length equal to discriminator_nlayer, default [128]
`discriminator_act_list`	Activation list of discriminator, length equal to discriminator_nlayer, default [nn.Sigmoid()]
`dropout_rate`	Rate of dropout for network, default 0.1
`R_noise_rate`	Rate of setting part of RNA input data to 0, default 0.5
`M_noise_rate`	Rate of setting part of methylation data input to 0, default 0.3
`num_experts`	Number of experts for translator, default 6
`num_experts_single`	Number of experts for single translator, default 6
`num_heads`	Number of parallel attention heads, default 8
`attn_drop`	Dropout probability applied to attention weights, default 0.1
`proj_drop`	Dropout probability applied to the output projection, default 0.1

other parameters for model training

Parameter	Description
`R_encoder_lr`	Learning rate of RNA encoder, default 0.001
`M_encoder_lr`	Learning rate of methylation data encoder, default 0.001
`R_decoder_lr`	Learning rate of RNA decoder, default 0.001
`M_decoder_lr`	Learning rate of methylation data decoder, default 0.001
`R_translator_lr`	Learning rate of RNA pretrain translator, default 0.0001
`M_translator_lr`	Learning rate of methylation data pretrain translator, default 0.0001
`translator_lr`	Learning rate of translator, default 0.0001
`discriminator_lr`	Learning rate of discriminator, default 0.005
`R2R_pretrain_epoch`	Max epoch for pretrain RNA autoencoder, default 100
`M2M_pretrain_epoch`	Max epoch for pretrain methylation data autoencoder, default 100
`lock_encoder_and_decoder`	Lock the pretrained encoder and decoder or not, default False
`translator_epoch`	Max epoch for train translator, default 200
`patience`	Patience for loss on validation, default 50
`batch_size`	Batch size for training and validation, default 64
`r_loss`	Loss function for RNA reconstruction, default nn.MSELoss(size_average=True)
`m_loss`	Loss function for methylation data reconstruction, default nn.BCELoss(size_average=True)
`d_loss`	Loss function for discriminator, default nn.BCELoss(size_average=True)
`loss_weight`	List of loss weight for [r_loss, a_loss, d_loss], default [1, 2, 1]
`seed`	Set up the random seed, default 19193
`kl_mean`	Size average for kl divergence or not, default True
`R_pretrain_kl_warmup`	Epoch of linear weight warm up for kl divergence in RNA pretrain, default 50
`M_pretrain_kl_warmup`	Epoch of linear weight warm up for kl divergence in methylation data pretrain, default 50
`translation_kl_warmup`	Epoch of linear weight warm up for kl divergence in translator pretrain, default 50

3. Predicting and evaluating

scBOND provide a predicting API, you could get predicted profiles as follow:

M2R_predict, R2M_predict = bond.test_model()

A series of evaluating method also be integrated in this function, you could get these evaluation using parameters:

Parameters	Description
output_path	optional, path for model evaluating output, if None, using './model' as path, default None.
load_model	optional, the path for load pretrained model, if not load, set it None, default False.
model_path	optional, the path for pretrained model, only used if `load_model` is True, default None.
test_cluster	optional, test the correlation evaluation or not, including AMI, ARI, HOM, NMI, default False.
test_figure	optional, draw the tSNE visualization for prediction or not, default False.
output_data	optional, output the prediction to file or not, if True, output the prediction to `output_path/A2R_predict.h5ad` and `output_path/R2A_predict.h5ad`, default False.

Also, scBOND provide a separate predicting API for single modal predicting. You can predict DNAm profile with RNA profile as follow:
```
R2M_predict = bond.predict_single_modal(data_type='rna')
```
And you can predict RNA profile with DNAm profile as follow:
```
M2R_predict = bond.predict_single_modal(data_type='met')
```

Demo, document, tutorial and source code

We provide demos of basic scBOND model and two variants (scBOND_Aug and one for single modality prediction) with GSE140493 dataset in scBOND usage, scBOND-aug usage and scBOND for single modality prediction.

We also provide richer tutorials and documents for scBOND in scBOND documents, including more details of provided APIs for customing data preprocessing, model structure and training strategy.

Name		Name	Last commit message	Last commit date
Latest commit History 57 Commits
docs		docs
examples		examples
figures		figures
scBond		scBond
.gitattributes		.gitattributes
.readthedocs.yaml		.readthedocs.yaml
LICENSE		LICENSE
README.md		README.md
requirements.txt		requirements.txt
setup.py		setup.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

scBOND: Biologically faithful bidirectional translation between single-cell transcriptomes and DNA methylomes with adaptability to paired data scarcity

Installation

Quick Start

1. Data preprocessing

2. Model training

3. Predicting and evaluating

Demo, document, tutorial and source code

We provide demos of basic scBOND model and two variants (scBOND_Aug and one for single modality prediction) with GSE140493 dataset in scBOND usage, scBOND-aug usage and scBOND for single modality prediction.

We also provide richer tutorials and documents for scBOND in scBOND documents, including more details of provided APIs for customing data preprocessing, model structure and training strategy.

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

License

BioX-NKU/scBOND

Folders and files

Latest commit

History

Repository files navigation

scBOND: Biologically faithful bidirectional translation between single-cell transcriptomes and DNA methylomes with adaptability to paired data scarcity

Installation

Quick Start

1. Data preprocessing

2. Model training

3. Predicting and evaluating

Demo, document, tutorial and source code

We provide demos of basic scBOND model and two variants (scBOND_Aug and one for single modality prediction) with GSE140493 dataset in scBOND usage, scBOND-aug usage and scBOND for single modality prediction.

We also provide richer tutorials and documents for scBOND in scBOND documents, including more details of provided APIs for customing data preprocessing, model structure and training strategy.

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages