scBOND: Biologically faithful bidirectional translation between single-cell transcriptomes and DNA methylomes with adaptability to paired data scarcity
A sophisticated framework for bidirectional cross-modality translation between scRNA-seq and scDNAm profiles with broad biological applicability. We show that scBOND(a) accurately translates data while preserving biologically significant differences between closely related cell types. It also recovers functional and tissue-specific signals in the human brain and reveals stage-specific and cell type-specific transcriptional-epigenetic mechanisms in the oligodendrocyte lineage. We further introduce scBOND-Aug, a powerful enhancement of scBOND that leverages biologically guided data augmentation, achieving remarkable performance and surpassing traditional methods in paired data-limited scenarios.
It's prefered to create a new environment for scBOND
conda create -n scBond python==3.9
conda activate scBond
scBOND is available on PyPI, and could be installed using
pip install scBond
Installation via Github is also provided
git clone https://github.com/Biox-NKU/scBOND
cd scBOND
pip install -r requirements.txt
This process will take approximately 5 to 10 minutes, depending on the user's computer device and internet connectivition.
Illustrating with the translation between scRNA-seq and scDNAm data as an example, scBOND could be easily used following 3 steps: data preprocessing, model training, predicting and evaluating.
Generate a scBOND model first with following process:
from scBond.bond import Bond
bond = Bond()-
Before data preprocessing, you should load the raw count matrix of scRNA-seq and scDNAm data via
bond.load_data:bond.load_data(RNA_data, MET_data, train_id, test_id, validation_id)
Parameters Description RNA_data AnnData object of shape n_obs×n_vars. Rows correspond to cells and columns to genes.MET_data AnnData object of shape n_obs×n_vars. Rows correspond to cells and columns to peaks.train_id A list of cell IDs for training. test_id A list of cell IDs for testing. validation_id An optional list of cell IDs for validation, if setted None, bond will use a default setting of 20% cells in train_id. Anndata object is a Python object/container designed to store single-cell data in Python packege anndata which is seamlessly integrated with scanpy, a widely-used Python library for single-cell data analysis.
-
For data preprocessing, you could use
bond.data_preprocessing:bond.data_preprocessing()
You could save processed data or output process logging to a file using following parameters.
Parameters Description save_data optional, choose save the processed data or not, default False. file_path optional, the path for saving processed data, only used if save_datais True, default None.logging_path optional, the path for output process logging, if not save, set it None, default None. scBOND also support to refine this process using other parameter, however, we strongly recommend the default settings to keep the best result for model.
other parameters
Parameter Description normalize_totalWhether to normalize total RNA expression per cell, default True log1pWhether to apply log(1+x) transformation to RNA data, default True use_hvgWhether to select highly variable genes, default True n_top_genesNumber of highly variable genes to select, default 3000 imputationImputation method for missing values in methylation data ('median' or other), default 'median' min_cellsFilter out features present in fewer than this fraction of cells, default 0.007 normalizeNormalization method for methylation data ('scale' or other), default 'scale' add_noiseWhether to add Gaussian noise to data, default False noise_rateStandard deviation of Gaussian noise to add, default 0.0 noise_seedRandom seed for noise generation, default 42 save_dataWhether to save processed data, default False file_pathPath for saving processed data (if save_data is True), default None
-
Before model training, you could choose to use data augmentation strategy or not. If using data augmentation, scBOND will generate synthetic samples with the use of cell-type labels(if
cell_typeinadata.obs) .scButterfly provide data augmentation API:
bond.augmentation(enable_augmentation)
You could choose parameter
enable_augmentationby whether you want to augment data (True) or not (False), this will cause more training time used, but promise better result for predicting.- If you choose
enable_augmentation = True, scBOND-Aug will try to findcell_typeinadata.obs. If failed, it will automaticly transfer toFalse. - If you just want to using original data for scBOND training, set
enable_augmentation = False.
- If you choose
-
You could construct a scBOND model as following:
bond.construct_model(chrom_list)
scBOND need a list of peaks count for each chromosome, remember to sort peaks with chromosomes.
Parameters Description chrom_list a list of peaks count for each chromosome, remember to sort peaks with chromosomes. logging_path optional, the path for output model structure logging, if not save, set it None, default None. -
scBOND model could be easily trained as following:
bond.train_model()
Parameters Description output_path optional, path for model check point, if None, using './model' as path, default None. load_model optional, the path for load pretrained model, if not load, set it None, default None. logging_path optional, the path for output training logging, if not save, set it None, default None. scBOND also support to refine the model structure and training process using other parameters for
bond.construct_model()andbond.train_model().other parameters for model construction
Parameter Description R_encoder_nlayerLayer counts of RNA encoder, default 2 M_encoder_nlayerLayer counts of methylation data encoder, default 2 R_decoder_nlayerLayer counts of RNA decoder, default 2 M_decoder_nlayerLayer counts of methylation data decoder, default 2 R_encoder_dim_listDimension list of RNA encoder, length equal to R_encoder_nlayer, default [256, 128] M_encoder_dim_listDimension list of methylation data encoder, length equal to M_encoder_nlayer, default [32, 128] R_decoder_dim_listDimension list of RNA decoder, length equal to R_decoder_nlayer, default [128, 256] M_decoder_dim_listDimension list of methylation data decoder, length equal to M_decoder_nlayer, default [128, 32] R_encoder_act_listActivation list of RNA encoder, length equal to R_encoder_nlayer, default [nn.LeakyReLU(), nn.LeakyReLU()] M_encoder_act_listActivation list of methylation data encoder, length equal to M_encoder_nlayer, default [nn.LeakyReLU(), nn.LeakyReLU()] R_decoder_act_listActivation list of RNA decoder, length equal to R_decoder_nlayer, default [nn.LeakyReLU(), nn.LeakyReLU()] M_decoder_act_listActivation list of methylation data decoder, length equal to M_decoder_nlayer, default [nn.LeakyReLU(), nn.Sigmoid()] translator_embed_dimDimension of embedding space for translator, default 128 translator_input_dim_rDimension of input from RNA encoder for translator, default 128 translator_input_dim_mDimension of input from methylation data encoder for translator, default 128 translator_embed_act_listActivation list for translator, involving [mean_activation, log_var_activation, decoder_activation], default [nn.LeakyReLU(), nn.LeakyReLU(), nn.LeakyReLU()] discriminator_nlayerLayer counts of discriminator, default 1 discriminator_dim_list_RDimension list of discriminator for RNA, length equal to discriminator_nlayer, default [128] discriminator_dim_list_MDimension list of discriminator for methylation data, length equal to discriminator_nlayer, default [128] discriminator_act_listActivation list of discriminator, length equal to discriminator_nlayer, default [nn.Sigmoid()] dropout_rateRate of dropout for network, default 0.1 R_noise_rateRate of setting part of RNA input data to 0, default 0.5 M_noise_rateRate of setting part of methylation data input to 0, default 0.3 num_expertsNumber of experts for translator, default 6 num_experts_singleNumber of experts for single translator, default 6 num_headsNumber of parallel attention heads, default 8 attn_dropDropout probability applied to attention weights, default 0.1 proj_dropDropout probability applied to the output projection, default 0.1 other parameters for model training
Parameter Description R_encoder_lrLearning rate of RNA encoder, default 0.001 M_encoder_lrLearning rate of methylation data encoder, default 0.001 R_decoder_lrLearning rate of RNA decoder, default 0.001 M_decoder_lrLearning rate of methylation data decoder, default 0.001 R_translator_lrLearning rate of RNA pretrain translator, default 0.0001 M_translator_lrLearning rate of methylation data pretrain translator, default 0.0001 translator_lrLearning rate of translator, default 0.0001 discriminator_lrLearning rate of discriminator, default 0.005 R2R_pretrain_epochMax epoch for pretrain RNA autoencoder, default 100 M2M_pretrain_epochMax epoch for pretrain methylation data autoencoder, default 100 lock_encoder_and_decoderLock the pretrained encoder and decoder or not, default False translator_epochMax epoch for train translator, default 200 patiencePatience for loss on validation, default 50 batch_sizeBatch size for training and validation, default 64 r_lossLoss function for RNA reconstruction, default nn.MSELoss(size_average=True) m_lossLoss function for methylation data reconstruction, default nn.BCELoss(size_average=True) d_lossLoss function for discriminator, default nn.BCELoss(size_average=True) loss_weightList of loss weight for [r_loss, a_loss, d_loss], default [1, 2, 1] seedSet up the random seed, default 19193 kl_meanSize average for kl divergence or not, default True R_pretrain_kl_warmupEpoch of linear weight warm up for kl divergence in RNA pretrain, default 50 M_pretrain_kl_warmupEpoch of linear weight warm up for kl divergence in methylation data pretrain, default 50 translation_kl_warmupEpoch of linear weight warm up for kl divergence in translator pretrain, default 50
-
scBOND provide a predicting API, you could get predicted profiles as follow:
M2R_predict, R2M_predict = bond.test_model()
A series of evaluating method also be integrated in this function, you could get these evaluation using parameters:
Parameters Description output_path optional, path for model evaluating output, if None, using './model' as path, default None. load_model optional, the path for load pretrained model, if not load, set it None, default False. model_path optional, the path for pretrained model, only used if load_modelis True, default None.test_cluster optional, test the correlation evaluation or not, including AMI, ARI, HOM, NMI, default False. test_figure optional, draw the tSNE visualization for prediction or not, default False. output_data optional, output the prediction to file or not, if True, output the prediction to output_path/A2R_predict.h5adandoutput_path/R2A_predict.h5ad, default False.
-
Also, scBOND provide a separate predicting API for single modal predicting. You can predict DNAm profile with RNA profile as follow:
R2M_predict = bond.predict_single_modal(data_type='rna')
And you can predict RNA profile with DNAm profile as follow:
M2R_predict = bond.predict_single_modal(data_type='met')
