scProTrans

A sequence knowledge-guided deep learning method for single-cell multi-omics translation (scProTrans)

Ovreview

Proteins, as direct executors of cellular biological functions, are central to understanding cellular life activities, disease mechanisms, and therapeutic strategies. Despite their importance, proteomics data remain scarce compared to the abundance of single-cell RNA sequencing (scRNA-seq) data, primarily due to experimental limitations and high costs. Advances of multi-omics sequencing technologies frame the pathway between transcriptomics and proteomics. A promising strategy involves leveraging multi-omics datasets to train models that translate scRNA-seq data into proteomics profiles, thereby constructing comprehensive multi-omics profiles. Here, we introduce ProTrans, a sequence knowledge-guided deep learning framework that bridges transcriptomics and proteomics by deciphering gene-protein relationships from CITE-seq datasets. ProTrans integrates gene, protein, and cell encoding to uncover cell-specific associations and enable zero-shot translation through sequence-to-embedding-to-profile learning. Extensive evaluations across 15 multi-omics datasets demonstrate that ProTrans surpasses state-of-the-art methods in proteomics translation and enhances downstream analyses, including cell clustering, subtype identification, and biomarker discovery. Additionally, ProTrans is extended to tri-omics scenarios by refactoring encoders, demonstrating its flexibility and scalability. Significantly, ProTrans not only elucidates cell-specific gene-protein relationships but also predicts protein profiles that are challenging to capture experimentally.

Requirements

anndata==0.11.3
h5py==3.10.0
numpy==2.2.2
pandas==2.2.3
scanpy==1.10.4
scikit_learn==1.4.2
scipy==1.15.1
scvi==0.6.8
torch==2.0.0

Datasets

All the original datasets can be downloaded from GSE194122, GSE100866, GSE164378, GSE128639, GSE156473, GSE200417, GSE158013, GSE96583. We have released the pretrained gene and protein sequence embeddings with link1 and link2.

Usage

Detailed explanation of parameters

Parameter	Type	Default	Description
`data_dir`	`str`	`./data`	Directory containing the input data.
`out_dir`	`str`	`./result`	Directory to save the output results.
`mode`	`str`	`''`	Mode of translation (cell type name, batch name, 'zeroshot', 'all' or '').
`batch_size`	`int`	`48`	Number of samples per batch during training.
`epochs`	`int`	`200`	Total number of epochs to train the model.
`lr`	`float`	`0.001`	Learning rate for the optimizer.
`patience`	`int`	`50`	Number of epochs to wait for improvement before early stopping.
`seed`	`int`	`0`	Random seed for reproducibility.
`preprocessed`	`bool`	`False`	Whether to use preprocessed data.
`transpose`	`bool`	`False`	Whether to transpose the input data. If the columns of input data are cells, set transpose to True.
`attention`	`bool`	`False`	Whether to save the gene-protein relationship matrix.

Users specify the mode parameter to perform translation tasks under different scenarios. Setting mode as cell type name (e.g. Mono), ProTrans will use the Mono cells as test set and the other cells as training set. Setting mode as batch name (e.g. Batch1), ProTrans will use the cells belonging to Batch1 as test set and the other cells as training set. Setting mode as 'zeroshot', ProTrans will randomly divides all proteins into training and test sets according to 6 to 4. Setting mode as 'all', ProTrans will use all cells to train model. Setting mode as '', ProTrans will randomly divides all cells into training and test sets according to 6 to 4.

Users need to provide rna.csv and protein.csv of raw expression reads to train or evaluate ProTrans. The annotation file is optional which is used to enable omics translation across cell types or batches. Taking GSE164378 as example, the directory and specific instructions for input files are as follows:

 |-- dataset
        |-- GSE164378
              |-- rna.csv           # (cell, gene)
              |-- protein.csv       # (cell, protein)
              |-- annotation.csv    # (cell, annotation)

All output files will be saved in out_dir.

The proteomics translation for intra-datasets

Run with raw RNA and protein expression file

python ProTrans.py --data_dir ../dataset/GSE164378 --out_dir ./result

Run with preprocessed RNA and protein expression file

python ProTrans.py --data_dir ../dataset/GSE164378 --out_dir ./result --preprocessed True

The proteomics translation across cell types

python ProTrans.py --data_dir ../dataset/GSE164378 --out_dir ./result --preprocessed True --mode Mono

The proteomics translation across batches

python ProTrans.py --data_dir ../dataset/GSE164378 --out_dir ./result --preprocessed True --mode Batch1

The proteomics translation while saving gene-protein relationship

python ProTrans.py --data_dir ../dataset/GSE164378 --out_dir ./result --preprocessed True --mode all --attention True

The proteomics translation with zeroshot machanism

python ProTrans.py --data_dir ../dataset/GSE164378 --out_dir ./result-zeroshot --preprocessed True --mode zeroshot

The proteomics translation across technologies

Taking GSE200417 as example, the directory and specific instructions for input files are as follows:

 |-- dataset
        |-- GSE200417
            |-- CITE
                |-- rna.csv          # (cell, gene)
                |-- protein.csv      # (cell, protein)      
            |-- DOGMA
                |-- rna.csv          # (cell, gene)
                |-- protein.csv      # (cell, protein)

Run the command as follows:

python ProTrans-technology.py --data_dir ../dataset/GSE200417 --out_dir ./result

Extending ProTrans to tri-omics translation

Taking GSM5123953 as example, users follow gen_atac.ipynb to convert the h5 file into atac.csv (cell*peak). Next, referring to atac2seq.ipynb to extract sequences corresponding to peaks, then follow seq2emb.ipynb to generate atac_emb.npz used in translation process. In addition, users need to unzip dataset/dna2vec/pre-trained DNA-8mers.7z to pre-trained DNA-8mers.txt.

the directory and specific instructions for input files are as follows:

 |-- dataset
        |-- GSM5123953
            |-- ATAC-ADT
                |-- atac.csv          # (cell, peak)
                |-- protein.csv       # (cell, protein)      
            |-- ATAC-RNA
                |-- atac.csv          # (cell, peak)
                |-- rna.csv           # (cell, protein)

The proteomics translation based on epigenomics

python ProTrans-ATAC-ADT.py --data_dir ../dataset/GSM5123953 --out_dir ./result

The transcriptomics translation based on epigenomics

python ProTrans-ATAC-RNA.py --data_dir ../dataset/GSM5123953 --out_dir ./result

License

This project is licensed under the MIT License.

Name		Name	Last commit message	Last commit date
Latest commit History 43 Commits
code		code
dataset		dataset
LICENSE		LICENSE
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

scProTrans

Ovreview

Requirements

Datasets

Usage

Detailed explanation of parameters

The proteomics translation for intra-datasets

The proteomics translation across cell types

The proteomics translation across batches

The proteomics translation while saving gene-protein relationship

The proteomics translation with zeroshot machanism

The proteomics translation across technologies

Extending ProTrans to tri-omics translation

License

About

Uh oh!

Releases 1

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

scProTrans

Ovreview

Requirements

Datasets

Usage

Detailed explanation of parameters

The proteomics translation for intra-datasets

The proteomics translation across cell types

The proteomics translation across batches

The proteomics translation while saving gene-protein relationship

The proteomics translation with zeroshot machanism

The proteomics translation across technologies

Extending ProTrans to tri-omics translation

License

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases 1

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages