CVAE-RNA-seq

github for "Conditional Variational Autoencoder-based Generative Model for Gene Expression Data Augmentation" | Paper | Code

Overview

Gene expression data can be utilized in various studies, including the prediction of disease prognosis. However, there are challenges associated with collecting enough data due to cost constraints. In this paper, we propose a gene expression data generation model based on Conditional Variational Autoencoder. Our results demonstrate that the proposed model generates synthetic data with superior quality compared to two other state-of-the-art models for gene expression data generation, namely the Wasserstein Generative Adversarial Network with Gradient Penalty based model and the structured data generation models CTGAN and TVAE.

Simple Result

Test 2745 samples, 969 L1000 landmark genes.
- Gamma score 0.98
Compare with datasets such as [Ramon Viñas, Helena Andrés-Terré, Pietro Liò, Kevin Bryson, Adversarial generation of gene expression data, Bioinformatics, Volume 38, Issue 3, February 2022, Pages 730–737]
- Gamma score 0.96

Dataset

In this study, samples of 15 common tissues (lung, breast, kidney, thyroid, colon, stomach, prostate, saliva, liver, esophageal myopathy, esophageal mucosa, esophageal gastrointestinal tract, bladder, uterus, and cervix) of GTEx and TCGA were used. We followed the pipeline described by Wang et al. (2018) to integrate data and modify the deployment effect. Since then, 969 common genes with the L1000 landmark gene set were selected to create a dataset consisting of 9,146 samples and 969 genes.

GTEx(Genotype-Tissue Expression) Dataset
TCGA(Cancer Genome Atlas) Dataset
L1000 landmark
RNA-seq(human transcriptomics) Dataset (9147 samples and 18154 genes )

Install dependencies

torch >= 1.12.1
python >= 3.7
Python packages
- umap-learn >= 0.5.3
- scikit-learn >= 1.1.1

Usage

969 landmark gene sets were pretreated using log2 (expression_value+1) and standardization. You can download sample data for learning and testing from the Google Drive link below.

npy_data - Google Drive

Model Train

python train.py

Evaluation Notebook

Please check the evaluation.ipynb file.

Contact

If you have any question or problem, please send an email to sanseng@mju.ac.kr

Name		Name	Last commit message	Last commit date
Latest commit History 57 Commits
checkpoints		checkpoints
README.md		README.md
evaluation.ipynb		evaluation.ipynb
model.py		model.py
train.py		train.py
train_pre.py		train_pre.py
utils.py		utils.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

CVAE-RNA-seq

Overview

Simple Result

Dataset

Install dependencies

Usage

Model Train

Evaluation Notebook

Contact

About

Releases

Packages

Languages

HyunSBong/CVAE-RNA-seq

Folders and files

Latest commit

History

Repository files navigation

CVAE-RNA-seq

Overview

Simple Result

Dataset

Install dependencies

Usage

Model Train

Evaluation Notebook

Contact

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages