Skip to content

Hyunsu Bong, Minsik Oh. “Conditional Variational Autoencoder-Based Generative Model for Gene Expression Data Augmentation" (Journal of Broadcast Engineering 2023)

Notifications You must be signed in to change notification settings

HyunSBong/CVAE-RNA-seq

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

57 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

CVAE-RNA-seq

github for "Conditional Variational Autoencoder-based Generative Model for Gene Expression Data Augmentation" | Paper | Code 스크린샷 2023-04-02 오후 11 47 58

Overview

Gene expression data can be utilized in various studies, including the prediction of disease prognosis. However, there are challenges associated with collecting enough data due to cost constraints. In this paper, we propose a gene expression data generation model based on Conditional Variational Autoencoder. Our results demonstrate that the proposed model generates synthetic data with superior quality compared to two other state-of-the-art models for gene expression data generation, namely the Wasserstein Generative Adversarial Network with Gradient Penalty based model and the structured data generation models CTGAN and TVAE.

Simple Result

  • Test 2745 samples, 969 L1000 landmark genes.

    • Gamma score 0.98

      스크린샷 2023-04-02 오후 11 48 23
  • Compare with datasets such as [Ramon Viñas, Helena Andrés-Terré, Pietro Liò, Kevin Bryson, Adversarial generation of gene expression data, Bioinformatics, Volume 38, Issue 3, February 2022, Pages 730–737]

    • Gamma score 0.96

      스크린샷 2023-04-02 오후 11 48 56

Dataset

In this study, samples of 15 common tissues (lung, breast, kidney, thyroid, colon, stomach, prostate, saliva, liver, esophageal myopathy, esophageal mucosa, esophageal gastrointestinal tract, bladder, uterus, and cervix) of GTEx and TCGA were used. We followed the pipeline described by Wang et al. (2018) to integrate data and modify the deployment effect. Since then, 969 common genes with the L1000 landmark gene set were selected to create a dataset consisting of 9,146 samples and 969 genes.

  • GTEx(Genotype-Tissue Expression) Dataset
  • TCGA(Cancer Genome Atlas) Dataset
  • L1000 landmark
  • RNA-seq(human transcriptomics) Dataset (9147 samples and 18154 genes )

Install dependencies

  • torch >= 1.12.1
  • python >= 3.7
  • Python packages
    • umap-learn >= 0.5.3
    • scikit-learn >= 1.1.1

Usage

969 landmark gene sets were pretreated using log2 (expression_value+1) and standardization. You can download sample data for learning and testing from the Google Drive link below.

npy_data - Google Drive

Model Train

python train.py

Evaluation Notebook

Please check the evaluation.ipynb file.

Contact

If you have any question or problem, please send an email to sanseng@mju.ac.kr

About

Hyunsu Bong, Minsik Oh. “Conditional Variational Autoencoder-Based Generative Model for Gene Expression Data Augmentation" (Journal of Broadcast Engineering 2023)

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published