Skip to content

SniperWong/DeepHEM

Repository files navigation

DeepHEM: A Novel Deep Domain-Adversarial Learning Framework for Identifying Human Essential MiRNAs

Overview

DeepHEM is a deep learning framework designed for identifying human essential miRNAs. This repository contains the source code, data, and experimental results for the paper "DeepHEM: A Novel Deep Domain-Adversarial Learning Framework for Identifying Human Essential MiRNAs". The framework leverages domain adaptation techniques to transfer knowledge from well-studied species (e.g., mouse) to less-studied species (e.g., human), enabling accurate essentiality miRNA identification even with limited labeled data.

Project Structure

The repository is organized into the following main directories:

  • ablation studies/: Contains code and results for ablation experiments investigating the impact of distribution alignment loss, loss function combination, multi-modal fusion method, and feature representation module.
  • biological correlation analysis/: Includes scripts for analyzing the biological correlation of prediction results, such as trend boxplots and correlation matrix diagrams.
  • case study/: Contains code and data for in-depth case studies of specific human miRNAs.
  • comparative analysis experiment/: Includes implementations of comparison methods (DAN, DDC, TCA) and the proposed DeepHEM method.
  • data/: Contains raw and processed data used in the experiments.

Requirements

To run the code, you need the following dependencies:

  • Python 3.7+
  • PyTorch 1.7+
  • pandas
  • numpy
  • scikit-learn
  • matplotlib
  • scipy
  • openpyxl (Excel I/O)
  • seaborn (for plots in correlation analysis)

Installation

Clone the repository:

git clone https://github.com/SniperWong/DeepHEM.git
cd DeepHEM

Install the required dependencies:

pip install -r requirements.txt

Data Preparation

1. Dataset Description

The dataset consists of miRNA sequences and features from two species:

  • Source domain: Mouse (Mus musculus) miRNA data with known essentiality labels
  • Target domain: Human (Homo sapiens) miRNA data for essentiality identification

2. Data Files

All data files are located in the data/ directory:

  • source_data.xlsx: Mouse miRNA sequences
  • mmu_label.xlsx: Essentiality labels for mouse miRNAs
  • target_data.xlsx: Human miRNA sequences
  • mmu_feature.xlsx: Inherent features for mouse miRNAs
  • hsa_feature.xlsx: Inherent features for human miRNAs
  • mmu_mti_feature.xlsx: miRNA-target gene interaction features for mouse
  • hsa_mti_feature.xlsx: miRNA-target gene interaction features for human
  • Conservation score.xlsx: Conservation scores of human miRNAs
  • DSW score.xlsx: Disease spectrum width score (DSW) of human miRNAs

3. Preprocessing Steps

The preprocessing pipeline includes the following steps (already implemented inside the training scripts):

  1. Read miRNA sequences from Excel files using data_helper.read_mirna_sequences.
    • Expected columns: first column is ID, second column is sequence string.
  2. Extract sequence features using 3-mer frequency with sklearn.feature_extraction.text.CountVectorizer (k=3).

Model Training

1. Model Architecture

DeepHEM consists of the following key components:

  • Feature extractor: Transformer-based architecture for extracting representations from miRNA sequences; MLP for inherent and MTI-based features for miRNAs
  • Classifier: Multi-layer perceptron (MLP) for function prediction
  • Domain discriminator: Adversarial network for domain adaptation
  • Feature fusion module: Concatenation for fusing different types of features

2. Training Configuration

  • Optimizer: Adam with learning rate 0.001
  • Epochs: 100
  • Batch size: 64
  • Loss functions: Combination of classification loss, MMD (Maximum Mean Discrepancy) loss, and adversarial loss

3. Running procedure

  1. load/preprocess the data, 2) train the model, 3) save best weights, 4) generate predictions for all human miRNAs, and 5) compute biological correlations.

Evaluation

1. Evaluation Metrics

The model's performance is evaluated using the following metrics:

  • Pearson correlation: Measures correlation and p_values between predicted scores and conservation scores
  • Spearman correlation: Measures rank correlation between predicted scores and conservation scores
  • Kendall correlation: Non-parametric measure of rank correlation

2. Running Evaluation

After training, the model automatically evaluates its performance on the target domain data. The evaluation results include:

  • Predictions for all human miRNAs
  • Correlation analysis with biological scores (DSW and conservation scores)

3. Ablation Studies

To reproduce the ablation studies, navigate to the ablation studies/ subfolders and run the corresponding scripts. Note: some filenames contain Chinese parentheses and spaces; always wrap in quotes on Windows.

  • Feature ablation (different feature combinations):

    cd "ablation studies/test--feature"
    python "main(only seq).py"
    python "main(only MTI).py"
    python "main(only 18).py"
    python "main(seq+MTI).py"
    python "main(seq+18).py"
    python "main(MTI+18) .py"
  • Fusion method ablation:

    cd "ablation studies/test--fusion"
    python "main(add).py"
    python "main(cross attention).py"
  • Kernel ablation (MMD variants):

    cd "ablation studies/test--kernel"
    python "main(MK-MMD).py"
    python "main(SK-MMD).py"
  • Loss function ablation:

    cd "ablation studies/test--loss"
    python "main(mmd+cl).py"
    python "main(adv+cl).py"

Comparative Analysis Experiment

This folder provides implementations and runnable entry points for the comparison methods evaluated alongside DeepHEM.

  • Directory: comparative analysis experiment/code/
    • DAN/ (Deep Adversarial Network baseline)
    • DDC/ (Deep Domain Confusion baseline)
    • TCA/ (Transfer Component Analysis baseline)
cd "comparative analysis experiment/code/DAN"
python main.py

cd "comparative analysis experiment/code/DDC"
python main.py

cd "comparative analysis experiment/code/TCA"
python main.py

Required inputs (expected in the method directory you run from, or adjust paths in the scripts):

  • source_data.xlsx, source_label.xlsx, target_data.xlsx
  • mmu_feature.xlsx, hsa_feature.xlsx (18 handcrafted features)
  • mmu_feature_pca.xlsx, hsa_feature_pca.xlsx (MTI PCA features)
  • Biological correlation files for evaluation: hsa_dsw_score_1913.txt, dsw_name_1913.txt, hsa_conserv_score_1913.txt, conservation_name_1913.txt, pred_name_1913.txt

Biological Correlation Analysis

To analyze the biological relevance of the prediction results:

  1. Correlation matrix:

    cd biological correlation analysis/plot correlation matrix
    python "Drawing Correlation Matrix Diagram.py"
  2. Trend boxplots:

    cd biological correlation analysis/draw trend boxplot
    python "Drawing the trend boxplots.py"

Case Study

To run the case study analysis:

cd "case study"
python casestudy.py

Reproducibility

To ensure reproducibility of the results reported in the paper:

  1. For biological correlation analysis,ablation studies,comparative analysis experiment, and case study, use the specific scripts and data provided in each subdirectory
  2. Run the DeepHEM scripts with default parameters

About

DeepHEM: A Novel Deep Domain-Adversarial Learning Framework for Identifying Human Essential MiRNAs

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages