DeepHEM is a deep learning framework designed for identifying human essential miRNAs. This repository contains the source code, data, and experimental results for the paper "DeepHEM: A Novel Deep Domain-Adversarial Learning Framework for Identifying Human Essential MiRNAs". The framework leverages domain adaptation techniques to transfer knowledge from well-studied species (e.g., mouse) to less-studied species (e.g., human), enabling accurate essentiality miRNA identification even with limited labeled data.
The repository is organized into the following main directories:
- ablation studies/: Contains code and results for ablation experiments investigating the impact of distribution alignment loss, loss function combination, multi-modal fusion method, and feature representation module.
- biological correlation analysis/: Includes scripts for analyzing the biological correlation of prediction results, such as trend boxplots and correlation matrix diagrams.
- case study/: Contains code and data for in-depth case studies of specific human miRNAs.
- comparative analysis experiment/: Includes implementations of comparison methods (DAN, DDC, TCA) and the proposed DeepHEM method.
- data/: Contains raw and processed data used in the experiments.
To run the code, you need the following dependencies:
- Python 3.7+
- PyTorch 1.7+
- pandas
- numpy
- scikit-learn
- matplotlib
- scipy
- openpyxl (Excel I/O)
- seaborn (for plots in correlation analysis)
Clone the repository:
git clone https://github.com/SniperWong/DeepHEM.git
cd DeepHEMInstall the required dependencies:
pip install -r requirements.txtThe dataset consists of miRNA sequences and features from two species:
- Source domain: Mouse (Mus musculus) miRNA data with known essentiality labels
- Target domain: Human (Homo sapiens) miRNA data for essentiality identification
All data files are located in the data/ directory:
source_data.xlsx: Mouse miRNA sequencesmmu_label.xlsx: Essentiality labels for mouse miRNAstarget_data.xlsx: Human miRNA sequencesmmu_feature.xlsx: Inherent features for mouse miRNAshsa_feature.xlsx: Inherent features for human miRNAsmmu_mti_feature.xlsx: miRNA-target gene interaction features for mousehsa_mti_feature.xlsx: miRNA-target gene interaction features for humanConservation score.xlsx: Conservation scores of human miRNAsDSW score.xlsx: Disease spectrum width score (DSW) of human miRNAs
The preprocessing pipeline includes the following steps (already implemented inside the training scripts):
- Read miRNA sequences from Excel files using
data_helper.read_mirna_sequences.- Expected columns: first column is ID, second column is sequence string.
- Extract sequence features using 3-mer frequency with
sklearn.feature_extraction.text.CountVectorizer(k=3).
DeepHEM consists of the following key components:
- Feature extractor: Transformer-based architecture for extracting representations from miRNA sequences; MLP for inherent and MTI-based features for miRNAs
- Classifier: Multi-layer perceptron (MLP) for function prediction
- Domain discriminator: Adversarial network for domain adaptation
- Feature fusion module: Concatenation for fusing different types of features
- Optimizer: Adam with learning rate 0.001
- Epochs: 100
- Batch size: 64
- Loss functions: Combination of classification loss, MMD (Maximum Mean Discrepancy) loss, and adversarial loss
- load/preprocess the data, 2) train the model, 3) save best weights, 4) generate predictions for all human miRNAs, and 5) compute biological correlations.
The model's performance is evaluated using the following metrics:
- Pearson correlation: Measures correlation and p_values between predicted scores and conservation scores
- Spearman correlation: Measures rank correlation between predicted scores and conservation scores
- Kendall correlation: Non-parametric measure of rank correlation
After training, the model automatically evaluates its performance on the target domain data. The evaluation results include:
- Predictions for all human miRNAs
- Correlation analysis with biological scores (DSW and conservation scores)
To reproduce the ablation studies, navigate to the ablation studies/ subfolders and run the corresponding scripts. Note: some filenames contain Chinese parentheses and spaces; always wrap in quotes on Windows.
-
Feature ablation (different feature combinations):
cd "ablation studies/test--feature" python "main(only seq).py" python "main(only MTI).py" python "main(only 18).py" python "main(seq+MTI).py" python "main(seq+18).py" python "main(MTI+18) .py"
-
Fusion method ablation:
cd "ablation studies/test--fusion" python "main(add).py" python "main(cross attention).py"
-
Kernel ablation (MMD variants):
cd "ablation studies/test--kernel" python "main(MK-MMD).py" python "main(SK-MMD).py"
-
Loss function ablation:
cd "ablation studies/test--loss" python "main(mmd+cl).py" python "main(adv+cl).py"
This folder provides implementations and runnable entry points for the comparison methods evaluated alongside DeepHEM.
- Directory:
comparative analysis experiment/code/DAN/(Deep Adversarial Network baseline)DDC/(Deep Domain Confusion baseline)TCA/(Transfer Component Analysis baseline)
cd "comparative analysis experiment/code/DAN"
python main.py
cd "comparative analysis experiment/code/DDC"
python main.py
cd "comparative analysis experiment/code/TCA"
python main.pyRequired inputs (expected in the method directory you run from, or adjust paths in the scripts):
source_data.xlsx,source_label.xlsx,target_data.xlsxmmu_feature.xlsx,hsa_feature.xlsx(18 handcrafted features)mmu_feature_pca.xlsx,hsa_feature_pca.xlsx(MTI PCA features)- Biological correlation files for evaluation:
hsa_dsw_score_1913.txt,dsw_name_1913.txt,hsa_conserv_score_1913.txt,conservation_name_1913.txt,pred_name_1913.txt
To analyze the biological relevance of the prediction results:
-
Correlation matrix:
cd biological correlation analysis/plot correlation matrix python "Drawing Correlation Matrix Diagram.py"
-
Trend boxplots:
cd biological correlation analysis/draw trend boxplot python "Drawing the trend boxplots.py"
To run the case study analysis:
cd "case study"
python casestudy.pyTo ensure reproducibility of the results reported in the paper:
- For biological correlation analysis,ablation studies,comparative analysis experiment, and case study, use the specific scripts and data provided in each subdirectory
- Run the DeepHEM scripts with default parameters