DeepARG-Attention: Interpretable ARG Identification in Metagenomics License: MIT Python PyTorch Status
Note: This repository serves as the implementation workspace for the undergraduate thesis research proposal: "Research on Deep Learning and Attention Mechanism-based Identification and Classification of Antibiotic Resistance Genes in Metagenomics".
🧬 Project Overview Antimicrobial Resistance (AMR) is a top global public health threat. Metagenomic sequencing allows for the direct detection of Antibiotic Resistance Genes (ARGs) from environmental samples. However, existing alignment-based tools (e.g., BLAST) and early deep learning models struggle with remote homology search (low sequence similarity) and lack interpretability.
This project proposes a CNN-Attention Hybrid Architecture designed to:
Surpass SOTA performance (DeepARG, HMD-ARG) on short metagenomic reads. Visualize active sites using Attention Mechanisms to explain why a gene is classified as resistant. Handle data imbalance effectively using advanced loss functions. 🏗️ Proposed Architecture The model utilizes a "Hybrid" approach, combining the local feature extraction capabilities of Convolutional Neural Networks (CNNs) with the global context awareness of Self-Attention mechanisms.
mermaid DNA Sequence / One-hot
1D-CNN Layers
ResNet Blocks
Multi-Head Self-Attention
Global Pooling
Fully Connected Layers
Output: ARG Class
Figure 1: Conceptual Architecture (Work in Progress)
🎯 Key Features (Planned) End-to-End Learning: Direct input from DNA sequences (ACGT) without manual feature engineering. Attention Maps: Generate visualization of high-weight regions corresponding to functional protein domains. Optimized for Short Reads: Robust classification for fragmented sequences (100-150bp) common in NGS data. Strict Curation: Training on a rigorously clustered dataset (CD-HIT) to prevent data leakage from homologous sequences. 🛠️ Tech Stack Language: Python 3.9 Deep Learning Framework: PyTorch Data Processing: Biopython, Pandas, NumPy Bioinformatics Tools: CD-HIT, DIAMOND, BLAST Visualization: Matplotlib, Seaborn (for Saliency Maps) 📅 Roadmap This project follows a 16-week development lifecycle as outlined in the research proposal.
Phase 1: Preparation (Weeks 1-3) Literature review (DeepARG, TRAC, HMD-ARG). Environment setup. Phase 2: Data Curation (Weeks 4-6) Integration of CARD and UniProt databases. Sequence clustering and dataset splitting (Train/Val/Test). Phase 3: Development (Weeks 7-10) Implementation of the CNN-Attention backbone. Hyperparameter tuning using Bayesian optimization. Phase 4: Evaluation (Weeks 11-13) Benchmarking against BLAST and DeepARG. Analysis of attention weights for interpretability. Phase 5: Publication (Weeks 14-16) Thesis writing and code documentation. 📚 References The development of this project is inspired by the following key works:
Li, Y., et al. (2018). HMD-ARG: hierarchical multi-task deep learning for annotating antibiotic resistance genes. Arango-Argoty, G., et al. (2018). DeepARG: a deep learning approach for predicting antibiotic resistance genes from metagenomic data. Author: Chenhao Guo Supervisor: Prof. Yu Li