Skip to content

ABILiLab/PIONEER

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

8 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

PIONEER: a structure-informed graph neural network for PE/PPE protein identification

The PE/PPE protein family of Mycobacterium tuberculosis plays a pivotal role in virulence, immune evasion, and host-pathogen interactions. However, the high GC-content and repetitive sequences of these proteins have long hindered accurate gene identification and functional annotation. We propose PIONEER, a structure-aware deep learning framework based on an Equivalence Graph Neural Network (EGNN) that integrates embeddings from the pre-trained protein language model ESMC with structural features. By representing proteins as residue-level graphs encoding both sequence semantics and 3D topological structure, PIONEER effectively captures hierarchical geometric relationships within proteins. Comparative benchmarking demonstrates that PIONEER outperforms 16 traditional machine learning algorithms and the existing deep learning model Digerati across multiple evaluation metrics, including accuracy, MCC, and F1 scores. Ablation experiments confirm the complementary importance of ESMC embeddings and secondary structure features, while t-SNE-based visual analysis reveals progressively enhanced feature separability across model layers. Our results highlight the critical role of structural feature learning in identifying PE/PPE proteins. PIONEER not only enhances classification accuracy but also provides a robust computational framework for functional annotation, analysis of pathogenicity mechanisms, and vaccine target discovery in Mycobacterium tuberculosis.

Table of Contents

Data

The original dataset is in FASTA format, containing amino acid sequences of proteins. To obtain corresponding 3D structures in PDB format, we use ESMFold for structure prediction.

Installation

Environment

Anaconda
python 3.8

Create Environment with Conda

First, create the environment. Download and install the anaconda platform. (Refer to https://www.anaconda.com/docs/getting-started/anaconda/install#linux-installer).

conda create -n PIONEER python=3.8
conda activate PIONEER

Dependency Installation

Install required Python packages:

pip install -r requirements.txt

Feature extraction and tool description

DSSP

Extraction of secondary structure features using DSSP. Change to the corresponding DSSP path in the PIONEER.py

ESM-C

We use the ESM-C 6B model provided by ESM (https://github.com/evolutionaryscale/esm) for sequence embedding extraction, which is currently only supported by the Forge API. Go to https://forge.evolutionaryscale.ai/ and register an account to get the API Token.

Please change the path to the actual location according to the result you downloaded. Example:

# Enter the token you applied.
token= '******' # API Token

Usage

For example:

python PIONEER.py --pdb test.pdb --ckpt /data/checkpoint.pth --output pre.csv

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages