PIONEER: a structure-informed graph neural network for PE/PPE protein identification

The PE/PPE protein family of Mycobacterium tuberculosis plays a pivotal role in virulence, immune evasion, and host-pathogen interactions. However, the high GC-content and repetitive sequences of these proteins have long hindered accurate gene identification and functional annotation. We propose PIONEER, a structure-aware deep learning framework based on an Equivalence Graph Neural Network (EGNN) that integrates embeddings from the pre-trained protein language model ESMC with structural features. By representing proteins as residue-level graphs encoding both sequence semantics and 3D topological structure, PIONEER effectively captures hierarchical geometric relationships within proteins. Comparative benchmarking demonstrates that PIONEER outperforms 16 traditional machine learning algorithms and the existing deep learning model Digerati across multiple evaluation metrics, including accuracy, MCC, and F1 scores. Ablation experiments confirm the complementary importance of ESMC embeddings and secondary structure features, while t-SNE-based visual analysis reveals progressively enhanced feature separability across model layers. Our results highlight the critical role of structural feature learning in identifying PE/PPE proteins. PIONEER not only enhances classification accuracy but also provides a robust computational framework for functional annotation, analysis of pathogenicity mechanisms, and vaccine target discovery in Mycobacterium tuberculosis.

Data

The original dataset is in FASTA format, containing amino acid sequences of proteins. To obtain corresponding 3D structures in PDB format, we use ESMFold for structure prediction.

Installation

Environment

Anaconda
python 3.8

Create Environment with Conda

First, create the environment. Download and install the anaconda platform. (Refer to https://www.anaconda.com/docs/getting-started/anaconda/install#linux-installer).

conda create -n PIONEER python=3.8
conda activate PIONEER

Dependency Installation

Install required Python packages:

pip install -r requirements.txt

Feature extraction and tool description

DSSP

Extraction of secondary structure features using DSSP. Change to the corresponding DSSP path in the PIONEER.py

ESM-C

We use the ESM-C 6B model provided by ESM (https://github.com/evolutionaryscale/esm) for sequence embedding extraction, which is currently only supported by the Forge API. Go to https://forge.evolutionaryscale.ai/ and register an account to get the API Token.

Please change the path to the actual location according to the result you downloaded. Example:

# Enter the token you applied.
token= '******' # API Token

Usage

For example:

python PIONEER.py --pdb test.pdb --ckpt /data/checkpoint.pth --output pre.csv

Name		Name	Last commit message	Last commit date
Latest commit History 8 Commits
data		data
src		src
utils		utils
LICENSE		LICENSE
PIONEER.py		PIONEER.py
README.md		README.md
dssp		dssp
features.py		features.py
hyperparameter.py		hyperparameter.py
model.py		model.py
pytorchtools.py		pytorchtools.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

PIONEER: a structure-informed graph neural network for PE/PPE protein identification

Table of Contents

Data

Installation

Environment

Create Environment with Conda

Dependency Installation

Feature extraction and tool description

DSSP

ESM-C

Usage

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

PIONEER: a structure-informed graph neural network for PE/PPE protein identification

Table of Contents

Data

Installation

Environment

Create Environment with Conda

Dependency Installation

Feature extraction and tool description

DSSP

ESM-C

Usage

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages