The PE/PPE protein family of Mycobacterium tuberculosis plays a pivotal role in virulence, immune evasion, and host-pathogen interactions. However, the high GC-content and repetitive sequences of these proteins have long hindered accurate gene identification and functional annotation. We propose PIONEER, a structure-aware deep learning framework based on an Equivalence Graph Neural Network (EGNN) that integrates embeddings from the pre-trained protein language model ESMC with structural features. By representing proteins as residue-level graphs encoding both sequence semantics and 3D topological structure, PIONEER effectively captures hierarchical geometric relationships within proteins. Comparative benchmarking demonstrates that PIONEER outperforms 16 traditional machine learning algorithms and the existing deep learning model Digerati across multiple evaluation metrics, including accuracy, MCC, and F1 scores. Ablation experiments confirm the complementary importance of ESMC embeddings and secondary structure features, while t-SNE-based visual analysis reveals progressively enhanced feature separability across model layers. Our results highlight the critical role of structural feature learning in identifying PE/PPE proteins. PIONEER not only enhances classification accuracy but also provides a robust computational framework for functional annotation, analysis of pathogenicity mechanisms, and vaccine target discovery in Mycobacterium tuberculosis.
The original dataset is in FASTA format, containing amino acid sequences of proteins. To obtain corresponding 3D structures in PDB format, we use ESMFold for structure prediction.
Anaconda
python 3.8
First, create the environment. Download and install the anaconda platform. (Refer to https://www.anaconda.com/docs/getting-started/anaconda/install#linux-installer).
conda create -n PIONEER python=3.8
conda activate PIONEERInstall required Python packages:
pip install -r requirements.txtExtraction of secondary structure features using DSSP. Change to the corresponding DSSP path in the PIONEER.py
We use the ESM-C 6B model provided by ESM (https://github.com/evolutionaryscale/esm) for sequence embedding extraction, which is currently only supported by the Forge API. Go to https://forge.evolutionaryscale.ai/ and register an account to get the API Token.
Please change the path to the actual location according to the result you downloaded. Example:
# Enter the token you applied.
token= '******' # API TokenFor example:
python PIONEER.py --pdb test.pdb --ckpt /data/checkpoint.pth --output pre.csv