SpecGP is a transformer-based model for predicting energy-adaptable structural spectra of glycopeptides. This project was developed using Python 3.9.13. A GPU-enabled environment is strongly recommended for accelerated computation. Key dependency versions include: matplotlib (3.9.2), numpy(1.26.3), openpyxl(3.1.5), pandas(2.1.4), pymzml(2.5.10), pythonnet(3.0.4), PyYAML (6.0.2), scikit-learn (0.24.2), scipy (1.13.1), statsmodels (0.13.2), torch (2.4.1+cu124), tqdm (4.66.5). Install these packages by using “pip install” command:
pip install matplotlib==3.9.2
pip install numpy==1.26.3
pip install openpyxl==3.1.5
pip install pandas==2.1.4
pip install pymzml==2.5.10
pip install pythonnet==3.0.4
pip install PyYAML==6.0.2
pip install scikit-learn==0.24.2
pip install scipy==1.13.1
pip install statsmodels==0.13.2
pip install torch==2.4.1+cu124
pip install tqdm==4.66.5
In addition, the following commands may be employed for compilation of the src/algorithm/floyd_alg.pyx file:
cd ./src/algorithm
pip install cython # Required if not installed
python setup.py build_ext --inplace
The complete dataset and pre-trained weights are available at the Google Drive.
Glycopeptide mass spectrometry raw files generated in our laboratory can be obtained from public repositories PXD025859.
All raw files must first be processed through StrucGP to obtain identification results, with each raw file and its corresponding result.xlsx file maintained together in the same folder. To indicate the fragmentation energy settings, folder names should be appended with either tce.20.33 (denoting multiple MS/MS spectra of single precursor ions at varying energies, 33 spectra triggered at 20) or sce.20.30.40 (representing stepped collision energy spectra incorporating 20, 30 and 40 energy levels).
Since SpecGP supports prediction of spectra across different collision energies, users are required to explicitly specify the energy parameters involved, while other energy conditions should follow the aforementioned naming conventions.
Furthermore, if you have search results from other tools, they can be utilized by converting their format to match that of StrucGP.
The entire process contains three steps:
- Data processing
- Model training
- Model Inference and Testing
The related code files are appropriately numbered for ease of use, such as python convert.py.
These scripts are executed in a command-line interface. Advanced users can also adapt these commands for other command-line interfaces.
Pre-processing: Integrate spectrum information from raw files with corresponding glycopeptide structural data in result.xlsx to generate Glycopeptide Spectrum Matches (GPSMs),
then serialize the pairs into PKL filesdemo_data.pkl.
python convert.py
Configuration parameters (set in convert.yaml):
Key settings:
raw_files: E:/specgp_files/0.1_demo_data/data_tce.20.33
result_files: E:/specgp_files/0.1_demo_data/data_tce.20.33
save_dir: E:/specgp_files/0.1_demo_data
save_name: demo_data.pkl
only_remove_duplicates: False
do_remove_duplicates: True
separate_train_and_val: True
ratio: 0.2
The description of the parameters:
--raw_files: This parameter denotes the pathway of original raw data(It needs to be named according to the rules mentioned above)
--result_files: This parameter signifies the file name of the StrucGP identification results(It is usually the same as --raw_files).
--save_dir: This parameter corresponds to the output directory path for storing preprocessed data files in Python pickle (.pkl) format.
--save_name: This parameter sets the name for pickle file.
--only_remove_duplicates: When True, performs exclusive weighted averaging on duplicate Glycopeptide Spectrum Matches (GPSMs). Default: False (recommended).
--do_remove_duplicates: This parameter controls whether to deduplicate redundant Glycopeptide Spectrum Matches (GPSMs). Default: True.
--separate_train_and_val: This parameter controls whether to directly split the dataset into training and testing sets.
--ratio: This parameter sets the ratio of test set.
Result output(demo data example):
demo_data.pklContains all Glycopeptide-Spectrum Matches (GPSMs) prior to processing.
demo_data_dr.pklStores deduplicated GPSMs where redundant glycopeptide structures have been removed (dr = deduplicated and refined).
demo_data_train.pkl:Training set derived from deduplicated GPSMs.
demo_data_val.pkl: Validation/test set from deduplicated GPSMs.
Uses same settings as MS/MS spectra-only prediction. Execute in sequence:
python convert.py
python rt_calibration.py --reference_file_name ShenJ_MouseBrain_C18_HILIC_IGP_CE20_33_Run1 --file_path E:/specgp_files/0.1_demo_data/demo_data.pkl --min_rt_v 11.1352 --max_rt_v 260.5374
The description of the parameters of the command line:
--reference_file_name: Specifies the file whose identified results will anchor the retention time (RT) calibration. All other files will be aligned to this reference.
--file_path: Path to the validation data files generated by convert.py.
--min_rt_v: Sets the minimum retention time (in minutes) for the analysis. If unset, defaults to the empirically observed minimum RT in the data.
--max_rt_v: Defines the maximum retention time (in minutes). When omitted, the actual maximum RT in the dataset is used.
The file demo_data_irt.pkl will be generated as the output of this pipeline, containing the validated and preprocessed data.
Modify the following parameters in convert.yaml (Set only_remove_duplicates: True; Update save_name: demo_data_irt.pkl) and perform action:
python convert.py
The script will perform deduplication and dataset splitting on the RT-validated file demo_data_irt.pkl, producing:
demo_data_irt_train.pkl :training set.
demo_data_irt_val.pkl :validation set.
For retention time alignment across different tissues or species, use:
python rt_calibration_by_tissue.py
The example dataset is available in data folder of the code repository.
python train.py
Configuration parameters (set in train.yaml):
Key settings:
train_path: E:/specgp_files/0.1_demo_data/demo_data_irt_train.pkl (Path to training set)
val_path: E:/specgp_files/0.1_demo_data/demo_data_irt_val.pkl (Path to validation set)
cfg: ./src/model.yaml (Model configuration file)
epochs: 40 (Number of training iterations)
resume: False (When set to False, training starts anew; if True, it resumes from the latest checkpoint in the ./runs/ directory, though a specific checkpoint path can also be designated.)
weights: E:/specgp_files/proteome_pretrian_base_model.pt (This represents the pre-trained weights for the base model, which plays a critical role in both enhancing model accuracy and stabilizing training convergence. For fine-tuning purposes, simply specify the path to these existing model weights.)
These weights (available in data folder of the code repository) enable efficient fine-tuning when directed to an existing model path. All training outputs, including model checkpoints and logs, are systematically organized in the ./runs/train/ directory under the project root.
python predict.py --weights ./runs/train/exp/last.pt --data_path E:/specgp_files/0.1_demo_data/demo_data_irt_val.pkl --batch_size 8 --workers 1 --device cuda:0
The description of the parameters of the command line:
--weights: This parameter sets the path to trained model weights (.pt or .pth file).
--data_path: This parameter sets the path to the input dataset (.pkl file) containing preprocessed test/validation data.
--batch_size: Inference batch size. Adjust based on GPU memory capacity (typical values: 8-64).
--workers: Number of data loading worker processes.
--device: Specify the GPU ID when using GPU acceleration, or set to cpu for CPU-only execution.
Dataset preparation
python sswt_convert.py --raw_and_result_files E:/specgp_files/demo_data --branch_db_path ./configs/branch_structures_17_20231018.xlsx --sce False --ces [20, 33]
--raw_and_result_files: Directory containing raw files and StrucGP identification result files
--branch_db_path: Path to the branch library file
--sce: Whether it is stepped collision energy
--ces: Collision energy used for the spectra
python sswt_cache_training_data.py --raw_and_result_files E:/specgp_files/demo_data.pkl --save_dir E:/specgp_files/demo_train_data --branch_db_path ./configs/branch_structures_17_20231018.xlsx --max_exp_psm_num 15
--raw_and_result_files: GPSM result file generated by sswt_convert.py
--save_dir: Output cache directory for the training dataset
--branch_db_path: Path to the branch library file
--max_exp_psm_num: Maximum number of training GPSMs used for the same glycopeptide
Model training
python sswt_train.py --train_data_paths ['E:/specgp_files/demo_train_data'] --batch_size 8 --train_retention_time False
--train_data_paths: Path to the training dataset generated by sswt_cache_training_data.py
--batch_size: Training batch size
--train_retention_time: Whether to simultaneously train retention time (when set to True, rt_calibration.py or rt_calibration_by_tissue.py needs to be executed after sswt_convert.py to calibrate the retention time)
Here, we release a new version of StrucGP, which incorporates the re-scoring system based on SpecGP.
Operating system and software:Windows 11、10 or Windows 7 Service Pack 1 above, Windows Server 2012 or above.
Hardware: Intel or AMD x86-64 processor, 4.0 GHz processor with 64 GB RAM
For the detailed usage of StrucGP and the method to obtain the license, please refer to the original article at https://doi.org/10.5281/zenodo.4925441. Below is a brief explanation for StrucGP use and an introduction to the usage of the integrated SpecGP re-scoring system.
1. Run main.exe file(in StrucGP1.3-beta folder), you can see the GUI interface along with a command console.
2. 'step 1' tab, click "Add MS file" to import the raw files to be searched against the database; click "Delete" to remove the selected raw files; click "Open fasta file" to select the matched protein database(downloaded from https://www.uniprot.org/); click "Open file" to select the branching database of glycan chains(in StrucGP1.3-beta folder).
3. 'step 2' tab, select the appropriate enzyme digestion method, maximum number of missed cleavage sites, modifications and other relevant information.
4. 'step 3' tab, select the appropriate energy parameters(if you want to analysis MS file with stepped collision energy, set the high and low energy to the same.), ppm tolerance, number of threads, and re-scoring related parameters.
Three models are provided in the StrucGP1.3-beta/models folder for users to choose from, corresponding to human and mouse configurations for dual collision energy and stepped collision energy settings. If users need a custom model, they can utilize SpecGP to complete model training.
5. When the program finishes running, a result file with the suffix 'after_rescoring.xlsx' will be generated in the same directory as the raw file as the final identification result. Taking the AMD Ryzen 9 9900X CPU and the Windows 11 operating platform as an example, glycan identification takes about 30 minutes, while re-scoring takes about 15 minutes.
Thank DeepGlyco and DeepGP teams for their open-source initiatives that supported this research.
If you have any questions or suggestions, please contact sun_glycolab@126.com or post issues at Github(https://github.com/Sun-GlycoLab/SpecGP).


