A probabilistic graphical model for taxonomic inference of viral proteome samples with associated confidence scores
Table of Contents
Our preprint is out now! You can read it here.
PepGM is a probabilistic graphical model embedded into a snakemake workflow for taxonomic inference of viral proteome samples. PepGM was developed by the the eScience group at BAM (Federal Institute for Materials Research and Testing).
The PepGM workflow includes the following steps:
- Optional host and cRAP filtering step
- SearchDB cleanup : cRAP DB ist added, host is added (if wanted), duplicate entries are removed using seqkit. generation of target-decoy DB using searchCLI. Susequent peptide search using searchCLI + PeptideShaker. Generation of a a peptide list
- All descendant strains of the target taxa are queried in the NCBI protein DB through the NCBI API. scripts: GetTargets.py, CreatePepGMGraph.py and FactorGraphGeneration.py
- Downloaded protein recordes are digested and queried against the protein ID list to generate a bipartite taxon-peptide graph. scripts: CreatePepGMGraph.py and FactorGraphGeneration.py
- The bipartite graph is transformed into a factor graph using convolution trees and conditional probability table factors (CPD). scripts: CreatePepGMGraph.py and FactorGraphGeneration.py
- For different sets of CPD parameters, the belief propagation algorithm is run until convergence to obtain the posterior probabilites of the taxa. scripts: belief_propagation.py and PepGM.py
- Through an empirically deduced metric, the ideal parameter set is inferred. script GridSearchAnalysis.py
- For this ideal parameter set, we output a results barchart and phylogenetic tree view showcasing the 15 best scoring tax. scripts: BarPlotResults, PhyloTreeView.py
If you find PepGM helpful for your research, please cite:
PepGM: A probabilistic graphical model for taxonomic inference of viral proteome samples with associated confidence scores
Tanja Holstein, Franziska Kistner, Lennart Martens, Thilo Muth
bioRxiv 2022.09.21.508832
doi: https://doi.org/10.1101/2022.09.21.508832
PepGM uses convolution trees. The code for the convolution trees was developed and is described in: https://journals.plos.org/plosone/article?id=10.1371/journal.pone.0091507
PepGM uses a version of the belief propagatin algorithm with a graphical network architecture previously described in https://pubs.acs.org/doi/10.1021/acs.jproteome.9b00566
- Your spectrum file in .mgf format
- A reference database in fasta format (see Preparation)
- A searchGUI .par parameters file with the database search parameters that can be generated using searchGUI
Additonally, you need:
- NCBI Entrez account
Make sure you have git installed and clone the repo:
git clone https://github.com/BAMeScience/PepGM.git
PepGM is a snakemake workflow developed with snakemake 5.10.0.
Installing snakemake requires mamba.
To install mamba:
conda install -n <your_env> -c conda-forge mamba
To install snakemake:
conda activate <your_env>
mamba create -c conda-forge -c bioconda -n <your_snakemake_env> snakemake
In accordance with the Snakemake recommendations, we suggest to save your sample data
in resources
folder. All outputs will be saved in results
.
Additional dependencies necessary are Java and GCC.
PepGM is tested for Linux OS and uses SearchGUI-4.1.14 and PeptideShaker-2.2.9 developed
by the CompOmics group at University of Ghent.
Download the necessary files at the following link:
- SearchGUI : http://compomics.github.io/projects/searchgui
- PeptideShaker : http://compomics.github.io/projects/peptide-shaker.html
We suggest to create a new directory bin
inside your PepGM
working directory and save the SearchGUI and PeptideShaker binaries there:
mkdir ./bin && cd bin
wget https://genesis.ugent.be/maven2/eu/isas/searchgui/SearchGUI/4.1.23/SearchGUI-4.1.23-mac_and_linux.tar.gz
wget https://genesis.ugent.be/maven2/eu/isas/peptideshaker/PeptideShaker/2.2.16/PeptideShaker-2.2.16.zip
tar -xvf SearchGUI-4.1.23-mac_and_linux.tar.gz && unzip PeptideShaker-2.2.16.zip
You can delete the .zip files afterwards:
rm *.tar.gz && rm *.zip
We recommend using the RefSeq Viral database as a generic reference database. It can be downloaded from the NCBI ftp:
cd ./resources/Database
wget ftp://ftp.ncbi.nlm.nih.gov/refseq/release/viral/\*.protein.faa.gz &&
gzip -d viral.*.protein.faa.gz &&
cat viral.*.protein.faa> refSeqViral.fasta &&
rm viral.*.protein.faa
PepGM uses the NCBI Entrez API.
We strongly advise you to create an NCBI account with your own key due to drastic speed increase.
Find out how to obtain your NCBI API key here.
As PepGM relies on SearchGUI to perform the database search, a SearchGUI parameters file, specifying the database search parameters, has to be provided. The easiest way to generate this file is via the GUI provided by SearchGUI. Other than that, the CLI instructions to set SearchGUI parameters are described here.
PepGM needs a configuration file in yaml
format to set up the workflow.
An exemplary configuration file is provided in config/config.yaml
.
Please insert your NCBI account details (mail & key) and provide the required absolute paths to
- SamplePath
- ParametersFile
- SearchGUI & PeptideShaker binaries (SearchGUIDir & PeptideShakerDir)
Do not change the config file location.
Details on the configuration parameters
Run panel
Set up the workflow of your PepGM run by providing parameters that fill wildcards to locate input files
such as raw spectra or reference database files. Thus, use file basenames i.e., without file
suffix, that your files already have or rename them accordingly. Run: Name of your run that is used to create a subfolder in the results directory.
Sample: Name of your sample that is used to create a subfolder in the run directory.
Reference: Name of reference database (e.g. human).
Host: Trivial host name.
Scientific host: Scientific host name. Retain (scientific) host names from public libraries such as ProteomeXchange or PRIDE (e.g. homo sapiens).
Add host and crap database: Search database is extended by a host and cRAP database. Mutually exclusive to Filter Spectra.
Input panel
Specify input file and directory paths. Sample spectra: Path to raw spectra file.
Parameter: Path to SearchGUI parameter file.
Sample data: Path to directory that contains sample raw spectra files.
Database: Path to directory that contains the reference database.
Peptide Shaker: Path to PeptideShaker binary (.jar).
Search GUI (folder): Path to SearchGUI binary (.jar).
The following paths are part of the recommended project structure for Snakemake workflows. Find out more about reproducible Snakemake workflows here.
Resources: Relative path to resources folder
Results: Relative path to results folder
TaxID mapping: Relative path to folder that contains mapped taxIDs.
Search panel
Choose a search engine that SearchGUI is using and the desired FDR levels. PepGM panel
Grid search: Choose increments for alpha, beta and prior that are to be included in the grid search to tune
graphical model parameters. Do not put a comma between values. Results plotting: Number of taxa in the final strain identification barplot.
Config file panel
Provide your NCBI API mail and key.
The graphical user interface (GUI) is developed to run Snakemake workflows without modifying
the configuration file manually in a text editor.
You can write a config file from scratch or edit an existing config file.
When modifying the config file in between runs, make sure to press the Write button before running.
PepGM can also be run from the command line. To run the snakemake workflow, you need to be in your PepGM repository and have the Snakemake conda environment activated. Run the following command
snakemake --use-conda --conda-frontend conda --cores <n_cores>
Where n_cores
is the number of cores you want snakemake to use.
All PepGM output files are saved into the results folder and include the following:
Main results:
- PepGM_Results.csv: Table with values ID, score, type (contains all taxids under 'ID' and all probabilities under 'score' that were attributed by PepGM)
- PepGM_ResultsPlot.png: Posterior probabilities of n (default: 15) highest scoring taxa
- PhyloTreeView.png : n (default: 1 5) highest scoring taxa including their score visualized in a taxonomic tree
Additional (intermediate):
- Intermediate results folder sorted by their prior value for all possible grid search parameter combinations
- mapped_taxids_weights.csv: csv file of all taxids that had at least one protein map to them and their weight
- PepGM_graph.graphml: graphml file of the graphical model (without convolution tree factors). Useful to visualize the graph structure and peptide-taxon connections
- paramcheck.png: barplot of the metric used to determine the graphical model parameters for n (default: 15) best performing parameter combinations
- log files for bug fixing
We have provided a toy example (Cowpox virus Brighton Red) to ease the first steps with PepGM. You will find a reduced
viral reference database only containing peptides from cowpow and cowpox-related strains,
a SearchGUI parameter file and the host and cRAP peptide sequence database in /resources
. The cowpox MS2
spectra can be downloaded
here (PRIDE ftp archive).
Download the spectra file to /resources/SampleData/
wget https://ftp.pride.ebi.ac.uk/pride/data/archive/2020/05/PXD014913/CPXV-0,1MOI-supernatant-HEp-24h.mgf
mv CPXV-0,1MOI-supernatant-HEp-24h.mgf spectrafile_PXD014913_cowpox_minimal_example.mgf
and adopt the reference database file basename in corresponding configuration parameter to minRefSeqViral. Finally, insert your API key and mail and replace the path to SamplePath, ParameterFile, SearchGUI and PeptideShaker with your individual locations.
- Damping oscillations
- Extension to metaproteomics+Unipept
See the open issues for a full list of proposed features (and known issues).
Contributions are what make the open source community such an amazing place to learn, inspire, and create. Any contributions you make are greatly appreciated.
If you have a suggestion that would make this better, please fork the repo and create a pull request. You can also simply open an issue with the tag "enhancement". Don't forget to give the project a star! Thanks again!
- Fork the Project
- Create your Feature Branch (
git checkout -b feature/AmazingFeature
) - Commit your Changes (
git commit -m 'Add some AmazingFeature'
) - Push to the Branch (
git push origin feature/AmazingFeature
) - Open a Pull Request
Distributed under the MIT License. See LICENSE.txt
for more information.
Tanja Holstein - @HolsteinTanja - tanja.holstein@bam.de
Franziska Kistner - LinkedIn - franziska.kistner@bam.de