Pipeline mode for TMT analysis
For this example we will see how to process and analyze the Clear Cell Renal Carcinoma (CCRC) cohort data from the third Clinical Proteomic Tumor Analysis Consortium (CPTAC 3) study using Philosopher pipeline with MSFragger database search. These samples are TMT-10 multiplexed and fractionated. Pipeline mode runs all steps of the analysis, to run each step manually, see the step-by-step tutorial.
We will need:
- Philosopher (version 2.1.2 or higher)
- MSFragger (version 2.3 or higher, see download instructions on the website)
- Java 8 Runtime Environment (required by MSFragger)
- mzML spectral files from the Clear Cell Renal Carcinoma data set from CPTAC 3 (download instructions below)
- A human protein sequence database (see below)
- A computer or server running GNU/Linux with at least 16 GB of RAM
We ran this example on a Linux Red Hat 7, so the commands shown below are Linux compatible. For Windows, you will need to adjust the folder separators from '/' to '\'.
Download the data set
Select the mzML files you want to download, in this example we will use two data sets from the 'Proteome' (non-phospho enriched) part of the study. Select these two mzML files and press 'DOWNLOAD':
We don't need to do any file conversion because we are already using the mzML files provided by the consortium, but you will need to unzip/decompress the files.
Organize the workspace
Start by creating a folder for the entire analysis that will be called CPTAC3_CCRC_tutorial, inside we will create a folder for each of the two whole proteome multiplexed samples we've downloaded. Inside each of these two folders, there should be 25 mzML files for each fraction of the multiplexed TMT-10 sample.
Create a folder called
bin for the software tools we will use, a folder called
params for the configuration file, and a folder called
database for the protein sequence FASTA file.
The workspace structure should look like this:
CPTAC3_CCRC_tutorial |---- 01CPTAC_CCRCC_Proteome_JHU_20171007 |---- 02CPTAC_CCRCC_Proteome_JHU_20171003 |---- bin | |---- MSFragger-2.3.jar | |---- philosopher |---- params | |---- philosopher.yaml |---- database | |---- 2020-03-05-decoys-reviewed-contam-UP000005640.fas
Inside each one of the two data set folders, place the 25 mzML files corresponding to all fractions for that data set, e.g.:
01CPTAC_CCRCC_Proteome_JHU_20171007 |---- 01CPTAC_CCRCC_W_JHU_20171007_LUMOS_f01.mzML |---- 01CPTAC_CCRCC_W_JHU_20171007_LUMOS_f02.mzML |---- 01CPTAC_CCRCC_W_JHU_20171007_LUMOS_f03.mzML |---- 01CPTAC_CCRCC_W_JHU_20171007_LUMOS_f04.mzML |---- 01CPTAC_CCRCC_W_JHU_20171007_LUMOS_f05.mzML |---- 01CPTAC_CCRCC_W_JHU_20171007_LUMOS_f06.mzML |---- 01CPTAC_CCRCC_W_JHU_20171007_LUMOS_f07.mzML |---- 01CPTAC_CCRCC_W_JHU_20171007_LUMOS_f08.mzML |---- 01CPTAC_CCRCC_W_JHU_20171007_LUMOS_f09.mzML |---- 01CPTAC_CCRCC_W_JHU_20171007_LUMOS_f10.mzML |---- 01CPTAC_CCRCC_W_JHU_20171007_LUMOS_f11.mzML |---- 01CPTAC_CCRCC_W_JHU_20171007_LUMOS_f12.mzML |---- 01CPTAC_CCRCC_W_JHU_20171007_LUMOS_f13.mzML |---- 01CPTAC_CCRCC_W_JHU_20171007_LUMOS_f14.mzML |---- 01CPTAC_CCRCC_W_JHU_20171007_LUMOS_f15.mzML |---- 01CPTAC_CCRCC_W_JHU_20171007_LUMOS_f16.mzML |---- 01CPTAC_CCRCC_W_JHU_20171007_LUMOS_f17.mzML |---- 01CPTAC_CCRCC_W_JHU_20171007_LUMOS_f18.mzML |---- 01CPTAC_CCRCC_W_JHU_20171007_LUMOS_f19.mzML |---- 01CPTAC_CCRCC_W_JHU_20171007_LUMOS_f20.mzML |---- 01CPTAC_CCRCC_W_JHU_20171007_LUMOS_f21.mzML |---- 01CPTAC_CCRCC_W_JHU_20171007_LUMOS_f22.mzML |---- 01CPTAC_CCRCC_W_JHU_20171007_LUMOS_f23.mzML |---- 01CPTAC_CCRCC_W_JHU_20171007_LUMOS_f24.mzML |---- 01CPTAC_CCRCC_W_JHU_20171007_LUMOS_fA.mzML |---- annotation.txt
The annotation file is a simple text file with mappings between the TMT channels and the sample labels, which is needed to generate the final reports. Each data set folder should contain a text file called annotation.txt with the mapping. Below are the annotation files for data set #01 and #02:
126 CPT0079430001 127N CPT0023360001 127C CPT0023350003 128N CPT0079410003 128C CPT0087040003 129N CPT0077310003 129C CPT0077320001 130N CPT0087050003 130C CPT0002270011 131N pool01
126 NCI7-1 127N CPT0078840001 127C CPT0075570001 128N CPT0075560003 128C CPT0078830003 129N CPT0077490003 129C CPT0077500001 130N CPT0023690003 130C CPT0023710001 131N pool02
Labels for this and other data sets can also be found on the NIH CPTAC data portal in the
Download a sequence database
If you don't already have a human protein FASTA file downloaded from Uniprot by Philosopher (e.g. [download-date]-decoys-reviewed-contam-UP000005640.fas), run the following two commands inside the database folder to download and format protein sequences:
philosopher workspace --init
philosopher database --id UP000005640 --reviewed --contam
If you already have a FASTA file (.fas extension), place it inside the database folder.
Set up the Philosopher pipeline configuration file
We will do the analysis using the automated pipeline mode, which will automatically run all the necessary steps for us. The pipeline mode uses the philosopher.yaml configuration file. The configuration file is divided in two sections: the first part contains a list of all the commands the program is able to automate, the following section contains the specific parameters for individual commands (see the documentation for more information). We will set each of the desired commands to yes on the upper part, then we will configure the individual steps. We will use the philosopher.yaml file below. Make sure the full file paths for MSFragger.jar and the FASTA database are correct:
# Philosopher pipeline configuration file. # # The pipeline mode automates the processing done by Philosopher. First, check # the steps you want to execute in the commands section and change them to # 'yes'. For each selected command, go to its section and adjust the parameters # accordingly to your analysis. # # If you want to include MSFragger and TMT-Integrator into your analysis, you will # haver o download them separately and then add their location tot their configuration # # Usage: # philosopher pipeline --config <this_configuration_file> [list_of_data_set_folders] analytics: true # reports when a workspace is created for usage estimation (default true) slackToken: # specify the Slack API token slackChannel: # specify the channel name commands: workspace: yes # manage the experiment workspace for the analysis database: yes # target-decoy database formatting comet: no # peptide spectrum matching with Comet msfragger: yes # peptide spectrum matching with MSFragger peptideprophet: yes # peptide assignment validation ptmprophet: no # PTM site localization proteinprophet: no # protein identification validation filter: yes # statistical filtering, validation and False Discovery Rates assessment freequant: yes # label-free Quantification labelquant: yes # isobaric Labeling-Based Relative Quantification bioquant: no # protein report based on Uniprot protein clusters report: yes # multi-level reporting for both narrow-searches and open-searches abacus: yes # combined analysis of LC-MS/MS results tmtintegrator: yes # integrates channel abundances from multiple TMT samples database: protein_database: /CPTAC3_CCRC_tutorial/database/2020-03-05-decoys-reviewed-contam-UP000005640.fas # path to the target-decoy protein database decoy_tag: rev_ # prefix tag used added to decoy sequences comet: noindex: true # skip raw file indexing param: # comet parameter file (default "comet.params.txt") raw: mzML # format of the spectra file msfragger: # v2.3 path: /CPTAC3_CCRC_tutorial/bin/MSFragger-2.3.jar # path to MSFragger jar memory: 16 # how much memory in GB to use param: # MSFragger parameter file raw: mzML # spectra format num_threads: 0 # 0=poll CPU to set num threads; else specify num threads directly (max 64) precursor_mass_lower: -20 # lower bound of the precursor mass window precursor_mass_upper: 20 # upper bound of the precursor mass window precursor_mass_units: 1 # 0=Daltons, 1=ppm precursor_true_tolerance: 20 # true precursor mass tolerance (window is +/- this value) precursor_true_units: 1 # 0=Daltons, 1=ppm fragment_mass_tolerance: 20 # fragment mass tolerance (window is +/- this value) fragment_mass_units: 1 # fragment mass tolerance units (0 for Da, 1 for ppm) calibrate_mass: 0 # 0=Off, 1=On, 2=On and find optimal parameters deisotope: 0 # activates deisotoping. isotope_error: -1/0/1/2/3 # 0=off, -1/0/1/2/3 (standard C13 error) mass_offsets: 0 # allow for additional precursor mass window shifts. Multiplexed with isotope_error. mass_offsets = 0/79.966 can be used as a restricted ‘open’ search that looks for unmodified and phosphorylated peptides (on any residue) precursor_mass_mode: selected # selected or isolated localize_delta_mass: 0 # this allows shifted fragment ions - fragment ions with mass increased by the calculated mass difference, to be included in scoring delta_mass_exclude_ranges: (-1.5,3.5) # exclude mass range for shifted ions searching fragment_ion_series: b,y # ion series used in search search_enzyme_name: Trypsin # name of enzyme to be written to the pepXML file search_enzyme_cutafter: KR # residues after which the enzyme cuts search_enzyme_butnotafter: P # residues that the enzyme will not cut before num_enzyme_termini: 2 # 2 for enzymatic, 1 for semi-enzymatic, 0 for nonspecific digestion allowed_missed_cleavage: 2 # maximum value is 5 clip_nTerm_M: 1 # specifies the trimming of a protein N-terminal methionine as a variable modification (0 or 1) variable_mod_01: 15.99490 M 3 # variable modification variable_mod_02: 42.01060 [^ 1 # variable modification variable_mod_03: 229.162932 n^ 1 # variable modification variable_mod_04: 229.162932 S 1 # variable modification variable_mod_05: # variable modification variable_mod_06: # variable modification variable_mod_07: # variable modification allow_multiple_variable_mods_on_residue: 1 # static mods are not considered max_variable_mods_per_peptide: 3 # maximum of 5 max_variable_mods_combinations: 5000 # maximum of 65534, limits number of modified peptides generated from sequence output_file_extension: pepXML # file extension of output files output_format: pepXML # file format of output files (pepXML or tsv) output_report_topN: 3 # reports top N PSMs per input spectrum output_max_expect: 50 # suppresses reporting of PSM if top hit has expectation greater than this threshold report_alternative_proteins: 0 # 0=no, 1=yes precursor_charge: 1 6 # assume range of potential precursor charge states. Only relevant when override_charge is set to 1 override_charge: 0 # 0=no, 1=yes to override existing precursor charge states with precursor_charge parameter digest_min_length: 7 # minimum length of peptides to be generated during in-silico digestion digest_max_length: 50 # maximum length of peptides to be generated during in-silico digestion digest_mass_range: 500.0 5000.0 # mass range of peptides to be generated during in-silico digestion in Daltons max_fragment_charge: 2 # maximum charge state for theoretical fragments to match (1-4) track_zero_topN: 0 # in addition to topN results, keep track of top results in zero bin zero_bin_accept_expect: 0 # boost top zero bin entry to top if it has expect under 0.01 - set to 0 to disable zero_bin_mult_expect: 1 # disabled if above passes - multiply expect of zero bin for ordering purposes (does not affect reported expect) add_topN_complementary: 0 # inserts complementary ions corresponding to the top N most intense fragments in each experimental spectra minimum_peaks: 15 # required minimum number of peaks in spectrum to search (default 10) use_topN_peaks: 150 # pre-process experimental spectrum to only use top N peaks min_fragments_modelling: 3 # minimum number of matched peaks in PSM for inclusion in statistical modeling min_matched_fragments: 4 # minimum number of matched peaks for PSM to be reported minimum_ratio: 0.01 # filters out all peaks in experimental spectrum less intense than this multiple of the base peak intensity clear_mz_range: 125.5 131.5 # for iTRAQ/TMT type data; will clear out all peaks in the specified m/z range remove_precursor_peak: 0 # remove precursor peaks from tandem mass spectra. 0=not remove; 1=remove the peak with precursor charge; 2=remove the peaks with all charge states. remove_precursor_range: -1.5,1.5 # m/z range in removing precursor peaks. Unit: Da. intensity_transform: 0 # transform peaks intensities with sqrt root. 0=not transform; 1=transform using sqrt root. add_Cterm_peptide: 0.000000 # c-term peptide fixed modifications add_Cterm_protein: 0.000000 # c-term protein fixed modifications add_Nterm_peptide: 0.000000 # n-term peptide fixed modifications add_Nterm_protein: 0.000000 # n-term protein fixed modifications add_A_alanine: 0.000000 # alanine fixed modifications add_C_cysteine: 57.021464 # cysteine fixed modifications add_D_aspartic_acid: 0.000000 # aspartic acid fixed modifications add_E_glutamic_acid: 0.000000 # glutamic acid fixed modifications add_F_phenylalanine: 0.000000 # phenylalanine fixed modifications add_G_glycine: 0.000000 # glycine fixed modifications add_H_histidine: 0.000000 # histidine fixed modifications add_I_isoleucine: 0.000000 # isoleucine fixed modifications add_K_lysine: 229.162932 # lysine fixed modifications add_L_leucine: 0.000000 # leucine fixed modifications add_M_methionine: 0.000000 # methionine fixed modifications add_N_asparagine: 0.000000 # asparagine fixed modifications add_P_proline: 0.000000 # proline fixed modifications add_Q_glutamine: 0.000000 # glutamine fixed modifications add_R_arginine: 0.000000 # arginine fixed modifications add_S_serine: 0.000000 # serine fixed modifications add_T_threonine: 0.000000 # threonine fixed modifications add_V_valine: 0.000000 # valine fixed modifications add_W_tryptophan: 0.000000 # tryptophan fixed modifications add_Y_tyrosine: 0.000000 # tyrosine fixed modifications peptideprophet: # v5.2 extension: pepXML # pepXML file extension clevel: 0 # set Conservative Level in neg_stdev from the neg_mean, low numbers are less conservative, high numbers are more conservative accmass: true # use Accurate Mass model binning decoyprobs: true # compute possible non-zero probabilities for Decoy entries on the last iteration enzyme: trypsin # enzyme used in sample (optional) exclude: false # exclude deltaCn*, Mascot*, and Comet* results from results (default Penalize * results) expectscore: true # use expectation value as the only contributor to the f-value for modeling forcedistr: false # bypass quality control checks, report model despite bad modeling glyc: false # enable peptide Glyco motif model icat: false # apply ICAT model (default Autodetect ICAT) instrwarn: false # warn and continue if combined data was generated by different instrument models leave: false # leave alone deltaCn*, Mascot*, and Comet* results from results (default Penalize * results) maldi: false # enable MALDI mode masswidth: 5 # model mass width (default 5) minpeplen: 7 # minimum peptide length not rejected (default 7) minpintt: 2 # minimum number of NTT in a peptide used for positive pI model (default 2) minpiprob: 0.9 # minimum probability after first pass of a peptide used for positive pI model (default 0.9) minprob: 0.05 # report results with minimum probability (default 0.05) minrtntt: 2 # minimum number of NTT in a peptide used for positive RT model (default 2) minrtprob: 0.9 # minimum probability after first pass of a peptide used for positive RT model (default 0.9) neggamma: false # use Gamma distribution to model the negative hits noicat: false # do no apply ICAT model (default Autodetect ICAT) nomass: false # disable mass model nonmc: false # disable NMC missed cleavage model nonparam: true # use semi-parametric modeling, must be used in conjunction with --decoy option nontt: false # disable NTT enzymatic termini model optimizefval: false # (SpectraST only) optimize f-value function f(dot,delta) using PCA phospho: false # enable peptide Phospho motif model pi: false # enable peptide pI model ppm: true # use PPM mass error instead of Dalton for mass modeling zero: false # report results with minimum probability 0 ptmprophet: # v5.2 autodirect: false # use direct evidence when the lability is high, use in combination with LABILITY cions: # use specified C-term ions, separate multiple ions by commas (default: y for CID, z for ETD) direct: false # use only direct evidence for evaluating PTM site probabilities em: 2 # set EM models to 0 (no EM), 1 (Intensity EM Model Applied) or 2 (Intensity and Matched Peaks EM Models Applied) static: false # use static fragppmtol for all PSMs instead of dynamically estimates offsets and tolerances fragppmtol: 15 # when computing PSM-specific mass_offset and mass_tolerance, use specified default +/- MS2 mz tolerance on fragment ions ifrags: false # use internal fragments for localization keepold: false # retain old PTMProphet results in the pepXML file lability: false # compute Lability of PTMs massdiffmode: false # use the Mass Difference and localize massoffset: 0 # adjust the massdiff by offset (0 = use default) maxfragz: 0 # limit maximum fragment charge (default: 0=precursor charge, negative values subtract from precursor charge) maxthreads: 4 # use specified number of threads for processing mino: 0 # use specified number of pseudo-counts when computing Oscore (0 = use default) minprob: 0 # use specified minimum probability to evaluate peptides mods: # specify modifications nions: # use specified N-term ions, separate multiple ions by commas (default: a,b for CID, c for ETD) nominofactor: false # disable MINO factor correction when MINO= is set greater than 0 (default: apply MINO factor correction) ppmtol: 1 # use specified +/- MS1 ppm tolerance on peptides which may have a slight offset depending on search parameters verbose: false # produce Warnings to help troubleshoot potential PTM shuffling or mass difference issues proteinprophet: # v5.2 accuracy: false # equivalent to --minprob 0 allpeps: false # consider all possible peptides in the database in the confidence model confem: false # use the EM to compute probability given the confidence delude: false # do NOT use peptide degeneracy information when assessing proteins excludezeros: false # exclude zero prob entries fpkm: false # model protein FPKM values glyc: false # highlight peptide N-glycosylation motif icat: false # highlight peptide cysteines instances: false # use Expected Number of Ion Instances to adjust the peptide probabilities prior to NSP adjustment iprophet: false # input is from iProphet logprobs: false # use the log of the probabilities in the Confidence calculations maxppmdiff: 20 # maximum peptide mass difference in PPM (default 20) minprob: 0.05 # peptideProphet probabilty threshold (default 0.05) mufactor: 1 # fudge factor to scale MU calculation (default 1) nogroupwts: false # check peptide's Protein weight against the threshold (default: check peptide's Protein Group weight against threshold) nonsp: false # do not use NSP model nooccam: false # non-conservative maximum protein list noprotlen: false # do not report protein length normprotlen: false # normalize NSP using Protein Length protmw: false # get protein mol weights softoccam: false # peptide weights are apportioned equally among proteins within each Protein Group (less conservative protein count estimate) unmapped: false # report results for UNMAPPED proteins filter: psmFDR: 0.01 # psm FDR level (default 0.01) peptideFDR: 0.01 # peptide FDR level (default 0.01) ionFDR: 0.01 # peptide ion FDR level (default 0.01) proteinFDR: 0.01 # protein FDR level (default 0.01) peptideProbability: 0.7 # top peptide probability threshold for the FDR filtering (default 0.7) proteinProbability: 0.5 # protein probability threshold for the FDR filtering (not used with the razor algorithm) (default 0.5) peptideWeight: 0.9 # threshold for defining peptide uniqueness (default 1) razor: true # use razor peptides for protein FDR scoring picked: true # apply the picked FDR algorithm before the protein scoring mapMods: true # map modifications acquired by an open search models: true # print model distribution sequential: true # alternative algorithm that estimates FDR using both filtered PSM and Protein lists freequant: peakTimeWindow: 0.4 # specify the time windows for the peak (minute) (default 0.4) retentionTimeWindow: 3 # specify the retention time window for xic (minute) (default 3) tolerance: 10 # m/z tolerance in ppm (default 10) isolated: true # use the isolated ion instead of the selected ion for quantification labelquant: annotation: annotation.txt # annotation file with custom names for the TMT channels bestPSM: true # select the best PSMs for protein quantification level: 2 # ms level for the quantification minProb: 0.7 # only use PSMs with a minimum probability score plex: 10 # number of channels purity: 0.5 # ion purity threshold (default 0.5) removeLow: 0.05 # ignore the lower 3% PSMs based on their summed abundances tolerance: 20 # m/z tolerance in ppm (default 20) uniqueOnly: false # report quantification based on only unique peptides report: msstats: false # create an output compatible to MSstats withDecoys: false # add decoy observations to reports mzID: false # create a mzID output bioquant: organismUniProtID: # UniProt proteome ID level: 0.9 # cluster identity level (default 0.9) abacus: protein: true # global level protein report peptide: false # global level peptide report proteinProbability: 0.05 # minimum protein probability (default 0.9) peptideProbability: 0.5 # minimum peptide probability (default 0.5) uniqueOnly: false # report TMT quantification based on only unique peptides reprint: false # create abacus reports using the Reprint format tmtintegrator: # v1.1.2 path: # path to TMT-Integrator jar memory: 100 # memory allocation, in Gb output: # the location of output files channel_num: 10 # number of channels in the multiplex (e.g. 10, 11) ref_tag: pool # unique tag for identifying the reference channel (Bridge sample added to each multiplex) groupby: -1 # level of data summarization(0: PSM aggregation to the gene level; 1: protein; 2: peptide sequence; 3: PTM site; -1: generate reports at all levels) psm_norm: false # perform additional retention time-based normalization at the PSM level outlier_removal: true # perform outlier removal prot_norm: -1 # normalization (0: None; 1: MD (median centering); 2: GN (median centering + variance scaling); -1: generate reports with all normalization options) min_pep_prob: 0.9 # minimum PSM probability threshold (in addition to FDR-based filtering by Philosopher) min_purity: 0.5 # ion purity score threshold min_percent: 0.05 # remove low intensity PSMs (e.g. value of 0.05 indicates removal of PSMs with the summed TMT reporter ions intensity in the lowest 5% of all PSMs) unique_pep: false # allow PSMs with unique peptides only (if true) or unique plus razor peptides (if false), as classified by Philosopher and defined in PSM.tsv files unique_gene: 0 # additional, gene-level uniqueness filter (0: allow all PSMs; 1: remove PSMs mapping to more than one GENE with evidence of expression in the dataset; 2:remove all PSMs mapping to more than one GENE in the fasta file) best_psm: true # keep the best PSM only (highest summed TMT intensity) among all redundant PSMs within the same LC-MS run prot_exclude: sp|,tr| # exclude proteins with specified tags at the beginning of the accession number (e.g. none: no exclusion; sp|,tr| : exclude protein with sp| or tr|) allow_overlabel: false # allow PSMs with TMT on S (when overlabeling on S was allowed in the database search) allow_unlabeled: false # allow PSMs without TMT tag or acetylation on the peptide n-terminus mod_tag: none # PTM info for generation of PTM-specific reports (none: for Global data; S(79.9663),T(79.9663),Y(79.9663): for Phospho; K(42.0106): for K-Acetyl) min_site_prob: -1 # site localization confidence threshold (-1: for Global; 0: as determined by the search engine; above 0 (e.g. 0.75): PTMProphet probability, to be used with phosphorylation only) ms1_int: true # use MS1 precursor ion intensity (if true) or MS2 summed TMT reporter ion intensity (if false) as part of the reference sample abundance estimation top3_pep: true # use top 3 most intense peptide ions as part of the reference sample abundance estimation print_RefInt: false # print individual reference sample abundance estimates for each multiplex in the final reports (in addition to the combined reference sample abundance estimate)
Run the pipeline
To start the pipeline, we need to run Philosopher using the
pipeline command, passing each of the data sets we wish to process together.
$ bin/philosopher pipeline --config params/philosopher.yaml 01CPTAC_CCRCC_Proteome_JHU_20171007 02CPTAC_CCRCC_Proteome_JHU_20171003
Each step will be executed sequentially, and no other commands or input from the user are necessary.
When the analysis is done, we will have individual results for each multiplexed TMT sample as well as the combined protein expression matrix containing all TMT channels labeled according to the
annotation.txt file. You should have new .tsv files in your workspace, which contain the filtered PSM, peptide, ion, and protein identifications.