Skip to content

Step by step analysis with Comet

Sarah Haynes edited this page Jul 7, 2021 · 5 revisions

This tutorial will show you how to use Philosopher for a complete proteomics data analysis, starting from a raw LC-MS file and ending with TMT-quantified protein reports.

We ran this example on a Linux Red Hat 7, so the commands shown below are Linux compatible. For Windows, you will need to adjust the folder separators from '/' to ''.

What are the basic steps?

The commands in this tutorial should be executed in a particular order. Consider this the "default" order in which to perform an analysis:

  1. Create a workspace
  2. Download a database
  3. Search with Comet
  4. PeptideProphet
  5. ProteinProphet
  6. Filter
  7. Quantify
  8. Report

Before we start

For this tutorial, we will use a publicly-available LC-MS data file from a TMT 10-plex unenriched human cell line sample described in this publication. Download the first file, 01_CPTAC_TMTS1-NCI7_P_JHUZ_20170509_LUMOS.mzML, from the dataset FTP location (the full listing is here).

Philosopher requires this mzML spectral format for quantification. A tutorial on raw file conversion can be found here.

For additional help on any of the Philosopher commands, you can use the --help flag (e.g. philosopher workspace --help), which will provide a description of all available flags for each command.

1. Create a workspace

(Make sure to use the full path to the Philosopher binary file in place of philosopher in the following steps.) Place the 01_CPTAC_TMTS1-NCI7_P_JHUZ_20170509_LUMOS.mzML in a new folder, which we will call the 'workspace'. We will create the workspace with the Philosopher workspace command, which will enable the program to store processed data in a special binary format for quick access.

Inside your workspace folder, open a new terminal window and run this command: philosopher workspace --init

From now on, all steps should be executed inside this same directory.

2. Download a protein database

For the first step we will download and format a database file using the database command, but first we need to find the Proteome ID (PID) for our organism. Searching the UniProt proteome list, we can see that the Homo sapiens proteome ID is UP000005640, so let's prepare the database file with the following command:

philosopher database --id UP000005640 --reviewed --contam

Philosopher will retrieve all reviewed protein sequences from this proteome, add common contaminants, and generate decoy sequences labeled with the tag rev_.

You should see that a new file was created in the workspace. (Note: all databases must be reformatted through Philosopher, see building a custom database or annotating an existing database.)

3. Perform a database search with Comet

Create a comet.params file by running philosopher comet --print, then edit the resulting comet.params file to update the FASTA file path and the fixed and variable modifications. (Note that we specify the TMT label as a fixed modification on lysine, and as a variable modification on the peptide N-terminus and S to account for overlabeling.):

# comet_version 2019.01 rev. 1
# Comet MS/MS search engine parameters file.
# Everything following the '#' symbol is treated as a comment.

database_name = /path/to/FASTA
decoy_search = 0                       # 0=no (default), 1=concatenated search, 2=separate search
peff_format = 0                        # 0=no (normal fasta, default), 1=PEFF PSI-MOD, 2=PEFF Unimod
peff_obo =                             # path to PSI Mod or Unimod OBO file

num_threads = 0                        # 0=poll CPU to set num threads; else specify num threads directly (max 128)

#
# masses
#
peptide_mass_tolerance = 20.00
peptide_mass_units = 2                 # 0=amu, 1=mmu, 2=ppm
mass_type_parent = 1                   # 0=average masses, 1=monoisotopic masses
mass_type_fragment = 1                 # 0=average masses, 1=monoisotopic masses
precursor_tolerance_type = 1           # 0=MH+ (default), 1=precursor m/z; only valid for amu/mmu tolerances
isotope_error = 3                      # 0=off, 1=0/1 (C13 error), 2=0/1/2, 3=0/1/2/3, 4=-8/-4/0/4/8 (for +4/+8 labeling)

#
# search enzyme
#
search_enzyme_number = 1               # choose from list at end of this params file
search_enzyme2_number = 0              # second enzyme; set to 0 if no second enzyme
num_enzyme_termini = 2                 # 1 (semi-digested), 2 (fully digested, default), 8 C-term unspecific , 9 N-term unspecific
allowed_missed_cleavage = 2            # maximum value is 5; for enzyme search

#
# Up to 9 variable modifications are supported
# format:  <mass> <residues> <0=variable/else binary> <max_mods_per_peptide> <term_distance> <n/c-term> <required> <neutral_loss>
#     e.g. 79.966331 STY 0 3 -1 0 0 97.976896
#
variable_mod01 = 15.9949 M 0 3 0 0 0 0.0
variable_mod02 = 42.010565 n 0 1 -1 0 0 0.0
variable_mod03 = 229.162932 n 0 1 -1 0 0 0.0
variable_mod04 = 229.162932 S 0 1 -1 0 0 0.0
variable_mod05 = 0.0 X 0 3 -1 0 0 0.0
variable_mod06 = 0.0 X 0 3 -1 0 0 0.0
variable_mod07 = 0.0 X 0 3 -1 0 0 0.0
variable_mod08 = 0.0 X 0 3 -1 0 0 0.0
variable_mod09 = 0.0 X 0 3 -1 0 0 0.0
max_variable_mods_in_peptide = 5
require_variable_mod = 0

#
# fragment ions
#
# ion trap ms/ms:  1.0005 tolerance, 0.4 offset (mono masses), theoretical_fragment_ions = 1
# high res ms/ms:    0.02 tolerance, 0.0 offset (mono masses), theoretical_fragment_ions = 0, spectrum_batch_size = 15000
#
fragment_bin_tol = 1.0005              # binning to use on fragment ions
fragment_bin_offset = 0.4              # offset position to start the binning (0.0 to 1.0)
theoretical_fragment_ions = 1          # 0=use flanking peaks, 1=M peak only
use_A_ions = 0
use_B_ions = 1
use_C_ions = 0
use_X_ions = 0
use_Y_ions = 1
use_Z_ions = 0
use_NL_ions = 0                        # 0=no, 1=yes to consider NH3/H2O neutral loss peaks

#
# output
#
output_sqtstream = 0                   # 0=no, 1=yes  write sqt to standard output
output_sqtfile = 0                     # 0=no, 1=yes  write sqt file
output_txtfile = 0                     # 0=no, 1=yes  write tab-delimited txt file
output_pepxmlfile = 1                  # 0=no, 1=yes  write pep.xml file
output_percolatorfile = 0              # 0=no, 1=yes  write Percolator tab-delimited input file
print_expect_score = 1                 # 0=no, 1=yes to replace Sp with expect in out & sqt
num_output_lines = 5                   # num peptide results to show
show_fragment_ions = 0                 # 0=no, 1=yes for out files only

sample_enzyme_number = 1               # Sample enzyme which is possibly different than the one applied to the search.
                                       # Used to calculate NTT & NMC in pepXML output (default=1 for trypsin).

#
# mzXML parameters
#
scan_range = 0 0                       # start and end scan range to search; either entry can be set independently
precursor_charge = 0 0                 # precursor charge range to analyze; does not override any existing charge; 0 as 1st entry ignores parameter
override_charge = 0                    # 0=no, 1=override precursor charge states, 2=ignore precursor charges outside precursor_charge range, 3=see online
ms_level = 2                           # MS level to analyze, valid are levels 2 (default) or 3
activation_method = ALL                # activation method; used if activation method set; allowed ALL, CID, ECD, ETD, ETD+SA, PQD, HCD, IRMPD

#
# misc parameters
#
digest_mass_range = 500.0 5000.0       # MH+ peptide mass range to analyze
peptide_length_range = 7 50            # minimum and maximum peptide length to analyze (default 1 63; max length 63)
num_results = 100                      # number of search hits to store internally
max_duplicate_proteins = 20            # maximum number of protein names to report for each peptide identification; -1 reports all duplicates
skip_researching = 1                   # for '.out' file output only, 0=search everything again (default), 1=don't search if .out exists
max_fragment_charge = 3                # set maximum fragment charge state to analyze (allowed max 5)
max_precursor_charge = 6               # set maximum precursor charge state to analyze (allowed max 9)
nucleotide_reading_frame = 0           # 0=proteinDB, 1-6, 7=forward three, 8=reverse three, 9=all six
clip_nterm_methionine = 0              # 0=leave sequences as-is; 1=also consider sequence w/o N-term methionine
spectrum_batch_size = 15000            # max. # of spectra to search at a time; 0 to search the entire scan range in one loop
decoy_prefix = rev_                  # decoy entries are denoted by this string which is pre-pended to each protein accession
equal_I_and_L = 1                      # 0=treat I and L as different; 1=treat I and L as same
output_suffix =                        # add a suffix to output base names i.e. suffix "-C" generates base-C.pep.xml from base.mzXML input
mass_offsets =                         # one or more mass offsets to search (values substracted from deconvoluted precursor mass)
precursor_NL_ions =                    # one or more precursor neutral loss masses, will be added to xcorr analysis

#
# spectral processing
#
minimum_peaks = 10                     # required minimum number of peaks in spectrum to search (default 10)
minimum_intensity = 0                  # minimum intensity value to read in
remove_precursor_peak = 0              # 0=no, 1=yes, 2=all charge reduced precursor peaks (for ETD), 3=phosphate neutral loss peaks
remove_precursor_tolerance = 1.5       # +- Da tolerance for precursor removal
clear_mz_range = 0.0 0.0               # for iTRAQ/TMT type data; will clear out all peaks in the specified m/z range

#
# additional modifications
#

add_Cterm_peptide = 0.0
add_Nterm_peptide = 0.0
add_Cterm_protein = 0.0
add_Nterm_protein = 0.0

add_G_glycine = 0.0000                 # added to G - avg.  57.0513, mono.  57.02146
add_A_alanine = 0.0000                 # added to A - avg.  71.0779, mono.  71.03711
add_S_serine = 0.0000                  # added to S - avg.  87.0773, mono.  87.03203
add_P_proline = 0.0000                 # added to P - avg.  97.1152, mono.  97.05276
add_V_valine = 0.0000                  # added to V - avg.  99.1311, mono.  99.06841
add_T_threonine = 0.0000               # added to T - avg. 101.1038, mono. 101.04768
add_C_cysteine = 57.021464             # added to C - avg. 103.1429, mono. 103.00918
add_L_leucine = 0.0000                 # added to L - avg. 113.1576, mono. 113.08406
add_I_isoleucine = 0.0000              # added to I - avg. 113.1576, mono. 113.08406
add_N_asparagine = 0.0000              # added to N - avg. 114.1026, mono. 114.04293
add_D_aspartic_acid = 0.0000           # added to D - avg. 115.0874, mono. 115.02694
add_Q_glutamine = 0.0000               # added to Q - avg. 128.1292, mono. 128.05858
add_K_lysine = 229.162932              # added to K - avg. 128.1723, mono. 128.09496
add_E_glutamic_acid = 0.0000           # added to E - avg. 129.1140, mono. 129.04259
add_M_methionine = 0.0000              # added to M - avg. 131.1961, mono. 131.04048
add_O_ornithine = 0.0000               # added to O - avg. 132.1610, mono  132.08988
add_H_histidine = 0.0000               # added to H - avg. 137.1393, mono. 137.05891
add_F_phenylalanine = 0.0000           # added to F - avg. 147.1739, mono. 147.06841
add_U_selenocysteine = 0.0000          # added to U - avg. 150.0379, mono. 150.95363
add_R_arginine = 0.0000                # added to R - avg. 156.1857, mono. 156.10111
add_Y_tyrosine = 0.0000                # added to Y - avg. 163.0633, mono. 163.06333
add_W_tryptophan = 0.0000              # added to W - avg. 186.0793, mono. 186.07931
add_B_user_amino_acid = 0.0000         # added to B - avg.   0.0000, mono.   0.00000
add_J_user_amino_acid = 0.0000         # added to J - avg.   0.0000, mono.   0.00000
add_X_user_amino_acid = 0.0000         # added to X - avg.   0.0000, mono.   0.00000
add_Z_user_amino_acid = 0.0000         # added to Z - avg.   0.0000, mono.   0.00000

#
# COMET_ENZYME_INFO _must_ be at the end of this parameters file
#
[COMET_ENZYME_INFO]
0.  No_enzyme              0      -           -
1.  Trypsin                1      KR          P
2.  Trypsin/P              1      KR          -
3.  Lys_C                  1      K           P
4.  Lys_N                  0      K           -
5.  Arg_C                  1      R           P
6.  Asp_N                  0      D           -
7.  CNBr                   1      M           -
8.  Glu_C                  1      DE          P
9.  PepsinA                1      FL          P
10. Chymotrypsin           1      FWYL        P


Run the following to launch the Comet search:

philosopher comet --param comet.params 01_CPTAC_TMTS1-NCI7_P_JHUZ_20170509_LUMOS.mzML

The search should be done within a few minutes. The search hits are now stored in a file called 01_CPTAC_TMTS1-NCI7_P_JHUZ_20170509_LUMOS.pep.xml.

4. PeptideProphet

The next step is to validate the peptide hits with PeptideProphet:

philosopher peptideprophet --database 2020-03-05-decoys-reviewed-contam-UP000005640.fas --ppm --accmass --decoyprobs --decoy rev_ --nonparam 01_CPTAC_TMTS1-NCI7_P_JHUZ_20170509_LUMOS.pep.xml

This will generate a new file called interact-01_CPTAC_TMTS1-NCI7_P_JHUZ_20170509_LUMOS.pep.xml.

5. ProteinProphet

Next, perform protein inference and generate a protXML file:

philosopher proteinprophet interact-01_CPTAC_TMTS1-NCI7_P_JHUZ_20170509_LUMOS.pep.xml

6. Filter and estimate FDR

Now we have all necessary files to filter our data using the FDR approach:

philosopher filter --razor --pepxml interact-01_CPTAC_TMTS1-NCI7_P_JHUZ_20170509_LUMOS.pep.xml --protxml interact.prot.xml

The filter algorithm can be applied in many different ways, use the --help flag and choose the best method to analyze your data. Scoring results will be shown in the console, and all processed data will be stored in your workspace for further analysis.

7. Perform label-based quantification

Filtered search results can now be quantified. Before quantifying each TMT channel, make a new text file called annotation.txt and fill it with the following to tell Philosopher what sample is in each TMT channel:

126 control_1
127N treated_1
127C control_2
128N treated_2
128C control_3
129N treated_3
129C control_4
130N treated_4
130C control_5
131N treated_5

Save this annotation.txt file, then run the quantification.

philosopher labelquant --brand tmt --plex 10 --dir . , where the . indicates the current workspace.

For other types of TMT labeling (6, 11, or 16-plex), use the appropriate --plex value. For iTRAQ experiments, use --brand itraq.

8. Report the results

Now we can inspect the results by printing the PSM, peptide, and protein reports:

philosopher report

Backup

As an optional last step, backup your data in case you wish to print the reports again later.

philosopher workspace --backup

Concluding remarks

We've demonstrated how to run a complete proteomics analysis with TMT quantification using Philosopher. By providing easy access to advanced analysis software and custom processing algorithms, quantitative protein reports can be obtained from LC-MS files in just a few minutes.