Skip to content

Pipeline mode for TMT analysis

Sarah Haynes edited this page Jun 17, 2021 · 6 revisions

For this example we will see how to process and analyze the Clear Cell Renal Carcinoma (CCRC) cohort data from the third Clinical Proteomic Tumor Analysis Consortium (CPTAC 3) study using Philosopher pipeline with MSFragger database search. These samples are TMT-10 multiplexed and fractionated. Pipeline mode runs all steps of the analysis, to run each step manually, see the step-by-step tutorial.

We will need:

  • Philosopher (version 2.1.2 or higher)
  • MSFragger (version 2.3 or higher, see download instructions on the website)
  • Java 8 Runtime Environment (required by MSFragger)
  • mzML spectral files from the Clear Cell Renal Carcinoma data set from CPTAC 3 (download instructions below)
  • A human protein sequence database (see below)
  • A computer or server running GNU/Linux with at least 16 GB of RAM

We ran this example on a Linux Red Hat 7, so the commands shown below are Linux compatible. For Windows, you will need to adjust the folder separators from '/' to '\'.

Download the data set

The CPTAC 3 data can be downloaded from the NIH/CPTAC data portal, which requires an installation of the IBM Aspera Connect browser extension and application. You'll also need to agree to the terms of use for the data.

Select the mzML files you want to download, in this example we will use two data sets from the 'Proteome' (non-phospho enriched) part of the study. Select these two mzML files and press 'DOWNLOAD':

  • 01CPTAC_CCRCC_Proteome_JHU_20171007
  • 02CPTAC_CCRCC_Proteome_JHU_20171003

We don't need to do any file conversion because we are already using the mzML files provided by the consortium, but you will need to unzip/decompress the files.

Organize the workspace

Start by creating a folder for the entire analysis that will be called CPTAC3_CCRC_tutorial, inside we will create a folder for each of the two whole proteome plexes we've downloaded. Inside each of these two folders, there should be 25 mzML files representing the 25 fractions of the TMT-10 plex. Create a folder called bin for the software tools we will use, a folder called params for the configuration file, and a folder called database for the protein sequence FASTA file.

The workspace structure should look like this:

CPTAC3_CCRC_tutorial
|---- 01CPTAC_CCRCC_Proteome_JHU_20171007
|---- 02CPTAC_CCRCC_Proteome_JHU_20171003
|---- bin
|   |---- MSFragger-2.3.jar
|   |---- philosopher
|---- params
|   |---- philosopher.yml
|---- database
|   |---- 2020-03-05-decoys-reviewed-contam-UP000005640.fas

Inside each one of the two data set folders, place the 25 mzML files corresponding to all fractions for that data set, e.g.:

01CPTAC_CCRCC_Proteome_JHU_20171007
|---- 01CPTAC_CCRCC_W_JHU_20171007_LUMOS_f01.mzML
|---- 01CPTAC_CCRCC_W_JHU_20171007_LUMOS_f02.mzML
|---- 01CPTAC_CCRCC_W_JHU_20171007_LUMOS_f03.mzML
|---- 01CPTAC_CCRCC_W_JHU_20171007_LUMOS_f04.mzML
|---- 01CPTAC_CCRCC_W_JHU_20171007_LUMOS_f05.mzML
|---- 01CPTAC_CCRCC_W_JHU_20171007_LUMOS_f06.mzML
|---- 01CPTAC_CCRCC_W_JHU_20171007_LUMOS_f07.mzML
|---- 01CPTAC_CCRCC_W_JHU_20171007_LUMOS_f08.mzML
|---- 01CPTAC_CCRCC_W_JHU_20171007_LUMOS_f09.mzML
|---- 01CPTAC_CCRCC_W_JHU_20171007_LUMOS_f10.mzML
|---- 01CPTAC_CCRCC_W_JHU_20171007_LUMOS_f11.mzML
|---- 01CPTAC_CCRCC_W_JHU_20171007_LUMOS_f12.mzML
|---- 01CPTAC_CCRCC_W_JHU_20171007_LUMOS_f13.mzML
|---- 01CPTAC_CCRCC_W_JHU_20171007_LUMOS_f14.mzML
|---- 01CPTAC_CCRCC_W_JHU_20171007_LUMOS_f15.mzML
|---- 01CPTAC_CCRCC_W_JHU_20171007_LUMOS_f16.mzML
|---- 01CPTAC_CCRCC_W_JHU_20171007_LUMOS_f17.mzML
|---- 01CPTAC_CCRCC_W_JHU_20171007_LUMOS_f18.mzML
|---- 01CPTAC_CCRCC_W_JHU_20171007_LUMOS_f19.mzML
|---- 01CPTAC_CCRCC_W_JHU_20171007_LUMOS_f20.mzML
|---- 01CPTAC_CCRCC_W_JHU_20171007_LUMOS_f21.mzML
|---- 01CPTAC_CCRCC_W_JHU_20171007_LUMOS_f22.mzML
|---- 01CPTAC_CCRCC_W_JHU_20171007_LUMOS_f23.mzML
|---- 01CPTAC_CCRCC_W_JHU_20171007_LUMOS_f24.mzML
|---- 01CPTAC_CCRCC_W_JHU_20171007_LUMOS_fA.mzML
|---- annotation.txt

The annotation file is a simple text file with mappings between the TMT channels and the sample labels, which is needed to generate the final reports. Each data set folder should contain a text file called annotation.txt with the mapping. Below are the annotation files for data set #01 and #02:

01CPTAC_CCRCC_Proteome_JHU_20171007:

126 CPT0079430001
127N CPT0023360001
127C CPT0023350003
128N CPT0079410003
128C CPT0087040003
129N CPT0077310003
129C CPT0077320001
130N CPT0087050003
130C CPT0002270011
131N pool01

02CPTAC_CCRCC_Proteome_JHU_20171003:

126 NCI7-1
127N CPT0078840001
127C CPT0075570001
128N CPT0075560003
128C CPT0078830003
129N CPT0077490003
129C CPT0077500001
130N CPT0023690003
130C CPT0023710001
131N pool02

Labels for this and other data sets can also be found on the NIH CPTAC data portal in the CPTAC_CCRCC_metadata folder.

Download a sequence database

If you don't already have a human protein FASTA file downloaded from UniProt by Philosopher (e.g. [download-date]-decoys-reviewed-contam-UP000005640.fas), run the following two commands inside the database folder to download and format protein sequences:

philosopher workspace --init

philosopher database --id UP000005640 --reviewed --contam

If you already have a FASTA file (.fas extension), place it inside the database folder.

Set up the Philosopher pipeline configuration file

We will do the analysis using the automated pipeline mode, which will automatically run all the necessary steps for us. The pipeline mode uses the philosopher.yml configuration file.

From inside the params folder, print the default configuration file by running:

philosopher pipeline --print

The configuration file is divided in two sections: the first part contains a list of all the commands the program is able to automate, the following section contains the specific parameters for individual commands (see the documentation for more information). We will set each of the desired commands to yes on the upper part, then we will configure the individual steps. Below are the specific areas of the configuration file that should be modified from the defaults and set to the values shown. Make sure the full file paths for the MSFragger.jar and the FASTA database are correct.

Steps:
Database Search: yes
Peptide Validation: yes
Label-Free Quantification: yes
Isobaric Quantification: yes
FDR Filtering: yes
Individual Reports: yes
Integrated Reports: yes
Integrated Isobaric Quantification: yes

Database Search:
protein_database: [add the protein FASTA file name here]
search_engine: msfragger
path: [add the MSFragger binary file path here]
precursor_mass_lower: -20
precursor_mass_upper: 20
isotope_error: -1/0/1/2/3
search_enzyme_name: stricttrypsin
search_enzyme_cutafter: KR
allowed_missed_cleavage: 2
variable_mod_03: 229.162932 n^ 1
variable_mod_04: 229.162932 S 1
precursor_charge: 1 6
use_topN_peaks: 300
clear_mz_range: 125.5 131.5
add_K_lysine: 229.162932

Isobaric Quantification:
bestPSM: true 
plex: 10
removeLow: 0.05

FDR Filtering:
razor: true
picked: true
mapMods: true
models: true
sequential: true

Run the pipeline

To start the pipeline, we need to run Philosopher using the pipeline command, passing each of the data sets we wish to process together.

$ bin/philosopher pipeline --config params/philosopher.yml 01CPTAC_CCRCC_Proteome_JHU_20171007 02CPTAC_CCRCC_Proteome_JHU_20171003

Each step will be executed sequentially, and no other commands or input from the user are necessary.

Wrapping up

When the analysis is done, we will have individual results for each multiplexed TMT sample as well as the combined protein expression matrix containing all TMT channels labeled according to the annotation.txt file. You should have new .tsv files in your workspace, which contain the filtered PSM, peptide, ion, and protein identifications.