# Detailed description of software protocol
## A fully reproducible one-stop-shop for the analysis of iTRAQ/TMT data

This file serves as documentation for the IsoLabeled protocol including explanation of the folder structure, details on the available parameters, the output files and instruction how to access the underlying code.

### Folders
*IN*: This folder contains the files necessary to run the example, i.e. spectra file _iTRAQCancer.mgf_, human protein sequences (SwissProt) _sp_human.fasta_ and the experimental design _exp_design_example.tsv_. If you want to use this folder for your own analysis, you need to remove the spectra file.

*OUT*: Fixed output folder for all intermediate and end result file. This folder is linked to the _data_ folder and therefore can be accessed from outside the Docker container.

*data*: Starting IsoLabeledProtocol via the _run.sh_/_run.bat_ scripts automatically maps the current folder (on the host) to _data_. Placing spectra and sequence files into this folder crucially simplifies that analysis.

*bin*: Here you can find the executables of [SearchGUI](http://compomics.github.io/projects/searchgui.html) and [PeptideShaker](http://compomics.github.io/projects/peptide-shaker.html).

*Scripts*: Python scripts to handle parameter selection, execution of tools and R scripts

### Parameters
#### Folders and database search
*Folder for spectra files (files need to be mgf) and fasta database:* 
We allow the folder _IN_ and _data_ where the latter is automatically mapped to the directory from which IsoLabeledProtocol is called (when using _run.sh_ or _run.bat_). The folder should contain the spectra files and a database of protein sequences given in a FASTA file. 
Raw files (e.g. Thermo raw files) can be converted into mgf using the msconvert tool of [ProteoWizard](http://proteowizard.sourceforge.net/tools.shtml). This requires a Windows computer. Files from different MS runs should be organized in folders as illustrated in Figure 1.


![Illustration of folder structure](misc/ExperimentalDesigns.svg)
**Figure 1**: Illustration of experimental designs and folder structure.



#### Database search
*Fasta file*: Select from FASTA files in the above given folder.
The sequence headers of the file should be formatted in a [UniProt](http://uniprot.org)-like format. See more details on a suitable version of the FASTA format in the [SearchGUI documentation](https://github.com/compomics/searchgui/wiki/DatabaseHelp). Do not provide decoy sequences as they will be created automatically.

*Precursor tolerance (ppm)*: MS tolerance in parts per million

*Fragment ion tolerance (da)*: MSn tolerance in Dalton

*Number of miscleavages*

*Further fixed modifications*: Select Carbamidomethylation or None. iTRAQ and TMT labels as fixed modifications are added separately

*Further variable modifications (Hold Ctrl to select multiple)*: Select variable modifications for the database search. Modified sequences will be used in the quantification and summarization of the proteins.

*Target (protein, peptide, and PSM) FDR*: False discovery rate for the peptide identifications.

*Quantification method*: Labelling method used in the experiment (iTRAQ/TMT)

*Number of different conditions*: Number of different types of samples (e.g. time points, treatments, cell types). This parameter is needed to assign sample types to the different labeling channels.


#### Experimental design
This tab allows to  distribute the previously defined number of different experimental conditions to
 the channels of the MS runs. The default labels can be changed. The entire experimental design will be written into the table _exp_design.tsv_ after pressing the _Save design_ button. The file is located in th
e _OUT_ folder.


#### Quantitative analysis
*Summarization method*: It is possible to select from the following methods to summarize peptide/PSM quantifications into protein abundance changes provided by the Bioconductor package [MSnbase](https://bioconductor.org/packages/release/bioc/html/MSnbase.html)

- iPQF: feature-based weighting of peptide spectra according to this [paper](https://www.ncbi.nlm.nih.gov/pubmed/26589272)
- Average over all PSMs (on log-scale)
- Median over all PSMs (on log-scale)
- Median over all PSMs after outlier removal (on log-scale)
- Robust summarization using the R function `rlm` (on log-scale)


*Minimum number of PSMs per protein*: Proteins or protein groups will only be quantified when 2 PSMs are available. This will be extended to unique peptides in the future

*Use PTMs or quantification*: Include peptides decorated by variable PTMs like oxidations in the protein summarization.



### Output files and figures


### Code structure and scripts
#### Access

#### Parameter interface

#### R scripts
