# Table of Contents
 <p><div class="lev1 toc-item"><a href="#Introduction" data-toc-modified-id="Introduction-1"><span class="toc-item-num">1&nbsp;&nbsp;</span>Introduction</a></div><div class="lev2 toc-item"><a href="#Generating-base-FASTA-with-intensities" data-toc-modified-id="Generating-base-FASTA-with-intensities-11"><span class="toc-item-num">1.1&nbsp;&nbsp;</span>Generating base-FASTA with intensities</a></div><div class="lev2 toc-item"><a href="#Simulating-and-quantifying-the-data" data-toc-modified-id="Simulating-and-quantifying-the-data-12"><span class="toc-item-num">1.2&nbsp;&nbsp;</span>Simulating and quantifying the data</a></div><div class="lev2 toc-item"><a href="#Generating-intensity-matrix" data-toc-modified-id="Generating-intensity-matrix-13"><span class="toc-item-num">1.3&nbsp;&nbsp;</span>Generating intensity matrix</a></div><div class="lev2 toc-item"><a href="#Dependencies" data-toc-modified-id="Dependencies-14"><span class="toc-item-num">1.4&nbsp;&nbsp;</span>Dependencies</a></div><div class="lev1 toc-item"><a href="#Prepare-for-simulation" data-toc-modified-id="Prepare-for-simulation-2"><span class="toc-item-num">2&nbsp;&nbsp;</span>Prepare for simulation</a></div><div class="lev2 toc-item"><a href="#Setup" data-toc-modified-id="Setup-21"><span class="toc-item-num">2.1&nbsp;&nbsp;</span>Setup</a></div><div class="lev2 toc-item"><a href="#Prepare-design-matrix" data-toc-modified-id="Prepare-design-matrix-22"><span class="toc-item-num">2.2&nbsp;&nbsp;</span>Prepare design matrix</a></div><div class="lev3 toc-item"><a href="#Setup-variables-with-sample-names" data-toc-modified-id="Setup-variables-with-sample-names-221"><span class="toc-item-num">2.2.1&nbsp;&nbsp;</span>Setup variables with sample names</a></div><div class="lev2 toc-item"><a href="#Generate-intensity-annotated-FASTA" data-toc-modified-id="Generate-intensity-annotated-FASTA-23"><span class="toc-item-num">2.3&nbsp;&nbsp;</span>Generate intensity-annotated FASTA</a></div><div class="lev1 toc-item"><a href="#Simulate-MS-data" data-toc-modified-id="Simulate-MS-data-3"><span class="toc-item-num">3&nbsp;&nbsp;</span>Simulate MS-data</a></div><div class="lev1 toc-item"><a href="#Quantification" data-toc-modified-id="Quantification-4"><span class="toc-item-num">4&nbsp;&nbsp;</span>Quantification</a></div><div class="lev2 toc-item"><a href="#Prepare-quantity-matrix" data-toc-modified-id="Prepare-quantity-matrix-41"><span class="toc-item-num">4.1&nbsp;&nbsp;</span>Prepare quantity matrix</a></div>

# Introduction

## Generating base-FASTA with intensities

The purpose with this notebook is to generate a simulated proteomics label-free quantification dataset. This is done by first generating a FASTA-file with headers in the following format:

```
>sp|P75925|C56I_ECOLI [# intensity=974548.6052639355 #]
MSFTNTPERYGVISAAFHWLSAIIVYGMFALGLWMVTLSYYDGWYHKAPELHKSIGILLM
```

The FASTA is generated from two separate FASTA files where entries in the first is used as background while entries from the second is used as spike-in. The script `lfqtk.py generat_spikein_set` varies intensities on features on the following conditions:

* A base-line intensity is set (on protein level)
* For the spike-in data intensities are set based on the base intensity times a sample-specific fold change
* The batch effect is applied as a constant intensity shift for all proteins in particular samples
* A random noise factor is applied

As this is performed on protein level intensities will be varying between peptides.

## Simulating and quantifying the data

Simulation of the data is performed using the OpenMS tool `MSSimulator`. The simulation is performed using default settings and output is received in `featureXML` format. The features in this files are linked to the identities of the entries used to generate the features.

The quantification is performed using the OpenMS tool `ProteinQuantifier`, which is ran with default settings. Quantification is only performed on peptide level.

## Generating intensity matrix

The final step is to merge the resulting tab-delimited files from the quantification step to a matrix representing the intensities for each peptide/sample. This is performed with a custom script `combine_quant_pd.py` which simply links intensities from identical peptide + protein identification. This is reasonable here as we know that the identifications are correct and unique from the simulated data.

## Dependencies

To run this notebook, the following dependencies are required. For demonstration purposes, resulting data files are included in the repository meaning that the analysis itself can be tested independently without running through this notebook.

* OpenMS binaries in PATH (version 2.2.0)
* Jupyter Bash kernel (https://github.com/takluyver/bash_kernel)
* GNU parallel (https://www.gnu.org/software/parallel/)

# Prepare for simulation

## Setup

In [1]:
threads=3
run=example_data_test
mkdir ${run}

## Prepare design matrix

This tab separated file contains information about each sample that will be generated - Which spike-in condition it belongs to and in which batch the sample was processed.

This design matrix is also used in the analysis step together with the resulting intensity matrix.

In [2]:
echo -e "name\tbiorepgroup\ttechrepgroup\tcondition\tbatch" > ${run}/design.tsv
echo -e "a1\t1\t1\ta\t1" >> ${run}/design.tsv
echo -e "a2\t2\t1\ta\t1" >> ${run}/design.tsv
echo -e "a3\t3\t1\ta\t1" >> ${run}/design.tsv
echo -e "a4\t4\t1\ta\t2" >> ${run}/design.tsv
echo -e "a5\t5\t1\ta\t2" >> ${run}/design.tsv
echo -e "a6\t6\t1\ta\t2" >> ${run}/design.tsv
echo -e "b1\t1\t1\tb\t1" >> ${run}/design.tsv
echo -e "b2\t2\t1\tb\t1" >> ${run}/design.tsv
echo -e "b3\t3\t1\tb\t1" >> ${run}/design.tsv
echo -e "b4\t4\t1\tb\t2" >> ${run}/design.tsv
echo -e "b5\t5\t1\tb\t2" >> ${run}/design.tsv
echo -e "b6\t6\t1\tb\t2" >> ${run}/design.tsv

In [3]:
cat ${run}/design.tsv

name	biorepgroup	techrepgroup	condition	batch
a1	1	1	a	1
a2	2	1	a	1
a3	3	1	a	1
a4	4	1	a	2
a5	5	1	a	2
a6	6	1	a	2
b1	1	1	b	1
b2	2	1	b	1
b3	3	1	b	1
b4	4	1	b	2
b5	5	1	b	2
b6	6	1	b	2


### Setup variables with sample names

A string with the sample names and a string with comma-delimited filenames are generated.

In [4]:
sample_names=$(cut -f1 ${run}/design.tsv \
    | tr "\n" " " | cut -f2- -d" ")
file_names=$(cut -f1 ${run}/design.tsv \
    | tr "\n" " " | cut -f2- -d" " \
    | sed "s/ /.fa,/g" | sed "s/,$//")
batch=$(cut -f2 ${run}/design.tsv | tr "\n" " " \
    | cut -f2- -d" ")

echo ${sample_names}
echo ${file_names}
echo ${batch}

a1 a2 a3 a4 a5 a6 b1 b2 b3 b4 b5 b6
a1.fa,a2.fa,a3.fa,a4.fa,a5.fa,a6.fa,b1.fa,b2.fa,b3.fa,b4.fa,b5.fa,b6.fa
1 2 3 4 5 6 1 2 3 4 5 6


## Generate intensity-annotated FASTA

We generate a FASTA file containing both background and spike-in entries.

In [5]:
python3 util_scripts/lfqtk_orig/lfqtk.py generate_spikein_set \
    --background_fa data/uniprot_ecoli.pure.fasta \
    --spikein_fa data/uniprot_potato.pure.fasta \
    --offset_mean 500000 \
    --offset_std 0 \
    --base_int 1000000 \
    --noise_std 200000 \
    --back_count 50 \
    --spike_count 10 \
    --spike_folds "1,1,1,1,1,1,2,2,2,2,2,2" \
    --offset_folds "0,0,0,1,1,1,0,0,0,1,1,1" \
    --out_base ${run} \
    --sample_names ${file_names} \
    --verbose

4436 entries loaded from data/uniprot_ecoli.pure.fasta as background
403 entries loaded from data/uniprot_potato.pure.fasta as spike-in
50 entries picked as background, 10 as spike-in
60 entries written to example_data_test/a1.fa
50 entries picked as background, 10 as spike-in
60 entries written to example_data_test/a2.fa
50 entries picked as background, 10 as spike-in
60 entries written to example_data_test/a3.fa
50 entries picked as background, 10 as spike-in
60 entries written to example_data_test/a4.fa
50 entries picked as background, 10 as spike-in
60 entries written to example_data_test/a5.fa
50 entries picked as background, 10 as spike-in
60 entries written to example_data_test/a6.fa
50 entries picked as background, 10 as spike-in
60 entries written to example_data_test/b1.fa
50 entries picked as background, 10 as spike-in
60 entries written to example_data_test/b2.fa
50 entries picked as background, 10 as spike-in
60 entries written to example_data_test/b3.fa
50 entries picked 

# Simulate MS-data

Generate the OpenMS tool `MSSimulator` using the generated FASTA files with assigned intensities. The used output is in `featureXML`-format which contains the annotation for the entry from which it was derived as well as intensity values that could be used for downstream quantification.

In [6]:
time parallel -j ${threads} \
"echo \"Processing {}\"
MSSimulator \
    -in ${run}/{}.fa \
    -out_fm ${run}/{}.ground.featureXML" \
    ::: ${sample_names}


Processing a2
Loading sequence data from example_data_test/a2.fa ...
done (60 protein(s) loaded)
Starting simulation
2017/11/13, 11:21:14: Digest Simulation ... started
2017/11/13, 11:21:14: RT Simulation ... started
2017/11/13, 11:21:16: Predicting RT ... done
2017/11/13, 11:21:16: RT prediction gave 'invalid' results for 1701 peptide(s), making them unobservable.
2017/11/13, 11:21:16:   (List is too big to show)
2017/11/13, 11:21:16: Creating experiment with #501 scans ... done
2017/11/13, 11:21:16: Detectability Simulation ... started
2017/11/13, 11:21:16: Ionization Simulation ... started
esi_impurity_probabilities_[0]: 1
weights[0]: 10
2017/11/13, 11:21:16: Simulating 1956 features
Progress of 'Ionization':
-- done [took 9.56 s (CPU), 9.57 s (Wall)] -- 
2017/11/13, 11:21:26: #Peptides not ionized: 0
2017/11/13, 11:21:26: #Peptides outside mz range: 735
2017/11/13, 11:21:26: Raw MS1 Simulation ... started
2017/11/13, 11:21:26:   Simulating signal for 5459 features ...
Progress of '

-- done [took 4.93 s (CPU), 4.96 s (Wall)] -- 
2017/11/13, 11:21:52: Contaminants out-of-RT-range: 204 / 486
2017/11/13, 11:21:52: Contaminants out-of-MZ-range: 111 / 486
2017/11/13, 11:21:53: Compressed data to grid ... 10118120 --> 9072385 (89%)
2017/11/13, 11:21:53: Adding white noise to spectra ...
2017/11/13, 11:21:53: Adding detector noise to spectra ...
2017/11/13, 11:21:53: Detector noise was disabled.
2017/11/13, 11:21:53: Tandem MS Simulation ... disabled
2017/11/13, 11:21:53: Final number of simulated features: 5493
2017/11/13, 11:21:53: Simulation took 19.386643 seconds
2017/11/13, 11:21:53: Storing simulated features in: example_data_test/a6.ground.featureXML
2017/11/13, 11:21:54: MSSimulator took 19.87 s (wall), 19.82 s (CPU), 0.00 s (system), 19.82 s (user).
Processing a5
Loading sequence data from example_data_test/a5.fa ...
done (60 protein(s) loaded)
Starting simulation
2017/11/13, 11:21:34: Digest Simulation ... started
2017/11/13, 11:21:34: RT Simulation ... started

2017/11/13, 11:22:15: Predicting RT ... done
2017/11/13, 11:22:15: RT prediction gave 'invalid' results for 1696 peptide(s), making them unobservable.
2017/11/13, 11:22:15:   (List is too big to show)
2017/11/13, 11:22:15: Creating experiment with #501 scans ... done
2017/11/13, 11:22:15: Detectability Simulation ... started
2017/11/13, 11:22:15: Ionization Simulation ... started
esi_impurity_probabilities_[0]: 1
weights[0]: 10
2017/11/13, 11:22:15: Simulating 1961 features
Progress of 'Ionization':
-- done [took 11.00 s (CPU), 11.00 s (Wall)] -- 
2017/11/13, 11:22:26: #Peptides not ionized: 0
2017/11/13, 11:22:26: #Peptides outside mz range: 737
2017/11/13, 11:22:26: Raw MS1 Simulation ... started
2017/11/13, 11:22:26:   Simulating signal for 5472 features ...
Progress of 'RawMSSignal':
-- done [took 4.91 s (CPU), 4.92 s (Wall)] -- 
2017/11/13, 11:22:32: Contaminants out-of-RT-range: 204 / 486
2017/11/13, 11:22:32: Contaminants out-of-MZ-range: 111 / 486
2017/11/13, 11:22:33: Compress

# Quantification

Each of the `featureXML` files are quantified using the `ProteinQuantifier` software. 

In [7]:
for xml in ${run}/*.featureXML; do 
    ProteinQuantifier \
        -in ${xml} \
        -peptide_out ${xml%.*}.csv
done


Processing summary - number of...
...features: 5462 used for quantification, 5462 total (0 no annotation, 0 ambiguous annotation)
...peptides: 1951 quantified, 1951 identified (considering best hits only)
ProteinQuantifier took 1.27 s (wall), 1.25 s (CPU), 0.00 s (system), 1.25 s (user).

Processing summary - number of...
...features: 5459 used for quantification, 5459 total (0 no annotation, 0 ambiguous annotation)
...peptides: 1951 quantified, 1951 identified (considering best hits only)
ProteinQuantifier took 1.30 s (wall), 1.26 s (CPU), 0.00 s (system), 1.26 s (user).

Processing summary - number of...
...features: 5460 used for quantification, 5460 total (0 no annotation, 0 ambiguous annotation)
...peptides: 1951 quantified, 1951 identified (considering best hits only)
ProteinQuantifier took 1.23 s (wall), 1.22 s (CPU), 0.00 s (system), 1.22 s (user).

Processing summary - number of...
...features: 5469 used for quantification, 5469 total (0 no annotation, 0 ambiguous annotation)

We receive identical peptide setup due to same protein being picked and cleaved (even if intensities varies).

## Prepare quantity matrix

The resulting quantities for the individual samples are combined into a single feature/sample-intensity matrix.

In [8]:
python3 util_scripts/combine_quant_pd.py \
    --dfs ${run}/*.csv \
    --out_fp ${run}/full_quant.tsv

In [9]:
wc -l ${run}/full_quant.tsv
head ${run}/full_quant.tsv

1943 example_data_test/full_quant.tsv
peptide	protein	a1	a2	a3	a4	a5	a6	b1	b2	b3	b4	b5	b6
FCR	sp|P0A7T7|RS18_ECOLI	157047800.0	149562700.0	133145904.0	179230100.0	181699200.0	174635900.0	177382000.0	113099000.0	103921296.0	189102000.0	222760704.0	316437312.0
LCR	sp|Q9JMR4|YUBK_ECOLI	44536000.0	38658100.0	36623500.0	42959800.0	69944600.0	63120700.0	35776100.0	28235900.0	69595200.0	65605100.0	69165000.0	94251696.0
FCQR	sp|P39357|YJHF_ECOLI	167540500.0	113365000.0	144550600.0	175582592.0	191633700.0	169260400.0	121626100.0	123871400.0	172349400.0	148213900.0	162357400.0	172989600.0
FLFK	sp|P34094|PHYB_SOLTU	88948800.0	113657704.0	149669700.0	164628100.0	232585100.0	181982000.0	227503600.0	205573900.0	197698392.0	273151500.0	266329100.0	324701600.0
FYLS	sp|A5A617|YDGU_ECOLI	84306000.0	89692600.0	51485500.0	99713000.0	113209000.0	141666000.0	96567504.0	118839000.0	46758200.0	140686000.0	130185000.0	46913400.0
KFCR	sp|P0A7T7|RS18_ECOLI	58980190.0	53195230.0	52687850.0	125105900.0	116526200.0