# Table of Contents
 <p><div class="lev1 toc-item"><a href="#Introduction" data-toc-modified-id="Introduction-1"><span class="toc-item-num">1&nbsp;&nbsp;</span>Introduction</a></div><div class="lev2 toc-item"><a href="#Generating-base-FASTA-with-intensities" data-toc-modified-id="Generating-base-FASTA-with-intensities-11"><span class="toc-item-num">1.1&nbsp;&nbsp;</span>Generating base-FASTA with intensities</a></div><div class="lev2 toc-item"><a href="#Simulating-and-quantifying-the-data" data-toc-modified-id="Simulating-and-quantifying-the-data-12"><span class="toc-item-num">1.2&nbsp;&nbsp;</span>Simulating and quantifying the data</a></div><div class="lev2 toc-item"><a href="#Generating-intensity-matrix" data-toc-modified-id="Generating-intensity-matrix-13"><span class="toc-item-num">1.3&nbsp;&nbsp;</span>Generating intensity matrix</a></div><div class="lev2 toc-item"><a href="#Dependencies" data-toc-modified-id="Dependencies-14"><span class="toc-item-num">1.4&nbsp;&nbsp;</span>Dependencies</a></div><div class="lev1 toc-item"><a href="#Prepare-for-simulation" data-toc-modified-id="Prepare-for-simulation-2"><span class="toc-item-num">2&nbsp;&nbsp;</span>Prepare for simulation</a></div><div class="lev2 toc-item"><a href="#Setup" data-toc-modified-id="Setup-21"><span class="toc-item-num">2.1&nbsp;&nbsp;</span>Setup</a></div><div class="lev2 toc-item"><a href="#Prepare-design-matrix" data-toc-modified-id="Prepare-design-matrix-22"><span class="toc-item-num">2.2&nbsp;&nbsp;</span>Prepare design matrix</a></div><div class="lev3 toc-item"><a href="#Setup-variables-with-sample-names" data-toc-modified-id="Setup-variables-with-sample-names-221"><span class="toc-item-num">2.2.1&nbsp;&nbsp;</span>Setup variables with sample names</a></div><div class="lev2 toc-item"><a href="#Generate-intensity-annotated-FASTA" data-toc-modified-id="Generate-intensity-annotated-FASTA-23"><span class="toc-item-num">2.3&nbsp;&nbsp;</span>Generate intensity-annotated FASTA</a></div><div class="lev1 toc-item"><a href="#Simulate-MS-data" data-toc-modified-id="Simulate-MS-data-3"><span class="toc-item-num">3&nbsp;&nbsp;</span>Simulate MS-data</a></div><div class="lev1 toc-item"><a href="#Quantification" data-toc-modified-id="Quantification-4"><span class="toc-item-num">4&nbsp;&nbsp;</span>Quantification</a></div><div class="lev2 toc-item"><a href="#Prepare-quantity-matrix" data-toc-modified-id="Prepare-quantity-matrix-41"><span class="toc-item-num">4.1&nbsp;&nbsp;</span>Prepare quantity matrix</a></div>

# Introduction

## Generating base-FASTA with intensities

The purpose with this notebook is to generate a simulated proteomics label-free quantification dataset. This is done by first generating a FASTA-file with headers in the following format:

```
>sp|P75925|C56I_ECOLI [# intensity=974548.6052639355 #]
MSFTNTPERYGVISAAFHWLSAIIVYGMFALGLWMVTLSYYDGWYHKAPELHKSIGILLM
```

The FASTA is generated from two separate FASTA files where entries in the first is used as background while entries from the second is used as spike-in. The script `lfqtk.py generat_spikein_set` varies intensities on features on the following conditions:

* A base-line intensity is set (on protein level)
* For the spike-in data intensities are set based on the base intensity times a sample-specific fold change
* The batch effect is applied as a constant intensity shift for all proteins in particular samples
* A random noise factor is applied

As this is performed on protein level intensities will be varying between peptides.

## Simulating and quantifying the data

Simulation of the data is performed using the OpenMS tool `MSSimulator`. The simulation is performed using default settings and output is received in `featureXML` format. The features in this files are linked to the identities of the entries used to generate the features.

The quantification is performed using the OpenMS tool `ProteinQuantifier`, which is ran with default settings. Quantification is only performed on peptide level.

## Generating intensity matrix

The final step is to merge the resulting tab-delimited files from the quantification step to a matrix representing the intensities for each peptide/sample. This is performed with a custom script `combine_quant_pd.py` which simply links intensities from identical peptide + protein identification. This is reasonable here as we know that the identifications are correct and unique from the simulated data.

## Dependencies

To run this notebook, the following dependencies are required. For demonstration purposes, resulting data files are included in the repository meaning that the analysis itself can be tested independently without running through this notebook.

* OpenMS binaries in PATH (version 2.2.0)
* Jupyter Bash kernel (https://github.com/takluyver/bash_kernel)
* GNU parallel (https://www.gnu.org/software/parallel/)

# Prepare for simulation

## Setup

In [1]:
threads=7
run=batch_harddata
mkdir ${run}

mkdir: cannot create directory ‘batch_harddata’: File exists


: 1

## Prepare design matrix

This tab separated file contains information about each sample that will be generated - Which spike-in condition it belongs to and in which batch the sample was processed.

This design matrix is also used in the analysis step together with the resulting intensity matrix.

In [2]:
echo -e "name\tbiorepgroup\ttechrepgroup\tcondition\tbatch" > ${run}/design.tsv
echo -e "a1\t1\t1\ta\t1" >> ${run}/design.tsv
echo -e "a2\t2\t1\ta\t1" >> ${run}/design.tsv
echo -e "a3\t3\t1\ta\t1" >> ${run}/design.tsv
echo -e "a4\t4\t1\ta\t2" >> ${run}/design.tsv
echo -e "a5\t5\t1\ta\t2" >> ${run}/design.tsv
echo -e "b1\t1\t1\tb\t1" >> ${run}/design.tsv
echo -e "b2\t2\t1\tb\t1" >> ${run}/design.tsv
echo -e "b3\t3\t1\tb\t1" >> ${run}/design.tsv
echo -e "b4\t4\t1\tb\t2" >> ${run}/design.tsv
echo -e "b5\t5\t1\tb\t2" >> ${run}/design.tsv


In [3]:
cat ${run}/design.tsv

name	biorepgroup	techrepgroup	condition	batch
a1	1	1	a	1
a2	2	1	a	1
a3	3	1	a	1
a4	4	1	a	2
a5	5	1	a	2
b1	1	1	b	1
b2	2	1	b	1
b3	3	1	b	1
b4	4	1	b	2
b5	5	1	b	2


### Setup variables with sample names

A string with the sample names and a string with comma-delimited filenames are generated.

In [4]:
sample_names=$(cut -f1 ${run}/design.tsv \
    | tr "\n" " " | cut -f2- -d" ")
file_names=$(cut -f1 ${run}/design.tsv \
    | tr "\n" " " | cut -f2- -d" " \
    | sed "s/ /.fa,/g" | sed "s/,$//")
batch=$(cut -f2 ${run}/design.tsv | tr "\n" " " \
    | cut -f2- -d" ")

echo ${sample_names}
echo ${file_names}
echo ${batch}

a1 a2 a3 a4 a5 b1 b2 b3 b4 b5
a1.fa,a2.fa,a3.fa,a4.fa,a5.fa,b1.fa,b2.fa,b3.fa,b4.fa,b5.fa
1 2 3 4 5 1 2 3 4 5


## Generate intensity-annotated FASTA

We generate a FASTA file containing both background and spike-in entries.

In [5]:
python3 util_scripts/lfqtk_orig/lfqtk.py generate_spikein_set \
    --background_fa data/uniprot_ecoli.pure.fasta \
    --spikein_fa data/uniprot_potato.pure.fasta \
    --offset_mean 8000000 \
    --offset_std 0 \
    --base_int 1000000 \
    --noise_std 200000 \
    --back_count 50 \
    --spike_count 10 \
    --spike_folds "1,1,1,1,1,2,2,2,2,2" \
    --offset_folds "0,0,0,1,1,0,0,0,1,1" \
    --out_base ${run} \
    --sample_names ${file_names} \
    --verbose

4436 entries loaded from data/uniprot_ecoli.pure.fasta as background
403 entries loaded from data/uniprot_potato.pure.fasta as spike-in
50 entries picked as background, 10 as spike-in
60 entries written to batch_harddata/a1.fa
50 entries picked as background, 10 as spike-in
60 entries written to batch_harddata/a2.fa
50 entries picked as background, 10 as spike-in
60 entries written to batch_harddata/a3.fa
50 entries picked as background, 10 as spike-in
60 entries written to batch_harddata/a4.fa
50 entries picked as background, 10 as spike-in
60 entries written to batch_harddata/a5.fa
50 entries picked as background, 10 as spike-in
60 entries written to batch_harddata/b1.fa
50 entries picked as background, 10 as spike-in
60 entries written to batch_harddata/b2.fa
50 entries picked as background, 10 as spike-in
60 entries written to batch_harddata/b3.fa
50 entries picked as background, 10 as spike-in
60 entries written to batch_harddata/b4.fa
50 entries picked as background, 10 as spike-

# Simulate MS-data

Generate the OpenMS tool `MSSimulator` using the generated FASTA files with assigned intensities. The used output is in `featureXML`-format which contains the annotation for the entry from which it was derived as well as intensity values that could be used for downstream quantification.

In [6]:
time parallel -j ${threads} \
"echo \"Processing {}\"
MSSimulator \
    -in ${run}/{}.fa \
    -out_fm ${run}/{}.ground.featureXML" \
    ::: ${sample_names}


Processing a2
Loading sequence data from batch_harddata/a2.fa ...
done (60 protein(s) loaded)
Starting simulation
2017/11/13, 12:46:14: Digest Simulation ... started
2017/11/13, 12:46:14: RT Simulation ... started
2017/11/13, 12:46:17: Predicting RT ... done
2017/11/13, 12:46:17: RT prediction gave 'invalid' results for 1700 peptide(s), making them unobservable.
2017/11/13, 12:46:17:   (List is too big to show)
2017/11/13, 12:46:17: Creating experiment with #501 scans ... done
2017/11/13, 12:46:17: Detectability Simulation ... started
2017/11/13, 12:46:17: Ionization Simulation ... started
esi_impurity_probabilities_[0]: 1
weights[0]: 10
2017/11/13, 12:46:17: Simulating 1957 features
Progress of 'Ionization':
-- done [took 10.28 s (CPU), 10.28 s (Wall)] -- 
2017/11/13, 12:46:27: #Peptides not ionized: 0
2017/11/13, 12:46:27: #Peptides outside mz range: 729
2017/11/13, 12:46:27: Raw MS1 Simulation ... started
2017/11/13, 12:46:27:   Simulating signal for 5469 features ...
Progress of 'R

2017/11/13, 12:46:45: Contaminants out-of-RT-range: 204 / 486
2017/11/13, 12:46:45: Contaminants out-of-MZ-range: 111 / 486
2017/11/13, 12:46:47: Compressed data to grid ... 10065756 --> 9008508 (89%)
2017/11/13, 12:46:47: Adding white noise to spectra ...
2017/11/13, 12:46:47: Adding detector noise to spectra ...
2017/11/13, 12:46:47: Detector noise was disabled.
2017/11/13, 12:46:47: Tandem MS Simulation ... disabled
2017/11/13, 12:46:47: Final number of simulated features: 5471
2017/11/13, 12:46:47: Simulation took 32.85211 seconds
2017/11/13, 12:46:47: Storing simulated features in: batch_harddata/a4.ground.featureXML
2017/11/13, 12:46:48: MSSimulator took 33.44 s (wall), 33.31 s (CPU), 0.00 s (system), 33.31 s (user).
Processing b1
Loading sequence data from batch_harddata/b1.fa ...
done (60 protein(s) loaded)
Starting simulation
2017/11/13, 12:46:14: Digest Simulation ... started
2017/11/13, 12:46:14: RT Simulation ... started
2017/11/13, 12:46:17: Predicting RT ... done
2017/11/

2017/11/13, 12:46:50:   (List is too big to show)
2017/11/13, 12:46:50: Creating experiment with #501 scans ... done
2017/11/13, 12:46:50: Detectability Simulation ... started
2017/11/13, 12:46:50: Ionization Simulation ... started
esi_impurity_probabilities_[0]: 1
weights[0]: 10
2017/11/13, 12:46:50: Simulating 1959 features
Progress of 'Ionization':
-- done [took 10.50 s (CPU), 10.51 s (Wall)] -- 
2017/11/13, 12:47:00: #Peptides not ionized: 0
2017/11/13, 12:47:00: #Peptides outside mz range: 739
2017/11/13, 12:47:00: Raw MS1 Simulation ... started
2017/11/13, 12:47:00:   Simulating signal for 5469 features ...
Progress of 'RawMSSignal':
-- done [took 4.89 s (CPU), 4.92 s (Wall)] -- 
2017/11/13, 12:47:06: Contaminants out-of-RT-range: 204 / 486
2017/11/13, 12:47:06: Contaminants out-of-MZ-range: 111 / 486
2017/11/13, 12:47:07: Compressed data to grid ... 10048179 --> 9019062 (89%)
2017/11/13, 12:47:07: Adding white noise to spectra ...
2017/11/13, 12:47:07: Adding detector noise to s

# Quantification

Each of the `featureXML` files are quantified using the `ProteinQuantifier` software. 

In [7]:
for xml in ${run}/*.featureXML; do 
    ProteinQuantifier \
        -in ${xml} \
        -peptide_out ${xml%.*}.csv
done


Processing summary - number of...
...features: 5473 used for quantification, 5473 total (0 no annotation, 0 ambiguous annotation)
...peptides: 1952 quantified, 1952 identified (considering best hits only)
ProteinQuantifier took 1.22 s (wall), 1.19 s (CPU), 0.00 s (system), 1.19 s (user).

Processing summary - number of...
...features: 5469 used for quantification, 5469 total (0 no annotation, 0 ambiguous annotation)
...peptides: 1952 quantified, 1952 identified (considering best hits only)
ProteinQuantifier took 1.21 s (wall), 1.19 s (CPU), 0.00 s (system), 1.19 s (user).

Processing summary - number of...
...features: 5472 used for quantification, 5472 total (0 no annotation, 0 ambiguous annotation)
...peptides: 1952 quantified, 1952 identified (considering best hits only)
ProteinQuantifier took 1.22 s (wall), 1.18 s (CPU), 0.00 s (system), 1.18 s (user).

Processing summary - number of...
...features: 5471 used for quantification, 5471 total (0 no annotation, 0 ambiguous annotation)

We receive identical peptide setup due to same protein being picked and cleaved (even if intensities varies).

## Prepare quantity matrix

The resulting quantities for the individual samples are combined into a single feature/sample-intensity matrix.

In [8]:
python3 util_scripts/combine_quant_pd.py \
    --dfs ${run}/*.csv \
    --out_fp ${run}/full_quant.tsv

In [9]:
wc -l ${run}/full_quant.tsv
head ${run}/full_quant.tsv

1947 batch_harddata/full_quant.tsv
peptide	protein	a1	a2	a3	a4	a5	b1	b2	b3	b4	b5
FCR	sp|P0A7T7|RS18_ECOLI	143364100.0	138053804.0	183435500.0	1285064960.0	1253592992.0	191191800.0	69824800.0	147808600.0	1276491968.0	1304431008.0
LCR	sp|Q9JMR4|YUBK_ECOLI	63306100.0	35053800.0	64352500.0	460408000.0	466686016.0	70630800.0	58335800.0	70208600.0	500944992.0	615046016.0
FCQR	sp|P39357|YJHF_ECOLI	101330800.0	157452900.0	202996400.0	1325604032.0	1452273056.0	227052292.0	176222800.0	157744900.0	1050848032.0	1091950016.0
FLFK	sp|P34094|PHYB_SOLTU	102128500.0	78585800.0	74083100.0	792976000.0	786582016.0	172088508.0	152445700.0	239726808.0	1574925024.0	1493992928.0
FYLS	sp|A5A617|YDGU_ECOLI	68813296.0	98732200.0	87036600.0	663654976.0	674534016.0	94788400.0	66909500.0	53805200.0	615339008.0	721168000.0
KFCR	sp|P0A7T7|RS18_ECOLI	56455290.0	55010320.0	73233220.0	541506100.0	503624108.0	79413414.0	28419240.0	59655620.0	492054516.0	622462384.0
LFLR	sp|P23840|DIND_ECOLI	136592300.0	120737700.0	855880