# Table of Contents
 <p><div class="lev1 toc-item"><a href="#Generate-pre-processed-FASTA-files" data-toc-modified-id="Generate-pre-processed-FASTA-files-1"><span class="toc-item-num">1&nbsp;&nbsp;</span>Generate pre-processed FASTA files</a></div><div class="lev2 toc-item"><a href="#Setup" data-toc-modified-id="Setup-11"><span class="toc-item-num">1.1&nbsp;&nbsp;</span>Setup</a></div><div class="lev2 toc-item"><a href="#Prepare-low-spike" data-toc-modified-id="Prepare-low-spike-12"><span class="toc-item-num">1.2&nbsp;&nbsp;</span>Prepare low-spike</a></div><div class="lev2 toc-item"><a href="#Prepare-high-spikes" data-toc-modified-id="Prepare-high-spikes-13"><span class="toc-item-num">1.3&nbsp;&nbsp;</span>Prepare high-spikes</a></div><div class="lev1 toc-item"><a href="#Simulate-MS-data" data-toc-modified-id="Simulate-MS-data-2"><span class="toc-item-num">2&nbsp;&nbsp;</span>Simulate MS-data</a></div><div class="lev1 toc-item"><a href="#Post-process-data" data-toc-modified-id="Post-process-data-3"><span class="toc-item-num">3&nbsp;&nbsp;</span>Post-process data</a></div>

# Generate pre-processed FASTA files

Using the custom-made `seqtk generate_spikein.py` script.

Dependencies:

* OpenMS binaries in PATH (version 2.2.0)

## Setup

In [1]:
threads=6
run=batch6_pure
mkdir ${run}

In [8]:
sample_names_low="l1 l2 l3"
sample_names_high="h1 h2 h3"
samples="${sample_names_low} ${sample_names_high}"

In [9]:
echo -e "name\tbiorepgroup\ttechrepgroup\tcondition" > ${run}/design.tsv
echo -e "l1\t1\t1\tlow" >> ${run}/design.tsv
echo -e "l2\t2\t1\tlow" >> ${run}/design.tsv
echo -e "l3\t3\t1\tlow" >> ${run}/design.tsv
echo -e "h1\t1\t1\thigh" >> ${run}/design.tsv
echo -e "h2\t2\t1\thigh" >> ${run}/design.tsv
echo -e "h3\t3\t1\thigh" >> ${run}/design.tsv

cat ${run}/design.tsv

name	biorepgroup	techrepgroup	condition
l1	1	1	low
l2	2	1	low
l3	3	1	low
h1	1	1	high
h2	2	1	high
h3	3	1	high


## Prepare low-spike

In [10]:
for name in ${sample_names_low}; do
    echo "Generating sample: ${name}"
    lfqtk generate_spikein \
        --background_fa data/uniprot_ecoli.pure.fasta \
        --spikein_fa data/uniprot_potato.pure.fasta \
        --output_fa ${run}/${name}.fa \
        --offset_mean 0 \
        --offset_std 0 \
        --back_int 1000000 \
        --back_noise_std 0 \
        --back_count 100 \
        --spike_int 1000000 \
        --spike_noise_std 0 \
        --spike_count 20 \
        --verbose
done

Generating sample: l1
4436 entries loaded from data/uniprot_ecoli.pure.fasta as background
403 entries loaded from data/uniprot_potato.pure.fasta as spike-in
100 entries picked as background, 20 as spike-in
120 entries written to batch6_pure/l1.fa
Generating sample: l2
4436 entries loaded from data/uniprot_ecoli.pure.fasta as background
403 entries loaded from data/uniprot_potato.pure.fasta as spike-in
100 entries picked as background, 20 as spike-in
120 entries written to batch6_pure/l2.fa
Generating sample: l3
4436 entries loaded from data/uniprot_ecoli.pure.fasta as background
403 entries loaded from data/uniprot_potato.pure.fasta as spike-in
100 entries picked as background, 20 as spike-in
120 entries written to batch6_pure/l3.fa


## Prepare high-spikes

In [11]:
for name in ${sample_names_high}; do
    echo "Generating sample: ${name}"
    lfqtk generate_spikein \
        --background_fa data/uniprot_ecoli.pure.fasta \
        --spikein_fa data/uniprot_potato.pure.fasta \
        --output_fa ${run}/${name}.fa \
        --offset_mean 0 \
        --offset_std 0 \
        --back_int 1000000 \
        --back_noise_std 0 \
        --back_count 100 \
        --spike_int 8000000 \
        --spike_noise_std 0 \
        --spike_count 20 \
        --verbose
done

Generating sample: h1
4436 entries loaded from data/uniprot_ecoli.pure.fasta as background
403 entries loaded from data/uniprot_potato.pure.fasta as spike-in
100 entries picked as background, 20 as spike-in
120 entries written to batch6_pure/h1.fa
Generating sample: h2
4436 entries loaded from data/uniprot_ecoli.pure.fasta as background
403 entries loaded from data/uniprot_potato.pure.fasta as spike-in
100 entries picked as background, 20 as spike-in
120 entries written to batch6_pure/h2.fa
Generating sample: h3
4436 entries loaded from data/uniprot_ecoli.pure.fasta as background
403 entries loaded from data/uniprot_potato.pure.fasta as spike-in
100 entries picked as background, 20 as spike-in
120 entries written to batch6_pure/h3.fa


# Simulate MS-data

Generate the OpenMS tool `MSSimulator`.

In [12]:
time parallel -j ${threads} \
"echo \"Processing {}\"
MSSimulator \
    -in ${run}/{}.fa \
    -out_fm ${run}/{}.ground.featureXML \
    -out_id ${run}/{}.ground.idXML \
    -out ${run}/{}.mzML \
    -out_pm ${run}/{}.centroided.mzML" \
    ::: ${samples}


Processing h1
Loading sequence data from batch6_pure/h1.fa ...
done (120 protein(s) loaded)
Starting simulation
2017/11/03, 16:24:32: Digest Simulation ... started
2017/11/03, 16:24:32: RT Simulation ... started
2017/11/03, 16:24:36: Predicting RT ... done
2017/11/03, 16:24:36: RT prediction gave 'invalid' results for 2629 peptide(s), making them unobservable.
2017/11/03, 16:24:36:   (List is too big to show)
2017/11/03, 16:24:36: Creating experiment with #501 scans ... done
2017/11/03, 16:24:36: Detectability Simulation ... started
2017/11/03, 16:24:36: Ionization Simulation ... started
esi_impurity_probabilities_[0]: 1
weights[0]: 10
2017/11/03, 16:24:36: Simulating 3284 features
Progress of 'Ionization':
-- done [took 17.97 s (CPU), 17.99 s (Wall)] -- 
2017/11/03, 16:24:54: #Peptides not ionized: 0
2017/11/03, 16:24:54: #Peptides outside mz range: 1214
2017/11/03, 16:24:54: Raw MS1 Simulation ... started
2017/11/03, 16:24:54:   Simulating signal for 9219 features ...
Progress of 'Ra

Processing h2
Loading sequence data from batch6_pure/h2.fa ...
done (120 protein(s) loaded)
Starting simulation
2017/11/03, 16:24:32: Digest Simulation ... started
2017/11/03, 16:24:32: RT Simulation ... started
2017/11/03, 16:24:36: Predicting RT ... done
2017/11/03, 16:24:36: RT prediction gave 'invalid' results for 3305 peptide(s), making them unobservable.
2017/11/03, 16:24:36:   (List is too big to show)
2017/11/03, 16:24:36: Creating experiment with #501 scans ... done
2017/11/03, 16:24:36: Detectability Simulation ... started
2017/11/03, 16:24:36: Ionization Simulation ... started
esi_impurity_probabilities_[0]: 1
weights[0]: 10
2017/11/03, 16:24:36: Simulating 3748 features
Progress of 'Ionization':
-- done [took 27.37 s (CPU), 27.45 s (Wall)] -- 
2017/11/03, 16:25:04: #Peptides not ionized: 0
2017/11/03, 16:25:04: #Peptides outside mz range: 1308
2017/11/03, 16:25:04: Raw MS1 Simulation ... started
2017/11/03, 16:25:04:   Simulating signal for 10616 features ...
Progress of 'R

In [13]:
time parallel -j ${threads} \
"echo \"Processing {}\"
FeatureFinderCentroided \
    -in ${run}/{}.centroided.mzML \
    -out ${run}/{}.featureXML" \
    ::: ${samples}


Processing h1
Progress of 'loading spectra list':
-- done [took 0.15 s (CPU), 0.15 s (Wall)] -- 
Progress of 'Precalculating intensity scores':
-- done [took 0.24 s (CPU), 0.25 s (Wall)] -- 
Progress of 'Precalculating mass trace scores':
-- done [took 0.64 s (CPU), 0.64 s (Wall)] -- 
Progress of 'Precalculating isotope distributions':
-- done [took 0.02 s (CPU), 0.02 s (Wall)] -- 
Progress of 'Calculating isotope pattern scores for charge 1':
-- done [took 1.42 s (CPU), 1.43 s (Wall)] -- 
Progress of 'Finding seeds for charge 1':
-- done [took 0.10 s (CPU), 0.10 s (Wall)] -- 
Found 56822 seeds for charge 1.
Progress of 'Extending seeds for charge 1':
-- done [took 01:04 m (CPU), 01:05 m (Wall)] -- 
Found 12863 feature candidates for charge 1.
Progress of 'Calculating isotope pattern scores for charge 2':
-- done [took 1.58 s (CPU), 1.58 s (Wall)] -- 
Progress of 'Finding seeds for charge 2':
-- done [took 0.15 s (CPU), 0.15 s (Wall)] -- 
Found 30098 seeds for charge 2.
Progress of 'Ex

-- done [took 1.84 s (CPU), 1.85 s (Wall)] -- 
Progress of 'Finding seeds for charge 2':
-- done [took 0.18 s (CPU), 0.18 s (Wall)] -- 
Found 31732 seeds for charge 2.
Progress of 'Extending seeds for charge 2':
-- done [took 29.02 s (CPU), 29.23 s (Wall)] -- 
Found 5171 feature candidates for charge 2.
Progress of 'Calculating isotope pattern scores for charge 3':
-- done [took 2.01 s (CPU), 2.04 s (Wall)] -- 
Progress of 'Finding seeds for charge 3':
-- done [took 0.07 s (CPU), 0.08 s (Wall)] -- 
Found 19775 seeds for charge 3.
Progress of 'Extending seeds for charge 3':
-- done [took 15.74 s (CPU), 15.89 s (Wall)] -- 
Found 2532 feature candidates for charge 3.
Progress of 'Calculating isotope pattern scores for charge 4':
-- done [took 2.15 s (CPU), 2.17 s (Wall)] -- 
Progress of 'Finding seeds for charge 4':
-- done [took 0.03 s (CPU), 0.02 s (Wall)] -- 
Found 9067 seeds for charge 4.
Progress of 'Extending seeds for charge 4':
-- done [took 5.30 s (CPU), 5.30 s (Wall)] -- 
Found 

In [14]:
for sample in ${samples}; do \
    echo "Processing sample: ${sample}"
    IDMapper \
        -id ${run}/${sample}.ground.idXML \
        -in ${run}/${sample}.featureXML \
        -out ${run}/${sample}.mapped.featureXML \
        > ${run}/${sample}.mapped.featureXML.log
done

Processing sample: l1
Processing sample: l2
Processing sample: l3
Processing sample: h1
Processing sample: h2
Processing sample: h3


Here, we do RT alignment (?) of features (with their identities mapped).

In [15]:
out_strings=""
for sample in ${samples}; do
    out_strings="${out_strings} ${run}/${sample}.mapped.aligned.featureXML"
done

echo ${out_strings}

MapAlignerPoseClustering \
    -in ${run}/*.mapped.featureXML \
    -out ${out_strings}


batch6_pure/l1.mapped.aligned.featureXML batch6_pure/l2.mapped.aligned.featureXML batch6_pure/l3.mapped.aligned.featureXML batch6_pure/h1.mapped.aligned.featureXML batch6_pure/h2.mapped.aligned.featureXML batch6_pure/h3.mapped.aligned.featureXML
Picking a reference (by size) ... done
Progress of 'Aligning input maps':
-- done [took 9.96 s (CPU), 10.01 s (Wall)] -- 
MapAlignerPoseClustering took 11.24 s (wall), 11.16 s (CPU), 0.00 s (system), 11.16 s (user).


In [16]:
FeatureLinkerUnlabeledQT \
    -in ${run}/*.mapped.aligned.featureXML \
    -out ${run}/combined.consensusXML

Progress of 'reading input':
-- done [took 5.29 s (CPU), 5.32 s (Wall)] -- 
Progress of 'linking features':
-- done [took 4.23 s (CPU), 4.24 s (Wall)] -- 
Number of consensus features:
  of size  6:    143
  of size  5:    573
  of size  4:   1705
  of size  3:   3586
  of size  2:   7162
  of size  1:  22032
  total:       35201
FeatureLinkerUnlabeledQT took 16.70 s (wall), 16.60 s (CPU), 0.00 s (system), 16.60 s (user).


# Post-process data

Extract and prepare the consensus data for normalization - transform it to an appropriate format.

In [17]:
TextExporter \
    -in ${run}/combined.consensusXML \
    -out ${run}/combined.linked_features.csv \
    -consensus:features ${run}/combined.features.csv

TextExporter took 3.64 s (wall), 3.56 s (CPU), 0.00 s (system), 3.56 s (user).


Not sure why, but the `combined.features.csv` file included extra empty fields with numbers beyond the actual sample range. These are omitted here, but should probably be remedied upstreams instead.

In [18]:
s_count=$(echo ${samples} | tr " " "\n" | wc -l)
cut_end=$(echo "${s_count} * 5 + 9" | bc)
cut -f 1-${cut_end} ${run}/combined.features.csv \
    > ${run}/combined.features.sub.csv

In [19]:
util_scripts/openms_to_normalyzer.py \
    -i ${run}/combined.features.sub.csv \
    -o ${run}/combined.final.tsv \
    --design ${run}/design.tsv

Writing dataframe with shape (35203, 13), to batch6_pure/combined.final.tsv
Done!
