# Prepare files for PECAN


Prior to running PECAN you must do the following:

  1. Remove extraneous header text from geoduck gonad transcriptome fasta file (our background proteome)
  2. Add Peptide Retention Time Calibration mixture fasta sequence to background database file
  3. Digest background proteome fasta into peptide fragments using Protein Digestion Simulator  
  4. Convert Isolation scheme (from Emma from Lumos) to .csv file 
  5. Create .txt file with list of paths to all mzML files  

Note: you also need a .txt file listing the name of the background proteome database. This has already been created, and is located in [DNR_Geoduck_DatabasePath.txt](../analyses/DIA/DNR_Geoduck_DatabasePath.txt)

----

## Step 1. Remove extraneous info from background proteome 

I received a protein fasta file from Steven, which will be used as the background proteome in PECAN.  First step is to edit the header data of each protein sequence to remove extraneous text. I do so in the following steps.  The resulting fasta file is saved on Owl, so feel free to download this file and skip to step 3. 

#### Input File: Geoduck gonad transcriptome fasta file [Geoduck-transcriptome-v2.transdecoder.pep](https://raw.githubusercontent.com/sr320/paper-pano-go/52c6b18b5b09e5c3a49250cf47ad4ddc8e9dc004/data-results/Geoduck-transcriptome-v2.transdecoder.pep)
#### Output File: Geoduck gonad transcriptome fasta file with extraneous info removed from each header line [Geoduck-transcriptome-v2.transdecoder_TrimmedHeadr.pep](http://owl.fish.washington.edu/generosa/Generosa_DNR/Geoduck-transcriptome-v2.transdecoder_TrimmedHeadr.pep)

In [14]:
# Let's see what the current header looks like
!head ../../data/DIA/Geoduck-transcriptome-v2.transdecoder.pep

>cds.comp100047_c0_seq1|m.5980 comp100047_c0_seq1|g.5980  ORF comp100047_c0_seq1|g.5980 comp100047_c0_seq1|m.5980 type:internal len:142 (-) comp100047_c0_seq1:3-425(-)
NAECRDLYKIFTQILSVRSQEGKIVIPDEFATKIRNWLGNKEELFKEAHNQKIITFYNEY
TREENTFNPIRGKRPMSVPDMPERKYIDQLSRKTQSQCDFCKYKTFTAEDTFGRIDSNFS
CSASNAFKLDHWHALFLLKTH
>cds.comp100068_c0_seq1|m.5981 comp100068_c0_seq1|g.5981  ORF comp100068_c0_seq1|g.5981 comp100068_c0_seq1|m.5981 type:internal len:106 (-) comp100068_c0_seq1:1-315(-)
LFLDKSGKRICSFNNLTAVIEKATERASRIRLAKGLSQPKYLSCGNVDKVPAPGYLTASF
TQLSVNKTRKDKGRNHLLLWDQTSSYSYIGPGIHYKDGKIRVNTT
>cds.comp100097_c0_seq1|m.5982 comp100097_c0_seq1|g.5982  ORF comp100097_c0_seq1|g.5982 comp100097_c0_seq1|m.5982 type:internal len:227 (+) comp100097_c0_seq1:2-679(+)
GTENLRICLKVIETYLLLGPREFLELYSGDLVHSLSNLLSDLRTEGVLLVLRVIELVLKS
FPTEGPALFKSMLPEFLRAVLNKDEHPVVMSLYLTLFGRIVLQNQEFFWNFLDQMAMESH


In [15]:
# Count how many lines there are in the fasta file pre-trimmed
! grep -c '>' ../../data/DIA/Geoduck-transcriptome-v2.transdecoder.pep

35951


In [18]:
# Remove extraneous text 
! cut -d " " -f 1 ../../data/DIA/Geoduck-transcriptome-v2.transdecoder.pep > \
../../data/DIA/Geoduck-transcriptome-v2.transdecoder_TrimmedHeadr.pep

In [19]:
# Confirm that I didn't lose any lines 
! grep -c '>' ../../data/DIA/Geoduck-transcriptome-v2.transdecoder_TrimmedHeadr.pep

35951


In [20]:
# Preview the edited fasta file 
! head ../../data/DIA/Geoduck-transcriptome-v2.transdecoder_TrimmedHeadr.pep

>cds.comp100047_c0_seq1|m.5980
NAECRDLYKIFTQILSVRSQEGKIVIPDEFATKIRNWLGNKEELFKEAHNQKIITFYNEY
TREENTFNPIRGKRPMSVPDMPERKYIDQLSRKTQSQCDFCKYKTFTAEDTFGRIDSNFS
CSASNAFKLDHWHALFLLKTH
>cds.comp100068_c0_seq1|m.5981
LFLDKSGKRICSFNNLTAVIEKATERASRIRLAKGLSQPKYLSCGNVDKVPAPGYLTASF
TQLSVNKTRKDKGRNHLLLWDQTSSYSYIGPGIHYKDGKIRVNTT
>cds.comp100097_c0_seq1|m.5982
GTENLRICLKVIETYLLLGPREFLELYSGDLVHSLSNLLSDLRTEGVLLVLRVIELVLKS
FPTEGPALFKSMLPEFLRAVLNKDEHPVVMSLYLTLFGRIVLQNQEFFWNFLDQMAMESH


--- 

## Step 2. Combine PRTC fasta with transcriptome fasta

We added a standard, the PRTC to each sample before injecting during our mass spec run.  We need to include them in our background database so that PECAN assigns the transitions associated with PRTC correctly.

* **Input Files:**
  - Peptide Retention Time Calibration mixture (PRTC) protein sequence, fasta file: [P00000_Pierce_prtc.fasta](../../data/DIA/P00000_Pierce_prtc.fasta)
  - Geoduck transcriptome with trimmed header, from Step 2: [Geoduck-transcriptome-v2.transdecoder_TrimmedHeadr.pep](http://owl.fish.washington.edu/generosa/Generosa_DNR/Geoduck-transcriptome-v2.transdecoder_TrimmedHeadr.pep)  
* **Output File:** Combined geoduck transcriptome + PRTC fasta file: [GeoTranscriptomePRTC.fasta](http://owl.fish.washington.edu/generosa/Generosa_DNR/GeoTranscriptomePRTC.fasta)

In [9]:
# Inspect PRTC fasta; it's short so we can print the whole thing out
! cat ../data/DIA/P00000_Pierce_prtc.fasta

>P00000 Pierce Peptide Retention Time Calibration Mixture
SSAAPPPPPRGISNEGQNASIKHVLTSIGEKDIPVPKPKIGDYAGIKTASEFDSAIAQDKSAAGAFGPELSRELGQSGVDTYLQTKGLILVGGYGTR
GILFVGSGVSGGEEGARSFANQPLEVVYSKLTILEELRNGFILDGFPRELASGLSFPVGFKLSSEAPALFQFDLK


In [16]:
# Inspect the trimmed header geoduck transcriptome fasta; it's super long so let's just look at the tail, since we'll be adding the PRTC sequence to the end of this file
! tail ../data/DIA/Geoduck-transcriptome-v2.transdecoder_TrimmedHeadr.pep

QLHVLNLLVLLLPSVHRDMLEAVLDFLEKVVEHSATNKMSLSNVAMIMAPNLFMSPKVRA
SPPGKTKRAWEIEIKMA
>cds.comp99988_c0_seq1|m.5978
INVNFSRFNESNLSLSGWANSGFHPAIEFECSKPLPLVGVSLFNPCREGEANGTLEVLDK
DKVLICMNVNLVYDASKHYVDVMFQKPIHIDATKRYTLRQTLKGTDLTHGLNGNNVIEDK
GVKVAFFTSNKDTGGSYEVYGQFFGIIFKC*
>cds.comp99988_c0_seq2|m.5979
INVNFSRFNESNLSLSGWANSGFHPAIEFECSKPLPLVGVSLFNPCREGEANGTLEVLDK
DKVLICMNVNLVYDASKHYVDVMFQKPIHIDATKRYTLRQTLKGTDLTHGLNGNNVIEDK
GVKVAFFTSNKDTGGSYEVYGQFFGIIFKC*


In [13]:
# Combine the two files
! cat ../data/DIA/Geoduck-transcriptome-v2.transdecoder_TrimmedHeadr.pep ../data/DIA/P00000_Pierce_prtc.fasta > GeoTranscriptomePRTC.fasta

In [15]:
# Inspect the tail of the resulting file
! tail GeoTranscriptomePRTC.fasta

INVNFSRFNESNLSLSGWANSGFHPAIEFECSKPLPLVGVSLFNPCREGEANGTLEVLDK
DKVLICMNVNLVYDASKHYVDVMFQKPIHIDATKRYTLRQTLKGTDLTHGLNGNNVIEDK
GVKVAFFTSNKDTGGSYEVYGQFFGIIFKC*
>cds.comp99988_c0_seq2|m.5979
INVNFSRFNESNLSLSGWANSGFHPAIEFECSKPLPLVGVSLFNPCREGEANGTLEVLDK
DKVLICMNVNLVYDASKHYVDVMFQKPIHIDATKRYTLRQTLKGTDLTHGLNGNNVIEDK
GVKVAFFTSNKDTGGSYEVYGQFFGIIFKC*
>P00000 Pierce Peptide Retention Time Calibration Mixture
SSAAPPPPPRGISNEGQNASIKHVLTSIGEKDIPVPKPKIGDYAGIKTASEFDSAIAQDKSAAGAFGPELSRELGQSGVDTYLQTKGLILVGGYGTR
GILFVGSGVSGGEEGARSFANQPLEVVYSKLTILEELRNGFILDGFPRELASGLSFPVGFKLSSEAPALFQFDLK


In [17]:
# See how many lines we have in the resulting file. From Step 1 we know we should have 35951 geoduck + 1 PRTC lines = 35952 lines
! grep -c '>' GeoTranscriptomePRTC.fasta

35952


---

## Step 3. Digest background fasta into peptides

Before running PECAN is to tryptic digest our geoduck transcriptome + PRTC fasta file (generated in Step 2) _in silico_. We will use the Protein Digestion Simulator program, which breaks up our fasta into fragments, just like we did in our sample prep. The resulting file will constitute the "background database" to identify peptides measured in DIA data. 

* **Input file:** Combined geoduck transcriptome + PRTC fasta file: [GeoTranscriptomePRTC.fasta](http://owl.fish.washington.edu/generosa/Generosa_DNR/GeoTranscriptomePRTC.fasta)
* **Output File:** Combined geoduck transcriptome + PRTC fasta file, tryptic digested into peptides [GeoTranscriptomePRTC_digested_Mass400to6000.txt](http://owl.fish.washington.edu/generosa/Generosa_DNR/GeoTranscriptomePRTC_digested_Mass400to6000.txt)

### Software Needed: 
 * Command/Terminal window in Windows
 * [Protein Digestion Simulator](https://omics.pnl.gov/software/protein-digestion-simulator); verison used, installed on Woodpecker: 2.2.6471.25262 
![PDS about](../../images/PDS00.png)

#### The following are screen shots showing settings used in PDS; note I selected the transcriptome with trimmed header for our input file, which we generated in Step 1 of this notebook. 

#### Tab 1:
![PDS tab 1](../../images/PDS01.png)

#### Tab 2:
![PDS tab 2](../../images/PDS02.png)

#### Tab 3:
![PDS tab 3](../../images/PDS03.png)

#### Tab 4:
![PDS tab 4](../../images/PDS04.png)

#### Go back to Tab 2 to execute the digestion: 
![PDS pushing "go"](../../images/PDS06.png)

#### The digestion took ~30 minutes; you can watch the % complete on the Progress tab: 
![PDS progress](../../images/PDS07.png)

#### When complete a box will pop up describing how many proteins were processed. We know we should have 35952 proteins (from step 2 of this notebook):
![PDS complete](../../images/PDS08.png)

---
## Step 4. Convert isolation scheme text file to .csv

Emma sent us the isolation scheme file, extracted from Lumos. To use this in PECAN we need it in .csv format

* **Input File:** [2017_January_23_envtstress_geoduck1_isoscheme.txt](../data/DIA/2017_January_23_envtstress_geoduck1_isoscheme.txt). 
* **Output File:** [DNR_Geoduck_IsolationScheme.csv](../data/DIA/DNR_Geoduck_IsolationScheme.csv)

To convert, I simply opened the .txt file in Excel, then saved as .csv and re-named:
![save-as csv](../images/Isolation-scheme-csv.png)

---
## Step 5. Create a .txt file with list of paths to all mzML files

PECAN needs a list of the DIA data files (in .mzML format) in order to execute the run, so we need to create a .txt file that lists all .mzml file names. 

* **Input File:** R-script to extract .mzML file names and create a .txt file: [Script01-File-path-for-PECAN.R](https://github.com/RobertsLab/Paper-DNR-Geoduck-Proteomics/raw/master/analyses/DIA/Script01-File-path-for-PECAN.R)
* **Output File:** Text file: [DNR_Geoduck_mzMLpath.txt](../analyses/DIA/2017-Geoduck-DIA-raw/DNR_Geoduck_mzMLpath.txt)   

---

### You are now ready to move to Notebook 03, [Building a Spectral Library with PECAN](../notebooks/DIA/03-Building%20Spectral%20Library%20with%20PECAN%20.ipynb)