### _De novo_ peptide sequencing with PEAKS ###

Last year, Rick used [PEAKS](http://www1.bioinfor.com/peaks/features/denovo.html), a _de novo_ + database searching software package designed by Bin Ma (Novor), to find peptidoglycan peptides from a WA margin sample. I have that data in an Excel file. I want to do the same analysis I've done on Novor sequenced peptides to these PEAKS peptides.

In [8]:
!date

Tue Dec  6 12:53:53 PST 2016


In [7]:
!system_profiler SPSoftwareDataType

Software:

    System Software Overview:

      System Version: macOS 10.12.1 (16B2555)
      Kernel Version: Darwin 16.1.0
      Boot Volume: Macintosh HD
      Boot Mode: Normal
      Computer Name: Megan’s MacBook Pro
      User Name: Megan Duffy (meganduffy)
      Secure Virtual Memory: Enabled
      System Integrity Protection: Enabled
      Time since boot: 22 days 3:56



In [14]:
# here are the raw peptides that PEAKS predicted (in descending order by length)
# these data were from before the Keil lab had proteomics-capable LC-MS/MS systems
# therefore, it's most certainly DDA (data-dependent aquisition) data
# the 'raw peptides' column was extracted from ~/Documents/Keil Lab/pepditoglycan/De Novo spreadsheet for proposal
# there no scoring in the spreadsheet
!head -50 /Users/meganduffy/Documents/git-repos/FISH-546/data/peaks_peptides.txt

LICCPVQQPPLMMNSQGGGGSGGGGGFEGMIRPAK
QPPSTTSMYTEEKPMNAPSSDSNNLFSGLDMVSK
TGLPAGSELSLYDVAPVVPGVAVDLSHIPTDVK
IDSVNAAPGIPENVALDDIQSSFLTDGQPGER
LASGQEIPCFALTSPEAGSDASSIPDYGVVCK
VQGHQCVACPAGTTNDGGDDASGGDTQCDITR
VNGQLLCGFTGLWTDVLTVMQDLSSEVALK
ADYSSEVATLASAGGDAVAVIGYLDQGGK
LELAQFDELAAFSQFASDLDASTQQQLER
TDFLQQEIDAGNYNPFGGTYNDPAVIDR
TDFLQQEIDAGNYNPFGGTYNDPAVIDR
GDGDGDGDGDGDGDGDGDGDGDDSGGR
GLDPINPGGPLDFFNVASNPQDLALLK
SGNFDNSIFGTGLTFVETAMSSGNSVK
SIPNKLGGVLALLLSILILTIVPMLHK
TVVIMELINNIAMNHGGYSVFAGVGER
VLNGLKETYNSLGVPIGATVQAIQAMK
VSEEEETQGLDIGEHGMEAYPDFASAK
YLYGTDFDSLDVSQSGNTCSMNNANVR
DDWDGATSDDDDDDDDGGGGGGGDGG
DGDGDGDGGGDGDGGGDGDGDGGGGK
ENDDDDDDDDGDENEGGAGDVSADER
LVVSLALLQFVLVIPALLIISDVPVK
NLAGYTDNRPLNEVLNTGNVGLSPFK
QCPENHVSFADLGDVTEADHSVEVFK
STLLSSIVGVPNSLGPGELLQHYGTK
TEQAFELPTGGAALMNSGENLMYFAR
ANDAAGDGTTTATVLAQSIVNEGLK
APAAKPSTSTVSASATPGGRTVSDK
AWDSDDDDDDDDDDDEGAGGAGAGK
DASDDASDDDDDDDDDAVDDDASIK
GGIDFQPITVLVPGGEEFPFTFSSK
ILQGYQYVAANPDEVCPANWTPGEK
SLGGEVVGEDYLPLGNTE

In [12]:
# how many peptides are there?
!wc -l /Users/meganduffy/Documents/git-repos/FISH-546/data/peaks_peptides.txt

     454 /Users/meganduffy/Documents/git-repos/FISH-546/data/peaks_peptides.txt


In [15]:
# what does Unipept lowest common ancestor analysis look like for these peptides?
# running unipept pept2lca
!unipept pept2lca -i ~/Documents/git-repos/FISH-546/data/peaks_peptides.txt \
-o ~/Documents/git-repos/FISH-546/analyses/peaks_peptides_lca.txt

In [16]:
# here is the output csv file:
!head ~/Documents/git-repos/FISH-546/analyses/peaks_peptides_lca.txt

peptide,taxon_id,taxon_name,taxon_rank
LICCPVQQPPLMMNSQGGGGSGGGGGFEGMIRPAK,3068,Volvox carteri f. nagariensis,forma
QPPSTTSMYTEEKPMNAPSSDSNNLFSGLDMVSK,653948,Albugo laibachii,species
TGLPAGSELSLYDVAPVVPGVAVDLSHIPTDVK,53246,Pseudoalteromonas,genus
IDSVNAAPGIPENVALDDIQSSFLTDGQPGER,152297,Pseudoalteromonas issachenkonii,species
LASGQEIPCFALTSPEAGSDASSIPDYGVVCK,53246,Pseudoalteromonas,genus
VQGHQCVACPAGTTNDGGDDASGGDTQCDITR,242159,Ostreococcus 'lucimarinus',species
VNGQLLCGFTGLWTDVLTVMQDLSSEVALK,2880,Ectocarpus siliculosus,species
ADYSSEVATLASAGGDAVAVIGYLDQGGK,198252,Candidatus Pelagibacter ubique,species
LELAQFDELAAFSQFASDLDASTQQQLER,1129,Synechococcus,genus


Here is the graphical output (sunburst) of the LCA ancestor:

This is a WA margin sample. We expect to see lots of diatoms. And...we do see peptides from diatoms (Bacilloriophyta)!  4 specific to _Thalassiosira pseudonana_ in fact. We also see cyanobacterial peptides (125). And one peptide specific to the water flea.

![peaks peptides lca sunburst](https://raw.githubusercontent.com/MeganEDuffy/FISH-546/master/analyses/2016-12-06-wa-peaks-unipept_lca.png)

### PEAKS _de novo_ search on all ETNP depth samples

I ran all the ETNP .mgf files in a PEAKS de novo search (variable modifications: carbomeythylation on cysteine, oxidation on methionine). I also ran them without modifications as a test, but since we used IAA (iodoacetic acid) in the protein digest prep the search with modifications should be included. 

Now I'm running all of them through Unipept (both ```unipept pept2lca``` and ```unipept pept2prot```) pipeline, which I've written a script for. See my Unipept [notebook](https://github.com/MeganEDuffy/FISH-546/blob/master/notebooks/etnp-unipept/04-pept2lca-all-depths.ipynb) for those analyses. 

 - The first step of this process to is extract the peptide list from the PEAKS output .csv file. I'll use ```awk``` to do this:

In [1]:
cd ~/Documents/git-repos/FISH-546/data

/Users/meganduffy/Documents/git-repos/FISH-546/data


In [2]:
# Here's what the files look like:
!head ETNP-peaks-exports/052716_100fmol_Hi3_1.csv

Fraction,Scan,Source File,Peptide,Tag Length,ALC (%),length,m/z,z,RT,Area,Mass,ppm,PTM,local confidence (%),tag (>=0%),mode
45,29402,052716_100fmol Hi3_1.mgf,VPSSLR,6,80,6,658.3881,1,58.58,,657.3809,-0.2,,68 60 79 87 95 95,VPSSLR,CID
45,29403,052716_100fmol Hi3_1.mgf,DLLPDEAASSLR,12,75,12,643.8519,2,58.58,,1285.6514,29.4,,44 50 46 28 90 93 95 95 92 88 92 92,DLLPDEAASSLR,CID
45,29390,052716_100fmol Hi3_1.mgf,VSNLAR,6,67,6,659.3831,1,54.50,,658.3762,-0.6,,79 79 67 81 44 49,VSNLAR,CID
45,29401,052716_100fmol Hi3_1.mgf,RGPALDEAASSLR,13,65,13,448.2097,3,58.57,,1341.7000,-69.0,,33 16 18 24 40 88 93 93 93 88 82 88 89,RGPALDEAASSLR,CID
45,29395,052716_100fmol Hi3_1.mgf,TDVAHVDAELAEVLAR,16,57,16,570.2692,3,54.53,,1707.8792,-54.6,,44 54 32 38 37 29 69 61 76 90 74 60 54 76 50 71,TDVAHVDAELAEVLAR,CID
45,29392,052716_100fmol Hi3_1.mgf,TDVVTPSAGGLAEVLAR,17,56,17,552.6296,3,54.51,,1654.8889,-13.3,,51 60 32 37 39 28 38 29 28 49 90 78 66 61 85 88 95,TDVVTPSAGGLAEVLAR,CID
45,29399,052716_1

In [None]:
# remove non peptide columns (all but column 10)
!awk -F "\"*,\"*" '{print $10}' ETNP-peaks-exports/052716_100fmol_Hi3_1.csv > 052716_100fmol_Hi3_1.txt