## PROTEUS2 protein secondary structure analysis:

### This is a notebook to organize and evaluate the output of peptide secondary structure prediction by [Proteus2](http://www.proteus2.ca/proteus2/), described in [Montgomerie et al., 2008](https://academic.oup.com/nar/article/36/suppl_2/W202/2506231) in _Nucleic Acids Research_. 

### The input of Proteus2 requires a FASTA format. We get that for Peaks DB protein results, but our database as some X residues, which Proteus2 will not accept. 

In [4]:
cd /home/millieginty/Documents/git-repos/rot-mayer/data/MED_Weissrot_Fusion_UWPR2021/MED_Weissrot_Fusion_322-T0dig-all_PEAKS_75/

/home/millieginty/Documents/git-repos/rot-mayer/data/MED_Weissrot_Fusion_UWPR2021/MED_Weissrot_Fusion_322-T0dig-all_PEAKS_75


In [5]:
# LIBRARIES
#import pandas library for working with tabular data
import os
os.getcwd()
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from scipy.stats import kde
#import regular expresson (regex)
import re
#check pandas version
pd.__version__

'1.0.5'

In [6]:
ls

322-T0dig-all_PEAKS_75_DB-search-psm.csv
322-T0dig-all_PEAKS_75_DNO.csv
322-T0dig-all_PEAKS_75_DNO.xml
322-T0dig-all_PEAKS_75_peptide.csv
322-T0dig-all_PEAKS_75_peptides.pep.xml
322-T0dig-all_PEAKS_75_peptide.xlsx
322-T0dig-all_PEAKS_75_protein-peptides.csv
322-T0dig-all_PEAKS_75_proteins.csv
322-T0dig-all_PEAKS_75_proteins.fasta


In [14]:
#look at the FASTA

!head -30 322-T0dig-all_PEAKS_75_proteins.fasta

>gi|54036848|sp|P63284.1|CLPB_ECOLI RecName: Full=Chaperone protein ClpB; AltName: Full=Heat shock protein F84.1
MRLDRLTNKFQLALADAQSLALGHDNQFIEPLHLMSALLNQEGGSVSPLL
TSAGINAGQLRTDINQALNRLPQVEGTGGDVQPSQDLVRVLNLCDKLAQK
RGDNFISSELFVLAALESRGTLADILKAAGATTANITQAIEQMRGGESVN
DQGAEDQRQALKKYTIDLTERAEQGKLDPVIGRDEEIRRTIQVLQRRTKN
NPVLIGEPGVGKTAIVEGLAQRIINGEVPEGLKGRRVLALDMGALVAGAK
YRGEFEERLKGVLNDLAKQEGNVILFIDELHTMVGAGKADGAMDAGNMLK
PALARGELHCVGATTLDEYRQYIEKDAALERRFQKVFVAEPSVEDTIAIL
RGLKERYELHHHVQITDPAIVAAATLSHRYIADRQLPDKAIDLIDEAASS
IRMQIDSKPEELDRLDRRIIQLKLEQQALMKESDEASKKRLDMLNEELSD
KERQYSELEEEWKAEKASLSGTQTIKAELEQAKIAIEQARRVGDLARMSE
LQYGKIPELEKQLEAATQLEGKTMRLLRNKVTDAEIAEVLARWTGIPVSR
MMESEREKLLRMEQELHHRVIGQNEAVDAVSNAIRRSRAGLADPNRPIGS
FLFLGPTGVGKTELCKALANFMFDSDEAMVRIDMSEFMEKHSVSRLVGAP
PGYVGYEEGGYLTEAVRRRPYSVILLDEVEKAHPDVFNILLQVLDDGRLT
DGQGRTVDFRNTVVIMTSNLGSDLIQERFGELDYAHMKELVLGVVSHNFR
PEFINRIDEVVVFHPLGEQHIASIAQIQLKRLYKRLEERGYEIHISDEAL
KLLSENGYDPVYGARPLKRAIQQQIENPLAQQILSGE

In [12]:
# use sed to remove all occurances of X

!sed 's/X//g' 322-T0dig-all_PEAKS_75_proteins.fasta > 322-T0dig-all_PEAKS_75_proteins-noX.fasta

In [13]:
# take a look

!head -30 322-T0dig-all_PEAKS_75_proteins-noX.fasta

>gi|54036848|sp|P63284.1|CLPB_ECOLI RecName: Full=Chaperone protein ClpB; AltName: Full=Heat shock protein F84.1
MRLDRLTNKFQLALADAQSLALGHDNQFIEPLHLMSALLNQEGGSVSPLL
TSAGINAGQLRTDINQALNRLPQVEGTGGDVQPSQDLVRVLNLCDKLAQK
RGDNFISSELFVLAALESRGTLADILKAAGATTANITQAIEQMRGGESVN
DQGAEDQRQALKKYTIDLTERAEQGKLDPVIGRDEEIRRTIQVLQRRTKN
NPVLIGEPGVGKTAIVEGLAQRIINGEVPEGLKGRRVLALDMGALVAGAK
YRGEFEERLKGVLNDLAKQEGNVILFIDELHTMVGAGKADGAMDAGNMLK
PALARGELHCVGATTLDEYRQYIEKDAALERRFQKVFVAEPSVEDTIAIL
RGLKERYELHHHVQITDPAIVAAATLSHRYIADRQLPDKAIDLIDEAASS
IRMQIDSKPEELDRLDRRIIQLKLEQQALMKESDEASKKRLDMLNEELSD
KERQYSELEEEWKAEKASLSGTQTIKAELEQAKIAIEQARRVGDLARMSE
LQYGKIPELEKQLEAATQLEGKTMRLLRNKVTDAEIAEVLARWTGIPVSR
MMESEREKLLRMEQELHHRVIGQNEAVDAVSNAIRRSRAGLADPNRPIGS
FLFLGPTGVGKTELCKALANFMFDSDEAMVRIDMSEFMEKHSVSRLVGAP
PGYVGYEEGGYLTEAVRRRPYSVILLDEVEKAHPDVFNILLQVLDDGRLT
DGQGRTVDFRNTVVIMTSNLGSDLIQERFGELDYAHMKELVLGVVSHNFR
PEFINRIDEVVVFHPLGEQHIASIAQIQLKRLYKRLEERGYEIHISDEAL
KLLSENGYDPVYGARPLKRAIQQQIENPLAQQILSGE

### Now doing the same process for all samples

### 323 Day 2 trypsin PeaksDB

In [15]:
cd /home/millieginty/Documents/git-repos/rot-mayer/data/MED_Weissrot_Fusion_UWPR2021/MED_Weissrot_Fusion_323-T2dig-all_PEAKS_77/

/home/millieginty/Documents/git-repos/rot-mayer/data/MED_Weissrot_Fusion_UWPR2021/MED_Weissrot_Fusion_323-T2dig-all_PEAKS_77


In [17]:
# use sed to remove all occurances of X

!sed 's/X//g' 323-T2dig-all_PEAKS_77_proteins.fasta > 323-T2dig-all_PEAKS_77_proteins-noX.fasta

### 324 Day 5 trypsin PeaksDB

In [22]:
cd /home/millieginty/Documents/git-repos/rot-mayer/data/MED_Weissrot_Fusion_UWPR2021/MED_Weissrot_Fusion_324-T5dig-all_PEAKS_79/

/home/millieginty/Documents/git-repos/rot-mayer/data/MED_Weissrot_Fusion_UWPR2021/MED_Weissrot_Fusion_324-T5dig-all_PEAKS_79


In [24]:
# use sed to remove all occurances of X

!sed 's/X//g' 324-T5dig-all_PEAKS_79_proteins.fasta > 324-T5dig-all_PEAKS_79_proteins-noX.fasta

### 325 Day 12 trypsin PeaksDB

In [26]:
cd /home/millieginty/Documents/git-repos/rot-mayer/data/MED_Weissrot_Fusion_UWPR2021/MED_Weissrot_Fusion_325-T12dig-all_PEAKS_81/

/home/millieginty/Documents/git-repos/rot-mayer/data/MED_Weissrot_Fusion_UWPR2021/MED_Weissrot_Fusion_325-T12dig-all_PEAKS_81


In [28]:
# use sed to remove all occurances of X

!sed 's/X//g' 325-T12dig-all_PEAKS_81_proteins.fasta > 325-T12dig-all_PEAKS_81_proteins-noX.fasta

### 329 Day 0 undigested PeaksDB

In [30]:
cd /home/millieginty/Documents/git-repos/rot-mayer/data/MED_Weissrot_Fusion_UWPR2021/MED_Weissrot_Fusion_329-T0nd-all_PEAKS_91/

/home/millieginty/Documents/git-repos/rot-mayer/data/MED_Weissrot_Fusion_UWPR2021/MED_Weissrot_Fusion_329-T0nd-all_PEAKS_91


In [32]:
# use sed to remove all occurances of X

!sed 's/X//g' 329-T0nd-all_PEAKS_91_proteins.fasta > 329-T0nd-all_PEAKS_91_proteins-noX.fasta

### 330 Day 2 undigested PeaksDB

In [33]:
cd /home/millieginty/Documents/git-repos/rot-mayer/data/MED_Weissrot_Fusion_UWPR2021/MED_Weissrot_Fusion_330-T2nd-all_PEAKS_94/

/home/millieginty/Documents/git-repos/rot-mayer/data/MED_Weissrot_Fusion_UWPR2021/MED_Weissrot_Fusion_330-T2nd-all_PEAKS_94


In [35]:
# use sed to remove all occurances of X

!sed 's/X//g' 330-T2nd-all_PEAKS_94_proteins.fasta > 330-T2nd-all_PEAKS_94_proteins-noX.fasta

### 331 Day 5 undigested PeaksDB

In [36]:
cd /home/millieginty/Documents/git-repos/rot-mayer/data/MED_Weissrot_Fusion_UWPR2021/MED_Weissrot_Fusion_331-T5nd-all_PEAKS_99/

/home/millieginty/Documents/git-repos/rot-mayer/data/MED_Weissrot_Fusion_UWPR2021/MED_Weissrot_Fusion_331-T5nd-all_PEAKS_99


In [37]:
ls

331-T5nd-all_PEAKS_99_DB-search-psm.csv
331-T5nd-all_PEAKS_99_DNO.csv
331-T5nd-all_PEAKS_99_peptide.csv
331-T5nd-all_PEAKS_99_peptides_1_0_0.mzid
331-T5nd-all_PEAKS_99_peptides.pep.xml
331-T5nd-all_PEAKS_99_peptide.xlsx
331-T5nd-all_PEAKS_99_protein-peptides.csv
331-T5nd-all_PEAKS_99_proteins.csv
331-T5nd-all_PEAKS_99_proteins.fasta


In [38]:
# use sed to remove all occurances of X

!sed 's/X//g' 331-T5nd-all_PEAKS_99_proteins.fasta > 331-T5nd-all_PEAKS_99_proteins-noX.fasta

### 332 Day 12 undigested PeaksDB

In [41]:
cd /home/millieginty/Documents/git-repos/rot-mayer/data/MED_Weissrot_Fusion_UWPR2021/MED_Weissrot_Fusion_332-T12nd-all_PEAKS_101/

/home/millieginty/Documents/git-repos/rot-mayer/data/MED_Weissrot_Fusion_UWPR2021/MED_Weissrot_Fusion_332-T12nd-all_PEAKS_101


In [43]:
# use sed to remove all occurances of X

!sed 's/X//g' 332-T12nd-all_PEAKS_101_proteins.fasta > 332-T12nd-all_PEAKS_101_proteins-noX.fasta