## PROTEUS2 protein secondary structure analysis:

### This is a notebook to organize and evaluate the output of protein secondary structure prediction by [Proteus2](http://www.proteus2.ca/proteus2/), described in [Montgomerie et al., 2008](https://academic.oup.com/nar/article/36/suppl_2/W202/2506231) in _Nucleic Acids Research_. 

### The output of Proteus2 comes in an email in nominally FASTA format, but with extra line breaks and spacings within protein and prediction sequences. Also, the name gets cut off.

#### They look like this:

#### >Thalas

MMKLAALAAL MGSAAAFAPA QTGKASTQLR AFEDELGAQP PLGFFDPFGM 

CCHHHHHHHH HHHHHCCCCC CCCCCCCCCC CCCCCCCCCC CCCCCCCCCC 

LSGDCTQERF DRLRYVEIKH GRICMLAFLG QIVTRAGIHL PGSINYAGDS 

CCCCCCHHHH HHHHHHHHHH HHHHHHHHHH HHHHHHHCCC CCCCCCCCCC 

FDSFPNGVAA LFGPNSIPTA GLVQIIAFIG VLECAFMRDV PGTGNEFVGD 

CCCCCCCCCC CCCCCCCCHH HHHHHHHHHH HHHHHHHHCC CCCCCCCCCC 

FRNGYIDFGW DDFDEETKLQ KRAIQSGTIS NMMKLAALAA LMGSAAAFAP 

CCCCCCCCCC CCCCHHHHHH HHHHHHHHHH HHHHHHHHHC CCCCCCCCCC 



### Output means:

- H = Helix
- E = Beta Strand
- C = Coil
- T = Membrane helix
- B = Membrane strand
- S = Signal peptide
- c = Cleavage site

In [1]:
cd /home/millieginty/Documents/git-repos/rot-mayer/analyses/proteus2/Proteins-Proteus2/

/home/millieginty/Documents/git-repos/rot-mayer/analyses/proteus2/Proteins-Proteus2


In [2]:
# LIBRARIES
#import pandas library for working with tabular data
import os
os.getcwd()
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from scipy.stats import kde
#import regular expresson (regex)
import re
#check pandas version
pd.__version__

'1.0.5'

In [3]:
ls

PeaksDB_322_prot_trypsin_totals
PeaksDB_323_prot_trypsin_totals
PeaksDB_324_prot_trypsin_totals
PeaksDB_325_prot_trypsin_totals
PeaksDB_329_prot_undigested_totals
PeaksDB_330_prot_undigested_totals
PeaksDB_331_prot_undigested_totals
T0_322_trypsin_PeaksDB_protein_proteus
T0_322_trypsin_PeaksDB_protein_proteus_sort.csv
T0_322_trypsin_PeaksDB_protein_proteus.txt
T0_329_undigested_PeaksDB_protein_proteus
T0_329_undigested_PeaksDB_protein_proteus_sort.csv
T0_329_undigested_PeaksDB_protein_proteus.txt
T12_325_trypsin_PeaksDB_protein_proteus
T12_325_trypsin_PeaksDB_protein_proteus_sort.csv
T12_325_trypsin_PeaksDB_protein_proteus.txt
T12_332_undigested_PeaksDB_protein_proteus
T2_323_trypsin_PeaksDB_protein_proteus
T2_323_trypsin_PeaksDB_protein_proteus_sort.csv
T2_323_trypsin_PeaksDB_protein_proteus.txt
T2_330_undigested_PeaksDB_protein_proteus
T2_330_undigested_PeaksDB_protein_proteus_sort.csv
T2_330_undigested_PeaksDB_protein_proteus.txt
T5_324_trypsin_PeaksDB_protein

In [4]:
!head T12_332_undigested_PeaksDB_protein_proteus

>Thalas

HQRNLKMKLA VLAALFGSAA AFAPAQTGKA TSALNAFESE LGAQPPLGFF 
CCEEEEEHHH HHHHHHHHHH CCCCCCCCCC CCCCCCCCCC CCCCCCCCCC 

DPLGLLDDAD QERFDRLRYV EIKHGRIAQL AFLGNIITRA GVHLPGNIDY 
CCCCCCCCTT TTTTTTTTTT TTTTTTTTTT TTTTTTTCCC CCCCEEEHHH 

AGNSFDSFPN GWAAILDAAF APAQTGKATS ALNAFESELG AQPPLGFFDP 
CCCCEEECCT TTTTTTTTTT TTTTTTTTTT EECCCCCCCC CCCCCTTTTT 


In [5]:
# remove empty lines
# remove empty spaces from lines
# remove carat protein sequence names
# collapse the 2 lines from every protein sequence line (protein seq, secondard stucture pred.) into one line

!sed '/^[[:space:]]*$/d' T12_332_undigested_PeaksDB_protein_proteus | cat \
| sed '/>/d' \
| tr -d "[:blank:]" > T12_332_undigested_PeaksDB_protein_proteus.txt

!awk '{printf "%s%s",$0,(NR%2?FS:RS)}' T12_332_undigested_PeaksDB_protein_proteus.txt > \
T12_332_undigested_PeaksDB_protein_proteus_sort.csv

In [6]:
!head T12_332_undigested_PeaksDB_protein_proteus_sort.csv

HQRNLKMKLAVLAALFGSAAAFAPAQTGKATSALNAFESELGAQPPLGFF CCEEEEEHHHHHHHHHHHHHCCCCCCCCCCCCCCCCCCCCCCCCCCCCCC
DPLGLLDDADQERFDRLRYVEIKHGRIAQLAFLGNIITRAGVHLPGNIDY CCCCCCCCTTTTTTTTTTTTTTTTTTTTTTTTTTTTTCCCCCCCEEEHHH
AGNSFDSFPNGWAAILDAAFAPAQTGKATSALNAFESELGAQPPLGFFDP CCCCEEECCTTTTTTTTTTTTTTTTTTTTTEECCCCCCCCCCCCCTTTTT
LGLLDDADQERFDRLRYVEIKHGRIAQLAFLGNIITRAGVHLPGNIDYAG TTTTTTTTTTTTTTTTTTCHHHHHHHHHHHHHHHHHHHHCCCCCCCCCCC
NSFDSFPNGWAAISGPDAISGSGLGQIVAFVGFLELFVMKDVTGEGEFVG CCCCCCCCCCCCCCCCCCCCHHHHHHHHHHHHHHHHHHHCCCCCCCCCCC
DFRNGALDFGWDKFDAETKLSKRAIELNNGRAAMMGILGLMVHEQLGGSL CCCCCCCCCCCCCCCHHHHHHHHHHHHHHCHHHHHHHHHHHHHHHHCCCC
PIVGEM CCCCCC
KMKLAILAALFGSAAAFAPSQTGKASTQLRAFEDELGAQPPLGFFDPFGM CCCCHHHHHHHHHHCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCC
LSGDCTQERFDRLRYVEIKHGRICMLAFLGQVVTRAGIHLPGSINYAGDS CCCCCCHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHCCCCCCCCCCC
FDSFPNGVAALFGPNSIPTAGLVQIIAFIGVLECAFMRDVPGTGNEFVGD CCCCCCCCCCCCCCCCCHHHHHHHHHHHHHHHHHHHHHCCCCCCCCCCCC


In [7]:
# read into pandas with space delimeter

PeaksDB_332_prot = pd.read_csv("T12_332_undigested_PeaksDB_protein_proteus_sort.csv", delim_whitespace=True, header=None)


# name columns

PeaksDB_332_prot.columns =['Stripped protein sequence', 'Secondary structure pred.'] 

In [8]:
PeaksDB_332_prot.head()

Unnamed: 0,Stripped protein sequence,Secondary structure pred.
0,HQRNLKMKLAVLAALFGSAAAFAPAQTGKATSALNAFESELGAQPP...,CCEEEEEHHHHHHHHHHHHHCCCCCCCCCCCCCCCCCCCCCCCCCC...
1,DPLGLLDDADQERFDRLRYVEIKHGRIAQLAFLGNIITRAGVHLPG...,CCCCCCCCTTTTTTTTTTTTTTTTTTTTTTTTTTTTTCCCCCCCEE...
2,AGNSFDSFPNGWAAILDAAFAPAQTGKATSALNAFESELGAQPPLG...,CCCCEEECCTTTTTTTTTTTTTTTTTTTTTEECCCCCCCCCCCCCT...
3,LGLLDDADQERFDRLRYVEIKHGRIAQLAFLGNIITRAGVHLPGNI...,TTTTTTTTTTTTTTTTTTCHHHHHHHHHHHHHHHHHHHHCCCCCCC...
4,NSFDSFPNGWAAISGPDAISGSGLGQIVAFVGFLELFVMKDVTGEG...,CCCCCCCCCCCCCCCCCCCCHHHHHHHHHHHHHHHHHHHCCCCCCC...


In [9]:
# add a column with the stripped peptide length (number of AAs)
PeaksDB_332_prot['Sequence length'] = PeaksDB_332_prot['Stripped protein sequence'].apply(len)

# use a count function to enumerate the # of C's (coil residues) for each peptide
PeaksDB_332_prot['C'] = PeaksDB_332_prot['Secondary structure pred.'].str.count("C")

# use a count function to enumerate the # of H's (helices residues) in each peptide
PeaksDB_332_prot['H'] = PeaksDB_332_prot['Secondary structure pred.'].str.count("H")

# use a count function to enumerate the # of E's (beta strand residues) in each peptide
PeaksDB_332_prot['E'] = PeaksDB_332_prot['Secondary structure pred.'].str.count("E")

#use a count function to enumerate the # of T's (membrane helix residues) in each peptide
PeaksDB_332_prot['T'] = PeaksDB_332_prot['Secondary structure pred.'].str.count("T")

# use a count function to enumerate the # of B's (membrane strand residues) in each peptide
PeaksDB_332_prot['B'] = PeaksDB_332_prot['Secondary structure pred.'].str.count("B")

# use a count function to enumerate the # of B's (signal peptide residues) in each peptide
PeaksDB_332_prot['S'] = PeaksDB_332_prot['Secondary structure pred.'].str.count("S")

# use a count function to enumerate the # of c's (cleavage site residues) in each peptide
PeaksDB_332_prot['c'] = PeaksDB_332_prot['Secondary structure pred.'].str.count("c")

# add a column with the % C
PeaksDB_332_prot['% C'] = PeaksDB_332_prot['C'] / PeaksDB_332_prot['Sequence length']

# add a column with the % H
PeaksDB_332_prot['% H'] = PeaksDB_332_prot['H'] / PeaksDB_332_prot['Sequence length']

# add a column with the % E
PeaksDB_332_prot['% E'] = PeaksDB_332_prot['E'] / PeaksDB_332_prot['Sequence length']

# add a column with the % T
PeaksDB_332_prot['% T'] = PeaksDB_332_prot['T'] / PeaksDB_332_prot['Sequence length']

# add a column with the % B
PeaksDB_332_prot['% B'] = PeaksDB_332_prot['B'] / PeaksDB_332_prot['Sequence length']

# add a column with the % S
PeaksDB_332_prot['% S'] = PeaksDB_332_prot['S'] / PeaksDB_332_prot['Sequence length']

# add a column with the % c
PeaksDB_332_prot['% c'] = PeaksDB_332_prot['c'] / PeaksDB_332_prot['Sequence length']

# additive check

PeaksDB_332_prot['% check'] = PeaksDB_332_prot['% C'] + PeaksDB_332_prot['% H'] + PeaksDB_332_prot['% E'] \
                                + PeaksDB_332_prot['% T'] + PeaksDB_332_prot['% B'] + \
                                PeaksDB_332_prot['% S'] + PeaksDB_332_prot['% c']

In [10]:
PeaksDB_332_prot.head()

Unnamed: 0,Stripped protein sequence,Secondary structure pred.,Sequence length,C,H,E,T,B,S,c,% C,% H,% E,% T,% B,% S,% c,% check
0,HQRNLKMKLAVLAALFGSAAAFAPAQTGKATSALNAFESELGAQPP...,CCEEEEEHHHHHHHHHHHHHCCCCCCCCCCCCCCCCCCCCCCCCCC...,50,32,13,5,0,0,0,0,0.64,0.26,0.1,0.0,0.0,0.0,0.0,1.0
1,DPLGLLDDADQERFDRLRYVEIKHGRIAQLAFLGNIITRAGVHLPG...,CCCCCCCCTTTTTTTTTTTTTTTTTTTTTTTTTTTTTCCCCCCCEE...,50,15,3,3,29,0,0,0,0.3,0.06,0.06,0.58,0.0,0.0,0.0,1.0
2,AGNSFDSFPNGWAAILDAAFAPAQTGKATSALNAFESELGAQPPLG...,CCCCEEECCTTTTTTTTTTTTTTTTTTTTTEECCCCCCCCCCCCCT...,50,19,0,5,26,0,0,0,0.38,0.0,0.1,0.52,0.0,0.0,0.0,1.0
3,LGLLDDADQERFDRLRYVEIKHGRIAQLAFLGNIITRAGVHLPGNI...,TTTTTTTTTTTTTTTTTTCHHHHHHHHHHHHHHHHHHHHCCCCCCC...,50,12,20,0,18,0,0,0,0.24,0.4,0.0,0.36,0.0,0.0,0.0,1.0
4,NSFDSFPNGWAAISGPDAISGSGLGQIVAFVGFLELFVMKDVTGEG...,CCCCCCCCCCCCCCCCCCCCHHHHHHHHHHHHHHHHHHHCCCCCCC...,50,31,19,0,0,0,0,0,0.62,0.38,0.0,0.0,0.0,0.0,0.0,1.0


In [11]:
index = ['332 total']

data = {
        '% C total': PeaksDB_332_prot['% C'].sum(),
        '% H total': PeaksDB_332_prot['% H'].sum(),
        '% E total': PeaksDB_332_prot['% E'].sum(),
        '% T total': PeaksDB_332_prot['% T'].sum(),
        '% B total': PeaksDB_332_prot['% B'].sum(),
        '% S total': PeaksDB_332_prot['% S'].sum(),
        '% c total': PeaksDB_332_prot['% c'].sum(),
        '% check sum': PeaksDB_332_prot['% check'].sum()
       }

PeaksDB_332_prot_totals = pd.DataFrame(data, columns=['% C total', '% H total', '% E total', '% T total', \
                                                      '% B total',  '% S total', '% c total', \
                                                      '% check sum'], index=index)

PeaksDB_332_prot_totals['overall % sum'] = PeaksDB_332_prot_totals['% C total'] \
                                            + PeaksDB_332_prot_totals['% H total'] \
                                            + PeaksDB_332_prot_totals['% E total'] \
                                            + PeaksDB_332_prot_totals['% T total'] \
                                            + PeaksDB_332_prot_totals['% B total'] \
                                            + PeaksDB_332_prot_totals['% S total'] \
                                            + PeaksDB_332_prot_totals['% c total'] 


PeaksDB_332_prot_totals['overall % C'] = PeaksDB_332_prot_totals['% C total'] / PeaksDB_332_prot_totals['overall % sum']

PeaksDB_332_prot_totals['overall % H'] = PeaksDB_332_prot_totals['% H total'] / PeaksDB_332_prot_totals['overall % sum']

PeaksDB_332_prot_totals['overall % E'] = PeaksDB_332_prot_totals['% E total'] / PeaksDB_332_prot_totals['overall % sum']

PeaksDB_332_prot_totals['overall % T'] = PeaksDB_332_prot_totals['% T total'] / PeaksDB_332_prot_totals['overall % sum']

PeaksDB_332_prot_totals['overall % B'] = PeaksDB_332_prot_totals['% B total'] / PeaksDB_332_prot_totals['overall % sum']

PeaksDB_332_prot_totals['overall % S'] = PeaksDB_332_prot_totals['% S total'] / PeaksDB_332_prot_totals['overall % sum']

PeaksDB_332_prot_totals['overall % c'] = PeaksDB_332_prot_totals['% c total'] / PeaksDB_332_prot_totals['overall % sum']

# write to csv

PeaksDB_332_prot_totals.to_csv("/home/millieginty/Documents/git-repos/rot-mayer/analyses/proteus2/Proteins-Proteus2/PeaksDB_332_prot_undigested_totals")

PeaksDB_332_prot_totals.head()

Unnamed: 0,% C total,% H total,% E total,% T total,% B total,% S total,% c total,% check sum,overall % sum,overall % C,overall % H,overall % E,overall % T,overall % B,overall % S,overall % c
332 total,43.865511,29.476685,9.121874,12.535931,0.0,0.0,0.0,95.0,95.0,0.461742,0.310281,0.09602,0.131957,0.0,0.0,0.0
