## PROTEUS2 peptide and protein secondary structure analysis:

### This is a notebook to organize and evaluate the output of peptide secondary structure prediction by [Proteus2](http://www.proteus2.ca/proteus2/), described in [Montgomerie et al., 2008](https://academic.oup.com/nar/article/36/suppl_2/W202/2506231) in _Nucleic Acids Research_. 

### The output of Proteus2 comes in an email in nominally FASTA format, but with extra line breaks and spacings within peptide and prediction sequences. 

#### They look like this:

### >1

LPQVEGTGGD VQPSQDLVR
CCCCCCCCCC CCCCCCCCC

#### >2

VIGQNEAVDA VSNAIR
CCCCCCHHHH HHHCCC

#### >3

AIDLIDEAAS SIR
CCCCCCCCCC CCC


### Output means:

- H = Helix
- E = Beta Strand
- C = Coil
- T = Membrane helix
- B = Membrane strand
- S = Signal peptide
- c = Cleavage site

In [1]:
cd /home/millieginty/Documents/git-repos/rot-mayer/analyses/proteus2/

In [2]:
# LIBRARIES
#import pandas library for working with tabular data
import os
os.getcwd()
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from scipy.stats import kde
#import regular expresson (regex)
import re
#check pandas version
pd.__version__

'1.0.5'

In [3]:
ls

PeaksDB_322_trypsin_totals
PeaksDB_323_trypsin_totals
PeaksDB_324_trypsin_totals
PeaksDB_325_trypsin_totals
PeaksDB_329_undigested_totals
PeaksDB_330_undigested_totals
T0_322_trypsin_PeaksDB_proteus
T0_322_trypsin_PeaksDB_proteus_sort.csv
T0_322_trypsin_PeaksDB_proteus.txt
T0_329_undigested_PeaksDB_proteus
T0_329_undigested_PeaksDB_proteus_sort.csv
T0_329_undigested_PeaksDB_proteus.txt
T12_325_trypsin_PeaksDB_proteus
T12_325_trypsin_PeaksDB_proteus_sort.csv
T12_325_trypsin_PeaksDB_proteus.txt
T12_332_undigested_PeaksDB_proteus
T12_332_undigested_PeaksDB_proteus_sort.csv
T12_332_undigested_PeaksDB_proteus.txt
T2_323_trypsin_PeaksDB_proteus
T2_323_trypsin_PeaksDB_proteus_sort.csv
T2_323_trypsin_PeaksDB_proteus.txt
T2_330_undigested_PeaksDB_proteus
T2_330_undigested_PeaksDB_proteus_sort.csv
T2_330_undigested_PeaksDB_proteus.txt
T5_324_trypsin_PeaksDB_proteus
T5_324_trypsin_PeaksDB_proteus_sort.csv
T5_324_trypsin_PeaksDB_proteus.txt
T5_331_undigested_PeaksDB_prot

In [4]:
!head T5_331_undigested_PeaksDB_proteus

>1

LPQVEGTGGD VQPSQDLVR
CCCCCCCCCC CCCCCCCCC

>2

VIGQNEAVDA VSNAIR
CCCCCCHHHH HHHCCC



In [5]:
# remove empty lines
# remove empty spaces from lines
# collapse the 3 lines from every peptide (>#, peptide seq, secondard stucture pred.) into one line

!sed '/^[[:space:]]*$/d' T5_331_undigested_PeaksDB_proteus | cat \
| tr -d "[:blank:]" > T5_331_undigested_PeaksDB_proteus.txt

!awk '{printf "%s%s",$0,(NR%3?FS:RS)}' T5_331_undigested_PeaksDB_proteus.txt > \
T5_331_undigested_PeaksDB_proteus_sort.csv

In [6]:
# read into pandas with space delimeter

PeaksDB_331 = pd.read_csv("T5_331_undigested_PeaksDB_proteus_sort.csv", delim_whitespace=True, header=None)

# delete carat column

del PeaksDB_331[0]

# name columns

PeaksDB_331.columns =['Stripped peptide sequence', 'Secondary structure pred.'] 

In [7]:
PeaksDB_331.head()

Unnamed: 0,Stripped peptide sequence,Secondary structure pred.
0,LPQVEGTGGDVQPSQDLVR,CCCCCCCCCCCCCCCCCCC
1,VIGQNEAVDAVSNAIR,CCCCCCHHHHHHHCCC
2,AIDLIDEAASSIR,CCCCCCCCCCCCC
3,VTDAEIAEVLAR,CCHHHHHHHHCC
4,TGLIPHPVPL,CCCCCCCCCC


In [8]:
# add a column with the stripped peptide length (number of AAs)
PeaksDB_331['Peptide length'] = PeaksDB_331['Stripped peptide sequence'].apply(len)

# use a count function to enumerate the # of C's (coil residues) for each peptide
PeaksDB_331['C'] = PeaksDB_331['Secondary structure pred.'].str.count("C")

# use a count function to enumerate the # of H's (helices residues) in each peptide
PeaksDB_331['H'] = PeaksDB_331['Secondary structure pred.'].str.count("H")

# use a count function to enumerate the # of E's (beta strand residues) in each peptide
PeaksDB_331['E'] = PeaksDB_331['Secondary structure pred.'].str.count("E")

# use a count function to enumerate the # of T's (membrane helix residues) in each peptide
#PeaksDB_331['T'] = PeaksDB_331['Secondary structure pred.'].str.count("T")

# use a count function to enumerate the # of B's (membrane strand residues) in each peptide
#PeaksDB_331['B'] = PeaksDB_331['Secondary structure pred.'].str.count("B")

# use a count function to enumerate the # of B's (signal peptide residues) in each peptide
#PeaksDB_331['S'] = PeaksDB_331['Secondary structure pred.'].str.count("S")

# use a count function to enumerate the # of c's (cleavage site residues) in each peptide
#PeaksDB_331['c'] = PeaksDB_331['Secondary structure pred.'].str.count("c")

# add a column with the % C
PeaksDB_331['% C'] = PeaksDB_331['C'] / PeaksDB_331['Peptide length']

# add a column with the % H
PeaksDB_331['% H'] = PeaksDB_331['H'] / PeaksDB_331['Peptide length']

# add a column with the % E
PeaksDB_331['% E'] = PeaksDB_331['E'] / PeaksDB_331['Peptide length']

# additive check

PeaksDB_331['% check'] = PeaksDB_331['% C'] + PeaksDB_331['% H'] + PeaksDB_331['% E']

In [9]:
PeaksDB_331.head()

Unnamed: 0,Stripped peptide sequence,Secondary structure pred.,Peptide length,C,H,E,% C,% H,% E,% check
0,LPQVEGTGGDVQPSQDLVR,CCCCCCCCCCCCCCCCCCC,19,19,0,0,1.0,0.0,0.0,1.0
1,VIGQNEAVDAVSNAIR,CCCCCCHHHHHHHCCC,16,9,7,0,0.5625,0.4375,0.0,1.0
2,AIDLIDEAASSIR,CCCCCCCCCCCCC,13,13,0,0,1.0,0.0,0.0,1.0
3,VTDAEIAEVLAR,CCHHHHHHHHCC,12,4,8,0,0.333333,0.666667,0.0,1.0
4,TGLIPHPVPL,CCCCCCCCCC,10,10,0,0,1.0,0.0,0.0,1.0


In [10]:
index = ['331 total']

data = {
        '% C total': PeaksDB_331['% C'].sum(),
        '% H total': PeaksDB_331['% H'].sum(),
        '% E total': PeaksDB_331['% E'].sum(),
        '% check sum': PeaksDB_331['% check'].sum()
       }

PeaksDB_331_totals = pd.DataFrame(data, columns=['% C total', '% H total', '% E total', '% check sum'], \
                                  index=index)

PeaksDB_331_totals['overall % sum'] = PeaksDB_331_totals['% C total'] + PeaksDB_331_totals['% H total'] \
                                    + PeaksDB_331_totals['% E total']
                                        
PeaksDB_331_totals['overall % C'] = PeaksDB_331_totals['% C total'] / PeaksDB_331_totals['overall % sum']

PeaksDB_331_totals['overall % H'] = PeaksDB_331_totals['% H total'] / PeaksDB_331_totals['overall % sum']

PeaksDB_331_totals['overall % E'] = PeaksDB_331_totals['% E total'] / PeaksDB_331_totals['overall % sum']

# write to csv

PeaksDB_331_totals.to_csv("/home/millieginty/Documents/git-repos/rot-mayer/analyses/proteus2/PeaksDB_331_undigested_totals")

PeaksDB_331_totals.head()

Unnamed: 0,% C total,% H total,% E total,% check sum,overall % sum,overall % C,overall % H,overall % E
331 total,199.424825,2.269384,15.305791,217.0,217.0,0.919008,0.010458,0.070534
