### Used the Galaxy manipulation tool `Filter sequences by ID from a tabular file`to get just the _T.wesiss_ protein sequences that are 1) found on Day 12 and 2) from the chloroplast (most of them, all but 3 proteins).

### PROTEUS2 protein secondary structure analysis of remanent (survivor) proteins:

### This is a notebook to combine and evaluate the protein secondary structure output for Day 12 chloroplast protein sequences.

### Output means:

- H = Helix
- E = Beta Strand
- C = Coil
- T = Membrane helix
- B = Membrane strand
- S = Signal peptide
- c = Cleavage site

In [1]:
cd /home/millieginty/Documents/git-repos/rot-mayer/analyses/proteus2/Proteins-Proteus2/

/home/millieginty/Documents/git-repos/rot-mayer/analyses/proteus2/Proteins-Proteus2


In [2]:
# LIBRARIES
#import pandas library for working with tabular data
import os
os.getcwd()
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from scipy.stats import kde
#import regular expresson (regex)
import re
#check pandas version
pd.__version__

'1.0.5'

### Galaxy wasn't letting me download the filtered fasta (see above), so I did so manually for the Day 12 DB diatom chloroplast/membrane proteins. Called `T12-325-trypsin-DB-diatom-membrane.fasta`.

### Still have to remove any `X` residues, though, because Proteus2 will not accept sequences with them. Using `sed`:

In [4]:
# use sed to remove all occurances of X

!sed 's/X//g' T12-325-trypsin-DB-diatom-membrane.fasta > T12-325-trypsin-DB-diatom-membrane-noX.fasta

In [5]:
# take a look

!head -30 T12-325-trypsin-DB-diatom-membrane-noX.fasta

>Thalassiosira_weissflogii_0171421366 
MKSAIIASLIAGAAAFAPTQTGRTPTSVSESKADLEAIASKANPFVKFYDPLNLAEQEFWGKSNEETIAWLRQSEIKHGRIAMFAFVGYIVQSNFVFPWAQTLAGAPHPSADLSPEAQWDAIPLGAKWQIFAVISMLELWDECGGGGAMPHYTKGRQPGKYPPFTLFRENVHFVLDLYDPFGFNKNMSAETKERRLVAELNNGRLAQLGIFGFLCADKIPGSVPVLNSIAIPYDGNPMIPFEGQFSYFN
>Thalassiosira_weissflogii_0201829098 
MMKLAALAALAGSAAAFAPSTGGRAVTSLSEKSVSLPFLETPPNTAGYVGDVGFDPFRFSDFIPVDFLREAELKHGRICMMAWLGFVAVDCGFRVYPIPEGFEGITAVQAHDAAVEQGGLSQMLLWFSLAEVFDLIAVVQMLEGSGRAPGDFGLDGGFLKGKSAAEIETMKLKEITHCRLAMFAFSGVVTQAVLTQGPFPYV
>Thalassiosira_weissflogii_0201860448 
HQRNLKMKLAVLAALFGSAAAFAPAQTGKATSALNAFESELGAQPPLGFFDPLGLLDDADQERFDRLRYVEIKHGRIAQLAFLGNIITRAGVHLPGNIDYAGNSFDSFPNGWAAILDAAFAPAQTGKATSALNAFESELGAQPPLGFFDPLGLLDDADQERFDRLRYVEIKHGRIAQLAFLGNIITRAGVHLPGNIDYAGNSFDSFPNGWAAISGPDAISGSGLGQIVAFVGFLELFVMKDVTGEGEFVGDFRNGALDFGWDKFDAETKLSKRAIELNNGRAAMMGILGLMVHEQLGGSLPIVGEM
>Thalassiosira_weissflogii_0203152572 
WAAISGPDAISGSGLGQIVAFVGFLELFVMKDVTGEGEFVGDFRNGALDFGWDKFDAETKLSKRAIELNNGRAAMMG

### This worked! Proteus2 accepted the submission. Now the results come in an email from Proteus2 in a garbage format:

### The output of Proteus2 comes in an email in nominally FASTA format, but with extra line breaks and spacings within protein and prediction sequences. Also, the name gets cut off.

#### They look like this:

#### >Thalas

MMKLAALAAL MGSAAAFAPA QTGKASTQLR AFEDELGAQP PLGFFDPFGM 

CCHHHHHHHH HHHHHCCCCC CCCCCCCCCC CCCCCCCCCC CCCCCCCCCC 

LSGDCTQERF DRLRYVEIKH GRICMLAFLG QIVTRAGIHL PGSINYAGDS 

CCCCCCHHHH HHHHHHHHHH HHHHHHHHHH HHHHHHHCCC CCCCCCCCCC 

FDSFPNGVAA LFGPNSIPTA GLVQIIAFIG VLECAFMRDV PGTGNEFVGD 

CCCCCCCCCC CCCCCCCCHH HHHHHHHHHH HHHHHHHHCC CCCCCCCCCC 

FRNGYIDFGW DDFDEETKLQ KRAIQSGTIS NMMKLAALAA LMGSAAAFAP 

CCCCCCCCCC CCCCHHHHHH HHHHHHHHHH HHHHHHHHHC CCCCCCCCCC 



### Output means:

- H = Helix
- E = Beta Strand
- C = Coil
- T = Membrane helix
- B = Membrane strand
- S = Signal peptide
- c = Cleavage site

In [6]:
!head T12-325-trypsin-DB-diatom-membrane-proteins

>Thalas

MKSAIIASLI AGAAAFAPTQ TGRTPTSVSE SKADLEAIAS KANPFVKFYD 
CCHHHHHHHH HHCCCCCCCC CCCCCCCCCC CCCCCCCCCC CCCCCCCCCC 

PLNLAEQEFW GKSNEETIAW LRQSEIKHGR IAMFAFVGYI VQSNFVFPWA 
CCCCCCCCCC CCCCHHHHHH HHHHHHHHHH HHHHHHHHHH HHHHHCCCCC 

QTLAGAPHPS ADLSPEAQWD AIPLGAKWQI FAVISMLELW DECGGGGAMP 
CCCCCCCCCC CCCCCCCCCC CCCHHHHHHH HHHHHHHHHH HHHHHHHHCC 


In [7]:
# remove empty lines
# remove empty spaces from lines
# remove carat protein sequence names
# collapse the 2 lines from every protein sequence line (protein seq, secondard stucture pred.) into one line

!sed '/^[[:space:]]*$/d' T12-325-trypsin-DB-diatom-membrane-proteins | cat \
| sed '/>/d' \
| tr -d "[:blank:]" > T12-325-trypsin-DB-diatom-membrane-proteins.txt

!awk '{printf "%s%s",$0,(NR%2?FS:RS)}' T12-325-trypsin-DB-diatom-membrane-proteins.txt > \
T12-325-trypsin-DB-diatom-membrane-proteins_sort.csv

In [8]:
!head T12-325-trypsin-DB-diatom-membrane-proteins_sort.csv

MKSAIIASLIAGAAAFAPTQTGRTPTSVSESKADLEAIASKANPFVKFYD CCHHHHHHHHHHCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCC
PLNLAEQEFWGKSNEETIAWLRQSEIKHGRIAMFAFVGYIVQSNFVFPWA CCCCCCCCCCCCCCHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHCCCCC
QTLAGAPHPSADLSPEAQWDAIPLGAKWQIFAVISMLELWDECGGGGAMP CCCCCCCCCCCCCCCCCCCCCCCHHHHHHHHHHHHHHHHHHHHHHHHHCC
HYTKGRQPGKYPPFTLFRENVHFVLDLYDPFGFNKNMSAETKERRLVAEL CCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCHHHHHHHHHHHH
NNGRLAQLGIFGFLCADKIPGSVPVLNSIAIPYDGNPMIPFEGQFSYFN HHHHHHHHHHHHHHHHHHHCCCCCEECCCCCCCCCCCCEEEECCCCCCC
MMKLAALAALAGSAAAFAPSTGGRAVTSLSEKSVSLPFLETPPNTAGYVG CCHHHHHHHHHHHHCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCC
DVGFDPFRFSDFIPVDFLREAELKHGRICMMAWLGFVAVDCGFRVYPIPE CCCCCCCCCCCHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHCCCCCCC
GFEGITAVQAHDAAVEQGGLSQMLLWFSLAEVFDLIAVVQMLEGSGRAPG CCCCCCCCCCCCCCHHHHHHHHHHHHHHHHHHHHHHHHHCCCCCCCCCCC
DFGLDGGFLKGKSAAEIETMKLKEITHCRLAMFAFSGVVTQAVLTQGPFP CCCCCCCCCCCCCHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHCCCCCC
HQRNLKMKLAVLAALFGSAAAFAPAQTGKATSALNAFESELGAQPPLGFF CCEEEEEHHHHHHHHHHHHHCCCC

In [9]:
# read into pandas with space delimeter

PeaksDB_325_mem_prot = pd.read_csv("T12-325-trypsin-DB-diatom-membrane-proteins_sort.csv", delim_whitespace=True, header=None)


# name columns

PeaksDB_325_mem_prot.columns =['Stripped protein sequence', 'Secondary structure pred.'] 

In [10]:
PeaksDB_325_mem_prot.head(13)

Unnamed: 0,Stripped protein sequence,Secondary structure pred.
0,MKSAIIASLIAGAAAFAPTQTGRTPTSVSESKADLEAIASKANPFV...,CCHHHHHHHHHHCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCC...
1,PLNLAEQEFWGKSNEETIAWLRQSEIKHGRIAMFAFVGYIVQSNFV...,CCCCCCCCCCCCCCHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHC...
2,QTLAGAPHPSADLSPEAQWDAIPLGAKWQIFAVISMLELWDECGGG...,CCCCCCCCCCCCCCCCCCCCCCCHHHHHHHHHHHHHHHHHHHHHHH...
3,HYTKGRQPGKYPPFTLFRENVHFVLDLYDPFGFNKNMSAETKERRL...,CCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCHHHHHHHH...
4,NNGRLAQLGIFGFLCADKIPGSVPVLNSIAIPYDGNPMIPFEGQFSYFN,HHHHHHHHHHHHHHHHHHHCCCCCEECCCCCCCCCCCCEEEECCCCCCC
5,MMKLAALAALAGSAAAFAPSTGGRAVTSLSEKSVSLPFLETPPNTA...,CCHHHHHHHHHHHHCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCC...
6,DVGFDPFRFSDFIPVDFLREAELKHGRICMMAWLGFVAVDCGFRVY...,CCCCCCCCCCCHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHCCC...
7,GFEGITAVQAHDAAVEQGGLSQMLLWFSLAEVFDLIAVVQMLEGSG...,CCCCCCCCCCCCCCHHHHHHHHHHHHHHHHHHHHHHHHHCCCCCCC...
8,DFGLDGGFLKGKSAAEIETMKLKEITHCRLAMFAFSGVVTQAVLTQ...,CCCCCCCCCCCCCHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHCC...
9,HQRNLKMKLAVLAALFGSAAAFAPAQTGKATSALNAFESELGAQPP...,CCEEEEEHHHHHHHHHHHHHCCCCCCCCCCCCCCCCCCCCCCCCCC...


In [11]:
# add a column with the stripped peptide length (number of AAs)
PeaksDB_325_mem_prot['Sequence length'] = PeaksDB_325_mem_prot['Stripped protein sequence'].apply(len)

# use a count function to enumerate the # of C's (coil residues) for each peptide
PeaksDB_325_mem_prot['C'] = PeaksDB_325_mem_prot['Secondary structure pred.'].str.count("C")

# use a count function to enumerate the # of H's (helices residues) in each peptide
PeaksDB_325_mem_prot['H'] = PeaksDB_325_mem_prot['Secondary structure pred.'].str.count("H")

# use a count function to enumerate the # of E's (beta strand residues) in each peptide
PeaksDB_325_mem_prot['E'] = PeaksDB_325_mem_prot['Secondary structure pred.'].str.count("E")

#use a count function to enumerate the # of T's (membrane helix residues) in each peptide
PeaksDB_325_mem_prot['T'] = PeaksDB_325_mem_prot['Secondary structure pred.'].str.count("T")

# use a count function to enumerate the # of B's (membrane strand residues) in each peptide
PeaksDB_325_mem_prot['B'] = PeaksDB_325_mem_prot['Secondary structure pred.'].str.count("B")

# use a count function to enumerate the # of B's (signal peptide residues) in each peptide
PeaksDB_325_mem_prot['S'] = PeaksDB_325_mem_prot['Secondary structure pred.'].str.count("S")

# use a count function to enumerate the # of c's (cleavage site residues) in each peptide
PeaksDB_325_mem_prot['c'] = PeaksDB_325_mem_prot['Secondary structure pred.'].str.count("c")

# add a column with the % C
PeaksDB_325_mem_prot['% C'] = PeaksDB_325_mem_prot['C'] / PeaksDB_325_mem_prot['Sequence length']

# add a column with the % H
PeaksDB_325_mem_prot['% H'] = PeaksDB_325_mem_prot['H'] / PeaksDB_325_mem_prot['Sequence length']

# add a column with the % E
PeaksDB_325_mem_prot['% E'] = PeaksDB_325_mem_prot['E'] / PeaksDB_325_mem_prot['Sequence length']

# add a column with the % T
PeaksDB_325_mem_prot['% T'] = PeaksDB_325_mem_prot['T'] / PeaksDB_325_mem_prot['Sequence length']

# add a column with the % B
PeaksDB_325_mem_prot['% B'] = PeaksDB_325_mem_prot['B'] / PeaksDB_325_mem_prot['Sequence length']

# add a column with the % S
PeaksDB_325_mem_prot['% S'] = PeaksDB_325_mem_prot['S'] / PeaksDB_325_mem_prot['Sequence length']

# add a column with the % c
PeaksDB_325_mem_prot['% c'] = PeaksDB_325_mem_prot['c'] / PeaksDB_325_mem_prot['Sequence length']

# additive check

PeaksDB_325_mem_prot['% check'] = PeaksDB_325_mem_prot['% C'] + PeaksDB_325_mem_prot['% H'] + PeaksDB_325_mem_prot['% E'] \
                                + PeaksDB_325_mem_prot['% T'] + PeaksDB_325_mem_prot['% B'] + \
                                PeaksDB_325_mem_prot['% S'] + PeaksDB_325_mem_prot['% c']

In [12]:
PeaksDB_325_mem_prot.head()

Unnamed: 0,Stripped protein sequence,Secondary structure pred.,Sequence length,C,H,E,T,B,S,c,% C,% H,% E,% T,% B,% S,% c,% check
0,MKSAIIASLIAGAAAFAPTQTGRTPTSVSESKADLEAIASKANPFV...,CCHHHHHHHHHHCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCC...,50,40,10,0,0,0,0,0,0.8,0.2,0.0,0.0,0.0,0.0,0.0,1.0
1,PLNLAEQEFWGKSNEETIAWLRQSEIKHGRIAMFAFVGYIVQSNFV...,CCCCCCCCCCCCCCHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHC...,50,19,31,0,0,0,0,0,0.38,0.62,0.0,0.0,0.0,0.0,0.0,1.0
2,QTLAGAPHPSADLSPEAQWDAIPLGAKWQIFAVISMLELWDECGGG...,CCCCCCCCCCCCCCCCCCCCCCCHHHHHHHHHHHHHHHHHHHHHHH...,50,25,25,0,0,0,0,0,0.5,0.5,0.0,0.0,0.0,0.0,0.0,1.0
3,HYTKGRQPGKYPPFTLFRENVHFVLDLYDPFGFNKNMSAETKERRL...,CCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCHHHHHHHH...,50,38,12,0,0,0,0,0,0.76,0.24,0.0,0.0,0.0,0.0,0.0,1.0
4,NNGRLAQLGIFGFLCADKIPGSVPVLNSIAIPYDGNPMIPFEGQFSYFN,HHHHHHHHHHHHHHHHHHHCCCCCEECCCCCCCCCCCCEEEECCCCCCC,49,24,19,6,0,0,0,0,0.489796,0.387755,0.122449,0.0,0.0,0.0,0.0,1.0


In [14]:
index = ['325 total']

data = {
        '% C total': PeaksDB_325_mem_prot['% C'].sum(),
        '% H total': PeaksDB_325_mem_prot['% H'].sum(),
        '% E total': PeaksDB_325_mem_prot['% E'].sum(),
        '% T total': PeaksDB_325_mem_prot['% T'].sum(),
        '% B total': PeaksDB_325_mem_prot['% B'].sum(),
        '% S total': PeaksDB_325_mem_prot['% S'].sum(),
        '% c total': PeaksDB_325_mem_prot['% c'].sum(),
        '% check sum': PeaksDB_325_mem_prot['% check'].sum()
       }

PeaksDB_325_mem_prot_totals = pd.DataFrame(data, columns=['% C total', '% H total', '% E total', '% T total', \
                                                      '% B total',  '% S total', '% c total', \
                                                      '% check sum'], index=index)

PeaksDB_325_mem_prot_totals['overall % sum'] = PeaksDB_325_mem_prot_totals['% C total'] \
                                            + PeaksDB_325_mem_prot_totals['% H total'] \
                                            + PeaksDB_325_mem_prot_totals['% E total'] \
                                            + PeaksDB_325_mem_prot_totals['% T total'] \
                                            + PeaksDB_325_mem_prot_totals['% B total'] \
                                            + PeaksDB_325_mem_prot_totals['% S total'] \
                                            + PeaksDB_325_mem_prot_totals['% c total'] 


PeaksDB_325_mem_prot_totals['overall % C'] = PeaksDB_325_mem_prot_totals['% C total'] / PeaksDB_325_mem_prot_totals['overall % sum']

PeaksDB_325_mem_prot_totals['overall % H'] = PeaksDB_325_mem_prot_totals['% H total'] / PeaksDB_325_mem_prot_totals['overall % sum']

PeaksDB_325_mem_prot_totals['overall % E'] = PeaksDB_325_mem_prot_totals['% E total'] / PeaksDB_325_mem_prot_totals['overall % sum']

PeaksDB_325_mem_prot_totals['overall % T'] = PeaksDB_325_mem_prot_totals['% T total'] / PeaksDB_325_mem_prot_totals['overall % sum']

PeaksDB_325_mem_prot_totals['overall % B'] = PeaksDB_325_mem_prot_totals['% B total'] / PeaksDB_325_mem_prot_totals['overall % sum']

PeaksDB_325_mem_prot_totals['overall % S'] = PeaksDB_325_mem_prot_totals['% S total'] / PeaksDB_325_mem_prot_totals['overall % sum']

PeaksDB_325_mem_prot_totals['overall % c'] = PeaksDB_325_mem_prot_totals['% c total'] / PeaksDB_325_mem_prot_totals['overall % sum']

# write to csv

PeaksDB_325_mem_prot_totals.to_csv("/home/millieginty/Documents/git-repos/rot-mayer/analyses/proteus2/Proteins-Proteus2/Day12_325 PeaksDB_dia_mem_prot_trypsin_totals")

PeaksDB_325_mem_prot_totals.head()

Unnamed: 0,% C total,% H total,% E total,% T total,% B total,% S total,% c total,% check sum,overall % sum,overall % C,overall % H,overall % E,overall % T,overall % B,overall % S,overall % c
325 total,34.002081,24.316742,2.025002,6.656174,0.0,0.0,0.0,67.0,67.0,0.507494,0.362936,0.030224,0.099346,0.0,0.0,0.0


In [15]:
# Now for the 3 cytoplasm proteins in Day 12

!head T12-325-trypsin-DB-diatom-cytoplasm-proteins

>Thalas

MMKLVLLATL ASTATAFLSP FSTVQQQFRA PATRLFADEA EGDSENAPAP 
CEEEEEEHHH HHHHHHHCCC CCCCCCCCCC CCCCCCCCCC CCCCECCCCC 

RSAQELSALT SSVKTVFSLE DLAKVLPHRY PFLLVDKVVE YEMGKRAVGL 
CCCCCCCCCC CCCCCCCEHH HHHHHCCEEE CCCCEEECEE EECCCEEEEE 

KSVTLNEPHF TGHFPDRPLM PGVLQVEALA QLAGLVCLQM EGAEPGAVFF 
EEEECCCCHH HHCCCCCEEE CHHHHHHHHH HHHHHHHHHH CCCCCCCEEE 


In [16]:
# remove empty lines
# remove empty spaces from lines
# remove carat protein sequence names
# collapse the 2 lines from every protein sequence line (protein seq, secondard stucture pred.) into one line

!sed '/^[[:space:]]*$/d' T12-325-trypsin-DB-diatom-cytoplasm-proteins | cat \
| sed '/>/d' \
| tr -d "[:blank:]" > T12-325-trypsin-DB-diatom-cytoplasm-proteins.txt

!awk '{printf "%s%s",$0,(NR%2?FS:RS)}' T12-325-trypsin-DB-diatom-cytoplasm-proteins.txt > \
T12-325-trypsin-DB-diatom-cytoplasm-proteins_sort.csv

In [17]:
!head T12-325-trypsin-DB-diatom-cytoplasm-proteins_sort.csv

MMKLVLLATLASTATAFLSPFSTVQQQFRAPATRLFADEAEGDSENAPAP CEEEEEEHHHHHHHHHHCCCCCCCCCCCCCCCCCCCCCCCCCCCECCCCC
RSAQELSALTSSVKTVFSLEDLAKVLPHRYPFLLVDKVVEYEMGKRAVGL CCCCCCCCCCCCCCCCCEHHHHHHHCCEEECCCCEEECEEEECCCEEEEE
KSVTLNEPHFTGHFPDRPLMPGVLQVEALAQLAGLVCLQMEGAEPGAVFF EEEECCCCHHHHCCCCCEEECHHHHHHHHHHHHHHHHHHHCCCCCCCEEE
FAGVDGVKWKKPVVPGDTLVMEVELKKWNKRFGLATATGRSYVDGELAVE EEEEECEEEECEEECCEEEEEEEEEEEECCCCCEEEEEEEEEECCEEEEE
LDEMKFALAK EEEEEEEECC
MKFTLLTLSTLATTLAAFTPTSSSFVKTTPVAHRAASSSSTSLDMKYKVA CEEEEEEEEEEECCCCCCCCCCCCEEECCCCCCCCCCCCCCCCCCCEEEE
VVGGGPSGACAAELFAQEKGLDTVLFERKMDNAKPCGGALPLCMLGEFDL EECCCHHHHHHHHHHHHHCCEEEEEEECCCCCCCCCCCCCCHHHHHHHHH
PESTVDRKVRRMKLLSPTNVEVDLGDTLQPNEYLGMCRRELMDKFLRDRA CHHHHHCCCEEEEEECCCCEEEEEECCCCCCCEEEEEEHHHHHHHHHHHH
LSYGAEAVQGLVTSLDVPADHRSNENSRYTLHYQEYAEGTSAGVPKTMDV HHHEEEEECEEEEEEEEEEEEEEECCCEEEEEEEEECCCCCCCCEEEEEE
DLLVGADGANSRVAKAMDAGEYNFALAFQERLKLSEEKMQFYEEMAEMYV EEEEECCCCCCHHHHHCCCCCCCEEEEEEEEEECCCCCCCCCCCEEEEEE


In [18]:
# read into pandas with space delimeter

PeaksDB_325_cyt_prot = pd.read_csv("T12-325-trypsin-DB-diatom-cytoplasm-proteins_sort.csv", delim_whitespace=True, header=None)


# name columns

PeaksDB_325_cyt_prot.columns =['Stripped protein sequence', 'Secondary structure pred.'] 

In [19]:
PeaksDB_325_cyt_prot.head(13)

Unnamed: 0,Stripped protein sequence,Secondary structure pred.
0,MMKLVLLATLASTATAFLSPFSTVQQQFRAPATRLFADEAEGDSEN...,CEEEEEEHHHHHHHHHHCCCCCCCCCCCCCCCCCCCCCCCCCCCEC...
1,RSAQELSALTSSVKTVFSLEDLAKVLPHRYPFLLVDKVVEYEMGKR...,CCCCCCCCCCCCCCCCCEHHHHHHHCCEEECCCCEEECEEEECCCE...
2,KSVTLNEPHFTGHFPDRPLMPGVLQVEALAQLAGLVCLQMEGAEPG...,EEEECCCCHHHHCCCCCEEECHHHHHHHHHHHHHHHHHHHCCCCCC...
3,FAGVDGVKWKKPVVPGDTLVMEVELKKWNKRFGLATATGRSYVDGE...,EEEEECEEEECEEECCEEEEEEEEEEEECCCCCEEEEEEEEEECCE...
4,LDEMKFALAK,EEEEEEEECC
5,MKFTLLTLSTLATTLAAFTPTSSSFVKTTPVAHRAASSSSTSLDMK...,CEEEEEEEEEEECCCCCCCCCCCCEEECCCCCCCCCCCCCCCCCCC...
6,VVGGGPSGACAAELFAQEKGLDTVLFERKMDNAKPCGGALPLCMLG...,EECCCHHHHHHHHHHHHHCCEEEEEEECCCCCCCCCCCCCCHHHHH...
7,PESTVDRKVRRMKLLSPTNVEVDLGDTLQPNEYLGMCRRELMDKFL...,CHHHHHCCCEEEEEECCCCEEEEEECCCCCCCEEEEEEHHHHHHHH...
8,LSYGAEAVQGLVTSLDVPADHRSNENSRYTLHYQEYAEGTSAGVPK...,HHHEEEEECEEEEEEEEEEEEEEECCCEEEEEEEEECCCCCCCCEE...
9,DLLVGADGANSRVAKAMDAGEYNFALAFQERLKLSEEKMQFYEEMA...,EEEEECCCCCCHHHHHCCCCCCCEEEEEEEEEECCCCCCCCCCCEE...


In [20]:
# add a column with the stripped peptide length (number of AAs)
PeaksDB_325_cyt_prot['Sequence length'] = PeaksDB_325_cyt_prot['Stripped protein sequence'].apply(len)

# use a count function to enumerate the # of C's (coil residues) for each peptide
PeaksDB_325_cyt_prot['C'] = PeaksDB_325_cyt_prot['Secondary structure pred.'].str.count("C")

# use a count function to enumerate the # of H's (helices residues) in each peptide
PeaksDB_325_cyt_prot['H'] = PeaksDB_325_cyt_prot['Secondary structure pred.'].str.count("H")

# use a count function to enumerate the # of E's (beta strand residues) in each peptide
PeaksDB_325_cyt_prot['E'] = PeaksDB_325_cyt_prot['Secondary structure pred.'].str.count("E")

#use a count function to enumerate the # of T's (cytoplasm helix residues) in each peptide
PeaksDB_325_cyt_prot['T'] = PeaksDB_325_cyt_prot['Secondary structure pred.'].str.count("T")

# use a count function to enumerate the # of B's (cytoplasm strand residues) in each peptide
PeaksDB_325_cyt_prot['B'] = PeaksDB_325_cyt_prot['Secondary structure pred.'].str.count("B")

# use a count function to enumerate the # of B's (signal peptide residues) in each peptide
PeaksDB_325_cyt_prot['S'] = PeaksDB_325_cyt_prot['Secondary structure pred.'].str.count("S")

# use a count function to enumerate the # of c's (cleavage site residues) in each peptide
PeaksDB_325_cyt_prot['c'] = PeaksDB_325_cyt_prot['Secondary structure pred.'].str.count("c")

# add a column with the % C
PeaksDB_325_cyt_prot['% C'] = PeaksDB_325_cyt_prot['C'] / PeaksDB_325_cyt_prot['Sequence length']

# add a column with the % H
PeaksDB_325_cyt_prot['% H'] = PeaksDB_325_cyt_prot['H'] / PeaksDB_325_cyt_prot['Sequence length']

# add a column with the % E
PeaksDB_325_cyt_prot['% E'] = PeaksDB_325_cyt_prot['E'] / PeaksDB_325_cyt_prot['Sequence length']

# add a column with the % T
PeaksDB_325_cyt_prot['% T'] = PeaksDB_325_cyt_prot['T'] / PeaksDB_325_cyt_prot['Sequence length']

# add a column with the % B
PeaksDB_325_cyt_prot['% B'] = PeaksDB_325_cyt_prot['B'] / PeaksDB_325_cyt_prot['Sequence length']

# add a column with the % S
PeaksDB_325_cyt_prot['% S'] = PeaksDB_325_cyt_prot['S'] / PeaksDB_325_cyt_prot['Sequence length']

# add a column with the % c
PeaksDB_325_cyt_prot['% c'] = PeaksDB_325_cyt_prot['c'] / PeaksDB_325_cyt_prot['Sequence length']

# additive check

PeaksDB_325_cyt_prot['% check'] = PeaksDB_325_cyt_prot['% C'] + PeaksDB_325_cyt_prot['% H'] + PeaksDB_325_cyt_prot['% E'] \
                                + PeaksDB_325_cyt_prot['% T'] + PeaksDB_325_cyt_prot['% B'] + \
                                PeaksDB_325_cyt_prot['% S'] + PeaksDB_325_cyt_prot['% c']

In [21]:
PeaksDB_325_cyt_prot.head()

Unnamed: 0,Stripped protein sequence,Secondary structure pred.,Sequence length,C,H,E,T,B,S,c,% C,% H,% E,% T,% B,% S,% c,% check
0,MMKLVLLATLASTATAFLSPFSTVQQQFRAPATRLFADEAEGDSEN...,CEEEEEEHHHHHHHHHHCCCCCCCCCCCCCCCCCCCCCCCCCCCEC...,50,33,10,7,0,0,0,0,0.66,0.2,0.14,0.0,0.0,0.0,0.0,1.0
1,RSAQELSALTSSVKTVFSLEDLAKVLPHRYPFLLVDKVVEYEMGKR...,CCCCCCCCCCCCCCCCCEHHHHHHHCCEEECCCCEEECEEEECCCE...,50,27,7,16,0,0,0,0,0.54,0.14,0.32,0.0,0.0,0.0,0.0,1.0
2,KSVTLNEPHFTGHFPDRPLMPGVLQVEALAQLAGLVCLQMEGAEPG...,EEEECCCCHHHHCCCCCEEECHHHHHHHHHHHHHHHHHHHCCCCCC...,50,17,23,10,0,0,0,0,0.34,0.46,0.2,0.0,0.0,0.0,0.0,1.0
3,FAGVDGVKWKKPVVPGDTLVMEVELKKWNKRFGLATATGRSYVDGE...,EEEEECEEEECEEECCEEEEEEEEEEEECCCCCEEEEEEEEEECCE...,50,11,0,39,0,0,0,0,0.22,0.0,0.78,0.0,0.0,0.0,0.0,1.0
4,LDEMKFALAK,EEEEEEEECC,10,2,0,8,0,0,0,0,0.2,0.0,0.8,0.0,0.0,0.0,0.0,1.0


In [22]:
index = ['325 total']

data = {
        '% C total': PeaksDB_325_cyt_prot['% C'].sum(),
        '% H total': PeaksDB_325_cyt_prot['% H'].sum(),
        '% E total': PeaksDB_325_cyt_prot['% E'].sum(),
        '% T total': PeaksDB_325_cyt_prot['% T'].sum(),
        '% B total': PeaksDB_325_cyt_prot['% B'].sum(),
        '% S total': PeaksDB_325_cyt_prot['% S'].sum(),
        '% c total': PeaksDB_325_cyt_prot['% c'].sum(),
        '% check sum': PeaksDB_325_cyt_prot['% check'].sum()
       }

PeaksDB_325_cyt_prot_totals = pd.DataFrame(data, columns=['% C total', '% H total', '% E total', '% T total', \
                                                      '% B total',  '% S total', '% c total', \
                                                      '% check sum'], index=index)

PeaksDB_325_cyt_prot_totals['overall % sum'] = PeaksDB_325_cyt_prot_totals['% C total'] \
                                            + PeaksDB_325_cyt_prot_totals['% H total'] \
                                            + PeaksDB_325_cyt_prot_totals['% E total'] \
                                            + PeaksDB_325_cyt_prot_totals['% T total'] \
                                            + PeaksDB_325_cyt_prot_totals['% B total'] \
                                            + PeaksDB_325_cyt_prot_totals['% S total'] \
                                            + PeaksDB_325_cyt_prot_totals['% c total'] 


PeaksDB_325_cyt_prot_totals['overall % C'] = PeaksDB_325_cyt_prot_totals['% C total'] / PeaksDB_325_cyt_prot_totals['overall % sum']

PeaksDB_325_cyt_prot_totals['overall % H'] = PeaksDB_325_cyt_prot_totals['% H total'] / PeaksDB_325_cyt_prot_totals['overall % sum']

PeaksDB_325_cyt_prot_totals['overall % E'] = PeaksDB_325_cyt_prot_totals['% E total'] / PeaksDB_325_cyt_prot_totals['overall % sum']

PeaksDB_325_cyt_prot_totals['overall % T'] = PeaksDB_325_cyt_prot_totals['% T total'] / PeaksDB_325_cyt_prot_totals['overall % sum']

PeaksDB_325_cyt_prot_totals['overall % B'] = PeaksDB_325_cyt_prot_totals['% B total'] / PeaksDB_325_cyt_prot_totals['overall % sum']

PeaksDB_325_cyt_prot_totals['overall % S'] = PeaksDB_325_cyt_prot_totals['% S total'] / PeaksDB_325_cyt_prot_totals['overall % sum']

PeaksDB_325_cyt_prot_totals['overall % c'] = PeaksDB_325_cyt_prot_totals['% c total'] / PeaksDB_325_cyt_prot_totals['overall % sum']

# write to csv

PeaksDB_325_cyt_prot_totals.to_csv("/home/millieginty/Documents/git-repos/rot-mayer/analyses/proteus2/Proteins-Proteus2/Day12_325 PeaksDB_dia_cyt_prot_trypsin_totals")

PeaksDB_325_cyt_prot_totals.head()

Unnamed: 0,% C total,% H total,% E total,% T total,% B total,% S total,% c total,% check sum,overall % sum,overall % C,overall % H,overall % E,overall % T,overall % B,overall % S,overall % c
325 total,9.748971,7.937481,6.313548,0.0,0.0,0.0,0.0,24.0,24.0,0.406207,0.330728,0.263065,0.0,0.0,0.0,0.0
