### Manipulation of Peaks de novo results of ETNP 2017 P2 samples LC-MS/MS data using python.

Starting with:

    Peaks de novo results (.csv) of PTM-optimized database searches

Goal:

    Files with stripped (no PTMs) peptide lists and
    Columns with #'s of each modification in every sequence
    Column with stripped peptide lengths (# amino acids)
    
### To use:

#### 1. Change the input file name in *IN 4*
#### 2. Change output file name in *IN 6*, *IN 7*, *IN 8*

We don't have technical duplicates here, sadly, unlike the MED4 Pro samples. I exported PeaksDN search results CSVs into my ETNP 2017 git repo:

In [1]:
cd /home/millieginty/Documents/git-repos/2017-etnp/data/pro2020/ETNP-SKQ17/PEAKS-PTMopt/

/home/millieginty/Documents/git-repos/2017-etnp/data/pro2020/ETNP-SKQ17/PEAKS-PTMopt


In [2]:
ls

ETNP-SKQ17-231-100m-0.3-JA2_DN50.csv
ETNP-SKQ17-231-100m-0.3-JA2_DN50_ptm.csv
ETNP-SKQ17-231-100m-0.3-JA2_DN50_stripped.csv
ETNP-SKQ17-231-100m-0.3-JA2_DN50_stripped_peptides.fas
ETNP-SKQ17-231-100m-0.3-JA2_DN50_stripped_peptides.txt
ETNP-SKQ17-231-100m-0.3-JA2_DN50_totals.csv
ETNP-SKQ17-231-100m-0.3-JA2_DN80_stripped_peptides.fas
ETNP-SKQ17-231-100m-0.3-JA2_DN80_stripped_peptides.txt
ETNP-SKQ17-231-100m-0.3-JA2_PTMopt_DN50_AA_NAAF.csv
ETNP-SKQ17-231-100m-0.3-JA2_PTMopt_DN50.csv
ETNP-SKQ17-231-100m-0.3-JA2__PTMopt_DN50_totals.csv
ETNP-SKQ17-231-100m-0.3-JA2_PTMopt_DN80_AA_NAAF.csv
ETNP-SKQ17-231-100m-0.3-JA2_PTMopt_DN80.csv
ETNP-SKQ17-231-100m-0.3-JA2__PTMopt_DN80_totals.csv
ETNP-SKQ17-233-265m-0.3-JA2_DN50_stripped.csv
ETNP-SKQ17-233-265m-0.3-JA4_DN50.csv
ETNP-SKQ17-233-265m-0.3-JA4_DN50_ptm.csv
ETNP-SKQ17-233-265m-0.3-JA4_DN50_stripped.csv
ETNP-SKQ17-233-265m-0.3-JA4_DN50_stripped_peptides.txt
ETNP-SKQ17-233-265m-0.3-JA4_DN50_totals.csv
ETNP-SKQ17-233-265m-0.3-JA4

In [3]:
# LIBRARIES
#import pandas library for working with tabular data
import os
os.getcwd()
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from scipy.stats import kde
#import regular expresson (regex)
import re
#check pandas version
pd.__version__

'1.0.5'

## 231: 100 m McLane pump filtered on 0.3 um GF-75

In [4]:
#read the CSV into a datadrame we name 'comet' using the pandas read_csv function
peaks231_50 = pd.read_csv("/home/millieginty/Documents/git-repos/2017-etnp/data/pro2020/ETNP-SKQ17/PEAKS-PTMopt/ETNP-SKQ17-231-100m-0.3-JA2_DN50.csv")

# also make a dataframe but keep only peptides  >80% ALC
peaks231_80 = peaks231_50.loc[peaks231_50['ALC (%)'] >= 80].copy()

# how many de novo sequence candidates >50% ALC?
print("# redundant Peaks peptides >50% ALC in dataframe", len(peaks231_50))
# how many de novo sequence candidates >80ALC?
print("# redundant Peaks peptides >80% ALC in dataframe", len(peaks231_80))

#look at the dataframe
peaks231_50.head()

# redundant Peaks peptides >50% ALC in dataframe 5312
# redundant Peaks peptides >80% ALC in dataframe 1140


Unnamed: 0,Fraction,Scan,Source File,Peptide,Tag Length,ALC (%),length,m/z,z,RT,Area,Mass,ppm,PTM,local confidence (%),tag (>=0%),mode
0,1,14924,20170410_ETNP-231-100m-0.3um-JA2_01.raw,EN(+.98)LAALEK,8,98,8,444.7382,2,50.59,38500000.0,887.46,2.0,Deamidation (NQ),98 99 100 99 98 99 100 96,EN(+.98)LAALEK,CID
1,1,11873,20170410_ETNP-231-100m-0.3um-JA2_01.raw,EN(+.98)N(+.98)LLAK,7,98,7,402.2114,2,41.36,7910000.0,802.4072,1.2,Deamidation (NQ),97 98 97 98 99 99 99,EN(+.98)N(+.98)LLAK,CID
2,1,16508,20170410_ETNP-231-100m-0.3um-JA2_01.raw,TGN(+.98)FLDPK,8,98,8,446.7253,2,55.16,5860000.0,891.4338,2.6,Deamidation (NQ),94 97 99 99 99 99 99 98,TGN(+.98)FLDPK,CID
3,1,27852,20170410_ETNP-231-100m-0.3um-JA2_01.raw,WLVNHPR,7,97,7,461.2513,2,81.49,1550000.0,920.498,-10.8,,99 100 100 96 97 97 98,WLVNHPR,CID
4,1,15671,20170410_ETNP-231-100m-0.3um-JA2_01.raw,TDENLPLGPK,10,97,10,542.2884,2,52.76,10900000.0,1082.5608,1.3,,98 99 100 99 100 97 98 95 98 95,TDENLPLGPK,CID


In [5]:
# use a count function to enumerate the # of A's (alanines) in each peptide
peaks231_50['A'] = peaks231_50['Peptide'].str.count("A")

# use a count function to enumerate the # of C's (cysteines) in each peptide
peaks231_50['C'] = peaks231_50['Peptide'].str.count("C")

# use a count function to enumerate the # of D's (aspartic acids) in each peptide
peaks231_50['D'] = peaks231_50['Peptide'].str.count("D")

# use a count function to enumerate the # of E's (glutamic acids) in each peptide
peaks231_50['E'] = peaks231_50['Peptide'].str.count("E")

# use a count function to enumerate the # of F's (phenylalanines) in each peptide
peaks231_50['F'] = peaks231_50['Peptide'].str.count("F")

# use a count function to enumerate the # of G's (glycines) in each peptide
peaks231_50['G'] = peaks231_50['Peptide'].str.count("G")

# use a count function to enumerate the # of H's (histidines) in each peptide
peaks231_50['H'] = peaks231_50['Peptide'].str.count("H")

# use a count function to enumerate the # of I's (isoleucines) in each peptide
# in peaks231_50 output, there will be no isoleucines (they're lumped in with leucines)
peaks231_50['I'] = peaks231_50['Peptide'].str.count("I")

# use a count function to enumerate the # of K's (lysines) in each peptide
peaks231_50['K'] = peaks231_50['Peptide'].str.count("K")

# use a count function to enumerate the # of L's (leucines) in each peptide
# also these include the isoleucines
peaks231_50['I/L'] = peaks231_50['Peptide'].str.count("L")

# use a count function to enumerate the # of M's (methionines) in each peptide
peaks231_50['M'] = peaks231_50['Peptide'].str.count("M")

# use a count function to enumerate the # of N's (asparagines) in each peptide
peaks231_50['N'] = peaks231_50['Peptide'].str.count("N")

# use a count function to enumerate the # of P's ([prolines]) in each peptide
peaks231_50['P'] = peaks231_50['Peptide'].str.count("P")

# use a count function to enumerate the # of Q's (glutamines) in each peptide
peaks231_50['Q'] = peaks231_50['Peptide'].str.count("Q")

# use a count function to enumerate the # of R's (arginines) in each peptide
peaks231_50['R'] = peaks231_50['Peptide'].str.count("R")

# use a count function to enumerate the # of S's (serines) in each peptide
peaks231_50['S'] = peaks231_50['Peptide'].str.count("S")

# use a count function to enumerate the # of T's (threonines) in each peptide
peaks231_50['T'] = peaks231_50['Peptide'].str.count("T")

# use a count function to enumerate the # of V's (valines) in each peptide
peaks231_50['V'] = peaks231_50['Peptide'].str.count("V")

# use a count function to enumerate the # of W's (tryptophans) in each peptide
peaks231_50['W'] = peaks231_50['Peptide'].str.count("W")

# use a count function to enumerate the # of Y's (tyrosines) in each peptide
peaks231_50['Y'] = peaks231_50['Peptide'].str.count("Y")

# use a count function to enumerate the # of carbamidomethylated C's in each peptide
peaks231_50['c-carb'] = peaks231_50['Peptide'].str.count("57.02")

# use a count function to enumerate the # of oxidized M's in each peptide
peaks231_50['m-oxid'] = peaks231_50['Peptide'].apply(lambda x: x.count('M(+15.99)'))

# use a lamba function to enumerate the # of deamidated N's in each peptide
# used to use peaks231_50['n-deam'] = peaks231_50['Peptide'].str.count("N\(+.98") but that didn't work with the 'N'

peaks231_50['n-deam'] = peaks231_50['Peptide'].apply(lambda x: x.count('N(+.98)'))

# use a count function to enumerate the # of deamidated Q's in each peptide
peaks231_50['q-deam'] = peaks231_50['Peptide'].apply(lambda x: x.count('Q(+.98)'))

# use a count function to enumerate the # of iron adducted K's in each peptide
peaks231_50['k-iron'] = peaks231_50['Peptide'].str.count("53.92")

# use a count function to enumerate the # of methylated K's in each peptide
peaks231_50['k-meth'] = peaks231_50['Peptide'].apply(lambda x: x.count('K(+14.02)'))

# use a count function to enumerate the # of methylated R's in each peptide
peaks231_50['r-meth'] = peaks231_50['Peptide'].apply(lambda x: x.count('R(+14.02)'))

# create a column with 'stripped' peptide sequences using strip
peaks231_50['stripped peptide'] = peaks231_50['Peptide'].str.replace(r"\(.*\)","")

# add a column with the stripped peptide length (number of AAs)
peaks231_50['stripped length'] = peaks231_50['stripped peptide'].apply(len)

# total the number of modifications in sequence
peaks231_50['ptm-total'] = peaks231_50['c-carb'] + peaks231_50['m-oxid'] + peaks231_50['n-deam'] + peaks231_50['q-deam'] + peaks231_50['k-iron'] + peaks231_50['k-meth'] + peaks231_50['r-meth']

# calculate NAAF numerator for each peptide k
peaks231_50['NAAF num.'] = peaks231_50['Area'] / peaks231_50['stripped length']

# write modified dataframe to new txt file, same name + 'stripped'
peaks231_50.to_csv("/home/millieginty/Documents/git-repos/2017-etnp/data/pro2020/ETNP-SKQ17/PEAKS-PTMopt/ETNP-SKQ17-231-100m-0.3-JA2_PTMopt_DN50.csv")


# check out the results
peaks231_50.head()

Unnamed: 0,Fraction,Scan,Source File,Peptide,Tag Length,ALC (%),length,m/z,z,RT,...,m-oxid,n-deam,q-deam,k-iron,k-meth,r-meth,stripped peptide,stripped length,ptm-total,NAAF num.
0,1,14924,20170410_ETNP-231-100m-0.3um-JA2_01.raw,EN(+.98)LAALEK,8,98,8,444.7382,2,50.59,...,0,1,0,0,0,0,ENLAALEK,8,1,4812500.0
1,1,11873,20170410_ETNP-231-100m-0.3um-JA2_01.raw,EN(+.98)N(+.98)LLAK,7,98,7,402.2114,2,41.36,...,0,2,0,0,0,0,ENLLAK,6,2,1318333.0
2,1,16508,20170410_ETNP-231-100m-0.3um-JA2_01.raw,TGN(+.98)FLDPK,8,98,8,446.7253,2,55.16,...,0,1,0,0,0,0,TGNFLDPK,8,1,732500.0
3,1,27852,20170410_ETNP-231-100m-0.3um-JA2_01.raw,WLVNHPR,7,97,7,461.2513,2,81.49,...,0,0,0,0,0,0,WLVNHPR,7,0,221428.6
4,1,15671,20170410_ETNP-231-100m-0.3um-JA2_01.raw,TDENLPLGPK,10,97,10,542.2884,2,52.76,...,0,0,0,0,0,0,TDENLPLGPK,10,0,1090000.0


In [6]:
# made a new dataframe that contains the sums of certain columns in the stripped peptide dataframe above (for >50% ALC)

index = ['sample total']

data = {'A': peaks231_50['A'].sum(),
        'C': peaks231_50['C'].sum(),
        'D': peaks231_50['D'].sum(),
        'E': peaks231_50['E'].sum(),
        'F': peaks231_50['F'].sum(),
        'G': peaks231_50['G'].sum(),
        'H': peaks231_50['H'].sum(),
        'I': peaks231_50['I'].sum(),
        'K': peaks231_50['K'].sum(),
        'I/L': peaks231_50['I/L'].sum(),
        'M': peaks231_50['M'].sum(),
        'N': peaks231_50['N'].sum(),
        'P': peaks231_50['P'].sum(),
        'Q': peaks231_50['Q'].sum(),
        'R': peaks231_50['R'].sum(),
        'S': peaks231_50['S'].sum(),
        'T': peaks231_50['T'].sum(),
        'V': peaks231_50['V'].sum(),
        'W': peaks231_50['W'].sum(),
        'Y': peaks231_50['Y'].sum(),
        'c-carb': peaks231_50['c-carb'].sum(),
        'm-oxid': peaks231_50['m-oxid'].sum(),
        'n-deam': peaks231_50['n-deam'].sum(),
        'q-deam': peaks231_50['q-deam'].sum(),
        'k-iron': peaks231_50['k-iron'].sum(),
        'k-meth': peaks231_50['k-meth'].sum(),
        'r-meth': peaks231_50['r-meth'].sum(),
        'Total area': peaks231_50['Area'].sum(),
        'Total length': peaks231_50['stripped length'].sum()
       }

totalpeaks231_50 = pd.DataFrame(data, columns=['A', 'C', 'D', 'E', 'F', 'G', 'H', 'I', 'K', 'I/L', 'M', 'N', 'P', 'Q', 'R', 'S', 'T', 'V', 'W', 'Y', 'c-carb', 'm-oxid', 'n-deam', 'q-deam', 'k-iron', 'k-meth', 'r-meth', 'Total area', 'Total length'], index=index)

# calculate percentage of C's with carb (should be 1.0)
totalpeaks231_50['% C w/ carb'] = totalpeaks231_50['c-carb'] / totalpeaks231_50['C'] 

# calculate percentage of M's that are oxidized
totalpeaks231_50['% M w/ oxid'] = totalpeaks231_50['m-oxid'] / totalpeaks231_50['M'] 

# calculate percentage of N's that are deamidated
totalpeaks231_50['% N w/ deam'] = totalpeaks231_50['n-deam'] / totalpeaks231_50['N'] 

# calculate percentage of Q's that are deamidated
totalpeaks231_50['% Q w/ deam'] = totalpeaks231_50['q-deam'] / totalpeaks231_50['Q'] 

# calculate percentage of K's that are hydroxylated
totalpeaks231_50['% K w/ iron'] = totalpeaks231_50['k-iron'] / totalpeaks231_50['K'] 

# calculate percentage of K's that are methylated
totalpeaks231_50['% K w/ meth'] = totalpeaks231_50['k-meth'] / totalpeaks231_50['K'] 

# calculate percentage of R's that are methylated
totalpeaks231_50['% R w/ meth'] = totalpeaks231_50['r-meth'] / totalpeaks231_50['R'] 

# calculate NAAF denominator for all peptides in dataset i
totalpeaks231_50['NAAF denom.'] = totalpeaks231_50['Total area'] / totalpeaks231_50['Total length']

# write modified dataframe to new txt file
totalpeaks231_50.to_csv("/home/millieginty/Documents/git-repos/2017-etnp/data/pro2020/ETNP-SKQ17/PEAKS-PTMopt/ETNP-SKQ17-231-100m-0.3-JA2__PTMopt_DN50_totals.csv")

totalpeaks231_50.head()

Unnamed: 0,A,C,D,E,F,G,H,I,K,I/L,...,Total area,Total length,% C w/ carb,% M w/ oxid,% N w/ deam,% Q w/ deam,% K w/ iron,% K w/ meth,% R w/ meth,NAAF denom.
sample total,3927,626,2010,2762,1551,2101,1048,0,5403,4784,...,12646790000.0,41900,1.0,0.43424,0.202742,0.029278,0.0,0.147326,0.287637,301832.734206


In [7]:
# use the calculated NAAF factor (in totalpeaks231_ dataframe, above) to caluclate the NAAF 
# NAAF: normalized normalized area abundance factor

# don't have to worry here about DECOY hits messing with Area totals
# but we will with Comet results

NAAF50 = 301832.734206

# use NAAF >50% ALC to get NAAF
peaks231_50['NAAF factor'] = (peaks231_50['NAAF num.'])/NAAF50

# separate out the dataframe into AAs 
peaks231_AA50 = peaks231_50[['stripped peptide', 'NAAF factor', 'A', 'C', 'D', 'E', 'F', 'G', 'H', 'K', 'I/L', 'M', 'N', 'P', 'Q', 'R', 'S', 'T', 'V', 'W', 'Y']].copy()

# multiply the NAAF50 factor by the AA total to normalize its abundance by peak area and peptide length

peaks231_AA50['A-NAAF50'] = peaks231_AA50['A'] * peaks231_50['NAAF factor']
peaks231_AA50['C-NAAF50'] = peaks231_AA50['C'] * peaks231_50['NAAF factor']
peaks231_AA50['D-NAAF50'] = peaks231_AA50['D'] * peaks231_50['NAAF factor']
peaks231_AA50['E-NAAF50'] = peaks231_AA50['E'] * peaks231_50['NAAF factor']
peaks231_AA50['F-NAAF50'] = peaks231_AA50['F'] * peaks231_50['NAAF factor']
peaks231_AA50['G-NAAF50'] = peaks231_AA50['G'] * peaks231_50['NAAF factor']
peaks231_AA50['H-NAAF50'] = peaks231_AA50['H'] * peaks231_50['NAAF factor']
peaks231_AA50['K-NAAF50'] = peaks231_AA50['K'] * peaks231_50['NAAF factor']
peaks231_AA50['I/L-NAAF50'] = peaks231_AA50['I/L'] * peaks231_50['NAAF factor']
peaks231_AA50['M-NAAF50'] = peaks231_AA50['M'] * peaks231_50['NAAF factor']
peaks231_AA50['N-NAAF50'] = peaks231_AA50['N'] * peaks231_50['NAAF factor']
peaks231_AA50['P-NAAF50'] = peaks231_AA50['P'] * peaks231_50['NAAF factor']
peaks231_AA50['Q-NAAF50'] = peaks231_AA50['Q'] * peaks231_50['NAAF factor']
peaks231_AA50['R-NAAF50'] = peaks231_AA50['R'] * peaks231_50['NAAF factor']
peaks231_AA50['S-NAAF50'] = peaks231_AA50['S'] * peaks231_50['NAAF factor']
peaks231_AA50['T-NAAF50'] = peaks231_AA50['T'] * peaks231_50['NAAF factor']
peaks231_AA50['V-NAAF50'] = peaks231_AA50['V'] * peaks231_50['NAAF factor']
peaks231_AA50['W-NAAF50'] = peaks231_AA50['W'] * peaks231_50['NAAF factor']
peaks231_AA50['Y-NAAF50'] = peaks231_AA50['Y'] * peaks231_50['NAAF factor']

# write the dataframe to a new csv
peaks231_AA50.to_csv("/home/millieginty/Documents/git-repos/2017-etnp/data/pro2020/ETNP-SKQ17/PEAKS-PTMopt/ETNP-SKQ17-231-100m-0.3-JA2_PTMopt_DN50_AA_NAAF.csv")

peaks231_AA50.head()

Unnamed: 0,stripped peptide,NAAF factor,A,C,D,E,F,G,H,K,...,M-NAAF50,N-NAAF50,P-NAAF50,Q-NAAF50,R-NAAF50,S-NAAF50,T-NAAF50,V-NAAF50,W-NAAF50,Y-NAAF50
0,ENLAALEK,15.944261,2,0,0,2,0,0,0,1,...,0.0,15.944261,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,ENLLAK,4.367761,1,0,0,1,0,0,0,1,...,0.0,8.735523,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,TGNFLDPK,2.426841,0,0,1,0,1,1,0,1,...,0.0,2.426841,2.426841,0.0,0.0,0.0,2.426841,0.0,0.0,0.0
3,WLVNHPR,0.733614,0,0,0,0,0,0,1,0,...,0.0,0.733614,0.733614,0.0,0.733614,0.0,0.0,0.733614,0.733614,0.0
4,TDENLPLGPK,3.611272,0,0,1,1,0,1,0,1,...,0.0,3.611272,7.222543,0.0,0.0,0.0,3.611272,0.0,0.0,0.0


### Same process but for de novo peptide >80 % ALC:

In [8]:
# use a count function to enumerate the # of A's (alanines) in each peptide
peaks231_80['A'] = peaks231_80['Peptide'].str.count("A")

# use a count function to enumerate the # of C's (cysteines) in each peptide
peaks231_80['C'] = peaks231_80['Peptide'].str.count("C")

# use a count function to enumerate the # of D's (aspartic acids) in each peptide
peaks231_80['D'] = peaks231_80['Peptide'].str.count("D")

# use a count function to enumerate the # of E's (glutamic acids) in each peptide
peaks231_80['E'] = peaks231_80['Peptide'].str.count("E")

# use a count function to enumerate the # of F's (phenylalanines) in each peptide
peaks231_80['F'] = peaks231_80['Peptide'].str.count("F")

# use a count function to enumerate the # of G's (glycines) in each peptide
peaks231_80['G'] = peaks231_80['Peptide'].str.count("G")

# use a count function to enumerate the # of H's (histidines) in each peptide
peaks231_80['H'] = peaks231_80['Peptide'].str.count("H")

# use a count function to enumerate the # of I's (isoleucines) in each peptide
# in peaks231_80 output, there will be no isoleucines (they're lumped in with leucines)
peaks231_80['I'] = peaks231_80['Peptide'].str.count("I")

# use a count function to enumerate the # of K's (lysines) in each peptide
peaks231_80['K'] = peaks231_80['Peptide'].str.count("K")

# use a count function to enumerate the # of L's (leucines) in each peptide
# also these include the isoleucines
peaks231_80['I/L'] = peaks231_80['Peptide'].str.count("L")

# use a count function to enumerate the # of M's (methionines) in each peptide
peaks231_80['M'] = peaks231_80['Peptide'].str.count("M")

# use a count function to enumerate the # of N's (asparagines) in each peptide
peaks231_80['N'] = peaks231_80['Peptide'].str.count("N")

# use a count function to enumerate the # of P's ([prolines]) in each peptide
peaks231_80['P'] = peaks231_80['Peptide'].str.count("P")

# use a count function to enumerate the # of Q's (glutamines) in each peptide
peaks231_80['Q'] = peaks231_80['Peptide'].str.count("Q")

# use a count function to enumerate the # of R's (arginines) in each peptide
peaks231_80['R'] = peaks231_80['Peptide'].str.count("R")

# use a count function to enumerate the # of S's (serines) in each peptide
peaks231_80['S'] = peaks231_80['Peptide'].str.count("S")

# use a count function to enumerate the # of T's (threonines) in each peptide
peaks231_80['T'] = peaks231_80['Peptide'].str.count("T")

# use a count function to enumerate the # of V's (valines) in each peptide
peaks231_80['V'] = peaks231_80['Peptide'].str.count("V")

# use a count function to enumerate the # of W's (tryptophans) in each peptide
peaks231_80['W'] = peaks231_80['Peptide'].str.count("W")

# use a count function to enumerate the # of Y's (tyrosines) in each peptide
peaks231_80['Y'] = peaks231_80['Peptide'].str.count("Y")

# use a count function to enumerate the # of carbamidomethylated C's in each peptide
peaks231_80['c-carb'] = peaks231_80['Peptide'].str.count("57.02")

# use a count function to enumerate the # of oxidized M's in each peptide
peaks231_80['m-oxid'] = peaks231_80['Peptide'].apply(lambda x: x.count('M(+15.99)'))

# use a lamba function to enumerate the # of deamidated N's in each peptide
# used to use peaks231_80['n-deam'] = peaks231_80['Peptide'].str.count("N\(+.98") but that didn't work with the 'N'

peaks231_80['n-deam'] = peaks231_80['Peptide'].apply(lambda x: x.count('N(+.98)'))

# use a count function to enumerate the # of deamidated Q's in each peptide
peaks231_80['q-deam'] = peaks231_80['Peptide'].apply(lambda x: x.count('Q(+.98)'))

# use a count function to enumerate the # of iron adducted K's in each peptide
peaks231_80['k-iron'] = peaks231_80['Peptide'].str.count("53.92")

# use a count function to enumerate the # of methylated K's in each peptide
peaks231_80['k-meth'] = peaks231_80['Peptide'].apply(lambda x: x.count('K(+14.02)'))

# use a count function to enumerate the # of methylated R's in each peptide
peaks231_80['r-meth'] = peaks231_80['Peptide'].apply(lambda x: x.count('R(+14.02)'))

# create a column with 'stripped' peptide sequences using strip
peaks231_80['stripped peptide'] = peaks231_80['Peptide'].str.replace(r"\(.*\)","")

# add a column with the stripped peptide length (number of AAs)
peaks231_80['stripped length'] = peaks231_80['stripped peptide'].apply(len)

# total the number of modifications in sequence
peaks231_80['ptm-total'] = peaks231_80['c-carb'] + peaks231_80['m-oxid'] + peaks231_80['n-deam'] + peaks231_80['q-deam'] + peaks231_80['k-iron'] + peaks231_80['k-meth'] + peaks231_80['r-meth']

# calculate NAAF numerator for each peptide k
peaks231_80['NAAF num.'] = peaks231_80['Area'] / peaks231_80['stripped length']

# write modified dataframe to new txt file, same name + 'stripped'
peaks231_80.to_csv("/home/millieginty/Documents/git-repos/2017-etnp/data/pro2020/ETNP-SKQ17/PEAKS-PTMopt/ETNP-SKQ17-231-100m-0.3-JA2_PTMopt_DN80.csv")


# check out the results
peaks231_80.head()

Unnamed: 0,Fraction,Scan,Source File,Peptide,Tag Length,ALC (%),length,m/z,z,RT,...,m-oxid,n-deam,q-deam,k-iron,k-meth,r-meth,stripped peptide,stripped length,ptm-total,NAAF num.
0,1,14924,20170410_ETNP-231-100m-0.3um-JA2_01.raw,EN(+.98)LAALEK,8,98,8,444.7382,2,50.59,...,0,1,0,0,0,0,ENLAALEK,8,1,4812500.0
1,1,11873,20170410_ETNP-231-100m-0.3um-JA2_01.raw,EN(+.98)N(+.98)LLAK,7,98,7,402.2114,2,41.36,...,0,2,0,0,0,0,ENLLAK,6,2,1318333.0
2,1,16508,20170410_ETNP-231-100m-0.3um-JA2_01.raw,TGN(+.98)FLDPK,8,98,8,446.7253,2,55.16,...,0,1,0,0,0,0,TGNFLDPK,8,1,732500.0
3,1,27852,20170410_ETNP-231-100m-0.3um-JA2_01.raw,WLVNHPR,7,97,7,461.2513,2,81.49,...,0,0,0,0,0,0,WLVNHPR,7,0,221428.6
4,1,15671,20170410_ETNP-231-100m-0.3um-JA2_01.raw,TDENLPLGPK,10,97,10,542.2884,2,52.76,...,0,0,0,0,0,0,TDENLPLGPK,10,0,1090000.0


In [10]:
# use a count function to enumerate the # of A's (alanines) in each peptide
peaks231_80['A'] = peaks231_80['Peptide'].str.count("A")

# use a count function to enumerate the # of C's (cysteines) in each peptide
peaks231_80['C'] = peaks231_80['Peptide'].str.count("C")

# use a count function to enumerate the # of D's (aspartic acids) in each peptide
peaks231_80['D'] = peaks231_80['Peptide'].str.count("D")

# use a count function to enumerate the # of E's (glutamic acids) in each peptide
peaks231_80['E'] = peaks231_80['Peptide'].str.count("E")

# use a count function to enumerate the # of F's (phenylalanines) in each peptide
peaks231_80['F'] = peaks231_80['Peptide'].str.count("F")

# use a count function to enumerate the # of G's (glycines) in each peptide
peaks231_80['G'] = peaks231_80['Peptide'].str.count("G")

# use a count function to enumerate the # of H's (histidines) in each peptide
peaks231_80['H'] = peaks231_80['Peptide'].str.count("H")

# use a count function to enumerate the # of I's (isoleucines) in each peptide
# in peaks231_80 output, there will be no isoleucines (they're lumped in with leucines)
peaks231_80['I'] = peaks231_80['Peptide'].str.count("I")

# use a count function to enumerate the # of K's (lysines) in each peptide
peaks231_80['K'] = peaks231_80['Peptide'].str.count("K")

# use a count function to enumerate the # of L's (leucines) in each peptide
# also these include the isoleucines
peaks231_80['I/L'] = peaks231_80['Peptide'].str.count("L")

# use a count function to enumerate the # of M's (methionines) in each peptide
peaks231_80['M'] = peaks231_80['Peptide'].str.count("M")

# use a count function to enumerate the # of N's (asparagines) in each peptide
peaks231_80['N'] = peaks231_80['Peptide'].str.count("N")

# use a count function to enumerate the # of P's ([prolines]) in each peptide
peaks231_80['P'] = peaks231_80['Peptide'].str.count("P")

# use a count function to enumerate the # of Q's (glutamines) in each peptide
peaks231_80['Q'] = peaks231_80['Peptide'].str.count("Q")

# use a count function to enumerate the # of R's (arginines) in each peptide
peaks231_80['R'] = peaks231_80['Peptide'].str.count("R")

# use a count function to enumerate the # of S's (serines) in each peptide
peaks231_80['S'] = peaks231_80['Peptide'].str.count("S")

# use a count function to enumerate the # of T's (threonines) in each peptide
peaks231_80['T'] = peaks231_80['Peptide'].str.count("T")

# use a count function to enumerate the # of V's (valines) in each peptide
peaks231_80['V'] = peaks231_80['Peptide'].str.count("V")

# use a count function to enumerate the # of W's (tryptophans) in each peptide
peaks231_80['W'] = peaks231_80['Peptide'].str.count("W")

# use a count function to enumerate the # of Y's (tyrosines) in each peptide
peaks231_80['Y'] = peaks231_80['Peptide'].str.count("Y")

# use a count function to enumerate the # of carbamidomethylated C's in each peptide
peaks231_80['c-carb'] = peaks231_80['Peptide'].str.count("57.02")

# use a count function to enumerate the # of oxidized M's in each peptide
peaks231_80['m-oxid'] = peaks231_80['Peptide'].apply(lambda x: x.count('M(+15.99)'))

# use a lamba function to enumerate the # of deamidated N's in each peptide
# used to use peaks231_80['n-deam'] = peaks231_80['Peptide'].str.count("N\(+.98") but that didn't work with the 'N'

peaks231_80['n-deam'] = peaks231_80['Peptide'].apply(lambda x: x.count('N(+.98)'))

# use a count function to enumerate the # of deamidated Q's in each peptide
peaks231_80['q-deam'] = peaks231_80['Peptide'].apply(lambda x: x.count('Q(+.98)'))

# use a count function to enumerate the # of iron adducted K's in each peptide
peaks231_80['k-iron'] = peaks231_80['Peptide'].str.count("53.92")

# use a count function to enumerate the # of methylated K's in each peptide
peaks231_80['k-meth'] = peaks231_80['Peptide'].apply(lambda x: x.count('K(+14.02)'))

# use a count function to enumerate the # of methylated R's in each peptide
peaks231_80['r-meth'] = peaks231_80['Peptide'].apply(lambda x: x.count('R(+14.02)'))

# create a column with 'stripped' peptide sequences using strip
peaks231_80['stripped peptide'] = peaks231_80['Peptide'].str.replace(r"\(.*\)","")

# add a column with the stripped peptide length (number of AAs)
peaks231_80['stripped length'] = peaks231_80['stripped peptide'].apply(len)

# total the number of modifications in sequence
peaks231_80['ptm-total'] = peaks231_80['c-carb'] + peaks231_80['m-oxid'] + peaks231_80['n-deam'] + peaks231_80['q-deam'] + peaks231_80['k-iron'] + peaks231_80['k-meth'] + peaks231_80['r-meth']

# calculate NAAF numerator for each peptide k
peaks231_80['NAAF num.'] = peaks231_80['Area'] / peaks231_80['stripped length']

# write modified dataframe to new txt file
peaks231_80.to_csv("/home/millieginty/Documents/git-repos/2017-etnp/data/pro2020/ETNP-SKQ17/PEAKS-PTMopt/ETNP-SKQ17-231-100m-0.3-JA2_PTMopt_DN80.csv")


# check out the results
peaks231_80.head()

Unnamed: 0,Fraction,Scan,Source File,Peptide,Tag Length,ALC (%),length,m/z,z,RT,...,m-oxid,n-deam,q-deam,k-iron,k-meth,r-meth,stripped peptide,stripped length,ptm-total,NAAF num.
0,1,14924,20170410_ETNP-231-100m-0.3um-JA2_01.raw,EN(+.98)LAALEK,8,98,8,444.7382,2,50.59,...,0,1,0,0,0,0,ENLAALEK,8,1,4812500.0
1,1,11873,20170410_ETNP-231-100m-0.3um-JA2_01.raw,EN(+.98)N(+.98)LLAK,7,98,7,402.2114,2,41.36,...,0,2,0,0,0,0,ENLLAK,6,2,1318333.0
2,1,16508,20170410_ETNP-231-100m-0.3um-JA2_01.raw,TGN(+.98)FLDPK,8,98,8,446.7253,2,55.16,...,0,1,0,0,0,0,TGNFLDPK,8,1,732500.0
3,1,27852,20170410_ETNP-231-100m-0.3um-JA2_01.raw,WLVNHPR,7,97,7,461.2513,2,81.49,...,0,0,0,0,0,0,WLVNHPR,7,0,221428.6
4,1,15671,20170410_ETNP-231-100m-0.3um-JA2_01.raw,TDENLPLGPK,10,97,10,542.2884,2,52.76,...,0,0,0,0,0,0,TDENLPLGPK,10,0,1090000.0


In [11]:
# made a new dataframe that contains the sums of certain columns in the stripped peptide dataframe above (for >50% ALC)

index = ['sample total']

data = {'A': peaks231_80['A'].sum(),
        'C': peaks231_80['C'].sum(),
        'D': peaks231_80['D'].sum(),
        'E': peaks231_80['E'].sum(),
        'F': peaks231_80['F'].sum(),
        'G': peaks231_80['G'].sum(),
        'H': peaks231_80['H'].sum(),
        'I': peaks231_80['I'].sum(),
        'K': peaks231_80['K'].sum(),
        'I/L': peaks231_80['I/L'].sum(),
        'M': peaks231_80['M'].sum(),
        'N': peaks231_80['N'].sum(),
        'P': peaks231_80['P'].sum(),
        'Q': peaks231_80['Q'].sum(),
        'R': peaks231_80['R'].sum(),
        'S': peaks231_80['S'].sum(),
        'T': peaks231_80['T'].sum(),
        'V': peaks231_80['V'].sum(),
        'W': peaks231_80['W'].sum(),
        'Y': peaks231_80['Y'].sum(),
        'c-carb': peaks231_80['c-carb'].sum(),
        'm-oxid': peaks231_80['m-oxid'].sum(),
        'n-deam': peaks231_80['n-deam'].sum(),
        'q-deam': peaks231_80['q-deam'].sum(),
        'k-iron': peaks231_80['k-iron'].sum(),
        'k-meth': peaks231_80['k-meth'].sum(),
        'r-meth': peaks231_80['r-meth'].sum(),
        'Total area': peaks231_80['Area'].sum(),
        'Total length': peaks231_80['stripped length'].sum()
       }

totalpeaks231_80 = pd.DataFrame(data, columns=['A', 'C', 'D', 'E', 'F', 'G', 'H', 'I', 'K', 'I/L', 'M', 'N', 'P', 'Q', 'R', 'S', 'T', 'V', 'W', 'Y', 'c-carb', 'm-oxid', 'n-deam', 'q-deam', 'k-iron', 'k-meth', 'r-meth', 'Total area', 'Total length'], index=index)

# calculate percentage of C's with carb (should be 1.0)
totalpeaks231_80['% C w/ carb'] = totalpeaks231_80['c-carb'] / totalpeaks231_80['C'] 

# calculate percentage of M's that are oxidized
totalpeaks231_80['% M w/ oxid'] = totalpeaks231_80['m-oxid'] / totalpeaks231_80['M'] 

# calculate percentage of N's that are deamidated
totalpeaks231_80['% N w/ deam'] = totalpeaks231_80['n-deam'] / totalpeaks231_80['N'] 

# calculate percentage of Q's that are deamidated
totalpeaks231_80['% Q w/ deam'] = totalpeaks231_80['q-deam'] / totalpeaks231_80['Q'] 

# calculate percentage of K's that are hydroxylated
totalpeaks231_80['% K w/ iron'] = totalpeaks231_80['k-iron'] / totalpeaks231_80['K'] 

# calculate percentage of K's that are methylated
totalpeaks231_80['% K w/ meth'] = totalpeaks231_80['k-meth'] / totalpeaks231_80['K'] 

# calculate percentage of R's that are methylated
totalpeaks231_80['% R w/ meth'] = totalpeaks231_80['r-meth'] / totalpeaks231_80['R'] 

# calculate NAAF denominator for all peptides in dataset i
totalpeaks231_80['NAAF denom.'] = totalpeaks231_80['Total area'] / totalpeaks231_80['Total length']

# write modified dataframe to new txt file
totalpeaks231_80.to_csv("/home/millieginty/Documents/git-repos/2017-etnp/data/pro2020/ETNP-SKQ17/PEAKS-PTMopt/ETNP-SKQ17-231-100m-0.3-JA2__PTMopt_DN80_totals.csv")

totalpeaks231_80.head()

Unnamed: 0,A,C,D,E,F,G,H,I,K,I/L,...,Total area,Total length,% C w/ carb,% M w/ oxid,% N w/ deam,% Q w/ deam,% K w/ iron,% K w/ meth,% R w/ meth,NAAF denom.
sample total,941,42,454,746,350,486,81,0,836,1415,...,8121100000.0,9476,1.0,0.60241,0.265203,0.091837,0.0,0.084928,0.157969,857017.763508


In [12]:
# use the calculated NAAF factor (in totalpeaks231_ dataframe, above) to caluclate the NAAF 
# NAAF: normalized normalized area abundance factor

# don't have to worry here about DECOY hits messing with Area totals
# but we will with Comet results

NAAF80 = 857017.763508

# use NAAF >80% ALC to get NAAF
peaks231_80['NAAF factor'] = (peaks231_80['NAAF num.'])/NAAF80

# separate out the dataframe into AAs 
peaks231_AA80 = peaks231_80[['stripped peptide', 'NAAF factor', 'A', 'C', 'D', 'E', 'F', 'G', 'H', 'K', 'I/L', 'M', 'N', 'P', 'Q', 'R', 'S', 'T', 'V', 'W', 'Y']].copy()

# multiply the NAAF80 factor by the AA total to normalize its abundance by peak area and peptide length

peaks231_AA80['A-NAAF80'] = peaks231_AA80['A'] * peaks231_80['NAAF factor']
peaks231_AA80['C-NAAF80'] = peaks231_AA80['C'] * peaks231_80['NAAF factor']
peaks231_AA80['D-NAAF80'] = peaks231_AA80['D'] * peaks231_80['NAAF factor']
peaks231_AA80['E-NAAF80'] = peaks231_AA80['E'] * peaks231_80['NAAF factor']
peaks231_AA80['F-NAAF80'] = peaks231_AA80['F'] * peaks231_80['NAAF factor']
peaks231_AA80['G-NAAF80'] = peaks231_AA80['G'] * peaks231_80['NAAF factor']
peaks231_AA80['H-NAAF80'] = peaks231_AA80['H'] * peaks231_80['NAAF factor']
peaks231_AA80['K-NAAF80'] = peaks231_AA80['K'] * peaks231_80['NAAF factor']
peaks231_AA80['I/L-NAAF80'] = peaks231_AA80['I/L'] * peaks231_80['NAAF factor']
peaks231_AA80['M-NAAF80'] = peaks231_AA80['M'] * peaks231_80['NAAF factor']
peaks231_AA80['N-NAAF80'] = peaks231_AA80['N'] * peaks231_80['NAAF factor']
peaks231_AA80['P-NAAF80'] = peaks231_AA80['P'] * peaks231_80['NAAF factor']
peaks231_AA80['Q-NAAF80'] = peaks231_AA80['Q'] * peaks231_80['NAAF factor']
peaks231_AA80['R-NAAF80'] = peaks231_AA80['R'] * peaks231_80['NAAF factor']
peaks231_AA80['S-NAAF80'] = peaks231_AA80['S'] * peaks231_80['NAAF factor']
peaks231_AA80['T-NAAF80'] = peaks231_AA80['T'] * peaks231_80['NAAF factor']
peaks231_AA80['V-NAAF80'] = peaks231_AA80['V'] * peaks231_80['NAAF factor']
peaks231_AA80['W-NAAF80'] = peaks231_AA80['W'] * peaks231_80['NAAF factor']
peaks231_AA80['Y-NAAF80'] = peaks231_AA80['Y'] * peaks231_80['NAAF factor']

# write the dataframe to a new csv
peaks231_AA80.to_csv("/home/millieginty/Documents/git-repos/2017-etnp/data/pro2020/ETNP-SKQ17/PEAKS-PTMopt/ETNP-SKQ17-231-100m-0.3-JA2_PTMopt_DN80_AA_NAAF.csv")

peaks231_AA80.head()

Unnamed: 0,stripped peptide,NAAF factor,A,C,D,E,F,G,H,K,...,M-NAAF80,N-NAAF80,P-NAAF80,Q-NAAF80,R-NAAF80,S-NAAF80,T-NAAF80,V-NAAF80,W-NAAF80,Y-NAAF80
0,ENLAALEK,5.615403,2,0,0,2,0,0,0,1,...,0.0,5.615403,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,ENLLAK,1.53828,1,0,0,1,0,0,0,1,...,0.0,3.07656,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,TGNFLDPK,0.854708,0,0,1,0,1,1,0,1,...,0.0,0.854708,0.854708,0.0,0.0,0.0,0.854708,0.0,0.0,0.0
3,WLVNHPR,0.258371,0,0,0,0,0,0,1,0,...,0.0,0.258371,0.258371,0.0,0.258371,0.0,0.0,0.258371,0.258371,0.0
4,TDENLPLGPK,1.271852,0,0,1,1,0,1,0,1,...,0.0,1.271852,2.543705,0.0,0.0,0.0,1.271852,0.0,0.0,0.0


### Visualizing the results

In [None]:
print("ALC max: ", peaks['ALC (%)'].max())
print("ALC min: ", peaks['ALC (%)'].min())

In [None]:
# take only AA totals and transpose for easier bar plotting in matplotlib

peaksaatot = totalpeaks[['A', 'C', 'D', 'E', 'F', 'G', 'H', 'I', 'K', 'L', 'M', 'N', 'P', 'Q', 'R', 'S', 'T', 'V', 'W', 'Y']].copy().T

# take only AA %s and transpose for easier bar plotting in matplotlib

peaksreltot = totalpeaks[['% C w/ carb.', '% M w/ oxid', '% N w/ deam', '% Q w/ deam', '% K w/ hydr', '% P w/ hydr', '% K w/ meth', '% R w/ meth']].copy().T

In [None]:
# bar plot of residue totals
# there is no isoleucine (I) in Peaks data, which is why L is really big and I is 0


x_labels = ['sample total']

ax = totalpeaks.plot(y=['A', 'C', 'D', 'E', 'F', 'G', 'H', 'I', 'K', 'L', 'M', 'N', 'P', 'Q', 'R', 'S', 'T', 'V', 'W', 'Y'], kind="bar", title = '100 m suspended')
plt.xticks(rotation=0)
ax.get_legend().remove()
ax.set_xticklabels(x_labels)

In [None]:
# bar plot of residue totals
# there is no isoleucine (I) in Peaks data, which is why L is really big and I is 0

my_colors = [(x/10.0, x/20.0, 0.75) for x in range(len(peaksaatot))] # <-- Quick gradient example along the Red/Green dimensions.

ax = peaksaatot.plot(y=['sample total'], kind="bar", color = 'green', title = '100 m suspended')


In [None]:
# bar plot of relative modifications

ax = totalpeaks.plot(y=['% C w/ carb.', '% M w/ oxid', '% N w/ deam', '% Q w/ deam', '% K w/ hydr', '% P w/ hydr', '% K w/ meth', '% R w/ meth'], kind="bar", title = '100 m suspended')
ax.set_xticklabels([])

In [None]:
# bar plot of relative mods


ax = peaksreltot.plot(y=['sample total'], kind="bar", title = '100 m suspended')
plt.xticks(rotation=45)

In [None]:
# making evenly spaced bins for the ALC data based on the min and max, called above
bins = [50, 55, 60, 65, 70, 75, 80, 85, 90, 95, 100]
labels = ['50-55', '55-60', '60-65', '65-70', '70-75', '75-80', '80-85', '85-90', '90-95', '95-100']

# use pandas cut function to do the binning itself
peaks['binned'] = pd.cut(peaks['ALC (%)'], bins=bins, labels=labels)

# bar plots of binned PTM data

index = ['50-55', '55-60', '60-65', '65-70', '70-75', '75-80', '80-85', '85-90', '90-95', '95-100']
data = {'Total PTMs': [peaks.groupby('binned')['ptm-total'].sum()['50-55'], peaks.groupby('binned')['ptm-total'].sum()['55-60'], peaks.groupby('binned')['ptm-total'].sum()['60-65'], peaks.groupby('binned')['ptm-total'].sum()['65-70'], peaks.groupby('binned')['ptm-total'].sum()['70-75'], peaks.groupby('binned')['ptm-total'].sum()['75-80'], peaks.groupby('binned')['ptm-total'].sum()['80-85'], peaks.groupby('binned')['ptm-total'].sum()['85-90'], peaks.groupby('binned')['ptm-total'].sum()['90-95'], peaks.groupby('binned')['ptm-total'].sum()['95-100']],
        'Cys carb.': [peaks.groupby('binned')['c-carb'].sum()['50-55'], peaks.groupby('binned')['c-carb'].sum()['55-60'], peaks.groupby('binned')['c-carb'].sum()['60-65'], peaks.groupby('binned')['c-carb'].sum()['65-70'], peaks.groupby('binned')['c-carb'].sum()['70-75'], peaks.groupby('binned')['c-carb'].sum()['75-80'], peaks.groupby('binned')['c-carb'].sum()['80-85'], peaks.groupby('binned')['c-carb'].sum()['85-90'], peaks.groupby('binned')['c-carb'].sum()['90-95'], peaks.groupby('binned')['c-carb'].sum()['95-100']],
        'Met oxi.': [peaks.groupby('binned')['m-oxid'].sum()['50-55'], peaks.groupby('binned')['m-oxid'].sum()['55-60'], peaks.groupby('binned')['m-oxid'].sum()['60-65'], peaks.groupby('binned')['m-oxid'].sum()['65-70'], peaks.groupby('binned')['m-oxid'].sum()['70-75'], peaks.groupby('binned')['m-oxid'].sum()['75-80'], peaks.groupby('binned')['m-oxid'].sum()['80-85'], peaks.groupby('binned')['m-oxid'].sum()['85-90'], peaks.groupby('binned')['m-oxid'].sum()['90-95'], peaks.groupby('binned')['m-oxid'].sum()['95-100']],
        'Asp deam.': [peaks.groupby('binned')['n-deam'].sum()['50-55'], peaks.groupby('binned')['n-deam'].sum()['55-60'], peaks.groupby('binned')['n-deam'].sum()['60-65'], peaks.groupby('binned')['n-deam'].sum()['65-70'], peaks.groupby('binned')['n-deam'].sum()['70-75'], peaks.groupby('binned')['n-deam'].sum()['75-80'], peaks.groupby('binned')['n-deam'].sum()['80-85'], peaks.groupby('binned')['n-deam'].sum()['85-90'], peaks.groupby('binned')['n-deam'].sum()['90-95'], peaks.groupby('binned')['n-deam'].sum()['95-100']],
        'Glut deam.': [peaks.groupby('binned')['q-deam'].sum()['50-55'], peaks.groupby('binned')['q-deam'].sum()['55-60'], peaks.groupby('binned')['q-deam'].sum()['60-65'], peaks.groupby('binned')['q-deam'].sum()['65-70'], peaks.groupby('binned')['q-deam'].sum()['70-75'], peaks.groupby('binned')['q-deam'].sum()['75-80'], peaks.groupby('binned')['q-deam'].sum()['80-85'], peaks.groupby('binned')['q-deam'].sum()['85-90'], peaks.groupby('binned')['q-deam'].sum()['90-95'], peaks.groupby('binned')['q-deam'].sum()['95-100']],
        'Lys hydr': [peaks.groupby('binned')['k-hydr'].sum()['50-55'], peaks.groupby('binned')['k-hydr'].sum()['55-60'], peaks.groupby('binned')['k-hydr'].sum()['60-65'], peaks.groupby('binned')['k-hydr'].sum()['65-70'], peaks.groupby('binned')['k-hydr'].sum()['70-75'], peaks.groupby('binned')['k-hydr'].sum()['75-80'], peaks.groupby('binned')['k-hydr'].sum()['80-85'], peaks.groupby('binned')['k-hydr'].sum()['85-90'], peaks.groupby('binned')['k-hydr'].sum()['90-95'], peaks.groupby('binned')['k-hydr'].sum()['95-100']],
        'Pro hydr': [peaks.groupby('binned')['p-hydr'].sum()['50-55'], peaks.groupby('binned')['p-hydr'].sum()['55-60'], peaks.groupby('binned')['p-hydr'].sum()['60-65'], peaks.groupby('binned')['p-hydr'].sum()['65-70'], peaks.groupby('binned')['p-hydr'].sum()['70-75'], peaks.groupby('binned')['p-hydr'].sum()['75-80'], peaks.groupby('binned')['p-hydr'].sum()['80-85'], peaks.groupby('binned')['p-hydr'].sum()['85-90'], peaks.groupby('binned')['p-hydr'].sum()['90-95'], peaks.groupby('binned')['p-hydr'].sum()['95-100']],
        'Lys meth.': [peaks.groupby('binned')['k-meth'].sum()['50-55'], peaks.groupby('binned')['k-meth'].sum()['55-60'], peaks.groupby('binned')['k-meth'].sum()['60-65'], peaks.groupby('binned')['k-meth'].sum()['65-70'], peaks.groupby('binned')['k-meth'].sum()['70-75'], peaks.groupby('binned')['k-meth'].sum()['75-80'], peaks.groupby('binned')['k-meth'].sum()['80-85'], peaks.groupby('binned')['k-meth'].sum()['85-90'], peaks.groupby('binned')['k-meth'].sum()['90-95'], peaks.groupby('binned')['k-meth'].sum()['95-100']],
        'Arg meth.': [peaks.groupby('binned')['r-meth'].sum()['50-55'], peaks.groupby('binned')['r-meth'].sum()['55-60'], peaks.groupby('binned')['r-meth'].sum()['60-65'], peaks.groupby('binned')['r-meth'].sum()['65-70'], peaks.groupby('binned')['r-meth'].sum()['70-75'], peaks.groupby('binned')['r-meth'].sum()['75-80'], peaks.groupby('binned')['r-meth'].sum()['80-85'], peaks.groupby('binned')['r-meth'].sum()['85-90'], peaks.groupby('binned')['r-meth'].sum()['90-95'], peaks.groupby('binned')['r-meth'].sum()['95-100']]
        }

peaksbin = pd.DataFrame(data, columns=['Total PTMs','Cys carb.','Met oxi.','Asp deam.', 'Glut deam.', 'Lys hydr', 'Pro hydr', 'Lys meth.', 'Arg meth.'], index=index)

# write the peaks bin ptm dataframe to a csv:
peaksbin.to_csv("/home/millieginty/Documents/git-repos/2017-etnp/data/pro2020/ETNP-SKQ17/PEAKS-PTMopt/ETNP-SKQ17-231-100m-0.3-JA2_DN50_ptm.csv")

ax1 = peaksbin.plot.bar(y='Total PTMs', rot=45)
ax1.set_title('Total PTMs')

ax2 = peaksbin.plot.bar(y='Cys carb.', rot=45)
ax2.set_title('Cysteine carbamidomethylation')

ax3 = peaksbin.plot.bar(y='Met oxi.', rot=45)
ax3.set_title('Methionine oxidation')

ax4 = peaksbin.plot.bar(y='Asp deam.', rot=45)
ax4.set_title('Asparagine deamidation')

ax5 = peaksbin.plot.bar(y='Glut deam.', rot=45)
ax5.set_title('Glutamine deamidation')

ax6 = peaksbin.plot.bar(y='Lys hydr', rot=45)
ax6.set_title('Lysine hydroxylation')

ax7 = peaksbin.plot.bar(y='Pro hydr', rot=45)
ax7.set_title('Proline hydroxylation')

ax8 = peaksbin.plot.bar(y='Lys meth.', rot=45)
ax8.set_title('Lysine methylation')

ax9 = peaksbin.plot.bar(y='Arg meth.', rot=45)
ax9.set_title('Arginine methylation')


### Exporting txt files of stripped peptides at confidence cutoffs:

In [None]:
# keep only peptide column >50% ALC
pep = peaks[["stripped peptide"]]

# write altered dataframe to new txt file
# used header and index parameters to get rid of 'Peptide' header and the indexing

pep.to_csv("/home/millieginty/Documents/git-repos/2017-etnp/data/pro2020/ETNP-SKQ17/PEAKS-PTMopt/ETNP-SKQ17-231-100m-0.3-JA2_DN50_stripped_peptides.txt", header=False, index=False)

# made the text file into a FASTA 

!awk '{print ">"NR"\n"$0}' /home/millieginty/Documents/git-repos/2017-etnp/data/pro2020/ETNP-SKQ17/PEAKS-PTMopt/ETNP-SKQ17-231-100m-0.3-JA2_DN50_stripped_peptides.txt > /home/millieginty/Documents/git-repos/2017-etnp/data/pro2020/ETNP-SKQ17/PEAKS-PTMopt/ETNP-SKQ17-231-100m-0.3-JA2_DN50_stripped_peptides.fas

# look

print("# of DN peptide >50% ALC", len(pep))
pep.head()

In [None]:
# keep only peptides  >80% ALC
peaks80 = peaks.loc[peaks['ALC (%)'] >= 80]

# see how many rows and double check
# peaks80.head(-10)

# keep only peptide column 
pep80 = peaks80[["stripped peptide"]]

# write altered dataframe to new txt file
# used header and index parameters to get rid of 'Peptide' header and the indexing

pep80.to_csv("/home/millieginty/Documents/git-repos/2017-etnp/data/pro2020/ETNP-SKQ17/PEAKS-PTMopt/ETNP-SKQ17-231-100m-0.3-JA2_DN80_stripped_peptides.txt", header=False, index=False)

# made the text file into a FASTA 

!awk '{print ">"NR"\n"$0}' /home/millieginty/Documents/git-repos/2017-etnp/data/pro2020/ETNP-SKQ17/PEAKS-PTMopt/ETNP-SKQ17-231-100m-0.3-JA2_DN80_stripped_peptides.txt > /home/millieginty/Documents/git-repos/2017-etnp/data/pro2020/ETNP-SKQ17/PEAKS-PTMopt/ETNP-SKQ17-231-100m-0.3-JA2_DN80_stripped_peptides.fas

print("# of DN peptide >80% ALC", len(pep80))
pep80.head()

### Using BioPython to query peptide sequences

I installed the BioPython package using `pip install biopython`. All instructions and information [here](https://www.tutorialspoint.com/biopython/index.htm). 

GitHub project: https://github.com/biopython/biopython

I'm relying on the ProtParam module to parse sequences for relative AA composition, instability, secondary structure, instability, and hydrophobicity. You can read more about that module and the studies the indecies are derived from here:

https://biopython.org/wiki/ProtParam

In [None]:
# Bio.SeqIO is the standard Sequence Input/Output interface for BioPython 1.43 and later
# Bio.SeqIO provides a simple uniform interface to input and output assorted sequence file formats.
# (including multiple sequence alignments), but will only deal with sequences as SeqRecord objects

# for accepted file formats see https://biopython.org/wiki/SeqIO

from Bio import SeqIO
#for seq_record in SeqIO.parse("/home/millieginty/Documents/git-repos/2017-etnp/data/MED4/MED2_tryp_1raw_db_peptides_nmod.fasta", "fasta"):
    #print(seq_record.id)
    #print(repr(seq_record.seq))
    #print(len(seq_record))
    
# I commented the print functions out so the output doesn't take up too much space. 

In [None]:
# seeing what the ProtParam module can do with a single protein sequence:

from Bio.SeqUtils.ProtParam import ProteinAnalysis

test_seq = "MAEGEITTFTALTEKFNLPPGNYKKPKLLYCSNGGHFLRILPDGTVDGTRDRSDQHIQLQLSAESVGEVYIKSTETGQYLAMDTSGLLYGSQTPSEECLFLERLEENHYNTYTSKKHAEKNWFVGLKKNGSCKRGPRTHYGQKAILFLPLPV"

analysed_seq = ProteinAnalysis(test_seq)
print("molecular weight of seq =", analysed_seq.molecular_weight())

# calculates the aromaticity value of a protein according to Lobry & Gautier (1994, Nucleic Acids Res., 22, 3174-3180). 
# it's simply the relative frequency of Phe+Trp+Tyr.

analysed_seq.aromaticity()
print("aromaticity of seq =", analysed_seq.aromaticity())

# secondary_structure_fraction:
# this methods returns a list of the fraction of amino acids which tend to be in helix, turn or sheet. 
# AAs in helix: V, I, Y, F, W, L
# AAs in turn: N, P, G, S
# AAs in sheet: E, M, A, L
# the returned list contains 3 values: [Helix, Turn, Sheet]

analysed_seq.secondary_structure_fraction()
print("frac in H T S =", analysed_seq.secondary_structure_fraction())

# the instability index, an implementation of the method of Guruprasad et al. (1990, Protein Engineering, 4, 155-161).
# this method tests a protein for stability. 
# any value above 40 means the protein is unstable (=has a short half life)
# NOT SURE WHAT THIS MEANS FOR PEPTIDES, BUT WE COULD DO THIS FOR PROTEINS

analysed_seq.instability_index()
print("instability =", analysed_seq.instability_index())

# count_amino_acids will do just that, and get_amino_acids_percent will return %'s for each AA across the sequence. 
analysed_seq.get_amino_acids_percent()

# taking the returned dictionary and converting to a dataframe

aadict = analysed_seq.get_amino_acids_percent()
aadf = pd.DataFrame(list(aadict.items()),columns = ['residue','% occurance']) 

aadf.head()

In [None]:
# use SeqIO and a loop to apply count_amino_acids to each sequence in the file
# aatot will give us the total number of each residue in the entire sample output

import collections
from Bio import SeqIO
from Bio.SeqUtils.ProtParam import ProteinAnalysis

all_aas = collections.defaultdict(int)
for record in SeqIO.parse("/home/millieginty/Documents/git-repos/2017-etnp/data/pro2020/ETNP-SKQ17/PEAKS-PTMopt/ETNP-SKQ17-231-100m-0.3-JA2_DN50_stripped_peptides.fas", "fasta"):
    x = ProteinAnalysis(str(record.seq))
    #print(record.id, x.count_amino_acids())
    for aa, count in x.count_amino_acids().items():
        all_aas[aa] += count        
        
# made a dataframe for amino acid total counts        
data = (all_aas)
aatot = pd.DataFrame(data, index = ['sample sequence total'])
aatot.head()

In [None]:
from pandas import Series, DataFrame

with open('/home/millieginty/Documents/git-repos/2017-etnp/data/pro2020/ETNP-SKQ17/PEAKS-PTMopt/ETNP-SKQ17-231-100m-0.3-JA2_DN50_stripped_peptides.fas') as fasta_file:  # Will close handle cleanly
    identifiers = []
    lengths = []
    for seq_record in SeqIO.parse(fasta_file, 'fasta'):  # (generator)
        identifiers.append(seq_record.id)
        lengths.append(len(seq_record.seq))
        
        
#converting lists to pandas Series    
s1 = Series(identifiers, name='ID')
s2 = Series(lengths, name='length')

#Gathering Series into a pandas DataFrame and rename index as ID column
idseq = DataFrame(dict(ID=s1, length=s2)).set_index(['ID'])

idseq.head()

In [None]:
from pandas import Series, DataFrame



with open('/home/millieginty/Documents/git-repos/2017-etnp/data/pro2020/ETNP-SKQ17/PEAKS-PTMopt/ETNP-SKQ17-231-100m-0.3-JA2_DN50_stripped_peptides.fas) as fasta_file:  # Will close handle cleanly
    identifiers = []
    lengths = []
    aa = []
    for seq_record in SeqIO.parse(fasta_file, 'fasta'):  # (generator)
        identifiers.append(seq_record.id)
        lengths.append(len(seq_record.seq))
        aa.count_amino_acids(seq_record.seq)
        
        
#converting lists to pandas Series    
s1 = Series(identifiers, name='ID')
s2 = Series(lengths, name='length')
s3 = Series(aa, name='AAs')

#Gathering Series into a pandas DataFrame and rename index as ID column
idseq = DataFrame(dict(ID=s1, length=s2, AAs=s3)).set_index(['ID'])

idseq.head()

In [None]:
from Bio import SeqIO
from Bio.SeqUtils import ProtParam

handle = open("/home/millieginty/Documents/git-repos/2017-etnp/data/pro2020/ETNP-SKQ17/PEAKS-PTMopt/ETNP-SKQ17-231-100m-0.3-JA2_DN50_stripped_peptides.fas") 
for record in SeqIO.parse(handle, "fasta"): 
    seq = str(record.seq)
    X = ProtParam.ProteinAnalysis(seq)
    print(X.count_amino_acids()) 
    #print X.get_amino_acids_percent() 
    #print X.molecular_weight() 
    #print X.aromaticity() 
    #print X.instability_index() 
    #print X.flexibility() 
    #print X.isoelectric_point() 
    #print X.secondary_structure_fraction()
    
# made a data series from the count_amino_acids function
# aacount = {X.count_amino_acids()}

# made a pandas dataframe from the series generated above
# aacount = pd.DataFrame(list(data.items()),columns = ['A', 'C', 'D', 'E', 'F', 'G', 'H', 'I', 'K', 'L', 'M', 'N', 'P', 'Q', 'R', 'S', 'T', 'V', 'W', 'Y']) 

aacount = pd.DataFrame(X.count_amino_acids(), index=[0])

# look at new dataframe

# aacount.head()